教你一种1分钟，下载1万个网页的方法，你学吗？|163

一：Pycurl介绍

Pycurl是一个用C语言编写的libcurl Python实现，功能非常强大，支持操作协议有FTP，HTTP，HTTPS，TELNET等。与urllib相比，Pycurl的速度要快很多。

二：Pycurl安装

大家可以去官网下载与本地Python一直的whl或exe包。

也可以使用下面的命令行直接安装。

pip install pycurl

三：Pycurl主要方法

pycurl.Curl() #创建一个pycurl对象的方法
pycurl.Curl().setopt(pycurl.URL, “URL”) #设置要访问的URL
pycurl.Curl().setopt(pycurl.MAXREDIRS, 5) #设置最大重定向次数
pycurl.Curl().setopt(pycurl.CONNECTTIMEOUT, 60)
pycurl.Curl().setopt(pycurl.TIMEOUT, 300) #连接超时设置
pycurl.Curl().setopt(pycurl.USERAGENT, "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)") #模拟浏览器
pycurl.Curl().perform() #服务器端返回的信息
pycurl.Curl().getinfo(pycurl.HTTP_CODE) #查看HTTP的状态类似urllib中status属性
pycurl.HTTP_CODE HTTP 响应代码
pycurl.REDIRECT_COUNT 重定向的次数
pycurl.SIZE_DOWNLOAD 下载的数据大小
pycurl.CONTENT_LENGTH_DOWNLOAD 下载内容长度
pycurl.RESPONSE_CODE 响应代码
pycurl.SPEED_DOWNLOAD 下载速度

四：示例DEMO

下面的示例代码是我2015年开发通用爬虫时，使用的一部分，现在已经改成并发模式了(后面会有简单的介绍)，不过大家还是可以做个参考。

# encoding=utf-8
import pycurl , traceback
from com.fy.utils.html.HtmlCode import HtmlCodeUtils
from io import BytesIO # <-- 这个用到里面的write函数
class Utils_PyCurl:
def __init__(self,):
self.hcu = HtmlCodeUtils() #根据HTML源码获取网页编码
self.c = pycurl.Curl()
self.c.setopt(pycurl.CONNECTTIMEOUT, 60)#链接超时
self.c.setopt(pycurl.TIMEOUT, 60) #下载超时
self.c.setopt(pycurl.MAXREDIRS, 5)#设置重定向次数
self.c.setopt(pycurl.USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)')#模拟浏览器
def PyCurl_Html(self, url):
self.requestCode = 0#URL请求返回码
host = url
for i in range(2):
del i
html, code = self.fetch_html_pycurl(host)
if int(self.requestCode) == 200:
return html, code
elif html == None:
continue
return None, None
def fetch_html_pycurl(self, host):
result = None
try:
self.c.setopt(self.c.URL, host)
self.c.setopt(self.c.HTTPHEADER, ['Accept:'])
e = BytesIO()
self.c.setopt(self.c.WRITEFUNCTION, e.write)
self.c.setopt(self.c.FOLLOWLOCATION, 1)
self.c.perform()
data = e.getvalue();
html_code = self.hcu.getChardet(data)#获取网页编码
result = str(data.decode(html_code, 'ignore'))
self.requestCode = self.c.getinfo(self.c.HTTP_CODE)
if result != None:print(len(result), self.requestCode)
else:print(self.requestCode)
except :print(traceback.print_exc())
return result, self.requestCode
pyc = Utils_PyCurl()
html, code = pyc.PyCurl_Html("文章地址")
print("code：", code)
print(html)

五：并发

在大批量数据采集时，我们需要对上述的代码进行二次封装，实现并发处理，提高下载的速度。多线程是实现并发最基础的方式，这里就不再次介绍。今天我们介绍一下基gevent协程的并发处理。

1：Gevent介绍

gevent是Python世界中最重要的异步网络库，可以大幅度提高系统的性能。最可贵的是，它允许我们几乎不修改代码，把同步程序变为异步程序。

2016年是Python3的重要时刻。Scrapy刚宣布支持Python3不久，Gevent又宣布支持Python3了。

给大家一个编码建议：如果用Python2，只用Python2.7；如果使用Python3，请至少使用Python3.4，最好使用Python3.5。

支持Python版本：

Python2.7及Python>=3.4

Gevent是基于协程的Python网络库。

包含的特性：

基于libev的快速事件循环(Linux上epoll，FreeBSD上kqueue）。

基于greenlet的轻量级执行单元。
API的概念和Python标准库一致(如事件，队列)。
可以配合socket，ssl模块使用。
能够使用标准库和第三方模块创建标准的阻塞套接字(gevent.monkey)。
默认通过线程池进行DNS查询,也可通过c-are(通过GEVENT_RESOLVER=ares环境变量开启）。
TCP/UDP/HTTP服务器
子进程支持（通过gevent.subprocess）
线程池

2：安装

pip install wheel

pip install gevent 或者 pip3 install gevent

具体安装命令视安装的Python 的版本而定。当出现 Successfully installed ...... 的提示后，即表示 gevent 安装成功。

如果中途出现异常，一定是依赖的包没有提前安装。具体的问题，根据提示处理，基本上就可以解决了。

3：基于的Gevent和pycurl 并发模式改造后的完整代码如下：

# encoding=utf-8
import pycurl , time, traceback
#如果没有给gevent打上补丁的话，它是检测不到除gevent它本省自带的IO操作的，当打上了补丁，它就能检测到程序其他所有的IO操作
from com.fy.utils.html.HtmlCode import HtmlCodeUtils
from io import BytesIO # <-- 这个用到里面的write函数
from gevent import monkey;
monkey.patch_all()#把当前程序的所有IO操作给我单独的做上标记
#如果没有给gevent打上补丁的话，它是检测不到除gevent它本省自带的IO操作的，当打上了补丁，它就能检测到程序其他所有的IO操作
import gevent
class Utils_PyCurl:
def __init__(self,):
self.hcu = HtmlCodeUtils() #根据HTML源码获取网页编码
self.c = pycurl.Curl()
self.c.setopt(pycurl.CONNECTTIMEOUT, 60)#链接超时
self.c.setopt(pycurl.TIMEOUT, 60) #下载超时
self.c.setopt(pycurl.MAXREDIRS, 5)#设置重定向次数
self.c.setopt(pycurl.USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)')#模拟浏览器
def geventControl(self, tasks, urlFieldName):
print("待处理的任务【" + str(len(tasks)) + "】个...")
start_time = time.time()
self.urlFieldName = urlFieldName#集合中存放地址的key
jobs = [gevent.spawn(self.fetch_html_pycurl, task) for task in tasks]
htmls = gevent.joinall(jobs, timeout=60, raise_error=False)
end_time = time.time()
print("下载页面【完毕】，共历时：【" + str((end_time - start_time))[0:4] + "】s.....\n")
del htmls
return tasks, (end_time - start_time)
def fetch_html_pycurl(self, task):
host = task[self.urlFieldName]#待处理的地址
html = None#HTML源码
requestCode = 200#返回的状态码
try:
self.c.setopt(self.c.URL, host)
self.c.setopt(self.c.HTTPHEADER, ['Accept:'])
e = BytesIO()
self.c.setopt(self.c.WRITEFUNCTION, e.write)
self.c.setopt(self.c.FOLLOWLOCATION, 1)
self.c.perform()
data = e.getvalue();
html_code = self.hcu.getChardet(data)#获取网页编码
html = str(data.decode(html_code, 'ignore'))
requestCode = self.c.getinfo(self.c.HTTP_CODE)
if html != None:print("HTML-length：", len(html), "HTTP_CODE：", requestCode, "request-url：", host)
else:print("HTML-length：", 0, "HTTP_CODE：", requestCode, "request-url：", host)
except :
requestCode = 400
print(traceback.print_exc())
task['HTML'] = html#HTML源码
task['HTTP_CODE'] = requestCode#返回的状态码
return task
pyc = Utils_PyCurl()
start_time = time.time()
for pageNo in range(1, 11):
tasks = []
for i in range(1, 100):
task = {}
task['url'] = "文章地址"
tasks.append(task)
pyc.geventControl(tasks, "url")
end_time = time.time()
print((end_time - start_time))