爬虫技巧-线程池 · python笔记

[TOC] ## 一 **线程池使用背景** 爬虫的本质就是client发请求批量获取server的响应数据，如果我们有多个url待爬取，只用一个线程且采用串行的方式执行，那只能等待爬取一个结束后才能继续下一个，效率会非常低。那么该如何提高爬取性能呢？ 1. 多进程 2. 多线程 3. 进程池 4. **线程池** 5. 协程其中最推荐初学者的就是线程池,原因如下 1. 多进程/线程的方式会频繁创建销毁,浪费性能 2. 线程比进程开销小,能使用线程池就不用进程池 3. 协程虽然高效,但是实现起来复杂 ## 二线程池的实现 ```python #1. 导入模块,re模块的作用是解析详情页中的js数据 import requests,re from lxml import etree from multiprocessing.dummy import Pool # 2. 设置url,headers等 url='https://www.pearvideo.com/category_2' headers={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36' } video_list=[] # 3. 创建请求,页面数据 res=requests.get(url=url,headers=headers).text tree=etree.HTML(res) # 4. 利用xpath解析 li_list=tree.xpath('//*[@id="categoryList"]/li') for li in li_list: video_name=li.xpath('./div/a/div[2]/text()')[0]+'.mp4' detal_url='https://www.pearvideo.com/'+li.xpath('./div/a/@href')[0] detal_res=requests.get(url=detal_url,headers=headers).text # 5. 详情页视频url在js中,不能用xpath和bs4解析,只能用正则 video_url=re.findall('srcUrl="(.*?)",vdoUrl',detal_res)[0] dic={ 'video_name':video_name, 'video_url':video_url } video_list.append(dic) # 6. 创建解析下载视频的函数 def get_video(dic): url=dic['video_url'] name=dic['video_name'] video_data=requests.get(url=url,headers=headers).content print("开始下载视频:%s ....."%name) with open(name,'wb') as f: f.write(video_data) # 7. 创建线程池并调用map方法 pools=Pool(4) pools.map(get_video,video_list) pools.close() pools.join() ```