### **第一步:创建项目**
~~~
scrapy startproject douyu
~~~
### **第二步:创建爬虫**
~~~
scrapy genspider douyucdn http://capi.douyucdn.cn
~~~
### **第三步:编写items.py,明确需要提取的数据**
~~~
import scrapy
class DouyuItem(scrapy.Item):
nickname = scrapy.Field()
headimg = scrapy.Field()
~~~
### **第四步:编写spiders/xxx.py 编写爬虫文件,处理请求和响应,以及提取数据(yeild item)**
~~~
import scrapy
import json
from douyu.items import DouyuItem
class DouyucdnSpider(scrapy.Spider):
name = 'douyucdn'
allowed_domains = ['douyucdn.cn']
baseUrl='http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset='
offset=0
start_urls = [baseUrl+str(offset)]
def parse(self, response):
data_list=json.loads(response.body)['data']
if not len(data_list):
return
for data in data_list:
item=DouyuItem()
item['headimg']=data['vertical_src']
item['nickname']=data['nickname']
yield item
self.offset+=20
yield scrapy.Request(self.baseUrl+str(self.offset),callback=self.parse)
~~~
### **第五步:编写pipelines.py管道文件,处理spider返回item数据**
~~~
from scrapy.pipelines.images import ImagesPipeline
from douyu.settings import IMAGES_STORE as images_store
import scrapy
import os
class DouyuPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
imgUrl=item['headimg']
yield scrapy.Request(imgUrl)
def item_completed(self, results, item, info):
#取出results中的文件地址
image_path=[x["path"] for ok , x in results if ok]
#然后拼接一下具体路径,这里需要引入settings.py中的IMAGES_STORE的值
old_path=images_store+image_path[0]
new_path=images_store+'named/'+item['nickname']+'.jpg'
os.rename(old_path,new_path)
return item
~~~
### **第六步:编写settings.py,启动管理文件,以及其他相关设置**
> 因为要伪装成手机访问,所以要指定user-agent,可以到http://www.fynas.com/ua 中找到自己想要伪装的手机信息
~~~
USER_AGENT = 'Mozilla/5.0 (iPhone 84; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.0 MQQBrowser/7.8.0 Mobile/14G60 Safari/8536.25 MttCustomUA/2 QBWebViewType/1 WKType/1'
~~~
> 因为我们要把主播的照片保存到本地,所以需要指定保存的地址
~~~
IMAGES_STORE = "C:/Users/Administrator/Desktop/douyu/images/"
~~~
> 因为涉及到图片处理,所以需要应用到第三方库Pillow,所以如果之前没有安装过,需要先安装一下,不然会有关于Pil的报错
~~~
pip install Pillow
~~~
因为有些网站会做robot过滤,所以要把robot关掉
~~~
ROBOTSTXT_OBEY = False
~~~
然后写一下管道名称:
~~~
ITEM_PIPELINES = {
'douyu.pipelines.DouyuPipeline': 300,
}
~~~
### **第七步:执行爬虫**
备注:
如何提取下面这段代码中的path值?
~~~
results='[(True, {'url': 'https://rpic.douyucdn.cn/live-cover/appCovers/2018/02/01/4189383_20180201171138_big.jpg', 'path': 'full/811a893386a55177f36abcde290eaf16933e5888.jpg', 'checksum': '0fd2746c8711d9eb6c7bc3db138f0ac4'})]'
~~~
用下面的方法
~~~
path=[x["path"] for ok ,x in results if ok ]
~~~