3-Scrapy中CrawlSpider · Scrapy框架

![](https://img.kancloud.cn/41/e0/41e066af9a6c25a24868d9667253ec98_1241x333.jpg) ***** 之前的代码中，我们有很大一部分时间在寻找下一页的URL地址或者内容的URL地址上面，这个过程能更简单一些吗？思路： 1.从response中提取所有的a标签对应的URL地址 2.自动的构造自己resquests请求，发送给引擎 URL地址：`http://www.circ.gov.cn/web/site0/tab5240` 目标：通过爬虫了解crawlspider的使用生成crawlspider的命令：`scrapy genspider -t crawl cf cbrc.gov.cn` ``` # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class YgSpider(CrawlSpider): name = 'yg' allowed_domains = ['sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=0'] rules = ( # LinkExtractor 连接提取器，提取URL地址 # callback 提取出来的URL地址的response会交给callback处理 # follow 当前URL地址的响应是够重新来rules来提取URL地址 Rule(LinkExtractor(allow=r'wz.sun0769.com/html/question/201811/\d+\.shtml'), callback='parse_item'), Rule(LinkExtractor(allow=r'http:\/\/wz.sun0769.com/index.php/question/questionType\?type=4&page=\d+'), follow=True), ) def parse_item(self, response): item = {} item['content'] = response.xpath('//div[@class="c1 text14_2"]//text()').extract() print(item) ``` **注意点** 1.用命令创建一个crawlspider的模板:scrapy genspider -t crawl <爬虫名字> <all_domain>,也可以手动创建 2.CrawlSpider中不能再有以parse为名字的数据提取方法，这个方法被CrawlSpider用来实现基础URL提取等功能 3.一个Rule对象接受很多参数，首先第一个是包含URL规则的LinkExtractor对象，常用的还有callback和follow - callback:连接提取器提取出来的URL地址对应的响应交给他处理 - follow:连接提取器提取出来的URL地址对应的响应是否继续被rules来过滤 4.不指定callback函数的请求下，如果follow为True，满足该rule的URL还会继续被请求 5.如果多个Rule都满足某一个URL，会从rules中选择第一个满足的进行操作 ## CrawlSpider补充(了解) ![](https://img.kancloud.cn/88/18/88186f73506d0eff05d88f5fec33cff7_1570x992.png)