Rules · TUNA-daily

一般爬虫的逻辑是：给定起始页面，发起访问，分析页面包含的所有其他链接，然后将这些链接放入队列，再逐次访问这些队列，直至边界条件结束。为了针对列表页+详情页这种模式，需要对链接抽取（link extractor）的逻辑进行限定。好在scrapy已经提供，关键是你知道这个接口，并灵活运用 ## 1 . url 特定截取 ~~~ rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))), Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'), ) ~~~ > 1. Rule是在定义抽取链接的规则，上面的两条规则分别对应列表页的各个分页页面和详情页，关键点在于通过**restrict_xpath**来限定只从页面特定的部分来抽取接下来将要爬取的链接。 > 2. follow用途：第一：这是我爬取豆瓣新书的规则 rules = (Rule(LinkExtractor(allow=(r’^https://book.douban.com/subject/[0-9]*/’),),callback=’parse_item’,follow=False), )，在这条规则下，我只会爬取定义的start_urls中的和规则符合的链接。假设我把follow修改为True，那么爬虫会start_urls爬取的页面中在寻找符合规则的url，如此循环，直到把全站爬取完毕。第二：rule无论有无callback，都由同一个_parse_response函数处理，只不过他会判断是否有follow和callback ## 2. CrawlSpider详解 CrawlSpider基于Spider，但是可以说是为全站爬取而生。简要说明 > CrawlSpider是爬取那些具有一定规则网站的常用的爬虫，它基于Spider并有一些独特属性 > rules: 是Rule对象的集合，用于匹配目标网站并排除干扰 > parse_start_url: 用于爬取起始响应，必须要返回Item，Request中的一个。 > 因为rules是Rule对象的集合，所以这里也要介绍一下Rule。它有几个参数：link_extractor、callback=None、cb_kwargs=None、follow=None、process_links=None、process_request=None > 其中的link_extractor既可以自己定义，也可以使用已有LinkExtractor类，主要参数为： > allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。 > deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。 > allow_domains：会被提取的链接的domains。 > deny_domains：一定不会被提取链接的domains。 > restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。还有一个类似的restrict_css 下面是官方提供的例子，我将从源代码的角度开始解读一些常见问题： ~~~ import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() return item ~~~ 问题：CrawlSpider如何工作的？因为CrawlSpider继承了Spider，所以具有Spider的所有函数。首先由start_requests对start_urls中的每一个url发起请求（make_requests_from_url)，这个请求会被parse接收。在Spider里面的parse需要我们定义，但CrawlSpider定义parse去解析响应（self._parse_response(response, self.parse_start_url, ~~~ cb_kwargs={}, follow=True)） _parse_response根据有无callback,follow和self.follow_links执行不同的操作 def _parse_response(self, response, callback, cb_kwargs, follow=True): ##如果传入了callback，使用这个callback解析页面并获取解析得到的reques或item if callback: cb_res = callback(response, **cb_kwargs) or () cb_res = self.process_results(response, cb_res) for requests_or_item in iterate_spider_output(cb_res): yield requests_or_item ## 其次判断有无follow，用_requests_to_follow解析响应是否有符合要求的link。 if follow and self._follow_links: for request_or_item in self._requests_to_follow(response): yield request_or_item ~~~ 其中_requests_to_follow又会获取link_extractor（这个是我们传入的LinkExtractor）解析页面得到的link（link_extractor.extract_links(response)）,对url进行加工（process_links，需要自定义），对符合的link发起Request。使用.process_request(需要自定义）处理响应。问题：CrawlSpider如何获取rules？ CrawlSpider类会在init方法中调用_compile_rules方法，然后在其中浅拷贝rules中的各个Rule获取要用于回调(callback)，要进行处理的链接（process_links）和要进行的处理请求（process_request) ~~~ def _compile_rules(self): def get_method(method): if callable(method): return method elif isinstance(method, six.string_types): return getattr(self, method, None) self._rules = [copy.copy(r) for r in self.rules] for rule in self._rules: rule.callback = get_method(rule.callback) rule.process_links = get_method(rule.process_links) rule.process_request = get_method(rule.process_request) ~~~ 那么Rule是怎么样定义的呢？ ~~~ class Rule(object): def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity): self.link_extractor = link_extractor self.callback = callback self.cb_kwargs = cb_kwargs or {} self.process_links = process_links self.process_request = process_request if follow is None: self.follow = False if callback else True else: self.follow = follow ~~~ 因此LinkExtractor会传给link_extractor。有callback的是由指定的函数处理，没有callback的是由哪个函数处理的？由上面的讲解可以发现_parse_response会处理有callback的（响应）respons。 cb_res = callback(response, **cb_kwargs) or () 而_requests_to_follow会将self._response_downloaded传给callback用于对页面中匹配的url发起请求（request）。 r = Request(url=link.url, callback=self._response_downloaded) 如何在CrawlSpider进行模拟登陆因为CrawlSpider和Spider一样，都要使用start_requests发起请求，用从Andrew_liu大神借鉴的代码说明如何模拟登陆： ##替换原来的start_requests，callback为 ~~~ def start_requests(self): return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)] def post_login(self, response): print 'Preparing login' #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单 xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0] print xsrf #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单 #登陆成功后, 会调用after_login回调函数 return [FormRequest.from_response(response, #"http://www.zhihu.com/login", meta = {'cookiejar' : response.meta['cookiejar']}, headers = self.headers, formdata = { '_xsrf': xsrf, 'email': '1527927373@qq.com', 'password': '321324jia' }, callback = self.after_login, dont_filter = True )] #make_requests_from_url会调用parse，就可以与CrawlSpider的parse进行衔接了 def after_login(self, response) : for url in self.start_urls : yield self.make_requests_from_url(url) ~~~ 最后贴上Scrapy.spiders.CrawlSpider的源代码，以便检查 ~~~ """ This modules implements the CrawlSpider which is the recommended spider to use for scraping typical web sites that requires crawling pages. See documentation in docs/topics/spiders.rst """ import copy import six from scrapy.http import Request, HtmlResponse from scrapy.utils.spider import iterate_spider_output from scrapy.spiders import Spider def identity(x): return x class Rule(object): def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity): self.link_extractor = link_extractor self.callback = callback self.cb_kwargs = cb_kwargs or {} self.process_links = process_links self.process_request = process_request if follow is None: self.follow = False if callback else True else: self.follow = follow class CrawlSpider(Spider): rules = () def __init__(self, *a, **kw): super(CrawlSpider, self).__init__(*a, **kw) self._compile_rules() def parse(self, response): return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True) def parse_start_url(self, response): return [] def process_results(self, response, results): return results def _requests_to_follow(self, response): if not isinstance(response, HtmlResponse): return seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=n, link_text=link.text) yield rule.process_request(r) def _response_downloaded(self, response): rule = self._rules[response.meta['rule']] return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow) def _parse_response(self, response, callback, cb_kwargs, follow=True): if callback: cb_res = callback(response, **cb_kwargs) or () cb_res = self.process_results(response, cb_res) for requests_or_item in iterate_spider_output(cb_res): yield requests_or_item if follow and self._follow_links: for request_or_item in self._requests_to_follow(response): yield request_or_item def _compile_rules(self): def get_method(method): if callable(method): return method elif isinstance(method, six.string_types): return getattr(self, method, None) self._rules = [copy.copy(r) for r in self.rules] for rule in self._rules: rule.callback = get_method(rule.callback) rule.process_links = get_method(rule.process_links) rule.process_request = get_method(rule.process_request) @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs) spider._follow_links = crawler.settings.getbool( 'CRAWLSPIDER_FOLLOW_LINKS', True) return spider def set_crawler(self, crawler): super(CrawlSpider, self).set_crawler(crawler) self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True) ~~~