CrawlSpider · Python爬虫

CrawlSpider可以更简单地实现翻页请求，利用 `Rule(LinkExtractor...)` 捕捉符合规则的url，然后调用一个解析器解析该url。 <br/> 步骤如下： **1. 到项目目录下创建crawlSpider** ``` # scrapy genspider -t crawl <爬虫名称> <域名> > scrapy genspider -t crawl ct_liks www.wxapp-union.com ``` 执行上面的命令后，将自动生成如下的`ct_liks.py`文件，如下： ```python import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CtLiksSpider(CrawlSpider): name = 'ct_liks' allowed_domains = ['www.wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) # 这个方法名是可随便更改的 def parse_item(self, response): item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() #item['description'] = response.xpath('//div[@id="description"]').get() return item ``` `ct_liks.py`文件是可以手动创建的，只是太麻烦了。 <br/> **2. 在`ct_liks.py`文件中定义 rules** ```python """ @Date 2021/4/9 """ import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CtLiksSpider(CrawlSpider): name = 'ct_liks' allowed_domains = ['www.wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/'] rules = ( # 1. 捕捉http://www.wxapp-union.com/页面的类似的 https://www.wxapp-union.com/article-7002-1.html 链接 # 如果多个Rule都满⾜某⼀个URL，会从rules中选择第⼀个满⾜的进⾏操作 Rule(LinkExtractor(allow=r'www.wxapp-union.com/article-\d+-1.html'), callback='parse_item'), # 你可以定义多条rule # Rule(LinkExtractor(allow=r'www.wxapp-union.com/article-\d+-1.html'), callback='parse_item2'), ) # 2. 每当有一个url符合www.wxapp-union.com/article-\d+-1.html规则，则parse_item自动被调用一次 def parse_item(self, response): title = response.xpath("//title").extract_first() print(title) ``` <br/> LinkExtractor和Rule还有如下参数可选： ```python class LxmlLinkExtractor(FilteringLinkExtractor): def __init__( self, allow=(), # 允许的url。所有满足这个正则表达式的url都会被提取。 deny=(), # 禁止的url。所有满足这个正则表达式的url都不会被提取。 allow_domains=(), # 允许的域名。只有在这个里面指定的域名的url才会被提取。 deny_domains=(), # 禁止的域名。所有在这个里面指定的域名的url都不会被提取 restrict_xpaths=(), # 严格的xpath。和allow共同过滤链接。 tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, deny_extensions=None, restrict_css=(), strip=True, restrict_text=None, ): class Rule: def __init__( self, link_extractor=None, # 就是一个LinkExtractor对象 callback=None, # 满足这个规则的url，应该要执行的回调函数。 # 因为 CrawlSpider使用了parse作为回调函数，因此不要覆盖parse作为自己的回调函数 cb_kwargs=None, follow=None, # 指定根据该规则从response中提取的链接是否需要跟进。 # 不指定callback函数的请求下，如果follow为True，满足该rule的URL还会继续被请求 process_links=None, # 从link_extractor中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接。 process_request=None, errback=None, ): ```