快代理整站爬取 · TUNA-daily

## 1. spider 继承CrawlSpider类，定义网页提取规则，对“下一页进行提取” ~~~ # -*- coding: utf-8 -*- from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from xinlang.items import ProxyIpItem class KuaiDaiLi(CrawlSpider): name = 'kuaidaili' # 爬虫标识名 allowed_domains = ['kuaidaili.com'] # 限定爬取网页的域 # 爬虫开始页，与Spider类不同的是，它的首页值提取符合规则的连接，真正开始爬取数据从rules爬取 start_urls = ['https://www.kuaidaili.com/free/inha/1/'] # 从rlues开始提取 rules = ( # 只提取复合规则的页面链接，不做分析，所以跟页面但是没有 Rule(LinkExtractor(allow=r'free/inha/\d+'), follow=True,callback='parse_item'), ) def parse_item(self, response): trList = response.xpath("//tbody//tr") for i in trList: ip = i.xpath("./td[1]/text()").extract()[0] port = i.xpath("./td[2]/text()").extract()[0] type = i.xpath("./td[4]/text()").extract()[0] position = i.xpath("./td[5]/text()").extract()[0] response_time = i.xpath("./td[6]/text()").extract()[0] item = ProxyIpItem() item['ip'] = ip item['port'] = port item['type'] = type item['position'] = position item['reponseTime'] = response_time yield item ~~~ ## 2. 下载中间件 User-Agent 使用fake_useragent随机获取agent ### 2.1 自定义中间件 ~~~ class RandomAgentMiddleWare(object): """This middleware allows spiders to override the user_agent""" def __init__(self,crawler ): self.ua = UserAgent() # 取到其定义的获取Useragent的方法 self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") # 返回一个中间件对象 @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def getAgent(): userAgent = getattr(self.ua,self.ua_type) print("userAgent:{0}".format(userAgent)) return userAgent # 对request设置useragent request.headers.setdefault(b'User-Agent', getAgent()) ~~~ ### 2.2 配置下载中间件 settings.py文件，这里的权值（中间件对应的数字543）设置大一点是可以的，以免中间件的设置被scrapy默认的中间件覆盖（大就后执行呗！） ~~~ DOWNLOADER_MIDDLEWARES = { 'xinlang.middlewares.RandomAgentMiddleWare': 543, } ~~~ ## 3. pipeline 存入mysql ### 3.1 自定义pipeline ~~~ class MysqlPipeline(object): # 采用同步的机制写入mysql def __init__(self): self.conn = pymysql.connect('192.168.56.130', 'root', 'tuna', 'proxyip', charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() # 处理item def process_item(self, item, spider): insert_sql = """ insert into kuaidaili(ip, port, ip_position, ip_type,response_time) VALUES (%s, %s, %s, %s, %s) """ self.cursor.execute(insert_sql, (item["ip"], item["port"], item["position"], item["type"],item["reponseTime"])) self.conn.commit() ~~~ ### 3.2 配置pipeline settings.py文件，这里的权值设置大一点是可以的，以免中间件的设置被scrapy默认的中间件覆盖（大就后执行呗！） ~~~ ITEM_PIPELINES = { 'xinlang.pipelines.MysqlPipeline': 300, } ~~~ ## 4. 注意的问题在刚开始爬取快代理的时候，不知道为啥老报错，就一顿debug，发现debug时可以正常爬取；突然想到爬虫最基础的一条反爬虫策略：限制ip在一定时间内的访问次数。咋整：只能配置一下下载速度了，在settings.py文件中对DOWNLOAD_DELAY进行配置，它的意思就是延迟我的请求（request） ~~~ # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 ~~~ 好了可以爬了