ThinkChat2.0新版上线,更智能更精彩,支持会话、画图、阅读、搜索等,送10W Token,即刻开启你的AI之旅 广告
## 1. spider 继承CrawlSpider类,定义网页提取规则,对“下一页进行提取” ~~~ # -*- coding: utf-8 -*- from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from xinlang.items import ProxyIpItem class KuaiDaiLi(CrawlSpider): name = 'kuaidaili' # 爬虫标识名 allowed_domains = ['kuaidaili.com'] # 限定爬取网页的域 # 爬虫开始页,与Spider类不同的是,它的首页值提取符合规则的连接,真正开始爬取数据从rules爬取 start_urls = ['https://www.kuaidaili.com/free/inha/1/'] # 从rlues开始提取 rules = ( # 只提取复合规则的页面链接,不做分析,所以跟页面但是没有 Rule(LinkExtractor(allow=r'free/inha/\d+'), follow=True,callback='parse_item'), ) def parse_item(self, response): trList = response.xpath("//tbody//tr") for i in trList: ip = i.xpath("./td[1]/text()").extract()[0] port = i.xpath("./td[2]/text()").extract()[0] type = i.xpath("./td[4]/text()").extract()[0] position = i.xpath("./td[5]/text()").extract()[0] response_time = i.xpath("./td[6]/text()").extract()[0] item = ProxyIpItem() item['ip'] = ip item['port'] = port item['type'] = type item['position'] = position item['reponseTime'] = response_time yield item ~~~ ## 2. 下载中间件 User-Agent 使用fake_useragent随机获取agent ### 2.1 自定义中间件 ~~~ class RandomAgentMiddleWare(object): """This middleware allows spiders to override the user_agent""" def __init__(self,crawler ): self.ua = UserAgent() # 取到其定义的获取Useragent的方法 self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") # 返回一个中间件对象 @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def getAgent(): userAgent = getattr(self.ua,self.ua_type) print("userAgent:{0}".format(userAgent)) return userAgent # 对request设置useragent request.headers.setdefault(b'User-Agent', getAgent()) ~~~ ### 2.2 配置下载中间件 settings.py文件,这里的权值(中间件对应的数字543)设置大一点是可以的,以免中间件的设置被scrapy默认的中间件覆盖(大就后执行呗!) ~~~ DOWNLOADER_MIDDLEWARES = { 'xinlang.middlewares.RandomAgentMiddleWare': 543, } ~~~ ## 3. pipeline 存入mysql ### 3.1 自定义pipeline ~~~ class MysqlPipeline(object): # 采用同步的机制写入mysql def __init__(self): self.conn = pymysql.connect('192.168.56.130', 'root', 'tuna', 'proxyip', charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() # 处理item def process_item(self, item, spider): insert_sql = """ insert into kuaidaili(ip, port, ip_position, ip_type,response_time) VALUES (%s, %s, %s, %s, %s) """ self.cursor.execute(insert_sql, (item["ip"], item["port"], item["position"], item["type"],item["reponseTime"])) self.conn.commit() ~~~ ### 3.2 配置pipeline settings.py文件,这里的权值设置大一点是可以的,以免中间件的设置被scrapy默认的中间件覆盖(大就后执行呗!) ~~~ ITEM_PIPELINES = { 'xinlang.pipelines.MysqlPipeline': 300, } ~~~ ## 4. 注意的问题 在刚开始爬取快代理的时候,不知道为啥老报错,就一顿debug,发现debug时可以正常爬取;突然想到爬虫最基础的一条反爬虫策略:限制ip在一定时间内的访问次数。 咋整: 只能配置一下下载速度了,在settings.py文件中对DOWNLOAD_DELAY进行配置,它的意思就是延迟我的请求(request) ~~~ # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 ~~~ 好了 可以爬了