8-Scrapy-分布式 · Scrapy框架

![](https://img.kancloud.cn/41/e0/41e066af9a6c25a24868d9667253ec98_1241x333.jpg) ***** ## Scrapy-分布式 ### 什么是scrapy_redis ``` scrapy_redis:Redis-based components for scrapy ``` github地址:https://github.com/rmax/scrapy-redis ### 回顾scrapy工作流程 ![](https://img.kancloud.cn/87/1c/871ca6c393c6bf98da4f1a81d935e2e1_1130x668.png) ### scrapy\_redis工作流程 ![](https://img.kancloud.cn/ab/04/ab04d6f3f23aabb3ba18ea8bd0a6cba7_1252x960.png) ### scrapy_redis下载 ``` clone github scrapy_redis源码文件 git clone https://github.com/rolando/scrapy-redis.git ``` ### scrapy_redis中的settings文件 ``` # Scrapy settings for example project # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy.org/topics/settings.html # SPIDER_MODULES = ['example.spiders'] NEWSPIDER_MODULE = 'example.spiders' USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)' DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 指定那个去重方法给request对象去重 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 指定Scheduler队列 SCHEDULER_PERSIST = True # 队列中的内容是否持久保存,为false的时候在关闭Redis的时候,清空Redis #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack" ITEM_PIPELINES = { 'example.pipelines.ExamplePipeline': 300, 'scrapy_redis.pipelines.RedisPipeline': 400, # scrapy_redis实现的items保存到redis的pipline } LOG_LEVEL = 'DEBUG' # Introduce an artifical delay to make use of parallelism. to speed up the # crawl. DOWNLOAD_DELAY = 1 ``` ### scrapy_redis运行 ``` allowed_domains = ['dmoztools.net'] start_urls = ['http://www.dmoztools.net/'] scrapy crawl dmoz ``` **运行结束后redis中多了三个键** ``` dmoz:requests 存放的是待爬取的requests对象 dmoz:item 爬取到的信息 dmoz:dupefilter 爬取的requests的指纹 ``` ![](https://img.kancloud.cn/0a/b5/0ab5c70987d18179e26e99e688b83e66_1389x647.png)