ThinkChat2.0新版上线,更智能更精彩,支持会话、画图、阅读、搜索等,送10W Token,即刻开启你的AI之旅 广告
## 1. 扩展(Extensions) ### 1.1 Extensions的作用 > 摘取自官方文档:http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/extensions.html#topics-extensions **scrapy的事件控制是通过信号(Signals)和扩展(Extensions)完成的** > 1. 扩展框架提供一个机制,使得你能将自定义功能绑定到Scrapy。 > 2. 扩展只是正常的类,它们在Scrapy启动时被实例化、初始化。 > 3. Scrapy扩展(包括middlewares和pipelines)的主要入口是 from_crawler 类方法, 它接收一个 Crawler 类的实例.通过这个对象访问settings,signals,stats,控制爬虫的行为。 > 4. 如果 from_crawler 方法抛出 NotConfigured 异常, 扩展会被禁用。否则,扩展会被开启。 ### 1.2 扩展例子(Sample extension) > * 这里我们将实现一个简单的扩展来演示上面描述到的概念。 该扩展会在以下事件时记录一条日志: * spider被打开 * spider被关闭 * 爬取了特定数量的条目(items) > 该扩展通过 MYEXT_ENABLED 配置项开启, items的数量通过 MYEXT_ITEMCOUNT 配置项设置。 以下是扩展的代码: ~~~ import logging from scrapy import signals from scrapy.exceptions import NotConfigured logger = logging.getLogger(__name__) class SpiderOpenCloseLogging(object): def __init__(self, item_count): self.item_count = item_count self.items_scraped = 0 @classmethod def from_crawler(cls, crawler): # first check if the extension should be enabled and raise # NotConfigured otherwise if not crawler.settings.getbool('MYEXT_ENABLED'): raise NotConfigured # get the number of items from settings item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000) # instantiate the extension object ext = cls(item_count) # connect the extension object to signals crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped) # return the extension object return ext def spider_opened(self, spider): logger.info("opened spider %s", spider.name) def spider_closed(self, spider): logger.info("closed spider %s", spider.name) def item_scraped(self, item, spider): self.items_scraped += 1 if self.items_scraped % self.item_count == 0: logger.info("scraped %d items", self.items_scraped) ~~~