Scrapy at a glance · Scrapy 1.6 中文文档

# Scrapy at a glance > 译者：[OSGeo 中国](https://www.osgeo.cn/) Scrapy是一个应用程序框架，用于对网站进行爬行和提取结构化数据，这些结构化数据可用于各种有用的应用程序，如数据挖掘、信息处理或历史存档。尽管Scrapy最初是为 [web scraping](https://en.wikipedia.org/wiki/Web_scraping) 它还可以用于使用API提取数据（例如 [Amazon Associates Web Services](https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html) ）或者作为一个通用的网络爬虫。 ## 浏览示例 Spider 为了向您展示Scrapy给桌子带来了什么，我们将用最简单的方法来运行一个Scrapy Spider 的例子。以下是一个 Spider 代码，它从网站http://quotes.toscrape.com上抓取著名的引语，按照以下页码： ```py import scrapy class QuotesSpider(scrapy.Spider): name = 'quotes' start_urls = [ 'http://quotes.toscrape.com/tag/humor/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.xpath('span/small/text()').get(), } next_page = response.css('li.next a::attr("href")').get() if next_page is not None: yield response.follow(next_page, self.parse) ``` 把它放在一个文本文件中，命名为 `quotes_spider.py` 然后用 [`runspider`](../topics/commands.html#std:command-runspider) 命令： ```py scrapy runspider quotes_spider.py -o quotes.json ``` 完成后，您将 `quotes.json` 以JSON格式提交一个引号列表，其中包含文本和作者，如下所示（此处重新格式化以提高可读性）： ```py [{ "author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d" }, { "author": "Groucho Marx", "text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d" }, { "author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d" }, ...] ``` ### 刚刚发生了什么？当你运行命令时 `scrapy runspider quotes_spider.py` 斯克里奇在里面寻找 Spider 的定义，然后用它的爬行引擎运行。通过向中定义的URL发出请求启动的爬网 `start_urls` 属性（在本例中，只有引号的URL _humor_ 并调用默认回调方法 `parse` ，将响应对象作为参数传递。在 `parse` 回调，我们使用CSS选择器循环引用元素，生成一个包含提取的引号文本和作者的python dict，查找到下一页的链接，并使用它调度另一个请求。 `parse` 方法作为回调。在这里，您注意到Scrapy的一个主要优点：请求是 [scheduled and processed asynchronously](../topics/architecture.html#topics-architecture) . 这意味着Scrapy不需要等待请求完成和处理，它可以同时发送另一个请求或做其他事情。这也意味着，即使某些请求失败或在处理过程中发生错误，其他请求也可以继续进行。虽然这使您能够非常快速地进行爬行（同时以容错的方式发送多个并发请求），但Scrapy还使您能够控制爬行的礼貌性。 [a few settings](../topics/settings.html#topics-settings-ref) . 您可以在每个请求之间设置下载延迟、限制每个域或每个IP的并发请求量，甚至 [using an auto-throttling extension](../topics/autothrottle.html#topics-autothrottle) 它试图自动解决这些问题。注解这是使用 [feed exports](../topics/feed-exports.html#topics-feed-exports) 要生成JSON文件，您可以轻松地更改导出格式（例如XML或CSV）或存储后端（FTP或 [Amazon S3](https://aws.amazon.com/s3/) 例如）。你也可以写一个 [item pipeline](../topics/item-pipeline.html#topics-item-pipeline) 将项目存储在数据库中。 ## 还有什么？你已经看到了如何使用Scrapy从网站中提取和存储项目，但这只是表面现象。Scrapy提供了许多强大的功能，使抓取变得简单和高效，例如： * 内置支持 [selecting and extracting](../topics/selectors.html#topics-selectors) 使用扩展的CSS选择器和XPath表达式从HTML/XML源中获取数据，并使用正则表达式提取助手方法。 * 安 [interactive shell console](../topics/shell.html#topics-shell) （ipython-aware）用于尝试使用css和xpath表达式来获取数据，在编写或调试spider时非常有用。 * 内置支持 [generating feed exports](../topics/feed-exports.html#topics-feed-exports) 以多种格式（json、csv、xml）存储在多个后端（ftp、s3、本地文件系统） * 强大的编码支持和自动检测，用于处理外部、非标准和中断的编码声明。 * [Strong extensibility support](../index.html#extending-scrapy) ，允许您使用 [signals](../topics/signals.html#topics-signals) 以及定义良好的API（中间件， [extensions](../topics/extensions.html#topics-extensions) 和 [pipelines](../topics/item-pipeline.html#topics-item-pipeline) ） * 广泛的内置扩展和用于处理的中间产品： * cookie和会话处理 * HTTP功能，如压缩、身份验证、缓存 * 用户代理欺骗 * robots.txt * 爬行深度限制 * 更多 * A [Telnet console](../topics/telnetconsole.html#topics-telnetconsole) 用于挂接到运行在Scrapy进程中的Python控制台，以便内省和调试爬虫程序 * 还有其他的好东西，比如可重复使用的 Spider [Sitemaps](https://www.sitemaps.org/index.html) 和XML/CSV源，这是 [automatically downloading images](../topics/media-pipeline.html#topics-media-pipeline) （或任何其他媒体）与抓取的项目、缓存DNS解析程序等相关！ ## 下一步是什么？接下来的步骤是 [install Scrapy](install.html#intro-install) ， [follow through the tutorial](tutorial.html#intro-tutorial) 学习如何创建一个完整的 Scrapy 项目和 [join the community](https://scrapy.org/community/) . 感谢您的关注！