利用scrapy抓取网易新闻并将其存储在mongoDB · Scrapy爬虫教程

好久没有写爬虫了，写一个scrapy的小爬爬来抓取网易新闻，代码原型是github上的一个爬虫，最近也看了一点mongoDB，顺便小用一下，体验一下NoSQL是什么感觉。言归正传啊，scrapy爬虫主要有几个文件需要修改。这个爬虫需要你装一下mongodb数据库和pymongo，进入数据库之后，利用find语句就可以查看数据库中的内容，抓取的内容如下所示： ~~~ { "_id" : ObjectId("5577ae44745d785e65fa8686"), "from_url" : "http://tech.163.com/", "news_body" : [ "科技讯 6月9日凌晨消息2015", "全球开发者大会（WWDC 2015）在旧", "召开，网易科技进行了全程图文直播。最新", "9操作系统在", "上性能得到极大提升，可以实现分屏显示，也可以支持画中画功能。", "新版iOS 9 增加了QuickType 键盘，让输入和编辑都更简单快捷。在搭配外置键盘使用 iPad 时，用户可以用快捷键来进行操作，例如在不同 app 之间进行切换。", "而且，iOS 9 重新设计了 app 间的切换。iPad的分屏功能可以让用户在不离开当前 app 的同时就能打开第二个 app。这意味着两个app在同一屏幕上，同时开启、并行运作。两个屏幕的比例可以是5：5，也可以是7：3。", "另外，iPad还支持“画中画”功能，可以将正在播放的视频缩放到一角，然后利用屏幕其它空间处理其他的工作。", "据透露分屏功能只支持iPad Air2；画中画功能将只支持iPad Air, iPad Air2, iPad mini2, iPad mini3。", "\r\n" ], "news_from" : "网易科技报道", "news_thread" : "ARKR2G22000915BD", "news_time" : "2015-06-09 02:24:55", "news_title" : "iOS 9在iPad上可实现分屏功能", "news_url" : "http://tech.163.com/15/0609/02/ARKR2G22000915BD.html" } ~~~ 下面就是需要修改的文件： 1.spider 爬虫文件，制定抓取规则主要是利用xpath 2.items.py 主要指定抓取的内容 3.pipeline.py 有一个指向和存储数据的功能，这里我们还会增加一个store.py的文件，文件内部就是创建一个MongoDB的数据库。 4.setting.py 配置文件，主要是配置代理、User_Agent、抓取时间间隔、延时等等主要就是这几个文件，这个scrapy照以前的爬虫我增加了几个新功能，一个是和数据库链接实现存储的功能，不在是存成json或者txt文件，第二个就是在spider中设置了follow = True这个属性，意思就是在爬到的结果上继续往下爬，相当于一个深搜的过程。下面我们看看源代码。一般首先我们写的是items.py文件 ~~~ # -*- coding: utf-8 -*- import scrapy class Tech163Item(scrapy.Item): news_thread = scrapy.Field() news_title = scrapy.Field() news_url = scrapy.Field() news_time = scrapy.Field() news_from = scrapy.Field() from_url = scrapy.Field() news_body = scrapy.Field() ~~~ 之后我们编写的就是spider文件。我们可以随便命名一个文件，因为我们调用爬虫的时候只需知道它文件内部的爬虫名字就可以了，也就是name = "news"这个属性，我们这里的爬虫名字叫做news。如果你需要使用这个爬虫你可能需要修改以下Rule里的allow属性，修改一下时间，因为网易新闻不会存储超过一年时间的新闻。你可以将时间改为近期如果现在为15年8月你就可以修改为/15/08。 ~~~ #encoding:utf-8 import scrapy import re from scrapy.selector import Selector from tech163.items import Tech163Item from scrapy.contrib.linkextractors import LinkExtractor from scrapy.contrib.spiders import CrawlSpider,Rule class Spider(CrawlSpider): name = "news" allowed_domains = ["tech.163.com"] start_urls = ['http://tech.163.com/'] rules = ( Rule( LinkExtractor(allow = r"/15/06\d+/\d+/*"), #代码中的正则/15/06\d+/\d+/*的含义是大概是爬去/15/06开头并且后面是数字/数字/任何格式/的新闻 callback = "parse_news", follow = True #follow=ture定义了是否再爬到的结果上继续往后爬 ), ) def parse_news(self,response): item = Tech163Item() item['news_thread'] = response.url.strip().split('/')[-1][:-5] self.get_title(response,item) self.get_source(response,item) self.get_url(response,item) self.get_news_from(response,item) self.get_from_url(response,item) self.get_text(response,item) return item def get_title(self,response,item): title = response.xpath("/html/head/title/text()").extract() if title: item['news_title'] = title[0][:-5] def get_source(self,response,item): source = response.xpath("//div[@class='ep-time-soure cDGray']/text()").extract() if source: item['news_time'] = source[0][9:-5] def get_news_from(self,response,item): news_from = response.xpath("//div[@class='ep-time-soure cDGray']/a/text()").extract() if news_from: item['news_from'] = news_from[0] def get_from_url(self,response,item): from_url = response.xpath("//div[@class='ep-time-soure cDGray']/a/@href").extract() if from_url: item['from_url'] = from_url[0] def get_text(self,response,item): news_body = response.xpath("//div[@id='endText']/p/text()").extract() if news_body: item['news_body'] = news_body def get_url(self,response,item): news_url = response.url if news_url: item['news_url'] = news_url ~~~ 之后我们创建一个store.py的文件，在这个文件里我们创建了一个数据库，之后会在pipeline文件中引用这个数据库，将数据存储在数据库中。下面我们看看源代码。 ~~~ import pymongo import random HOST = "127.0.0.1" PORT = 27017 client = pymongo.MongoClient(HOST,PORT) NewsDB = client.NewsDB ~~~ 在pipeline.py文件中，我们将import NewsDB这个数据库，利用update语句将每一条新闻插入这个数据库，其中还有两个判断一个是判断爬虫的名字是否为news另一个是判断线程的编号是否为空，其中最重要的一句就是NewsDB.new.update(spec,{"$set":dict(item)},upsert = True)，将字典中的数据插入到数据库中。 ~~~ from store import NewsDB class Tech163Pipeline(object): def process_item(self, item, spider): if spider.name != "news": return item if item.get("news_thread",None) is None: return item spec = {"news_thread":item["news_thread"]} NewsDB.new.update(spec,{"$set":dict(item)},upsert = True) return None ~~~ 最后我们会更改一下配置文件设置一下USER_AGENT，我们要最大程度的让爬虫模仿浏览器的行为，这样才能顺利抓取的你想要的内容。 ~~~ BOT_NAME = 'tech163' SPIDER_MODULES = ['tech163.spiders'] NEWSPIDER_MODULE = 'tech163.spiders' ITEM_PIPELINES = ['tech163.pipelines.Tech163Pipeline',] # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'tech163 (+http://www.yourdomain.com)' USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7' DOWNLOAD_TIMEOUT = 15 ~~~