scrapy有一个简单的入门文档,大家可以参考一下,我感觉官方文档是最靠谱的,也是最真实的。
首先我们先创建一个scrapy的项目
~~~
scrapy startproject weather
~~~
我采用的是ubuntu12.04的系统,建立项目之后主文件夹就会出现一个weather的文件夹。我们可以通过tree来查看文件夹的结构。可以使用sudoapt-get install tree安装。
~~~
tree weather
~~~
~~~
weather
├── scrapy.cfg
├── wea.json
├── weather
│ ├── __init__.py
│ ├── __init__.pyc
│ ├── items.py
│ ├── items.pyc
│ ├── pipelines.py
│ ├── pipelines.py~
│ ├── pipelines.pyc
│ ├── settings.py
│ ├── settings.pyc
│ └── spiders
│ ├── __init__.py
│ ├── __init__.pyc
│ ├── weather_spider1.py
│ ├── weather_spider1.pyc
│ ├── weather_spider2.py
│ ├── weather_spider2.py~
│ ├── weather_spider2.pyc
│ └── weather_spider.pyc
├── weather.json
└── wea.txt
~~~
上面就是我编写过之后的爬虫文件,现在我们新创建一个weathertest来看一下初始的时候文件是什么样的。
~~~
weathertest
├── scrapy.cfg
└── weathertest
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
~~~
~~~
scrapy.cfg:项目的配置文件
weather/:该项目的python模块。之后您将在此加入代码。
weather/items.py:相当于要提取的元素,相当于一个容器
weather/pipelines.py:存文件时或者发送到其他地方可用其编写
weather/settings.py:项目的设置文件.
weather/spiders/:放置spider代码的目录.
~~~
Item是保存爬取到的数据的容器;其使用方法和python字典类似,并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。
~~~
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
city = scrapy.Field()
date = scrapy.Field()
dayDesc = scrapy.Field()
dayTemp = scrapy.Field()
pass
~~~
之后我们编写今天的爬虫一号,使用xpath分解html中的标签,为了创建一个Spider,您必须继承scrapy.Spider类, 且定义以下三个属性:
1.name:用于区别Spider。该名字必须是唯一的,您不可以为不同的Spider设定相同的名字。
2.start_urls:包含了Spider在启动时进行爬取的url列表。因此,第一个被获取到的页面将是其中之一。后续的URL则从初始的URL获取到的数据中提取。
3.parse()是spider的一个方法。被调用时,每个初始URL完成下载后生成的Response对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(responsedata),提取数据(生成item)以及生成需要进一步处理的URL的Request对象。
~~~
import scrapy
from weather.items import WeatherItem
class WeatherSpider(scrapy.Spider):
name = 'weather_spider1'
allowed_domains = ['sina.com.cn']
start_urls = ['http://weather.sina.com.cn/beijing']
def parse(self,response):
item = WeatherItem()
item['city'] = response.xpath("//*[@id='slider_ct_name']/text()").extract()
tenDay = response.xpath('//*[@id="blk_fc_c0_scroll"]');
item['date'] = tenDay.css('p.wt_fc_c0_i_date::text').extract()
item['dayDesc'] = tenDay.css('img.icons0_wt::attr(title)').extract()
item['dayTemp'] = tenDay.css('p.wt_fc_c0_i_temp::text').extract()
return item
~~~
Scrapy使用了一种基于XPath和CSS表达式机制:Scrapy Selectors。
这里给出XPath表达式的例子及对应的含义:
/html/head/title:选择HTML文档中<head>标签内的<title>元素
/html/head/title/text():选择上面提到的<title>元素的文字
//td:选择所有的<td>元素
//div[@class="mine"]:选择所有具有class="mine"属性的div元素
上边仅仅是几个简单的XPath例子,XPath实际上要比这远远强大的多。
为了配合XPath,Scrapy除了提供了Selector之外,还提供了方法来避免每次从response中提取数据时生成selector的麻烦。
Selector有四个基本的方法(点击相应的方法可以看到详细的API文档):
xpath():传入xpath表达式,返回该表达式所对应的所有节点的selectorlist列表 。
css():传入CSS表达式,返回该表达式所对应的所有节点的selectorlist列表.
extract():序列化该节点为unicode字符串并返回list。
re():根据传入的正则表达式对数据进行提取,返回unicode字符串list列表。
然后我们就可以编写pipelines.py文件了,如果你只是想保存文件,也可以不编写这个文件,就保持原样即可,运行爬虫的时候再后面加上 -o weather.json
~~~
scrapy crawl weather_spider1 -o weather.json
~~~
~~~
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class WeatherPipeline(object):
def __init__(self):
self.file = open('wea.txt','w+')
def process_item(self, item, spider):
city = item['city'][0].encode('utf-8')
self.file.write('city:'+str(city)+'\n\n')
date = item['date']
desc = item['dayDesc']
dayDesc = desc[1::2]
nightDesc = desc[0::2]
dayTemp = item['dayTemp']
weaitem = zip(date,dayDesc,nightDesc,dayTemp)
for i in range(len(weaitem)):
item = weaitem[i]
d = item[0]
dd = item[1]
nd = item[2]
ta = item[3].split('/')
dt = ta[0]
nt = ta[1]
txt = 'date: {0} \t\t day:{1}({2}) \t\t night:{3}({4}) \n\n'.format(
d,
dd.encode('utf-8'),
dt.encode('utf-8'),
nd.encode('utf-8'),
nt.encode('utf-8')
)
self.file.write(txt)
return item
~~~
最后设置一下settings.py文件就OK了。settings.py文件可以设置一下爬虫抓取网站时的身份或者代理。
~~~
# -*- coding: utf-8 -*-
# Scrapy settings for weather project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'weather'
SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'
USER_AGENT = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
DEFAULT_REQUEST_HEADERS = {
'Referer': 'http://www.weibo.com'
}
ITEM_PIPELINES = {
'weather.pipelines.WeatherPipeline': 1
}
DOWNLOAD_DELAY = 0.5
~~~
爬虫抓取网页也可以使用BeautifulSoup来抓取,来看一下我们今天的爬虫2号,哇咔咔。
~~~
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from weather.items import WeatherItem
class WeatherSpider(scrapy.Spider):
name = "weather_spider2"
allowed_domains = ["sina.com.cn"]
start_urls = ['http://weather.sina.com.cn']
def parse(self, response):
html_doc = response.body
#html_doc = html_doc.decode('utf-8')
soup = BeautifulSoup(html_doc)
itemTemp = {}
itemTemp['city'] = soup.find(id='slider_ct_name')
tenDay = soup.find(id='blk_fc_c0_scroll')
itemTemp['date'] = tenDay.findAll("p", {"class": 'wt_fc_c0_i_date'})
itemTemp['dayDesc'] = tenDay.findAll("img", {"class": 'icons0_wt'})
itemTemp['dayTemp'] = tenDay.findAll('p', {"class": 'wt_fc_c0_i_temp'})
item = WeatherItem()
for att in itemTemp:
item[att] = []
if att == 'city':
item[att] = itemTemp.get(att).text
continue
for obj in itemTemp.get(att):
if att == 'dayDesc':
item[att].append(obj['title'])
else:
item[att].append(obj.text)
return item
~~~
最后进入到weather文件夹内,开始运行scrapy。
可以先查看一下scrapy的命令有那些,在主文件夹内查看和在项目文件中查看是两个效果。
~~~
Scrapy 0.24.6 - project: weather
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
deploy Deploy project in Scrapyd target
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
~~~
我们可以使用scrapy crawl weather_spider1或者scrapy crawl weather_spider2.然后在主文件夹内生成一个wea.txt的文件打开之后就是今天的天气。
~~~
city:北京
date: 05-11 day:多云(20°C ) night:多云( 11°C)
date: 05-12 day:晴(27°C ) night:晴( 11°C)
date: 05-13 day:多云(29°C ) night:晴( 17°C)
date: 05-14 day:多云(29°C ) night:多云( 19°C)
date: 05-15 day:晴(26°C ) night:晴( 12°C)
date: 05-16 day:晴(27°C ) night:晴( 16°C)
date: 05-17 day:阴(29°C ) night:晴( 19°C)
date: 05-18 day:晴(29°C ) night:少云( 16°C)
date: 05-19 day:局部多云(31°C ) night:少云( 16°C)
date: 05-20 day:局部多云(29°C ) night:局部多云( 16°C)
~~~