多应用+插件架构,代码干净,二开方便,首家独创一键云编译技术,文档视频完善,免费商用码云13.8K 广告
[TOC] ## 1. scrapyd > 1. scrapyd是又scrapy提供的免费开源的工具,用来管理你创建的scrapy项目的有界面的管理工具。 > 2. scrapy-client是是免费开源的工具,用来打包并发布你的scrapy项目到scrapyd。用scrapyd发布要麻烦一些。这个工具简化了发布步骤。 官方文档:http://scrapyd.readthedocs.io/en/latest/overview.html ### 1.1 install(Ubuntu) * 前提要求安装了scrapy:https://doc.scrapy.org/en/latest/topics/ubuntu.html ~~~ # 安装依赖 sudo apt-get install -y libffi-dev libssl-dev libxml2-dev libxslt1-dev zlib1g-dev build-dep python-lxml git clone https://github.com/scrapy/scrapyd cd scrapyd/ python3 setup.py install ~~~ 或者: ~~~ pip3 install scrapyd ~~~ > 1. 报错:Invalid environment marker:python_version < '3',解决办法如下 ~~~ sudo pip3 install --upgrade setuptools ~~~ > 2. 报错: Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed? ~~~ sudo apt-get install -y libxml2-dev libxslt1-dev zlib1g-dev ~~~ > 3. 报错:error: Could not find required distribution pyasn1 ~~~ pip3 install pyasn1 ~~~ > 4. 报错:error: Setup script exited with error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 ~~~ sudo apt-get build-dep python-lxml ~~~ > 5. 报错:c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directory #include <ffi.h> ~~~ sudo apt-get install libffi-dev ~~~ 6. 报错:error: Setup script exited with error in cryptography setup command: Invalid environment marker: platform_python_implementation != 'PyPy' ~~~ sudo pip install --upgrade setuptools ~~~ ### 1.2 配置scrapyd > Scrapyd searches for configuration files in the following locations, and parses them in order with the latest one taking more priority: ~~~ /etc/scrapyd/scrapyd.conf (Unix) c:\scrapyd\scrapyd.conf (Windows) /etc/scrapyd/conf.d/* (in alphabetical order, Unix) scrapyd.conf ~/.scrapyd.conf (users home directory) ~~~ scrapyd默认绑定127.0.0.1,我们需要把它修改为服务器ip,这样client才可以向它发送部署请求 ~~~ # 创建目录 mkdir /etc/scrapyd # 创建文件 vim /etc/scrapyd/scrapyd.conf # 增加配置 [scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 4 finished_to_keep = 100 poll_interval = 5.0 bind_address = 192.168.56.130 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus ~~~ ### 1.2 运行scrapyd ~~~ nohup scrapyd & > scrpyd.log 2>&1 & ~~~ ## 2. scrapyd-clinet GitHub地址:https://github.com/scrapy/scrapyd-client ### 2.1 安装 ~~~ pip3 install scrapyd-client ~~~ ### 2.2 部署爬虫项目scrapyd-deploy #### 2.2.1 配置爬虫项目 > 修改爬虫项目下的scrapy.cfg,设置该爬虫项目所要发布到的服务器(运行scrapyd的服务器) ~~~ [deploy] url = http://192.168.56.130:6800/ project = proxyscrapy username = proxyscrapy password = tuna ~~~ #### 2.2.2 部署 **1. 执行打包命令** ~~~ scrapyd-deploy ~~~ > Windows下报错: ~~~ E:\PythonWorkSpace\proxyscrapy>scrapyd-deploy 'scrapyd-deploy' 不是内部或外部命令,也不是可运行的程序 ~~~ > * 通常情况下,开始时在Windows系统下,但是不具有可执行权限,所以要做以下修改 1. 在python的安装目录下,找到Scripts目录,新建scrapyd-deploy.bat文件 ![](https://box.kancloud.cn/7ff68824b1f51d9b522ebad4489f2892_1214x501.png) 2. 添加一下内容 ~~~ @echo off "D:\Python\Python36\python.exe" "D:\Python\Python36\Scripts\scrapyd-deploy" %1 %2 %3 %4 %5 %6 %7 %8 %9 ~~~ > * 再次执行打包,成功返回以下: ~~~ Packing version 1519871059 Deploying to project "proxyscrapy" in http://192.168.56.130:6800/addversion.json Server response (200): {"project": "proxyscrapy", "status": "ok", "node_name": "zabbix01", "version": "1519871059", "spiders": 4} ~~~ **2. 发布爬虫项目** windos下需要安装curl :https://www.kancloud.cn/tuna_dai_/day01/535005 ~~~ curl http://192.168.56.130:6800/schedule.json -d project=proxyscrapy -d spider=yaoq ~~~ scrapyd还提供了很多请求,包括列举所有爬虫项目,所有爬虫,取消运行的爬虫等,官方api:http://scrapyd.readthedocs.io/en/latest/api.html 命令成功返回 ~~~ {"status": "ok", "node_name": "zabbix01", "jobid": "3db9af3e1d0011e88b5c080027a60f41" } ~~~ 3. 查看爬虫状态 http://192.168.56.130:6800 点击jobs查看爬虫 ![](https://box.kancloud.cn/d6ae60474a7ec7773d4b0112beec767e_863x565.png) 之后可以看爬虫的状态和日志 ![](https://box.kancloud.cn/429af5ca9c8be6bb88d0fd3695d59fd4_1189x341.png) 修改代码后要重新scrapyd-deploy打包部署,爽!!!!! ## 3. 部署到多台scrapyd服务器 ### 3.1 配置爬虫项目的scrapy.cfg > 1. 指定多个target(scrapyd服务器),格式[deploy:标识名] ~~~ [deploy:zabbix01] url = http://192.168.56.130:6800/ project = proxyscrapy username = proxyscrapy password = tuna [deploy:es01] url = http://192.168.56.130:6800/ project = proxyscrapy username = proxyscrapy password = tuna ~~~ ### 3.2 打包项目到scrapyd(target) #### 3.2.1 单个部署 scrapyd-deploy [target标识名] 例: ~~~ E:\PythonWorkSpace\proxyscrapy>scrapyd-deploy zabbix01 Packing version 1519951093 Deploying to project "proxyscrapy" in http://192.168.56.130:6800/addversion.json Server response (200): {"status": "ok", "version": "1519951093", "node_name": "zabbix01", "spiders": 4, "project": "proxyscrapy"} E:\PythonWorkSpace\proxyscrapy>scrapyd-deploy es01 Packing version 1519951106 Deploying to project "proxyscrapy" in http://192.168.56.130:6800/addversion.json Server response (200): {"status": "ok", "version": "1519951106", "node_name": "zabbix01", "spiders": 4, "project": "proxyscrapy"} ~~~ #### 3.2.2 多个project同时打包 ~~~ E:\PythonWorkSpace\scrapyredis>scrapyd-deploy -a Packing version 1519952580 Deploying to project "scrapyredis" in http://192.168.56.130:6800/addversion.json Server response (200): {"status": "ok", "version": "1519952580", "node_name": "zabbix01", "spiders": 1, "project": "scrapyredis"} Packing version 1519952580 Deploying to project "scrapyredis" in http://192.168.56.130:6800/addversion.json Server response (200): {"status": "ok", "version": "1519952580", "node_name": "zabbix01", "spiders": 1, "project": "scrapyredis"} ~~~ > 1. 此时可以查看有多少可用的target ~~~ E:\PythonWorkSpace\proxyscrapy>scrapyd-deploy -l zabbix01 http://192.168.56.130:6800/ es01 http://192.168.56.130:6800/ ~~~ > 2. 查看某一target上部署那些项目 ~~~ E:\PythonWorkSpace\proxyscrapy>scrapyd-deploy -L zabbix01 scrapyredis proxyscrapy ~~~ > 3. 在服务器上开启爬虫