7.2 Splash的使用 · python3爬虫笔记

## 1.说明 Lua脚本语法:[http://www.runoob.com/lua/lua-basic-syntax.html](http://www.runoob.com/lua/lua-basic-syntax.html) Lua下载:[https://github.com/rjpcomputing/luaforwindows/releases](https://github.com/rjpcomputing/luaforwindows/releases) 官方文档: * [https://splash.readthedocs.io/en/stable/scripting-ref.html](https://splash.readthedocs.io/en/stable/scripting-ref.html) * [https://splash.readthedocs.io/en/stable/scripting-element-object.html](https://splash.readthedocs.io/en/stable/scripting-element-object.html) * [https://splash.readthedocs.io/en/stable/api.html\#render-json](https://splash.readthedocs.io/en/stable/api.html#render-json) * [https://splash.readthedocs.io/en/stable/api.html\#render-png](https://splash.readthedocs.io/en/stable/api.html#render-png) * [https://splash.readthedocs.io/en/stable/api.html\#render-html](https://splash.readthedocs.io/en/stable/api.html#render-html) ### 2.实例 ![](/assets/7.2.2.png) 脚本语言内容: ``` function main(splash, args) assert(splash:go(args.url)) assert(splash:wait(0.5)) return { html = splash:html(), png = splash:png(), har = splash:har(), } end ``` wait:等待 html:返回页面的源码 png:返回页面的截图 HAR:返回页面的HAR信息 ### 3.Splash Lua脚本 #### 入口及返回值 ``` function main(splash, args) assert(splash:go("https://www.baidu.com")) assert(splash:wait(0.5)) local title = splash:evaljs("document.title") return { title = title } end ``` 通过spalsh:evaljs方法传入javascript脚本 Splash默认调用main方法方法的返回值为字典形式或者是字符串形式，最后都会转化为一个Splash HTTP Response ``` function main(splash, args) return { hello = "world" } end ``` #### 异步处理 ``` function main(splash, args) local example_urls = {"www.baidu.com", "www.taobao.com", "www.zhihu.com"} local urls = args.urls or example_urls local results = {} for index, url in ipairs(urls) do local ok, reason = splash:go("http://" .. url) if ok then splash:wait(2) results[url] = splash:png() end end return results end ``` ![](/assets/7.2.3.png) wait:等待的描述 ..:字符串拼接操作符 ipairs:操作字典进行迭代 #### 4.Splash对象属性 #### args splash对象的args属性可以获取加载时配置的参数 ``` function main(spalsh,args) local url = args.url return url end ``` 运行结果: ``` Splash Response: "https://www.qidian.com/" ``` #### js\_endabled js\_endabled属性是splash的javascript执行开关，可以将其配置为True或False来控制是否可以执行JavaScript代码，默认为True ``` function main(splash,args) splash:go(args.url) splash.js_enabled = false local title = splash:evaljs("document.title") return { title = title, } end ``` 禁用之后，调用evaljs方法执行javascript代码，会抛出异常 ``` { "description": "Error happened while executing Lua script", "info": { "message": "[string \"function main(splash,args)\r...\"]:4: unknown JS error: None", "line_number": 4, "source": "[string \"function main(splash,args)\r...\"]", "splash_method": "evaljs", "type": "JS_ERROR", "error": "unknown JS error: None", "js_error_message": null }, "error": 400, "type": "ScriptError" } ``` 一般情况下默认开启 #### resoure\_timeout 设置加载的超时时间，单位为秒数。如果设置为0或nil就表示不检测超时 ``` function main(splash,args) splash.resource_timeout = 0.1 splash:go(args.url) return splash:png() end ``` 设置超时为0.1秒，如果在0.1秒之内没有得到响应就会抛出异常 ![](/assets/7.2.4.png) ### images\_enabled 设置图片是否加载 ``` function main(splash,args) splash.images_enabled = false splash:go(args.url) return splash:png() end ``` #### plugins\_enabled 控制浏览器插件是否开启，如Flash。默认情况下事是False/不开启 ``` splash.plugins_enabled = flase/true ``` #### scroll\_position 控制页面的滚动偏移 splash.scroll\_position = {x=x,y=y} ``` function main(splash,args) splash.images_enabled = true splash:go(args.url) splash:wait(3) splash.scroll_position = {y=200} return splash:png() end ``` ### 5.Splash对象方法 #### go go:用来请求某个链接的方法，可以模拟GET和POST请求，同时支持传入Headers、From Data等数据 ``` ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil} ``` 参数说明: * url:请求的url * baseurl:资源加载相对路径 * headers:请求的headers * http\_method:默认是get，同时支持post * body:post的时候的表单数据，使用的Content-type为application * fromdata:post的时候表单数据，使用的Content-type为application/x-www-form-urlencoded 返回的结果是结果ok和原因reason的组合，如果ok为空，代表网页加载出现了错误，此时reason变量中包含了错误的原因，否则证明页面加载成功 ``` function main(splash,args) local ok,reason = splash:go{"http://httpbin.org/post",http_method="POST",body="name-Germey"} if ok then return splash:html() end end ``` 模拟post请求，并传入POST的表单数据，如果成功，则返回用页面源代码运行结果: ``` <html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{ "args": {}, "data": "", "files": {}, "form": { "name-Germey": "" }, "headers": { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en,*", "Connection": "close", "Content-Length": "11", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "Origin": "null", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1" }, "json": null, "origin": "58.16.27.216", "url": "http://httpbin.org/post" } </pre></body></html> ``` #### wait 控制页面等待时间 ``` ok,reason = splash:wait(time,cancel_on_redirect=false,cancle_on_error=True) ``` 参数说明: * time:等待秒数 * cancel\_on\_redirect:默认Fasle,如果发生了重定向就停止等待，并返回重定向结果 * cancel\_on\_error:默认False,如果加载发生了加载错误就停止等待 ``` function main(splash,args) splash:go(args.url) splash:wait(2) return splash:html() end ``` #### jsfunc 直接调用javascript定义的方法，需要用双中括号包围，相等于实现了javascript方法到lua脚本的转换 ``` function main(splash, args) local get_div_count = splash:jsfunc([[ function () { var body = document.body; var divs = body.getElementsByTagName('div'); return divs.length; } ]]) splash:go("https://www.baidu.com") return ("There are %s DIVs"):format( get_div_count()) end ``` 运行结果: ``` Splash Response: "There are 23 DIVs" ``` #### evaljs 执行javascript代码并返回最后一条语句的返回结果 ``` result = splash:evaljs(js) ``` #### runjs 执行JavaScript代码类似于evaljs功能类似，但偏向于执行某些动作或声明某些方法evaljs偏向于获取某些执行结果 ``` function main(splash,args) splash:go("https;//www.baidu.com") splash:runjs("foo = function(){return 'bar'}") local result = splash:evaljs("foo()") return result end ``` 运行结果: ``` Splash Response: "bar" ``` #### autoload 设置每个页面访问时自动加载的对象 ``` ok,reason = splash:autoload{source_or_url,source=nil,url=nil} ``` 参数说明: * source\_or\_url:JavaScript代码或者JavaScript库链接 * source，JavaScript代码 * url，JavaScript库链接只负责加载JavaScript代码或库，不执行任何操作，如果要执行操作可以调用evaljs或runjs方法 ``` function main(splash,args) splash:autoload([[ function get_documetn_title(){ return document.title; } ]]) splash:go("https://www.baidu.com") return splash:evaljs("get_documetn_title()") end ``` 运行结果: ``` Splash Response: "百度一下，你就知道" ``` 加载某些方法库，如JQuery ``` function main(splash,args) splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js") splash:go("https://www.baidu.com") local version = splash:evaljs("$.fn.jquery") return "JQuery version: " .. version end ``` 运行结果： ``` Splash Response: "JQuery version: 1.10.2" ``` #### call\_later 可以通过设置定时任务和延迟时间实现任务延时执行，并且可以在执行前通过 cancel 方法重新执行定时任务 ``` function main(splash,args) local snapshots = {} local timer = splash:call_later(function() snapshots['a'] = splash:png() splash:wait(1.0) snapshots["b"] = splash:png() end,0.2) splash:go("https://www.taobao.com") splash:wait(3.0) return snapshots end ``` ![](/assets/7.2.5.png) #### http\_get {#httpget} 此方法可以模拟发送 HTTP 的 GET 请求，使用方法如下： ``` response = splash:http_get{url, headers=nil, follow_redirects=true} ``` 参数说明如下： * url，请求URL。 * headers，可选参数，默认为空，请求的 Headers。 * follow\_redirects，可选参数，默认为 True，是否启动自动重定向。 #### http\_post {#httppost} 和 http\_get 方法类似，此方法是模拟发送一个 POST 请求，不过多了一个参数 body，使用方法如下 ``` response = splash:http_post{url, headers=nil, follow_redirects=true, body=nil} ``` 参数说明如下： * url，请求URL。 * headers，可选参数，默认为空，请求的 Headers。 * follow\_redirects，可选参数，默认为 True，是否启动自动重定向。 * body，可选参数，默认为空，即表单数据。 ``` function main(splash, args) local treat = require("treat") local json = require("json") local response = splash:http_post{"http://httpbin.org/post", body=json.encode({name="Germey"}), headers={["content-type"]="application/json"} } return { html=treat.as_string(response.body), url=response.url, status=response.status } end ``` #### set\_content {#setcontent} 此方法可以用来设置页面的内容 ``` function main(splash) assert(splash:set_content("<html><body><h1>hello Angle</h1></body></html>")) return splash:png() end ``` #### html {#html} 此方法可以用来获取网页的源代码 #### png {#png} 此方法可以用来获取 PNG 格式的网页截图 #### jpeg {#jpeg} 此方法可以用来获取 JPEG 格式的网页截图 #### har {#har} 此方法可以用来获取页面加载过程描述 #### url {#url} 此方法可以获取当前正在访问的 URL #### get\_cookies {#getcookies} 此方法可以获取当前页面的 Cookies #### add\_cookie {#addcookie} 此方法可以为当前页面添加 Cookie ``` cookies = splash:add_cookie{name, value, path=nil, domain=nil, expires=nil, httpOnly=nil, secure=nil} ``` ``` function main(splash) splash:add_cookie{"sessionid", "asdadasd", "/", domain="http://example.com"} splash:go("http://example.com/") return splash:html() end ``` #### clear\_cookies {#clearcookies} 此方法可以清除所有的 Cookies ``` splash:clear_cookies() ``` #### get\_viewport\_size {#getviewportsize} 此方法可以获取当前浏览器页面的大小，即宽高 #### set\_viewport\_size {#setviewportsize} 此方法可以设置当前浏览器页面的大小，即宽高 ``` spalsh:set_viewport_size(x,y) ``` #### set\_viewport\_full {#setviewportfull} 此方法可以设置浏览器全屏显示 #### set\_user\_agent {#setuseragent} 此方法可以设置浏览器的 User-Agent ``` splash:set_user_agent("Splash") ``` #### set\_custom\_headers {#setcustomheaders} 此方法可以设置请求的 Headers ``` set_custom_headers() 此方法可以设置请求的 Headers ``` #### select {#select} select 方法可以选中符合条件的第一个节点，如果有多个节点符合条件，则只会返回一个，其参数是 CSS 选择器 ``` function main(splash,args) splash:go(args.url) input = splash:select("#kw") input:send_text('Splash') splash:wait(3) return splash:png() end ``` ![](/assets/7.2.6.png) #### select\_all {#selectall} 此方法可以选中所有的符合条件的节点，其参数是 CSS 选择器 ``` function main(splash) local treat = require('treat') assert(splash:go("http://quotes.toscrape.com/")) assert(splash:wait(0.5)) local texts = splash:select_all('.quote .text') local results = {} for index, text in ipairs(texts) do results[index] = text.node.innerHTML end return treat.as_array(results) end ``` #### mouse\_click {#mouseclick} 此方法可以模拟鼠标点击操作，传入的参数为坐标值 x、y，也可以直接选中某个节点直接调用此方法 ``` function main(splash,args) splash:go(args.url) input = splash:select("#kw") input:send_text('Splash') submit = splash:select("#su") submit:mouse_click() splash:wait(3) return splash:png() end ``` ![](/assets/7.2.7.png) ### 6.Splash API调用 ### render.html 用于获取javascript渲染页面的HTML代码 ``` curl http://localhost:8050/render.html?url=https://www.baidu.com ``` ``` import requests url = "http://192.168.99.100:8050/render.html?url=https://www.baidu.com&wait=5" response = requests.get(url) print(response.text) ``` ### render.png 获取网页截图参数: * height:高 * width:宽 ``` curl http://localhost:8050/render.png?url=https://www.taobao.com&wait=5&width=1000&height=700 ``` ``` import requests url = "http://192.168.99.100:8050/render.png?url=https://www.taobao.com&wait=10&width=1000&height=700" response = requests.get(url) with open('taobao.png','wb') as f: f.write(response.content) ``` 图片:[https://splash.readthedocs.io/en/stable/api.html\#render-png](https://splash.readthedocs.io/en/stable/api.html#render-png) ### render.har {#renderhar} 此接口用于获取页面加载的 HAR 数据 ``` curl http://localhost:8050/render.har?url=https://www.jd.com&wait=5 ``` 返回结果非常多，是一个 Json 格式的数据，里面包含了页面加载过程中的 HAR 数据。 ### render.json {#renderjson} 此接口包含了前面接口的所有功能，返回结果是 Json 格式 ``` curl http://localhost:8050/render.json?url=https://httpbin.org ``` 更多参数:[https://splash.readthedocs.io/en/stable/api.html\#render-json](https://splash.readthedocs.io/en/stable/api.html#render-json) ### execute {#execute} 此接口可以实现和 Lua 脚本的对接 ``` function main(splash) return "hello" end ``` ``` curl http://localhost:8050/execute?lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend ``` ``` import requests from urllib.parse import quote lua = ''' function main(splash) return 'hello' end ''' url = "http://192.168.99.100:8050/execute?lua_source=" + quote(lua) response = requests.get(url) print(response.text) ``` ``` import requests from urllib.parse import quote lua = ''' function main(splash, args) local treat = require("treat") local response = splash:http_get("http://httpbin.org/get") return { html=treat.as_string(response.body), url=response.url, status=response.status } end ''' url = 'http://localhost:8050/execute?lua_source=' + quote(lua) response = requests.get(url) print(response.text) ```