1.Requests的常用方法 · Python3 爬虫实战

## **Requests的常用方法** ### Requests库常用的函数方法 ``` requests.get() 获取Html的主要方法，模拟发送get请求 requests.post() 向html提交post请求方法 requests.put() 向html提交put请求方法 requests.patch 向html 提交局部修改的请求 requests.delete() 向html 提交删除的请求 ``` ### 1.Get请求 ~~~ import requests import json r = requests.get('http://httpbin.org/get') html = r.text html2 = json.loads(html) print(html) print(type(html),type(html2)) print(html["url"]) print(html2["url"]) 运行结果如下： { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", Traceback (most recent call last): "User-Agent": "python-requests/2.22.0" }, "origin": "114.248.162.218, 114.248.162.218", File "F:/Desktop/Project/课件代码/1.py", line 8, in <module> "url": "https://httpbin.org/get" } print(html["url"]) TypeError: string indices must be integers <class 'str'> <class 'dict'> ~~~ ### 2.POST请求 ~~~ import requests data = {'name': 'germey', 'age': '22'} r = requests.post("http://httpbin.org/post", data=data) print(r.text) 运行结果 { "args": {}, "data": "", "files": {}, "form": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "18", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "json": null, "origin": "114.248.162.218, 114.248.162.218", "url": "https://httpbin.org/post" } ~~~ ### 3.添加header ~~~ import requests r1 = requests.get("https://www.zhihu.com/explore") print(r1.text) headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac oS X 10 11 _4) AppleWebKit/537. 36 (KHTML, like Gecko)' } r2 = requests.get("https://www.zhihu.com/explore",headers=headers) print(r2.text) 运行结果 <html> <head><title>400 Bad Request</title></head> <body bgcolor="white"> <center><h1>400 Bad Request</h1></center> <hr><center>openresty</center> </body> </html> ============== <!doctype html> <html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">发现 - 知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="有问题，上知乎。知乎，可信赖的问答社区，以让每个人高效获得可信赖的解答为使命。知乎凭借认真、专业和友善的社区氛围，结构化、易获得的优质内容，基于问答的内容生产方式和独特的社区机制，吸引、聚集了各行各业中大量的亲历者、内行人、领域专家、领域爱好者，将高质量的内容透过人的节点来成规模地生产和分享。用户通过问答等交流方式建立信任和连接，打造和提升个人影响力，并发现、获得新机会。"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png" sizes="152x152"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-120.b3e6278d.png" sizes="120x120"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-76.7a750095.png" sizes="76x76"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-60.a4a761d4.png" sizes="60x60"/><link rel="shortcut icon" type="image/x-icon" href="https://static.zhihu.com/static/favicon.ico"/><link rel="search" type="application/opensearchdescription+xml" href="https://static.zhihu.com/static/search.xml" title="知乎"/><link rel="dns-prefetch" href="//static.zhimg.com"/><link rel="dns-prefetch" href="//pic1.zhimg.com"/><link rel="dns-prefetch" href="//pic2.zhimg.com"/><link rel="dns-prefetch" href="//pic3.zhimg.com"/><link rel="dns-prefetch" href="//pic4.zhimg.com"/><style> .u-safeAreaInset-top { height: constant(safe-area-inset-top) !important; height: env(safe-area-inset-top) !important; } .u-safeAreaInset-bottom { height: constant(safe-area-inset-bottom) !important; height: env(safe-area-inset-bottom) !important; } ~~~ ### 4.文件上传 ~~~ import requests files = {'file': open('favicon.png', 'rb')} r = requests. post("http://httpbin.org/post", files=files) print(r.text) 运行结果 { "args": {}, "data": "", "files": { "file": "data:application/octet-stream;base64,iVBORw0KGgoAAAANSUhEUgAAAhwAAAECCAMAAACCFP44AAAACXBIWXMAAAsTAAALEwEAmpwYAAAKTWlDQ1BQaG90b3Nob3AgSUNDIHByb2ZpbGUAAHjanVN3WJP3Fj7f92UPVkLY8LGXbIEAIiOsCMgQWaIQkgBhhBASQMWFiApWFBURnEhVxILVCkidiOKgKLhnQYqIWotVXDjuH9yntX167+3t+9f7vOec5/zOec8PgBESJpHmomoAOVKFPDrYH49PS" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "8024", "Content-Type": "multipart/form-data; boundary=ae576c1072214f7675389b19c437283d", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "json": null, "origin": "114.248.162.218, 114.248.162.218", "url": "https://httpbin.org/post" } ~~~ ### 5.代理设置对于某些网站，在测试的时候请求几次，能正常获取内容。但是一- 旦开始大规模爬取，对于大规模且频繁的请求，网站可能会弹出验证码，或者跳转到登录认证页面，更甚者可能会直接封禁客户端的IP，导致一定时间段内无法访问。那么，为了防止这种情况发生，我们需要设置代理来解决这个问题，这就需要用到proxies参数。可以用这样的方式设置: ~~~ import requests proxies = { "http": "http://sun:qq123456.@192.168.66.211:520", } r1 = requests.get('http://httpbin.org/get') r2 = requests.get('http://httpbin.org/get',proxies=proxies) print(r1.text) print(r2.text) 运行结果： { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "origin": "114.248.162.218, 114.248.162.218", "url": "https://httpbin.org/get" } { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "origin": "175.98.194.165, 175.98.194.165", "url": "https://httpbin.org/get" } ~~~ ### 超时设置在本机网络状况不好或者服务器网络响应太慢甚至无响应时，我们可能会等待特别久的时间才可能收到响应，甚至到最后收不到响应而报错。为了防止服务器不能及时响应，应该设置一个超时时间，即超过了这个时间还没有得到响应，那就报错。这需要用到timeout参数。这个时间的计算是发出请求到服务器返回响应的时间。示例如下: ~~~ #设置超时 import requests r = requests.get("https://www.taobao.com", timeout = 0.0001) print(r.status_code) 运行结果 requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.taobao.com', port=443): Read timed out. (read timeout=0.0001) #永不超时 import requests r = requests.get("https://www.taobao.com", timeout = 1) print(r.status_code) r = requests.get( 'https://www.google.com',timeout=None) print(r.text) ~~~ ### 会话保持在requests中，如果直接利用get()或post()等方法的确可以做到模拟网页的请求，但是这实际上是相当于不同的会话，也就是说相当于你用了两个浏览器打开了不同的页面。设想这样一个场景，第一个请求利用post()方法登录了某个网站，第二次想获取成功登录后的自己的个人信息，你又用了一次get()方法去请求个人信息页面。实际上，这相当于打开了两个浏览器, 是两个完全不相关的会话，能成功获取个人信息吗?那当然不能。有小伙伴可能说了，我在两次请求时设置一样的cookies 不就行了?可以，但这样做起来显得很烦琐，我们有更简单的解决方法。其实解决这个问题的主要方法就是维持同--个会话，也就是相当于打开一个新的浏览器选项卡而不是新开- - 个浏览器。但是我又不想每次设置cookies, 那该怎么办呢?这时候就有了新的利器--- Session 对象。利用它，我们可以方便地维护一一个会话，而且不用担心cookies 的问题，它会帮我们自动处理好。 ~~~ get测试： import requests requests .get('http://httpbin.org/cookies/set/number/123456789') r = requests .get('http://httpbin.org/cookies') print(r.text) 运行结果： { "cookies": {} } 使用会话进行测试： import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/number/123456789') r = s.get('http://httpbin.org/cookies') print(r.text) 运行结果： { "cookies": { "number": "123456789" } } ~~~