chapter28_阶段考核4_爬虫下载网易汽车 · python 快速入门

## 实战要求 >[info] 通过前面章节的学习，我们学会了HTTP网络请求，学会了HTML解析，学会了文件读写，学会了多线程。 > >基于这几个知识点，我们尝试写一个爬虫，将网易汽车网站上的图片都下载到本地吧！ > >网易汽车新能源选车首页：http://product.auto.163.com/newpower/#newindex >1. 将每款新能源汽车的详情页（一张）介绍图下载到本地，命名格式：品牌_型号_最低价_最高价.jpg >2. 使用多线程，提高程序效率 ## 说明 * 请独立完成实战要求，完成后再参考下面的示例代码 * 如果觉得自己的代码*更加优雅，更加高效*，欢迎留言**，与大家一起**分享**哦~ :-: 一起来挑战吧~ ## 参考代码: ```python #!/usr/bin/env python # -*- coding: utf-8 -*- from pyquery import PyQuery as pq import requests import queue import math import time import threading class WyCard(object): def __init__(self,card_type): self.list_url={ "newpower":"http://product.auto.163.com/energy_api/getEnergySeriesList.action?orderType=0&size=20&page=" }.get(card_type) self.q=queue.Queue() def put_q_list(self): """ 向queue队列中插入汽车信息 :return: """ resp=requests.get(self.list_url+"1") page_count=1 if resp.status_code==200: total=resp.json().get("total") card_list = resp.json().get("list") page_count = math.ceil(int(total)/20) self.q.put(card_list) else: print("列表页访问失败:",self.list_url+"1") if page_count>1: # 翻页获取汽车列表 for i in range(1,page_count): i+=1 resp=requests.get(self.list_url+str(i)) if resp.status_code==200: card_list = resp.json().get("list") self.q.put(card_list) else: print("列表页访问失败:",self.list_url+str(i)) # queue列表末尾，添加一个特殊字符，用于让取消息的线程知道已经到了队列末尾，不再有新汽车加入队列了。 self.q.put("END") def get_q_list(self): """ 从queue队列中获取汽车信息，获得汽车品牌，型号，价格，详情页地址；并且通过详情页地址，获取详情页中的图片，将图片下载到本地 :return: """ while True: data = self.q.get() if data=="END": print("END....") # 注意这里使用了一个小技巧，每个线程结束时，向queue中插入一个字符串“END”，这样其它线程拿到“END”时，也就知道该退出线程了。 # 相当于当某个线程结束后，通知其他线程也可以结束了。 self.q.put("END") break for item in data: url = item.get("url") brand_name = item.get("brand_name") name = item.get("name") price_min = item.get("price_min") price_max = item.get("price_max") title=brand_name+"_"+name+"_"+price_min+"万_"+price_max+"万.jpg" self._download_card(url,title) def _download_card(self,series_url,title): """ 下载图片到本地 :param series_url: :param title: :return: """ resp = requests.get(series_url) if resp.status_code == 200: d = pq(resp.text) img_src = d('#car_pic img').attr("src") resp = requests.get(img_src) if resp.status_code == 200: with open(title,"wb") as fp: print("正在下载:",title,flush=True) fp.write(resp.content) else: print("图片访问失败：",img_src) else: print("页面访问失败：",series_url) def main(card_type): """ 主函数，通过多线程，实现一个线程去获取汽车列表，多个线程去下载图片。 :param card_type: :return: """ card=WyCard(card_type) thread_list=[] # 创建一个获取汽车列表的线程 thread_list.append(threading.Thread(target=card.put_q_list,args=())) # 创建多个线程去下载图片，这里设定10个线程 get_thread_count=10 for i in range(get_thread_count): thread_list.append(threading.Thread(target=card.get_q_list,args=())) # 开启所有线程 for t in thread_list: t.start() # 等待所有线程完成 for t in thread_list: t.join() if __name__ == '__main__': start_time=time.time() main("newpower") print("total_time:",time.time()-start_time) ``` **逻辑分析：** 1. 访问首页 http://product.auto.163.com/newpower/#newindex 首先想到的是获取列表页数，我们点击翻页时，会发现url并没有发生改变，也就是说我们不能通过直接访问url来进行翻页了。 2. 查找翻页数，通过查看列表页源码，并不能找得到翻页这一栏的元素 ![](https://box.kancloud.cn/9e5be56e8401d016e99f20610b3c4b10_999x332.jpg) ![](https://box.kancloud.cn/a2087ef133590d4947e7f112c4b0ee4b_853x353.jpg) 从这里，我们已经可以确定，我们不能通过网页解析的方式来获取汽车列表数据与翻页数了。 3. 那么它是如何实行翻页的呢？通过network分析，我们可以发现，当我们点击翻页时，会发出一个XHR请求，返回汽车列表，从请求参数中，我们可以找到一个叫page的参数值，因此，我们可以确定我们可以通过这个接口请求获取得到汽车列表总数，通过page参数获取每一页的列表。 GET http://product.auto.163.com/energy_api/getEnergySeriesList.action?orderType=0&size=20&page=1 4. 通过分析汽车列表返回信息，我们可以直接获取到汽车品牌，汽车型号，汽车详情页URL等关键信息 5. 通过详情页URL，访问详情页，通过html解析，可以获取到汽车详情页中的图片，然后将图片下载到本地 6. 使用一个线程去获取汽车列表，并且插入到queue中，多个线程从queue中拿到汽车列表，循环列表获取到具体汽车信息，并且保存相关信息到本地 7. 调整下载图片的线程（总共有156个图片），可以发现单线程会比较慢，但是也不是线程数越大就越快。以我的电脑为例，1个线程则需要耗时52s，10个线程需要耗时7s，100个线程需要耗时8s <hr style="margin-top:100px"> :-: ![](https://box.kancloud.cn/2ff0bc02ec938fef8b6dd7b7f16ee11d_258x258.jpg) ***微信扫一扫，关注“python测试开发圈”，了解更多测试教程！***