模拟Http请求 · 网络爬虫知识汇总

# 模拟http请求如何模拟一个http请求，在上一个章节中已经对http协议有了一个简单的理解。模拟一个http请求，就是通过代码，模拟一个浏览器做的事情，也可以理解成模拟一个用户在做的事情，爬虫就是一个不理智的用户，他在疯狂的通过一个浏览器点击呈现在页面上他想要的链接。 ###通过firebug工具查看一个http请求 1：安装firbug插件 firbug插件只是其中的一个查看浏览器http请求的方式，也可以通过其他的模式去查看，工具有很多，不用太局限与形式，这里只是用firebug做一个例子。 2：发送请求https://www.baidu.com 3:查看firbug，观察http请求参数 ![](httpresponsecode1.png) ###代码模拟http请求 ```/** * * @param URL * @param defaultEncoding * @param timeOut * @return String[] http内容 */ public static final String getUrlString(URL url, String defaultEncoding, int timeOut) { boolean gzip = false; InputStream in = null; String location =""; HttpURLConnection con = null; try { con = (HttpURLConnection) url.openConnection(); con.setReadTimeout(timeOut); con.setConnectTimeout(timeOut); // 设置HTTP request Header // 设置UA (关于useragent的具体意义，可自行百度或者谷歌,在这里可以简单的理解成，UA标示是用于模拟一个真实用户的一个标示) con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10"); con.setRequestProperty("Keep-Alive", "115"); con.setRequestProperty("Connection", "Keep-Alive"); con.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"); // 以上参数就是用于模拟一个 http请求,使用firebug,可以查看一个http请求的时候的具体请求头跟返回的头 if ((gzip) && (Math.random() < 0.3D)) { con.setRequestProperty("Accept-Encoding", "gzip"); } // 记录请求时间 long starttime = System.currentTimeMillis(); con.connect(); // 获取http请求返回码 int code = con.getResponseCode(); int length = con.getContentLength(); if (length >= 0) { } // 获取http请求的返回内容类型 String encoding2 = con.getHeaderField("Content-Type"); if (encoding2 != null) { int index; if ((index = encoding2.indexOf("charset=")) > 0) encoding2 = encoding2.substring(index + "charset=".length()).replace('"', ' ').replace('\'', ' ').trim(); else encoding2 = defaultEncoding; } // 如果为404 返回内容也没有意义 if(code ==404){ return ""; } if (code != 404) { in = new BufferedInputStream(con.getInputStream()); } if (in == null){ return null; } location = con.getHeaderField("Location"); if(location==null){ location=""; } // 判断页面是否压缩传输 //关于gzip压缩的理解，请参考以下博文 //http://kb.cnblogs.com/page/163781/ String contentencoding = con.getHeaderField("Content-Encoding"); if ((gzip) && ("gzip".equals(contentencoding))) { System.out.println("gzipped"); in = new GZIPInputStream(in); } ByteArrayOutputStream urlData = new ByteArrayOutputStream(); byte[] buf2 = new byte[1024]; int n; while ((n = in.read(buf2)) >= 0) { if (urlData.size() > 2097152) { if (length < 0) return null; } urlData.write(buf2, 0, n); } if (length < 0){ } String str1 = null; if (encoding2 != null) { try { str1 = urlData.toString(encoding2); if (in != null) { try { in.close(); } catch (Exception e) { e.printStackTrace(); } in = null; } if (con != null) { try { con.getInputStream().close(); } catch (Exception e) { e.printStackTrace(); } con = null; } return str1; } catch (UnsupportedEncodingException e) { System.out.println("UnsupportedEncodingException detected: " + e.getMessage()); str1 = urlData.toString(); if (in != null) { try { in.close(); } catch (Exception e1) { e1.printStackTrace(); } in = null; } if (con != null) { try { con.getInputStream().close(); } catch (Exception e2) { e2.printStackTrace(); } con = null; } return str1; } } return str1; } catch (SocketTimeoutException e) { e.printStackTrace(); } catch (MalformedURLException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); if (con != null) try { InputStream err = con.getErrorStream(); if (err != null) { err.close(); err = null; } } catch (Exception e1) { e1.printStackTrace(); } } finally { if (in != null) { try { in.close(); } catch (Exception e) { e.printStackTrace(); } in = null; } if (con != null) { try { con.getInputStream().close(); } catch (Exception e) { e.printStackTrace(); } con = null; } } return null; }``` 资料参考 : > http://kb.cnblogs.com/page/163781/ http协议之压缩