纵横小说简介页采集 · Lucene案例开发

转载请注明出处：http://blog.csdn.net/xiaojimanman/article/details/44851419 http://www.llwjy.com/blogdetail/1b5ae17c513d127838c2e02102b5bb87.html 个人博客站已经上线了，网址 [www.llwjy.com](http://www.llwjy.com) ~欢迎各位吐槽~ ------------------------------------------------------------------------------------------------- 在上一篇博客中，我们已经对纵横中文小说的更新列表页做了简单的采集，获得了小说简介页的URL，因此这篇博客我们就介绍纵横中文小说简介页信息的采集，事例地址：http://book.zongheng.com/book/362857.html **页面分析** 在开始之前，建议个人先看一下简介页的样子，下图只是我们要采集的信息所在的区域。 ![img](https://box.kancloud.cn/2016-02-22_56ca7bf05a44f.jpg) 在这一部分，我们需要获取书名、作者名、分类、字数、简介、最新章节名、章节页URL和标签等信息。在页面上，我们通过鼠标右键--查看网页源代码发现下面一个现象 ![img](https://box.kancloud.cn/2016-02-22_56ca7bf0800e6.jpg) 纵横小说为了做360的seo，把小说的一些关键信息放到head中，这样就大大减少我们下正则的复杂度，由于这几个正则大同小异，所以就只用书名做简单的介绍，其余的正则可以参照后面的源代码。这里的书名在上述截图中的**33行**，我们需要提取中间的**飞仙诀** 信息，因此我们提取该信息的正则表达式为” <meta name="og:novel:book_name" content="(.*?)"/> “ ，其他信息和此正则类似。通过上图这部分源代码我们可以轻易的获取书名、作者名、最新章节、简介、分类和章节列表页URL，对于标签和字数这两个字段，我们就需要继续分析下面的源代码。通过简单的查找，我们可以找到下图中的源代码，这里就包含我们需要的字数和标签两个属性。 ![img](https://box.kancloud.cn/2016-02-22_56ca7bf0aa64f.jpg) 对于字数这个属性，我们可以通过简单的正则表达式 ” <span itemprop="wordCount">(\d*?)</span> “ 获取，而对于标签这个属性，我们需要通过两步才能得到想要的内容。 **第一步**：获取keyword所在的html代码，也就是上图中的**234行**，这一步的正则表达式为 ”<div class="keyword">(.*?)</div> “； **第二步**：对第一步获得的部分html做进一步提取，获取想要的内容，这一步的正则表达式为 ” <a.*?>(.*?)</a> “。 **代码实现** 对于非更新列表也的网页信息采集，我们统一继承CrawlBase类，对于如何伪装可以参照上一篇博客，这里就重点介绍DoRegex类中的两个方法 **方法一：** ~~~ String getFirstString(String dealStr, String regexStr, int n) ~~~ 这里的第一个参数是要处理的字符串，这里也就是网页源代码，第二个参数是要查找内容的正则表达式，第三个参数是要提取的内容在正则表达式中的位置，函数的功能是从指定的字符串中查找与正则第一个匹配的内容，返回指定的提取信息。 **方法二：** ~~~ String getString(String dealStr, String regexStr, String splitStr, int n) ~~~ 这里的第1、2、4参数分别对应方法一中的第1、2、3参数，参数splitStr的意义是分隔符，函数的功能是在指定的字符串中查找与正则表达式匹配的内容，之间用指定的分隔符隔开。 **运行结果** ![](https://box.kancloud.cn/2016-02-22_56ca7bf0cfe92.jpg) **源代码** 通过对上面两个方法的介绍，相信对于下面的源代码也会很简单。 ~~~ /** *@Description: 简介页 */ package com.lulei.crawl.novel.zongheng; import java.io.IOException; import java.util.HashMap; import com.lulei.crawl.CrawlBase; import com.lulei.util.DoRegex; import com.lulei.util.ParseUtil; public class IntroPage extends CrawlBase { private static final String NAME = "<meta name=\"og:novel:book_name\" content=\"(.*?)\"/> "; private static final String AUTHOR = "<meta name=\"og:novel:author\" content=\"(.*?)\"/> "; private static final String DESC = "<meta property=\"og:description\" content=\"(.*?)\"/> "; private static final String TYPE = "<meta name=\"og:novel:category\" content=\"(.*?)\"/> "; private static final String LATESTCHAPTER = "<meta name=\"og:novel:latest_chapter_name\" content=\"(.*?)\"/> "; private static final String CHAPTERLISTURL = "<meta name=\"og:novel:read_url\" content=\"(.*?)\"/> "; private static final String WORDCOUNT = "<span itemprop=\"wordCount\">(\\d*?)</span>"; private static final String KEYWORDS = "<div class=\"keyword\">(.*?)</div>"; private static final String KEYWORD = "<a.*?>(.*?)</a>"; private String pageUrl; private static HashMap<String, String> params; /** * 添加相关头信息，对请求进行伪装 */ static { params = new HashMap<String, String>(); params.put("Referer", "http://book.zongheng.com"); params.put("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"); } public IntroPage(String url) throws IOException { readPageByGet(url, "utf-8", params); this.pageUrl = url; } /** * @return * @Author:lulei * @Description: 获取书名 */ private String getName() { return DoRegex.getFirstString(getPageSourceCode(), NAME, 1); } /** * @return * @Author:lulei * @Description: 获取作者名 */ private String getAuthor() { return DoRegex.getFirstString(getPageSourceCode(), AUTHOR, 1); } /** * @return * @Author:lulei * @Description: 书籍简介 */ private String getDesc() { return DoRegex.getFirstString(getPageSourceCode(), DESC, 1); } /** * @return * @Author:lulei * @Description: 书籍分类 */ private String getType() { return DoRegex.getFirstString(getPageSourceCode(), TYPE, 1); } /** * @return * @Author:lulei * @Description: 最新章节 */ private String getLatestChapter() { return DoRegex.getFirstString(getPageSourceCode(), LATESTCHAPTER, 1); } /** * @return * @Author:lulei * @Description: 章节列表页Url */ private String getChapterListUrl() { return DoRegex.getFirstString(getPageSourceCode(), CHAPTERLISTURL, 1); } /** * @return * @Author:lulei * @Description: 字数 */ private int getWordCount() { String wordCount = DoRegex.getFirstString(getPageSourceCode(), WORDCOUNT, 1); return ParseUtil.parseStringToInt(wordCount, 0); } /** * @return * @Author:lulei * @Description: 标签 */ private String keyWords() { String keyHtml = DoRegex.getFirstString(getPageSourceCode(), KEYWORDS, 1); return DoRegex.getString(keyHtml, KEYWORD, " ", 1); } public static void main(String[] args) throws IOException { // TODO Auto-generated method stub IntroPage intro = new IntroPage("http://book.zongheng.com/book/362857.html"); System.out.println(intro.pageUrl); System.out.println(intro.getName()); System.out.println(intro.getAuthor()); System.out.println(intro.getDesc()); System.out.println(intro.getType()); System.out.println(intro.getLatestChapter()); System.out.println(intro.getChapterListUrl()); System.out.println(intro.getWordCount()); System.out.println(intro.keyWords()); } } ~~~ ---------------------------------------------------------------------------------------------------- ps:最近发现其他网站可能会对博客转载，上面并没有源链接，如想查看更多关于 [基于lucene的案例开发](http://www.llwjy.com/blogtype/lucene.html) 请[点击这里](http://blog.csdn.net/xiaojimanman/article/category/2841877)。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html