纵横小说阅读页采集 · Lucene案例开发

转载请注明出处：[http://blog.csdn.net/xiaojimanman/article/details/44937073](http://blog.csdn.net/xiaojimanman/article/details/44937073) [http://www.llwjy.com/blogdetail/29bd8de30e8d17871c707b76ec3212b0.html](http://www.llwjy.com/blogdetail/29bd8de30e8d17871c707b76ec3212b0.html) 个人博客站已经上线了，网址 [www.llwjy.com](http://www.llwjy.com) ~欢迎各位吐槽~ ------------------------------------------------------------------------------------------------- 在之前的三篇博客中，我们已经介绍了关于纵横小说的更新列表页、简介页、章节列表页的相关信息采集，今天这篇博客就重点介绍一下最重要的阅读页的信息采集。本文还是以一个简单的URL为例，网址如下：http://book.zongheng.com/chapter/362857/6001264.html 。页面分析上述url网址下的下面样式如下： ![img](https://box.kancloud.cn/2016-02-22_56ca7bf1d830c.jpg) 阅读页和章节列表页一样，都无法通过简单的鼠标右键-->查看网页源代码这个操作，所以还是通过**F12-->NetWork-->Ctrl+F5**这个操作找到页面的源代码，结果截图如下： ![img](https://box.kancloud.cn/2016-02-22_56ca7bf217a75.jpg) 对页面源代码做简单的查找，即可找到标题、字数和章节内容这些属性值所在的位置分别是 **47行、141行和145行**（页面不同，可能所在的行数也略微有点差别，具体的行数请个人根据实际情况来确定）。对于这三部分的正则，因为和之前的大同小异，使用的方法之前也已经介绍了，所以这里就只给出最终的结果： ~~~ \\章节内容正则 private static final String CONTENT = "<div id=\"chapterContent\" class=\"content\" itemprop=\"acticleBody\">(.*?)</div>"; \\标题正则 private static final String TITLE = "chapterName=\"(.*?)\""; \\字数正则 private static final String WORDCOUNT = "itemprop=\"wordCount\">(\\d*)</span>"; ~~~ **运行结果** ![img](https://box.kancloud.cn/2016-02-22_56ca7bf25c180.jpg) 看到运行结果的截图，你也许会发现一个问题，就是章节内容中含有一些html标签，这里是因为我们的案例最终的展示是网页展示，所以这里就偷个懒，如果需要去掉这些标签的，可以直接通过String的repalceAll方法对其替换。 **源代码** 查看最新源代码请访问：http://www.llwjy.com/source/com.lulei.crawl.novel.zongheng.ReadPage.html ~~~ /** *@Description: 阅读页 */ package com.lulei.crawl.novel.zongheng; import java.io.IOException; import java.util.HashMap; import com.lulei.crawl.CrawlBase; import com.lulei.util.DoRegex; import com.lulei.util.ParseUtil; public class ReadPage extends CrawlBase { private static final String CONTENT = "<div id=\"chapterContent\" class=\"content\" itemprop=\"acticleBody\">(.*?)</div>"; private static final String TITLE = "chapterName=\"(.*?)\""; private static final String WORDCOUNT = "itemprop=\"wordCount\">(\\d*)</span>"; private String pageUrl; private static HashMap<String, String> params; /** * 添加相关头信息，对请求进行伪装 */ static { params = new HashMap<String, String>(); params.put("Referer", "http://book.zongheng.com"); params.put("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"); } public ReadPage(String url) throws IOException { readPageByGet(url, "utf-8", params); this.pageUrl = url; } /** * @return * @Author:lulei * @Description: 章节标题 */ private String getTitle() { return DoRegex.getFirstString(getPageSourceCode(), TITLE, 1); } /** * @return * @Author:lulei * @Description: 字数 */ private int getWordCount() { String wordCount = DoRegex.getFirstString(getPageSourceCode(), WORDCOUNT, 1); return ParseUtil.parseStringToInt(wordCount, 0); } /** * @return * @Author:lulei * @Description: 正文 */ private String getContent() { return DoRegex.getFirstString(getPageSourceCode(), CONTENT, 1); } public static void main(String[] args) throws IOException { // TODO Auto-generated method stub ReadPage readPage = new ReadPage("http://book.zongheng.com/chapter/362857/6001264.html"); System.out.println(readPage.pageUrl); System.out.println(readPage.getTitle()); System.out.println(readPage.getWordCount()); System.out.println(readPage.getContent()); } } ~~~ ---------------------------------------------------------------------------------------------------- ps:最近发现其他网站可能会对博客转载，上面并没有源链接，如想查看更多关于 [基于lucene的案例开发](http://www.llwjy.com/blogtype/lucene.html) 请[点击这里](http://blog.csdn.net/xiaojimanman/article/category/2841877)。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html