# HtmlParser介绍
<div><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">1、相关资料</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
官方文档:http://htmlparser.sourceforge.net/samples.html</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
API:http://htmlparser.sourceforge.net/javadoc/index.html</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
其它HTML 解释器:jsoup等。由于HtmlParser自2006年以后就再没更新,目前很多人推荐使用jsoup代替它。</p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">2、使用HtmlPaser的关键步骤</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(1)通过Parser类创建一个解释器</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(2)创建Filter或者Visitor</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(3)使用parser根据filter或者visitor来取得所有符合条件的节点</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(4)对节点内容进行处理</p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">3、使用Parser的构造函数创建解释器</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><table border="1" cellpadding="2" cellspacing="0" width="100%" style="color: rgb(0, 0, 0); font-family: Simsun; font-size: 14px;"><tbody><tr style="background-color:rgb(238,238,238);"><td style="height: 41px;"><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>()</code> <br>
Zero argument constructor.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28org.htmlparser.lexer.Lexer%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class in org.htmlparser.lexer" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/lexer/Lexer.html" target="_blank" style="color:rgb(106,57,6);">Lexer</a> lexer)</code> <br>
Construct a parser using the provided lexer.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28org.htmlparser.lexer.Lexer,%20org.htmlparser.util.ParserFeedback%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class in org.htmlparser.lexer" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/lexer/Lexer.html" target="_blank" style="color:rgb(106,57,6);">Lexer</a> lexer, <a title="interface in org.htmlparser.util" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/util/ParserFeedback.html" target="_blank" style="color:rgb(106,57,6);">ParserFeedback</a> fb)</code> <br>
Construct a parser using the provided lexer and feedback object.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.lang.String%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.lang" href="http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html" target="_blank" style="color:rgb(106,57,6);">String</a> resource)</code> <br>
Creates a Parser object with the location of the resource (URL or file).</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.lang.String,%20org.htmlparser.util.ParserFeedback%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.lang" href="http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html" target="_blank" style="color:rgb(106,57,6);">String</a> resource, <a title="interface in org.htmlparser.util" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/util/ParserFeedback.html" target="_blank" style="color:rgb(106,57,6);">ParserFeedback</a> feedback)</code> <br>
Creates a Parser object with the location of the resource (URL
or file) You would typically create a DefaultHTMLParserFeedback object
and pass it in.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.net.URLConnection%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.net" href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URLConnection.html" target="_blank" style="color:rgb(106,57,6);">URLConnection</a> connection)</code> <br>
Construct a parser using the provided URLConnection.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.net.URLConnection,%20org.htmlparser.util.ParserFeedback%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.net" href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URLConnection.html" target="_blank" style="color:rgb(106,57,6);">URLConnection</a> connection, <a title="interface in org.htmlparser.util" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/util/ParserFeedback.html" target="_blank" style="color:rgb(106,57,6);">ParserFeedback</a> fb)</code> <br>
Constructor for custom HTTP access.</td></tr></tbody></table><span style="color: rgb(54, 46, 43); font-family: Arial;"> 对于大多数使用者来说,使用最多的是通过一个</span><span style="color: blue; font-family: Arial;">URLConnection</span><span style="color: rgb(54, 46, 43); font-family: Arial;">或者一个保存有网页内容的字符串来初始化Parser,或者使用静态函数来生成一个Parser对象。</span><span style="color: blue; font-family: Arial;">ParserFeedback</span><span style="color: rgb(54, 46, 43); font-family: Arial;">的代码很简单,是针对调试和跟踪分析过程的,一般不需要改变。而使用</span><span style="color: green; font-family: Arial;">Lexer</span><span style="color: rgb(54, 46, 43); font-family: Arial;">则是一个相对比较高级的话题,放到以后再讨论吧。</span><br style="color: rgb(54, 46, 43); font-family: Arial;"><span style="color: rgb(54, 46, 43); font-family: Arial;"> 这里比较有趣的一点是,如果需要设置页面的编码方式的话,不使用Lexer就只有静态函数一个方法了。对于大多数中文页面来说,好像这是应该用得比较多的一个方法。</span><br style="color: rgb(54, 46, 43); font-family: Arial;"><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">4、HtmlPaser使用Node对象保存各节点信息</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><img src="http://note.youdao.com/yws/res/10738/977917BD60E34D578F9EB0747420F7BB" data-media-type="image" /><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(1)访问各个节点的方法<br>
Node <span style="color:blue;">getParent</span> ():取得父节点<br>
NodeList <span style="color:blue;">getChildren</span> ():取得子节点的列表<br>
Node <span style="color:blue;">getFirstChild</span> ():取得第一个子节点<br>
Node <span style="color:blue;">getLastChild</span> ():取得最后一个子节点<br>
Node <span style="color:blue;">getPreviousSibling</span> ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了)<br>
Node <span style="color:blue;">getNextSibling</span> ():取得下一个兄弟节点<br>
(2)取得<span style="color:fuchsia;">Node</span>内容的函数<br>
String <span style="color:blue;">getText</span> ():取得文本<br>
String <span style="color:blue;">toPlainTextString</span>():取得纯文本信息。<br>
String <span style="color:blue;">toHtml</span> () :取得<span style="color:green;">HTML</span>信息(原始<span style="color:green;">HTML</span>)<br>
String <span style="color:blue;">toHtml</span> (boolean verbatim):取得<span style="color:green;">HTML</span>信息(原始<span style="color:green;">HTML</span>)<br>
String <span style="color:blue;">toString</span> ():取得字符串信息(原始<span style="color:green;">HTML</span>)<br>
Page <span style="color:blue;">getPage</span> ():取得这个<span style="color:green;">Node</span>对应的<span style="color:green;">Page</span>对象<br>
int <span style="color:blue;">getStartPosition</span> ():取得这个<span style="color:green;">Node</span>在<span style="color:green;">HTML</span>页面中的起始位置<br>
int <span style="color:blue;">getEndPosition</span> ():取得这个<span style="color:green;">Node</span>在<span style="color:green;">HTML</span>页面中的结束位置</p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">5、使用Filter访问Node节点及其内容</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:18px;">(1)Filter的种类</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
顾名思义,Filter就是对于结果进行过滤,取得需要的内容。</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
所有的Filter均实现了NodeFilter接口,此接口只有一个方法Boolean accept(Node node),用于确定某个节点是否属于此Filter过滤的范围。</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
HTMLParser在org.htmlparser.filters包之内一共定义了16个不同的Filter,也可以分为几类。<br><span style="color:green;"><a href="http://www.baizeju.com/html/HTMLParser/200807/07-121.html#%E5%88%A4%E6%96%AD%E7%B1%BBFilter" target="_blank" style="color:rgb(16,138,198);"><strong>判断类<span style="color:green;">Filter</span>:</strong></a></span><br><span style="color:blue;">TagNameFilter</span><span style="color:blue;"><br>
HasAttributeFilter</span><br>
HasChildFilter<br>
HasParentFilter<br>
HasSiblingFilter<br>
IsEqualFilter<br><span style="color:green;"><a href="http://www.baizeju.com/html/HTMLParser/200807/07-121.html#%E9%80%BB%E8%BE%91%E8%BF%90%E7%AE%97Filter" target="_blank" style="color:rgb(16,138,198);"><strong>逻辑运算<span style="color:green;">Filter</span>:</strong></a></span><br><span style="color:blue;">AndFilter</span><span style="color:blue;"><br>
NotFilter</span><br>
OrFilter<br>
XorFilter<br><span style="color:green;"><a href="http://www.baizeju.com/html/HTMLParser/200807/07-121.html#%E5%85%B6%E4%BB%96Filter" target="_blank" style="color:rgb(16,138,198);"><strong>其他<span style="color:green;">Filter</span>:</strong></a></span><br><span style="color:blue;">NodeClassFilter</span><span style="color:blue;"><br>
StringFilter</span><br>
LinkStringFilter<br>
LinkRegexFilter<br>
RegexFilter<br>
CssSelectorNodeFilter</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
除此以外,可以自定义一些Filter,用于完成特殊需求的过滤。<br><span style="font-size:18px;">(2)Filter的使用示例</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
以下示例用于提取HTML文件中的链接</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><div style="background-color:rgb(231,229,220);color:rgb(54,46,43);font-family:Consolas,'Courier New',Courier,mono,serif;"><div><div style="background-color:rgb(248,248,248);color:silver;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:9px;"><strong>[java]</strong> <a title="view plain" href="http://blog.csdn.net/jediael_lu/article/details/26396705#" target="_blank" style="color: rgb(160, 160, 160);">view
plain</a><a title="copy" href="http://blog.csdn.net/jediael_lu/article/details/26396705#" target="_blank" style="color: rgb(160, 160, 160);">copy</a><a title="在CODE上查看代码片" href="https://code.csdn.net/snippets/356130" target="_blank" style="color: rgb(160, 160, 160);"><img src="http://note.youdao.com/yws/res/10737/F9100224A02B471E9B4A148E168E4281" alt="在CODE上查看代码片" width="12" height="12" data-media-type="image" /></a><a title="派生到我的代码片" href="https://code.csdn.net/snippets/356130/fork" target="_blank" style="color: rgb(160, 160, 160);"><img src="https://code.csdn.net/assets/ico_fork.svg" alt="派生到我的代码片" width="12" height="12" data-media-type="image" /></a><div></div></div></div><ol start="1" style="background-color:rgb(255,255,255);color:rgb(92,92,92);"><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">package</span> org.ljh.search.html; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.HashSet; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.Set; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.Node; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.NodeFilter; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.Parser; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.filters.NodeClassFilter; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.filters.OrFilter; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.tags.LinkTag; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.util.NodeList; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.util.ParserException; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 130, 0);">//本类创建用于HTML文件解释工具</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">class</span> HtmlParserTool { </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">// 本方法用于提取某个html文档中内嵌的链接</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">static</span> Set<String> extractLinks(String url, LinkFilter filter) { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> Set<String> links = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> HashSet<String>(); </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">try</span> { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">// 1、构造一个Parser,并设置相关的属性</span> </span></li><li style="color:inherit;"><span style="color: black;"> Parser parser = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> Parser(url); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> parser.setEncoding(<span style="color: blue;">"gb2312"</span>); </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">// 2.1、自定义一个Filter,用于过滤<Frame >标签,然后取得标签中的src属性值</span> </span></li><li style="color:inherit;"><span style="color: black;"> NodeFilter frameNodeFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> NodeFilter() { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(100, 100, 100);">@Override</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">boolean</span> accept(Node node) { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span> (node.getText().startsWith(<span style="color: blue;">"frame src="</span>)) { </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">true</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } <span style="color: rgb(0, 102, 153); font-weight: bold;">else</span> { </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">false</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> }; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//2.2、创建第二个Filter,过滤<a>标签</span> </span></li><li style="color:inherit;"><span style="color: black;"> NodeFilter aNodeFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> NodeClassFilter(LinkTag.<span style="color: rgb(0, 102, 153); font-weight: bold;">class</span>); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//2.3、净土上述2个Filter形成一个组合逻辑Filter。</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> OrFilter linkFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> OrFilter(frameNodeFilter, aNodeFilter); </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//3、使用parser根据filter来取得所有符合条件的节点</span> </span></li><li style="color:inherit;"><span style="color: black;"> NodeList nodeList = parser.extractAllNodesThatMatch(linkFilter); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//4、对取得的Node进行处理</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">for</span>(<span style="color: rgb(0, 102, 153); font-weight: bold;">int</span> i = <span style="color: rgb(192, 0, 0);">0</span>; i<nodeList.size();i++){ </span></li><li style="color:inherit;"><span style="color: black;"> Node node = nodeList.elementAt(i); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> String linkURL = <span style="color: blue;">""</span>; </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//如果链接类型为<a /></span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(node <span style="color: rgb(0, 102, 153); font-weight: bold;">instanceof</span> LinkTag){ </span></li><li style="color:inherit;"><span style="color: black;"> LinkTag link = (LinkTag)node; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> linkURL= link.getLink(); </span></li><li style="color:inherit;"><span style="color: black;"> }<span style="color: rgb(0, 102, 153); font-weight: bold;">else</span>{ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//如果类型为<frame /></span> </span></li><li style="color:inherit;"><span style="color: black;"> String nodeText = node.getText(); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">int</span> beginPosition = nodeText.indexOf(<span style="color: blue;">"src="</span>); </span></li><li style="color:inherit;"><span style="color: black;"> nodeText = nodeText.substring(beginPosition); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">int</span> endPosition = nodeText.indexOf(<span style="color: blue;">" "</span>); </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(endPosition == -<span style="color: rgb(192, 0, 0);">1</span>){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> endPosition = nodeText.indexOf(<span style="color: blue;">">"</span>); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> linkURL = nodeText.substring(<span style="color: rgb(192, 0, 0);">5</span>, endPosition - <span style="color: rgb(192, 0, 0);">1</span>); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//判断是否属于本次搜索范围的url</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(filter.accept(linkURL)){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> links.add(linkURL); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } <span style="color: rgb(0, 102, 153); font-weight: bold;">catch</span> (ParserException e) { </span></li><li style="color:inherit;"><span style="color: black;"> e.printStackTrace(); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> links; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;">} </span></li></ol></div><p style="color: rgb(54, 46, 43); font-family: Arial;">
程序中的一些说明:</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(1)通过Node#getText()取得节点的String。</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(2)node instanceof TagLink,即<a/>节点,其它还有很多的类似节点,如tableTag等,基本上每个常见的html标签均会对应一个tag。官方文档说明如下:</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><table border="1" cellpadding="2" cellspacing="0" width="100%" style="color: rgb(0, 0, 0); font-family: Simsun; font-size: 14px;"><tbody><tr><td><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/nodes/package-summary.html" target="_blank" style="color:rgb(106,57,6);">org.htmlparser.nodes</a></strong></td><td>The nodes package has the concrete node implementations.</td></tr><tr><td><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/package-summary.html" target="_blank" style="color:rgb(106,57,6);">org.htmlparser.tags</a></strong></td><td>The tags package contains specific tags.</td></tr></tbody></table><span style="color: rgb(54, 46, 43); font-family: Arial;">因此可以通过此方法直接判断一个节点是否某个标签内容。</span><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
其中用到的LinkFilter接口定义如下:</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><div style="background-color:rgb(231,229,220);color:rgb(54,46,43);font-family:Consolas,'Courier New',Courier,mono,serif;"><div><div style="background-color:rgb(248,248,248);color:silver;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:9px;"><strong>[java]</strong><div></div></div></div><ol start="1" style="background-color:rgb(255,255,255);color:rgb(92,92,92);"><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">package</span> org.ljh.search.html; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 130, 0);">//本接口所定义的过滤器,用于判断url是否属于本次搜索范围。</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">interface</span> LinkFilter { </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">boolean</span> accept(String url); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;">} </span></li></ol></div><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
测试程序如下:</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><div style="background-color:rgb(231,229,220);color:rgb(54,46,43);font-family:Consolas,'Courier New',Courier,mono,serif;"><div><div style="background-color:rgb(248,248,248);color:silver;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:9px;"><strong>[java]</strong> <div></div></div></div><ol start="1" style="background-color:rgb(255,255,255);color:rgb(92,92,92);"><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">package</span> org.ljh.search.html; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.Iterator; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.Set; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.junit.Test; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">class</span> HtmlParserToolTest { </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(100, 100, 100);">@Test</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">void</span> testExtractLinks() { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> String url = <span style="color: blue;">"http://www.baidu.com"</span>; </span></li><li style="color:inherit;"><span style="color: black;"> LinkFilter linkFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> LinkFilter(){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(100, 100, 100);">@Override</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">boolean</span> accept(String url) { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(url.contains(<span style="color: blue;">"baidu"</span>)){ </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">true</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> }<span style="color: rgb(0, 102, 153); font-weight: bold;">else</span>{ </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">false</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"> }; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> Set<String> urlSet = HtmlParserTool.extractLinks(url, linkFilter); </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> Iterator<String> it = urlSet.iterator(); </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">while</span>(it.hasNext()){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> System.out.println(it.next()); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;">} </span></li></ol></div><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><span style="color: rgb(54, 46, 43); font-family: Arial;">输出结果如下:</span><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
http://www.hao123.com<br>
http://www.baidu.com/<br>
http://www.baidu.com/duty/<br>
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=<br>
http://music.baidu.com<br>
http://ir.baidu.com<br>
http://www.baidu.com/gaoji/preferences.html<br>
http://news.baidu.com<br>
http://map.baidu.com<br>
http://music.baidu.com/search?fr=ps&key=<br>
http://image.baidu.com<br>
http://zhidao.baidu.com<br>
http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=<br>
http://www.baidu.com/more/<br>
http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w<br>
http://wenku.baidu.com<br>
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=<br>
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F<br>
http://www.baidu.com/cache/sethelp/index.html<br>
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt<br>
http://tieba.baidu.com/f?kw=&fr=wwwt<br>
http://home.baidu.com<br>
https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F<br>
http://v.baidu.com<br>
http://e.baidu.com/?refer=888<br>
;<br>
http://tieba.baidu.com<br>
http://baike.baidu.com<br>
http://wenku.baidu.com/search?word=&lm=0&od=0<br>
http://top.baidu.com<br>
http://map.baidu.com/m?word=&fr=ps01000</p></div>
- Introduction
- 爬虫相关技能介绍
- 爬虫简单介绍
- 爬虫涉及到的知识点
- 爬虫用途
- 爬虫流程介绍
- 需求描述
- Http请求处理
- http基础知识介绍
- http状态码
- httpheader
- java原生态处理http
- URL类
- 获取URL请求状态
- 模拟Http请求
- apache httpclient
- Httpclient1
- httpclient2
- httpclient3
- httpclient4
- httpclient5
- httpclient6
- okhttp
- OKhttp使用教程
- 技术使用
- java执行javascript
- 网页解析
- Xpath介绍
- HtmlCleaner
- HtmlCleaner介绍
- HtmlCleaner使用
- HtmlParser
- HtmlParser介绍
- Jsoup
- 解析和遍历一个HTML文档
- 解析一个HTML字符串
- 解析一个body片断
- 从一个URL加载一个Document
- 从一个文件加载一个文档
- 使用DOM方法来遍历一个文档
- 使用选择器语法来查找元素
- 从元素抽取属性,文本和HTML
- 处理URLs
- 示例程序 获取所有链接
- 设置属性的值
- 设置一个元素的HTML内容
- 消除不受信任的HTML (来防止XSS攻击)
- 正则表达式
- elasticsearch笔记
- 下载安装elasticsearch
- 检查es服务健康