处理URLs · 网络爬虫知识汇总

# 处理URLs <h2>问题</h2> <p>你有一个包含相对URLs路径的HTML文档，需要将这些相对路径转换成绝对路径的URLs。</p> <h2>方法</h2> <ol> <li>在你解析文档时确保有指定<code>base URI</code>，然后</li> <li>使用 <code>abs:</code> 属性前缀来取得包含<code>base URI</code>的绝对路径。代码如下： </li> </ol> <pre><code>Document doc = Jsoup.connect("http://www.open-open.com").get(); Element link = doc.select("a").first(); String relHref = link.attr("href"); // == "/" String absHref = link.attr("abs:href"); // "http://www.open-open.com/" </code></pre> <h2>说明</h2> <p>在HTML元素中，URLs经常写成相对于文档位置的相对路径： <code><a href="/download">...</a></code>. 当你使用 <code><a title="Get an attribute's value by its key." href="http://jsoup.org/apidocs/org/jsoup/nodes/Node.html#attr%28java.lang.String%29">Node.attr(String key)</a></code> 方法来取得a元素的href属性时，它将直接返回在HTML源码中指定定的值。</p> <p>假如你需要取得一个绝对路径，需要在属性名前加 <code>abs:</code> 前缀。这样就可以返回包含根路径的URL地址<code>attr("abs:href")</code></p> <p>因此，在解析HTML文档时，定义base URI非常重要。</p> <p>如果你不想使用<code>abs:</code> 前缀，还有一个方法能够实现同样的功能 <code><a title="Get an absolute URL from a URL attribute that may be relative (i.e." href="http://jsoup.org/apidocs/org/jsoup/nodes/Node.html#absUrl%28java.lang.String%29">Node.absUrl(String key)</a></code>。</p></div>