遍历文档树 · Beautiful Soup 4.2.0 中文文档

# 遍历文档树还拿”爱丽丝梦游仙境”的文档来做例子: ``` html_doc = """ <html><head><title>The Dormouse's story</title></head> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) ``` 通过这段例子来演示怎样从文档的一段内容找到另一段内容 ## 子节点一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性. 注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点 ### tag的名字操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 `<head>` 标签,只要用 `soup.head` : ``` soup.head # <head><title>The Dormouse's story</title></head> soup.title # <title>The Dormouse's story</title> ``` 这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取`<body>`标签中的第一个``标签: ``` soup.body.b # The Dormouse's story ``` 通过点取属性的方式只能获得当前名字的第一个tag: ``` soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> ``` 如果想要得到所有的`<a>`标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 `Searching the tree` 中描述的方法,比如: find_all() ``` soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ``` ### .contents 和 .children tag的 `.contents` 属性可以将tag的子节点以列表的方式输出: ``` head_tag = soup.head head_tag # <head><title>The Dormouse's story</title></head> head_tag.contents [<title>The Dormouse's story</title>] title_tag = head_tag.contents[0] title_tag # <title>The Dormouse's story</title> title_tag.contents # [u'The Dormouse's story'] ``` `BeautifulSoup` 对象本身一定会包含子节点,也就是说`<html>`标签也是 `BeautifulSoup` 对象的子节点: ``` len(soup.contents) # 1 soup.contents[0].name # u'html' ``` 字符串没有 `.contents` 属性,因为字符串没有子节点: ``` text = title_tag.contents[0] text.contents # AttributeError: 'NavigableString' object has no attribute 'contents' ``` 通过tag的 `.children` 生成器,可以对tag的子节点进行循环: ``` for child in title_tag.children: print(child) # The Dormouse's story ``` ### .descendants `.contents` 和 `.children` 属性仅包含tag的直接子节点.例如,`<head>`标签只有一个直接子节点`<title>` ``` head_tag.contents # [<title>The Dormouse's story</title>] ``` 但是`<title>`标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于`<head>`标签的子孙节点. `.descendants` 属性可以对所有tag的子孙节点进行递归循环 \[5\] : ``` for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's story ``` 上面的例子中, `<head>`标签只有一个子节点,但是有2个子孙节点:`<head>`节点和`<head>`的子节点, `BeautifulSoup` 有一个直接子节点(`<html>`节点),却有很多子孙节点: ``` len(list(soup.children)) # 1 len(list(soup.descendants)) # 25 ``` ### .string 如果tag只有一个 `NavigableString` 类型子节点,那么这个tag可以使用 `.string` 得到子节点: ``` title_tag.string # u'The Dormouse's story' ``` 如果一个tag仅有一个子节点,那么这个tag也可以使用 `.string` 方法,输出结果与当前唯一子节点的 `.string` 结果相同: ``` head_tag.contents # [<title>The Dormouse's story</title>] head_tag.string # u'The Dormouse's story' ``` 如果tag包含了多个子节点,tag就无法确定 `.string` 方法应该调用哪个子节点的内容, `.string` 的输出结果是 `None` : ``` print(soup.html.string) # None ``` ### .strings 和 stripped_strings 如果tag中包含多个字符串 \[2\] ,可以使用 `.strings` 来循环获取: ``` for string in soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n' ``` 输出的字符串中可能包含了很多空格或空行,使用 `.stripped_strings` 可以去除多余空白内容: ``` for string in soup.stripped_strings: print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...' ``` 全部是空格的行会被忽略掉,段首和段末的空白会被删除 ## 父节点继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中 ### .parent 通过 `.parent` 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点: ``` title_tag = soup.title title_tag # <title>The Dormouse's story</title> title_tag.parent # <head><title>The Dormouse's story</title></head> ``` 文档title的字符串也有父节点:`<title>`标签 ``` title_tag.string.parent # <title>The Dormouse's story</title> ``` 文档的顶层节点比如`<html>`的父节点是 `BeautifulSoup` 对象: ``` html_tag = soup.html type(html_tag.parent) # <class 'bs4.BeautifulSoup'> ``` `BeautifulSoup` 对象的 `.parent` 是None: ``` print(soup.parent) # None ``` ### .parents 通过元素的 `.parents` 属性可以递归得到元素的所有父辈节点,下面的例子使用了 `.parents` 方法遍历了`<a>`标签到根节点的所有节点. ``` link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> for parent in link.parents: if parent is None: print(parent) else: print(parent.name) # p # body # html # [document] # None ``` ## 兄弟节点看一段简单的例子: ``` sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>") print(sibling_soup.prettify()) # <html> # <body> # <a> # # text1 # # <c> # text2 # </c> # </a> # </body> # </html> ``` 因为``标签和`<c>`标签是同一层:他们是同一个元素的子节点,所以``和`<c>`可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系. ### .next_sibling 和 .previous_sibling 在文档树中,使用 `.next_sibling` 和 `.previous_sibling` 属性来查询兄弟节点: ``` sibling_soup.b.next_sibling # <c>text2</c> sibling_soup.c.previous_sibling # text1 ``` ``标签有 `.next_sibling` 属性,但是没有 `.previous_sibling` 属性,因为``标签在同级节点中是第一个.同理,`<c>`标签有 `.previous_sibling` 属性,却没有 `.next_sibling` 属性: ``` print(sibling_soup.b.previous_sibling) # None print(sibling_soup.c.next_sibling) # None ``` 例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同: ``` sibling_soup.b.string # u'text1' print(sibling_soup.b.string.next_sibling) # None ``` 实际文档中的tag的 `.next_sibling` 和 `.previous_sibling` 属性通常是字符串或空白. 看看“爱丽丝”文档: ``` <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> ``` 如果以为第一个`<a>`标签的 `.next_sibling` 结果是第二个`<a>`标签,那就错了,真实结果是第一个`<a>`标签和第二个`<a>`标签之间的顿号和换行符: ``` link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> link.next_sibling # u',\n' ``` 第二个`<a>`标签是顿号的 `.next_sibling` 属性: ``` link.next_sibling.next_sibling # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> ``` ### .next_siblings 和 .previous_siblings 通过 `.next_siblings` 和 `.previous_siblings` 属性可以对当前节点的兄弟节点迭代输出: ``` for sibling in soup.a.next_siblings: print(repr(sibling)) # u',\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # u'; and they lived at the bottom of a well.' # None for sibling in soup.find(id="link3").previous_siblings: print(repr(sibling)) # ' and\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u',\n' # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # u'Once upon a time there were three little sisters; and their names were\n' # None ``` ## 回退和前进看一下“爱丽丝” 文档: ``` <html><head><title>The Dormouse's story</title></head> The Dormouse's story ``` HTML解析器把这段字符串转换成一连串的事件: “打开`<html>`标签”,”打开一个`<head>`标签”,”打开一个`<title>`标签”,”添加一段字符串”,”关闭`<title>`标签”,”打开``标签”,等等.Beautiful Soup提供了重现解析器初始化过程的方法. ### .next_element 和 .previous_element `.next_element` 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 `.next_sibling` 相同,但通常是不一样的. 这是“爱丽丝”文档中最后一个`<a>`标签,它的 `.next_sibling` 结果是一个字符串,因为当前的解析过程 \[2\] 因为当前的解析过程因为遇到了`<a>`标签而中断了: ``` last_a_tag = soup.find("a", id="link3") last_a_tag # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> last_a_tag.next_sibling # '; and they lived at the bottom of a well.' ``` 但这个`<a>`标签的 `.next_element` 属性结果是在`<a>`标签被解析之后的解析内容,不是`<a>`标签后的句子部分,应该是字符串”Tillie”: ``` last_a_tag.next_element # u'Tillie' ``` 这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入`<a>`标签,然后是字符串“Tillie”,然后关闭`</a>`标签,然后是分号和剩余部分.分号与`<a>`标签在同一层级,但是字符串“Tillie”会被先解析. `.previous_element` 属性刚好与 `.next_element` 相反,它指向当前被解析的对象的前一个解析对象: ``` last_a_tag.previous_element # u' and\n' last_a_tag.previous_element.next_element # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ``` ### .next_elements 和 .previous_elements 通过 `.next_elements` 和 `.previous_elements` 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样: ``` for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # ... # u'...' # u'\n' # None ```