输出 · Beautiful Soup 4.2.0 中文文档

# 输出 ## 格式化输出 `prettify()` 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行 ``` markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) soup.prettify() # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' print(soup.prettify()) # <html> # <head> # </head> # <body> # <a href="http://example.com/"> # I linked to # # example.com # # </a> # </body> # </html> ``` `BeautifulSoup` 对象和它的tag节点都可以调用 `prettify()` 方法: ``` print(soup.a.prettify()) # <a href="http://example.com/"> # I linked to # # example.com # # </a> ``` ## 压缩输出如果只想得到结果字符串,不重视格式,那么可以对一个 `BeautifulSoup` 对象或 `Tag` 对象使用Python的 `unicode()` 或 `str()` 方法: ``` str(soup) # '<html><head></head><body><a href="http://example.com/">I linked to example.com</a></body></html>' unicode(soup.a) # u'<a href="http://example.com/">I linked to example.com</a>' ``` `str()` 方法返回UTF-8编码的字符串,可以指定 [编码](#id51) 的设置. 还可以调用 `encode()` 方法获得字节码或调用 `decode()` 方法获得Unicode. ## 输出格式 Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&lquot;”: ``` soup = BeautifulSoup("“Dammit!” he said.") unicode(soup) # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>' ``` 如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了: ``` str(soup) # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>' ``` ## get_text() 如果只想得到tag中包含的文本内容,那么可以嗲用 `get_text()` 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回: ``` markup = '<a href="http://example.com/">\nI linked to example.com\n</a>' soup = BeautifulSoup(markup) soup.get_text() u'\nI linked to example.com\n' soup.i.get_text() u'example.com' ``` 可以通过参数指定tag的文本内容的分隔符: ``` # soup.get_text("|") u'\nI linked to |example.com|\n' ``` 还可以去除获得文本内容的前后空白: ``` # soup.get_text("|", strip=True) u'I linked to|example.com' ``` 或者使用 [.stripped_strings](#strings-stripped-strings) 生成器,获得文本列表后手动处理列表: ``` [text for text in soup.stripped_strings] # [u'I linked to', u'example.com'] ```