Beautiful Soup 3 · Beautiful Soup 4.2.0 中文文档

# Beautiful Soup 3 Beautiful Soup 3是上一个发布版本,目前已经停止维护.Beautiful Soup 3库目前已经被几个主要的linux平台添加到源里: `$ apt-get install Python-beautifulsoup` 在PyPi中分发的包名字是 `BeautifulSoup` : `$ easy_install BeautifulSoup` `$ pip install BeautifulSoup` 或通过 [Beautiful Soup 3.2.0源码包](http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz) 安装 Beautiful Soup 3的在线文档查看 [这里](http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html) ,当然还有 [中文版](http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html) ,然后再读本片文档,来对比Beautiful Soup 4中有什新变化. ## 迁移到BS4 只要一个小变动就能让大部分的Beautiful Soup 3代码使用Beautiful Soup 4的库和方法—-修改 `BeautifulSoup` 对象的引入方式: ``` from BeautifulSoup import BeautifulSoup ``` 修改为: ``` from bs4 import BeautifulSoup ``` * 如果代码抛出 `ImportError` 异常“No module named BeautifulSoup”,原因可能是尝试执行Beautiful Soup 3,但环境中只安装了Beautiful Soup 4库 * 如果代码跑出 `ImportError` 异常“No module named bs4”,原因可能是尝试运行Beautiful Soup 4的代码,但环境中只安装了Beautiful Soup 3. 虽然BS4兼容绝大部分BS3的功能,但BS3中的大部分方法已经不推荐使用了,就方法按照 [PEP8标准](http://www.Python.org/dev/peps/pep-0008/) 重新定义了方法名.很多方法都重新定义了方法名,但只有少数几个方法没有向下兼容. 上述内容就是BS3迁移到BS4的注意事项 ### 需要的解析器 Beautiful Soup 3曾使用Python的 `SGMLParser` 解析器,这个模块在Python3中已经被移除了.Beautiful Soup 4默认使用系统的 `html.parser` ,也可以使用lxml或html5lib扩展库代替.查看 [安装解析器](#id9) 章节因为 `html.parser` 解析器与 `SGMLParser` 解析器不同,它们在处理格式不正确的文档时也会产生不同结果.通常 `html.parser` 解析器会抛出异常.所以推荐安装扩展库作为解析器.有时 `html.parser` 解析出的文档树结构与 `SGMLParser` 的不同.如果发生这种情况,那么需要升级BS3来处理新的文档树. ### 方法名的变化 * `renderContents` -> `encode_contents` * `replaceWith` -> `replace_with` * `replaceWithChildren` -> `unwrap` * `findAll` -> `find_all` * `findAllNext` -> `find_all_next` * `findAllPrevious` -> `find_all_previous` * `findNext` -> `find_next` * `findNextSibling` -> `find_next_sibling` * `findNextSiblings` -> `find_next_siblings` * `findParent` -> `find_parent` * `findParents` -> `find_parents` * `findPrevious` -> `find_previous` * `findPreviousSibling` -> `find_previous_sibling` * `findPreviousSiblings` -> `find_previous_siblings` * `nextSibling` -> `next_sibling` * `previousSibling` -> `previous_sibling` Beautiful Soup构造方法的参数部分也有名字变化: * `BeautifulSoup(parseOnlyThese=...)` -> `BeautifulSoup(parse_only=...)` * `BeautifulSoup(fromEncoding=...)` -> `BeautifulSoup(from_encoding=...)` 为了适配Python3,修改了一个方法名: * `Tag.has_key()` -> `Tag.has_attr()` 修改了一个属性名,让它看起来更专业点: * `Tag.isSelfClosing` -> `Tag.is_empty_element` 修改了下面3个属性的名字,以免雨Python保留字冲突.这些变动不是向下兼容的,如果在BS3中使用了这些属性,那么在BS4中这些代码无法执行. * UnicodeDammit.Unicode -> UnicodeDammit.Unicode_markup`` * `Tag.next` -> `Tag.next_element` * `Tag.previous` -> `Tag.previous_element` ### 生成器将下列生成器按照PEP8标准重新命名,并转换成对象的属性: * `childGenerator()` -> `children` * `nextGenerator()` -> `next_elements` * `nextSiblingGenerator()` -> `next_siblings` * `previousGenerator()` -> `previous_elements` * `previousSiblingGenerator()` -> `previous_siblings` * `recursiveChildGenerator()` -> `descendants` * `parentGenerator()` -> `parents` 所以迁移到BS4版本时要替换这些代码: ``` for parent in tag.parentGenerator(): ... ``` 替换为: ``` for parent in tag.parents: ... ``` (两种调用方法现在都能使用) BS3中有的生成器循环结束后会返回 `None` 然后结束.这是个bug.新版生成器不再返回 `None` . BS4中增加了2个新的生成器, [.strings 和 stripped_strings](#strings-stripped-strings) . `.strings` 生成器返回NavigableString对象, `.stripped_strings` 方法返回去除前后空白的Python的string对象. ### XML BS4中移除了解析XML的 `BeautifulStoneSoup` 类.如果要解析一段XML文档,使用 `BeautifulSoup` 构造方法并在第二个参数设置为“xml”.同时 `BeautifulSoup` 构造方法也不再识别 `isHTML` 参数. Beautiful Soup处理XML空标签的方法升级了.旧版本中解析XML时必须指明哪个标签是空标签. 构造方法的 `selfClosingTags` 参数已经不再使用.新版Beautiful Soup将所有空标签解析为空元素,如果向空元素中添加子节点,那么这个元素就不再是空元素了. ### 实体 HTML或XML实体都会被解析成Unicode字符,Beautiful Soup 3版本中有很多处理实体的方法,在新版中都被移除了. `BeautifulSoup` 构造方法也不再接受 `smartQuotesTo` 或 `convertEntities` 参数. [编码自动检测](#unicode-dammit) 方法依然有 `smart_quotes_to` 参数,但是默认会将引号转换成Unicode.内容配置项 `HTML_ENTITIES` , `XML_ENTITIES` 和 `XHTML_ENTITIES` 在新版中被移除.因为它们代表的特性已经不再被支持. 如果在输出文档时想把Unicode字符转换成HTML实体,而不是输出成UTF-8编码,那就需要用到 [输出格式](#id47) 的方法. ### 迁移杂项 [Tag.string](#string) 属性现在是一个递归操作.如果A标签只包含了一个B标签,那么A标签的.string属性值与B标签的.string属性值相同. [多值属性](#id12) 比如 `class` 属性包含一个他们的值的列表,而不是一个字符串.这可能会影响到如何按照CSS类名哦搜索tag. 如果使用 `find*` 方法时同时传入了 [text 参数](#text) 和 [name 参数](#id32) .Beautiful Soup会搜索指定name的tag,并且这个tag的 [Tag.string](#string) 属性包含text参数的内容.结果中不会包含字符串本身.旧版本中Beautiful Soup会忽略掉tag参数,只搜索text参数. `BeautifulSoup` 构造方法不再支持 markupMassage 参数.现在由解析器负责文档的解析正确性. 很少被用到的几个解析器方法在新版中被移除,比如 `ICantBelieveItsBeautifulSoup` 和 `BeautifulSOAP` .现在由解析器完全负责如何解释模糊不清的文档标记. `prettify()` 方法在新版中返回Unicode字符串,不再返回字节流.