BeautifulSoup 教程 · ZetCode 中文系列教程

# BeautifulSoup 教程 > 原文： [http://zetcode.com/python/beautifulsoup/](http://zetcode.com/python/beautifulsoup/) BeautifulSoup 教程是 BeautifulSoup Python 库的入门教程。这些示例查找标签，遍历文档树，修改文档和刮取网页。 ## BeautifulSoup BeautifulSoup 是用于解析 HTML 和 XML 文档的 Python 库。它通常用于网页抓取。 BeautifulSoup 将复杂的 HTML 文档转换为复杂的 Python 对象树，例如标记，可导航字符串或注释。 ## 安装 BeautifulSoup 我们使用`pip3`命令安装必要的模块。 ```py $ sudo pip3 install lxml ``` 我们需要安装 BeautifulSoup 使用的`lxml`模块。 ```py $ sudo pip3 install bs4 ``` 上面的命令将安装 BeautifulSoup。 ## HTML 文件在示例中，我们将使用以下 HTML 文件： `index.html` ```py <!DOCTYPE html> <html> <head> <title>Header</title> <meta charset="utf-8"> </head> <body> <h2>Operating systems</h2> <ul id="mylist" style="width:150px"> <li>Solaris</li> <li>FreeBSD</li> <li>Debian</li> <li>NetBSD</li> <li>Windows</li> </ul> <p> FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms. </p> <p> Debian is a Unix-like computer operating system that is composed entirely of free software. </p> </body> </html> ``` ## BeautifulSoup 简单示例在第一个示例中，我们使用 BeautifulSoup 模块获取三个标签。 `simple.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.h2) print(soup.head) print(soup.li) ``` 该代码示例将打印三个标签的 HTML 代码。 ```py from bs4 import BeautifulSoup ``` 我们从`bs4`模块导入`BeautifulSoup`类。 `BeautifulSoup`是从事工作的主要类。 ```py with open("index.html", "r") as f: contents = f.read() ``` 我们打开`index.html`文件并使用`read()`方法读取其内容。 ```py soup = BeautifulSoup(contents, 'lxml') ``` 创建了`BeautifulSoup`对象； HTML 数据将传递给构造器。第二个选项指定解析器。 ```py print(soup.h2) print(soup.head) ``` 在这里，我们打印两个标签的 HTML 代码：`h2`和`head`。 ```py print(soup.li) ``` 有多个`li`元素；该行打印第一个。 ```py $ ./simple.py <h2>Operating systems</h2> <head> <title>Header</title> <meta charset="utf-8"/> </head> <li>Solaris</li> ``` 这是输出。 ## BeautifulSoup 标签，名称，文本标记的`name`属性给出其名称，`text`属性给出其文本内容。 `tags_names.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print("HTML: {0}, name: {1}, text: {2}".format(soup.h2, soup.h2.name, soup.h2.text)) ``` 该代码示例打印`h2`标签的 HTML 代码，名称和文本。 ```py $ ./tags_names.py HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems ``` 这是输出。 ## BeautifulSoup 遍历标签使用`recursiveChildGenerator()`方法，我们遍历 HTML 文档。 `traverse_tree.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') for child in soup.recursiveChildGenerator(): if child.name: print(child.name) ``` 该示例遍历文档树并打印所有 HTML 标记的名称。 ```py $ ./traverse_tree.py html head title meta body h2 ul li li li li li p p ``` 在 HTML 文档中，我们有这些标签。 ## BeautifulSoup 子元素使用`children`属性，我们可以获取标签的子级。 `get_children.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') root = soup.html root_childs = [e.name for e in root.children if e.name is not None] print(root_childs) ``` 该示例检索`html`标记的子代，将它们放置在 Python 列表中，然后将其打印到控制台。由于`children`属性还返回标签之间的空格，因此我们添加了一个条件，使其仅包含标签名称。 ```py $ ./get_children.py ['head', 'body'] ``` `html`标签有两个子元素：`head`和`body`。 ## BeautifulSoup 后继元素使用`descendants`属性，我们可以获得标签的所有后代（所有级别的子级）。 `get_descendants.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') root = soup.body root_childs = [e.name for e in root.descendants if e.name is not None] print(root_childs) ``` 该示例检索`body`标记的所有后代。 ```py $ ./get_descendants.py ['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p'] ``` 这些都是`body`标签的后代。 ## BeautifulSoup 网页抓取请求是一个简单的 Python HTTP 库。它提供了通过 HTTP 访问 Web 资源的方法。 `scraping.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup import requests as req resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') print(soup.title) print(soup.title.text) print(soup.title.parent) ``` 该示例检索一个简单网页的标题。它还打印其父级。 ```py resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') ``` 我们获取页面的 HTML 数据。 ```py print(soup.title) print(soup.title.text) print(soup.title.parent) ``` 我们检索标题的 HTML 代码，其文本以及其父级的 HTML 代码。 ```py $ ./scraping.py <title>Something.</title> Something. <head><title>Something.</title></head> ``` 这是输出。 ## BeautifulSoup 美化代码使用`prettify()`方法，我们可以使 HTML 代码看起来更好。 `prettify.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup import requests as req resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') print(soup.prettify()) ``` 我们美化了一个简单网页的 HTML 代码。 ```py $ ./prettify.py <html> <head> <title> Something. </title> </head> <body> Something. </body> </html> ``` 这是输出。 ## BeautifulSoup 通过 ID 查找元素使用`find()`方法，我们可以通过各种方式（包括元素 ID）查找元素。 `find_by_id.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') #print(soup.find("ul", attrs={ "id" : "mylist"})) print(soup.find("ul", id="mylist")) ``` 该代码示例查找具有`mylist` ID 的`ul`标签。带注释的行是执行相同任务的另一种方法。 ## BeautifulSoup 查找所有标签使用`find_all()`方法，我们可以找到满足某些条件的所有元素。 `find_all.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') for tag in soup.find_all("li"): print("{0}: {1}".format(tag.name, tag.text)) ``` 该代码示例查找并打印所有`li`标签。 ```py $ ./find_all.py li: Solaris li: FreeBSD li: Debian li: NetBSD ``` 这是输出。 `find_all()`方法可以获取要搜索的元素列表。 `find_all2.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tags = soup.find_all(['h2', 'p']) for tag in tags: print(" ".join(tag.text.split())) ``` 该示例查找所有`h2`和`p`元素并打印其文本。 `find_all()`方法还可以使用一个函数，该函数确定应返回哪些元素。 `find_by_fun.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup def myfun(tag): return tag.is_empty_element with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tags = soup.find_all(myfun) print(tags) ``` 该示例打印空元素。 ```py $ ./find_by_fun.py [<meta charset="utf-8"/>] ``` 文档中唯一的空元素是`meta`。也可以使用正则表达式查找元素。 `regex.py` ```py #!/usr/bin/python3 import re from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') strings = soup.find_all(string=re.compile('BSD')) for txt in strings: print(" ".join(txt.split())) ``` 该示例打印包含`"BSD"`字符串的元素的内容。 ```py $ ./regex.py FreeBSD NetBSD FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms. ``` 这是输出。 ## BeautifulSoup CSS 选择器通过`select()`和`select_one()`方法，我们可以使用一些 CSS 选择器来查找元素。 `select_nth_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.select("li:nth-of-type(3)")) ``` 本示例使用 CSS 选择器来打印第三个`li`元素的 HTML 代码。 ```py $ ./select_nth_tag.py <li>Debian</li> ``` 这是第三个`li`元素。 CSS 中使用`#`字符通过 ID 属性选择标签。 `select_by_id.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.select_one("#mylist")) ``` 该示例打印具有`mylist` ID 的元素。 ## BeautifulSoup 附加元素 `append()`方法将新标签附加到 HTML 文档。 `append_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') newtag = soup.new_tag('li') newtag.string='OpenBSD' ultag = soup.ul ultag.append(newtag) print(ultag.prettify()) ``` 该示例附加了一个新的`li`标签。 ```py newtag = soup.new_tag('li') newtag.string='OpenBSD' ``` 首先，我们使用`new_tag()`方法创建一个新标签。 ```py ultag = soup.ul ``` 我们获得对`ul`标签的引用。 ```py ultag.append(newtag) ``` 我们将新创建的标签附加到`ul`标签。 ```py print(ultag.prettify()) ``` 我们以整齐的格式打印`ul`标签。 ## BeautifulSoup 插入元素 `insert()`方法在指定位置插入标签。 `insert_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') newtag = soup.new_tag('li') newtag.string='OpenBSD' ultag = soup.ul ultag.insert(2, newtag) print(ultag.prettify()) ``` 该示例将第三个位置的`li`标签插入`ul`标签。 ## BeautifulSoup 替换文字 `replace_with()`替换元素的文本。 `replace_text.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tag = soup.find(text="Windows") tag.replace_with("OpenBSD") print(soup.ul.prettify()) ``` 该示例使用`find()`方法查找特定元素，并使用`replace_with()`方法替换其内容。 ## BeautifulSoup 删除元素 `decompose()`方法从树中删除标签并销毁它。 `decompose_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') ptag2 = soup.select_one("p:nth-of-type(2)") ptag2.decompose() print(soup.body.prettify()) ``` 该示例删除了第二个`p`元素。在本教程中，我们使用了 Python BeautifulSoup 库。您可能也会对以下相关教程感兴趣： [Pyquery 教程](/python/pyquery)， [Python 教程](/lang/python/)， [Python 列表推导](/articles/pythonlistcomprehensions/)， [OpenPyXL 教程](/articles/openpyxl/)，Python Requests 教程和 [Python CSV 教程](/python/csv/)。