Python 正则表达式 · PythonGuru 中文系列教程

# Python 正则表达式 > 原文： [https://thepythonguru.com/python-regular-expression/](https://thepythonguru.com/python-regular-expression/) * * * 于 2020 年 1 月 7 日更新 * * * 正则表达式广泛用于模式匹配。 Python 具有对常规功能的内置支持。要使用正则表达式，您需要导入`re`模块。 ```py import re ``` 现在您可以使用正则表达式了。 ## `search`方法 * * * `re.search()`用于查找字符串中模式的第一个匹配项。 **语法**： `re.search(pattern, string, flags[optional])` `re.search()`方法接受模式和字符串，并在成功时返回`match`对象；如果找不到匹配项，则返回`None`。 `match`对象具有`group()`方法，该方法在字符串中包含匹配的文本。您必须使用原始字符串来指定模式，即像这样用`r`开头的字符串。 ```py r'this \n' ``` 所有特殊字符和转义序列在原始字符串中均失去其特殊含义，因此`\n`不是换行符，它只是一个反斜杠`\`后跟一个`n`。 ```py >>> import re >>> s = "my number is 123" >>> match = re.search(r'\d\d\d', s) >>> match <_sre.SRE_Match object; span=(13, 16), match='123'> >>> match.group() '123' ``` 上面我们使用`\d\d\d`作为模式。 `\d`正则表达式匹配一位数字，因此 `\d\d\d`将匹配`111`，`222`和`786`之类的数字。它与`12`和`1444`不匹配。 ## 正则表达式中使用的基本模式 * * * | 符号 | 描述 | | --- | --- | | `.` | 点匹配除换行符以外的任何字符 | | `\w` | 匹配任何单词字符，即字母，字母数字，数字和下划线（`_`） | | `\W` | 匹配非单词字符 | | `\d` | 匹配一个数字 | | `\D` | 匹配不是数字的单个字符 | | `\s` | 匹配任何空白字符，例如`\n`，`\t`，空格 | | `\S` | 匹配单个非空白字符 | | `[abc]` | 匹配集合中的单个字符，即匹配`a`，`b`或`c` | | `[^abc]` | 匹配`a`，`b`和`c`以外的单个字符 | | `[a-z]` | 匹配`a`至`z`范围内的单个字符。 | | `[a-zA-Z]` | 匹配`a-z`或`A-Z`范围内的单个字符 | | `[0-9]` | 匹配`0`-`9`范围内的单个字符 | | `^` | 匹配从字符串开头开始 | | `$` | 匹配从字符串末尾开始 | | `+` | 匹配一个或多个前面的字符（贪婪匹配）。 | | `*` | 匹配零个或多个前一个字符（贪婪匹配）。 | 再举一个例子： ```py import re s = "tim email is tim@somehost.com" match = re.search(r'[\w.-]+@[\w.-]+', s) # the above regular expression will match a email address if match: print(match.group()) else: print("match not found") ``` 这里我们使用了`[\w.-]+@[\w.-]+`模式来匹配电子邮件地址。成功后，`re.search()`返回一个`match`对象，其`group()`方法将包含匹配的文本。 ## 捕捉组 * * * 组捕获允许从匹配的字符串中提取部分。您可以使用括号`()`创建组。假设在上面的示例中，我们想从电子邮件地址中提取用户名和主机名。为此，我们需要在用户名和主机名周围添加`()`，如下所示。 ```py match = re.search(r'([\w.-]+)@([\w.-]+)', s) ``` 请注意，括号不会更改模式匹配的内容。如果匹配成功，则`match.group(1)`将包含第一个括号中的匹配，`match.group(2)`将包含第二个括号中的匹配。 ```py import re s = "tim email is tim@somehost.com" match = re.search('([\w.-]+)@([\w.-]+)', s) if match: print(match.group()) ## tim@somehost.com (the whole match) print(match.group(1)) ## tim (the username, group 1) print(match.group(2)) ## somehost (the host, group 2) ``` ## `findall()`函数 * * * 如您所知，现在`re.search()`仅找到模式的第一个匹配项，如果我们想找到字符串中的所有匹配项，这就是`findall()`发挥作用的地方。 **语法**： `findall(pattern, string, flags=0[optional])` 成功时，它将所有匹配项作为字符串列表返回，否则返回空列表。 ```py import re s = "Tim's phone numbers are 12345-41521 and 78963-85214" match = re.findall(r'\d{5}', s) if match: print(match) ``` **预期输出**： ```py ['12345', '41521', '78963', '85214'] ``` 您还可以通过`findall()`使用组捕获，当应用组捕获时，`findall()`返回一个元组列表，其中元组将包含匹配的组。一个示例将清除所有内容。 ```py import re s = "Tim's phone numbers are 12345-41521 and 78963-85214" match = re.findall(r'(\d{5})-(\d{5})', s) print(match) for i in match: print() print(i) print("First group", i[0]) print("Second group", i[1]) ``` **预期输出**： ```py [('12345', '41521'), ('78963', '85214')] ('12345', '41521') First group 12345 Second group 41521 ('78963', '85214') First group 78963 Second group 85214 ``` ## 可选标志 * * * `re.search()`和`re.findall()`都接受可选参数称为标志。标志用于修改模式匹配的行为。 | 标志 | 描述 | | --- | --- | | `re.IGNORECASE` | 忽略大写和小写 | | `re.DOTALL` | 允许（`.`）匹配换行符，默认（`.`）匹配除换行符之外的任何字符 | | `re.MULTILINE` | 这将允许`^`和`$`匹配每行的开始和结束 | ## 使用`re.match()` * * * `re.match()`与`re.search()`非常相似，区别在于它将在字符串的开头开始寻找匹配项。 ```py import re s = "python tuts" match = re.match(r'py', s) if match: print(match.group()) ``` 您可以通过使用`re.search()`将`^`应用于模式来完成同一件事。 ```py import re s = "python tuts" match = re.search(r'^py', s) if match: print(match.group()) ``` 这样就完成了您需要了解的有关`re`模块的所有内容。 * * * * * *