Python 正则表达式 · Programiz 中文系列教程

# Python 正则表达式 > 原文： [https://www.programiz.com/python-programming/regex](https://www.programiz.com/python-programming/regex) #### 在本教程中，您将学习正则表达式（RegEx），并使用 Python 的`re`模块与 RegEx 一起使用（在示例的帮助下）。正则表达式（RegEx）是定义搜索模式的字符序列。例如， ```py ^a...s$ ``` 上面的代码定义了 RegEx 模式。模式为从`a`到`s`的任何五个字母的字符串。使用 RegEx 定义的模式可用于与字符串匹配。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `^a...s$` | `abs` | 没有匹配 | | | `alias` | 匹配 | | | `abyss` | 匹配 | | | `Alias` | 没有匹配 | | | `An abacus` | 没有匹配 | * * * Python 有一个名为`re`的模块可与 RegEx 一起使用。这是一个例子： ```py import re pattern = '^a...s$' test_string = 'abyss' result = re.match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.") ``` 在这里，我们使用`re.match()`函数在`test_string`中搜索`pattern`。如果搜索成功，该方法将返回一个匹配对象。如果不是，则返回`None`。 * * * `re`模块中定义了其他一些函数，可与 RegEx 一起使用。在探讨之前，让我们学习正则表达式本身。如果您已经了解 RegEx 的基础知识，请跳至 [Python RegEx](#python-regex)。 * * * ## 使用 RegEx 指定模式为了指定正则表达式，使用了元字符。在上面的示例中，`^`和`$`是元字符。 * * * ### 元字符元字符是 RegEx 引擎以特殊方式解释的字符。以下是元字符列表： ```py [] . ^ $ + ? {} () \ | ``` * * * `[]` - **方括号** 方括号指定您要匹配的一组字符。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `[abc]` | `a` | 1 个匹配 | | | `ac` | 2 个匹配 | | | `Hey Jude` | 没有匹配 | | | `abc de ca` | 5 个匹配 | 在这里，如果您要匹配的字符串包含`a`，`b`或`c`中的任何一个，则`[abc]`将匹配。您也可以使用方括号内的`-`指定字符范围。 * `[a-e]`与`[abcde]`相同。 * `[1-4]`与`[1234]`相同。 * `[0-39]`与`[01239]`相同。您可以在方括号的开头使用尖号`^`符号来补充（反转）字符集。 * `[^abc]`表示除`a`或`b`或`c`之外的任何字符。 * `[^0-9]`表示任何非数字字符。 * * * `.` - **句号** 句点匹配任何单个字符（换行符`'\n'`除外）。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `..` | `a` | 没有匹配 | | | `ac` | 1 个匹配 | | | `acd` | 1 个匹配 | | | `acde` | 2 个匹配（包含 4 个字符） | * * * `^` - **脱字符** 插入符号`^`用于检查字符串**是否以某个字符开头**。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `^a` | `a` | 1 个匹配 | | `abc` | 1 个匹配 | | `bac` | 没有匹配 | | `^ab` | `abc` | 1 个匹配 | | `acb` | 没有匹配（以`a`开头，但之后没有`b`） | * * * `$` - **美元** 美元符号`$`用于检查字符串**是否以某个字符结束**。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `a$` | `a` | 1 个匹配 | | `formula` | 1 个匹配 | | `cab` | 没有匹配 | * * * `*` - **星号** 星形符号`*`匹配**模式的零次或多次出现**。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `ma*n` | `mn` | 1 个匹配 | | `man` | 1 个匹配 | | `maaan` | 1 个匹配 | | `main` | 没有匹配（`a`之后没有`n`） | | `woman` | 1 个匹配 | * * * `+`- **加号** 加号`+`**匹配左侧的模式的一个或多个出现**。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `ma+n` | `mn` | 没有匹配（没有`a`字符） | | `man` | 1 个匹配 | | `maaan` | 1 个匹配 | | `main` | 没有匹配（`a`后跟`n`） | | `woman` | 1 个匹配 | * * * `?` - **问号** 问号符号`?`匹配**模式的零或一次出现**。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `ma?n` | `mn` | 1 个匹配 | | `man` | 1 个匹配 | | `maaan` | 没有匹配（一个以上`a`字符） | | `main` | 没有匹配（`a`之后没有`n`） | | `woman` | 1 个匹配 | * * * `{}` - **大括号** 考虑以下代码：`{n,m}`。这意味着至少`n`，并且最多`m`个重复的模式留给它。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `a{2,3}` | `abc dat` | 没有匹配 | | `abc daat` | 1 个匹配（在`daat`） | | `aabc daaat` | 2 个匹配（在`aabc`和`daaat`） | | `aabc daaaat` | 2 个匹配（在`aabc`和`daaaat`） | 让我们再尝试一个示例。此 RegEx `[0-9]{2, 4}`匹配至少 2 位但不超过 4 位 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `[0-9]{2,4}` | `ab123csde` | 1 个匹配（在`ab123csde`） | | `12 and 345673` | 3 个匹配（`12`，`3456`，`73`） | | `1 and 2` | 没有匹配 | * * * `|` - **替代项** 竖线`|`用于交替（`or`运算符）。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `a|b` | `cde` | 没有匹配 | | `ade` | 1 个匹配（在`ade`匹配） | | `acdbea` | 3 个匹配（在`acdbea`） | 在此，`a|b`匹配包含`a`或`b`的任何字符串 * * * `()` - **分组** 括号`()`用于对子模式进行分组。例如，`(a|b|c)xz`匹配与`a`或`b`或`c`匹配的任何字符串，后跟`xz` | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `(a|b|c)xz` | `ab xz` | 没有匹配 | | `abxz` | 1 个匹配（在`abxz`匹配） | | `axz cabxz` | 2 个匹配（在`axzbc cabxz`） | * * * `\` - **反斜线** 反冲`\`用于转义包括所有元字符在内的各种字符。例如，如果字符串包含`$`后跟`a`，则`\$a`匹配。在这里，`$`不是由 RegEx 引擎以特殊方式解释的。如果不确定某个字符是否具有特殊含义，可以在其前面放置`\`。这样可以确保不对字符进行特殊处理。 * * * **特殊序列** 特殊序列使常用模式更易于编写。以下是特殊序列的列表： `\A` - 如果指定字符在字符串的开头，则匹配。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\Athe` | `the sun` | 匹配 | | `In the sun` | 没有匹配 | * * * `\b` - 如果指定字符在单词的开头或结尾，则匹配。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\bfoo` | `football` | 匹配 | | `a football` | 匹配 | | `afootball` | 没有匹配 | | `foo\b` | `the foo` | 匹配 | | `the afoo test` | 匹配 | | `the afootest` | 没有匹配 | * * * `\B` - 与`\b`相反。如果指定字符不是单词开头或结尾，则匹配。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\Bfoo` | `football` | 没有匹配 | | `a football` | 没有匹配 | | `afootball` | 匹配 | | `foo\B` | `the foo` | 没有匹配 | | `the afoo test` | 没有匹配 | | `the afootest` | 匹配 | * * * `\d` - 匹配任何十进制数字。相当于`[0-9]` | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\d` | `12abc3` | 3 个匹配（在`12abc3`） | | `Python` | 没有匹配 | * * * `\D` - 匹配任何非十进制数字。相当于`[^0-9]` | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\D` | `1ab34"50` | 3 个匹配（在`1ab34"50`） | | `1345` | 没有匹配 | * * * `\s` - 匹配字符串包含任何空格字符的地方。等效于`[ \t\n\r\f\v]`。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\s` | `Python RegEx` | 1 个匹配 | | `PythonRegEx` | 没有匹配 | * * * `\S` - 匹配字符串包含任何非空白字符的地方。等效于`[^ \t\n\r\f\v]`。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\S` | `a b` | 2 个匹配（在`a b`） | | | 没有匹配 | * * * `\w` - 匹配任何字母数字字符（数字和字母）。等效于`[a-zA-Z0-9_]`。顺便说一下，下划线`_`也被认为是字母数字字符。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\w` | `12&": ;c` | 3 个匹配（在`12&": ;c`） | | `%"> !` | 没有匹配 | * * * `\W` - 匹配任何非字母数字字符。相当于`[^a-zA-Z0-9_]` | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `\W` | `1a2%c` | 1 个匹配（在`1a2%c`） | | `Python` | 没有匹配 | * * * `\Z` - 如果指定字符在字符串的末尾，则匹配。 | 表达式 | 字符串 | 是否匹配 | | --- | --- | --- | | `Python\Z` | `I like Python` | 1 个匹配 | | `I like Python Programming` | 没有匹配 | | `Python is fun.` | 没有匹配 | * * * **提示**：要构建和测试正则表达式，可以使用 RegEx 测试器工具，例如 [regex101](https://regex101.com/ "RegEx tester") 。该工具不仅可以帮助您创建正则表达式，还可以帮助您学习它。现在，您了解了 RegEx 的基础知识，让我们讨论如何在 Python 代码中使用 RegEx。 * * * ## Python RegEx Python 有一个名为`re`的模块可用于正则表达式。要使用它，我们需要导入模块。 ```py import re ``` 该模块定义了一些可与 RegEx 一起使用的函数和常量。 * * * ## `re.findall()` `re.findall()`方法返回包含所有匹配项的字符串列表。 * * * ### 示例 1：`re.findall()` ```py # Program to extract numbers from a string import re string = 'hello 12 hi 89\. Howdy 34' pattern = '\d+' result = re.findall(pattern, string) print(result) # Output: ['12', '89', '34'] ``` 如果找不到该模式，则`re.findall()`返回一个空列表。 * * * ## `re.split()` `re.split`方法在存在匹配项的地方拆分字符串，并返回发生拆分的字符串列表。 * * * ### 示例 2：`re.split()` ```py import re string = 'Twelve:12 Eighty nine:89.' pattern = '\d+' result = re.split(pattern, string) print(result) # Output: ['Twelve:', ' Eighty nine:', '.'] ``` 如果找不到该模式，则`re.split()`返回一个包含原始字符串的列表。 * * * 您可以将`maxsplit`参数传递给`re.split()`方法。这是将要发生的最大拆分次数。 ```py import re string = 'Twelve:12 Eighty nine:89 Nine:9.' pattern = '\d+' # maxsplit = 1 # split only at the first occurrence result = re.split(pattern, string, 1) print(result) # Output: ['Twelve:', ' Eighty nine:89 Nine:9.'] ``` 顺便说一下，`maxsplit`的默认值为 0；意味着所有可能的分裂。 * * * ## `re.sub()` `re.sub()`的语法为： ```py re.sub(pattern, replace, string) ``` 该方法返回一个字符串，其中匹配的匹配项被`replace`变量的内容替换。 * * * ### 示例 3：`re.sub()` ```py # Program to remove all whitespaces import re # multiline string string = 'abc 12\ de 23 \n f45 6' # matches all whitespace characters pattern = '\s+' # empty string replace = '' new_string = re.sub(pattern, replace, string) print(new_string) # Output: abc12de23f456 ``` 如果找不到该模式，则`re.sub()`返回原始字符串。 * * * 您可以将`count`作为第四个参数传递给`re.sub()`方法。如果省略，则结果为 0。这将替换所有出现的事件。 ```py import re # multiline string string = 'abc 12\ de 23 \n f45 6' # matches all whitespace characters pattern = '\s+' replace = '' new_string = re.sub(r'\s+', replace, string, 1) print(new_string) # Output: # abc12de 23 # f45 6 ``` * * * ## `re.subn()` `re.subn()`与`re.sub()`相似，但期望它返回 2 个项目的元组，其中包含新字符串和进行的替换数目。 * * * ### 示例 4：`re.subn()` ```py # Program to remove all whitespaces import re # multiline string string = 'abc 12\ de 23 \n f45 6' # matches all whitespace characters pattern = '\s+' # empty string replace = '' new_string = re.subn(pattern, replace, string) print(new_string) # Output: ('abc12de23f456', 4) ``` * * * ## `re.search()` `re.search()`方法采用两个参数：模式和字符串。该方法查找 RegEx 模式与字符串匹配的第一个位置。如果搜索成功，则`re.search()`返回一个匹配对象。如果不是，则返回`None`。 ```py match = re.search(pattern, str) ``` * * * ### 示例 5：`re.search()` ```py import re string = "Python is fun" # check if 'Python' is at the beginning match = re.search('\APython', string) if match: print("pattern found inside the string") else: print("pattern not found") # Output: pattern found inside the string ``` 在此，`match`包含一个匹配对象。 * * * ## 匹配对象您可以使用[`dir()`](/python-programming/methods/built-in/dir)函数获取匹配对象的方法和属性。匹配对象的一些常用方法和属性是： * * * ### `match.group()` `group()`方法返回字符串中匹配的部分。 ### 示例 6：匹配对象 ```py import re string = '39801 356, 2102 1111' # Three digit number followed by space followed by two digit number pattern = '(\d{3}) (\d{2})' # match variable contains a Match object. match = re.search(pattern, string) if match: print(match.group()) else: print("pattern not found") # Output: 801 35 ``` 在此，`match`变量包含一个匹配对象。我们的模式`(\d{3}) (\d{2})`具有两个子组`(\d{3})`和`(\d{2})`。您可以获取这些带括号的子组的字符串的一部分。这是如何做： ```py >>> match.group(1) '801' >>> match.group(2) '35' >>> match.group(1, 2) ('801', '35') >>> match.groups() ('801', '35') ``` * * * ### `match.start()`，`match.end()`和`match.span()` `start()`函数返回匹配的子字符串的开头的索引。同样，`end()`返回匹配的子字符串的结束索引。 ```py >>> match.start() 2 >>> match.end() 8 ``` `span()`函数返回一个包含匹配部分的开始和结束索引的元组。 ```py >>> match.span() (2, 8) ``` * * * ### `match.re`和`match.string` 匹配对象的`re`属性返回正则表达式对象。同样，`string`属性返回传递的字符串。 ```py >>> match.re re.compile('(\\d{3}) (\\d{2})') >>> match.string '39801 356, 2102 1111' ``` * * * 我们已经介绍了`re`模块中定义的所有常用方法。如果您想了解更多信息，请访问 [Python 3 `re`模块](https://docs.python.org/3/library/re.html)。 * * * ### 在 RegEx 之前使用`r`前缀在正则表达式前使用`r`或`R`前缀时，表示原始字符串。例如，`'\n'`是换行，而`r'\n'`表示两个字符：反斜杠`\`后跟`n`。反冲`\`用于转义包括所有元字符在内的各种字符。但是，使用`r`前缀会使`\`被视为正常字符。 * * * ### 示例 7：使用`r`前缀的原始字符串 ```py import re string = '\n and \r are escape sequences.' result = re.findall(r'[\n\r]', string) print(result) # Output: ['\n', '\r'] ```