正则表达式 · UCB DS100 数据科学的原理与技巧

# 正则表达式 > 原文：[Regular Expressions](https://www.textbook.ds100.org/ch/08/text_regex.html) > > 校验：[Kitty Du](https://github.com/miaoxiaozui2017) > > 自豪地采用[谷歌翻译](https://translate.google.cn/) ```python # HIDDEN # Clear previously defined variables %reset -f # Set directory for data loading to work properly import os os.chdir(os.path.expanduser('~/notebooks/08')) ``` 在本节中，我们将介绍正则表达式，这是一个在字符串中指定模式的重要工具。 ## 动机在较大篇幅的文本中，许多有用的子字符串以特定的格式出现。例如，下面的句子包含了一个美国电话号码。 `"give me a call, my number is 123-456-7890."` 电话号码包含以下模式： 1. 三个数字 2. 然后是破折号 3. 后面是三个数字 4. 然后是破折号 5. 后面是四个数字如果有一段自由格式的文本，我们自然会希望检测和提取电话号码。我们还可能希望提取特定的电话号码片段，例如，通过提取区号，我们可以推断文本中提到的个人的位置。为了检测字符串是否包含电话号码，我们可以尝试编写如下方法： ```python def is_phone_number(string): digits = '0123456789' def is_not_digit(token): return token not in digits # Three numbers for i in range(3): if is_not_digit(string[i]): return False # Followed by a dash if string[3] != '-': return False # Followed by three numbers for i in range(4, 7): if is_not_digit(string[i]): return False # Followed by a dash if string[7] != '-': return False # Followed by four numbers for i in range(8, 12): if is_not_digit(string[i]): return False return True ``` ```python is_phone_number("382-384-3840") ``` > True ```python is_phone_number("phone number") ``` > False 上面的代码令人不快且冗长。比起手动循环字符串中的字符，我们可能更喜欢指定一个模式并命令 python 来匹配该模式。 **正则表达式**（通常缩写为**regex**）通过允许我们为字符串创建通用模式，方便地解决了这个确切的问题。使用正则表达式，我们可以在短短两行 python 中重新实现`is_phone_number`方法： ```python import re def is_phone_number(string): regex = r"[0-9]{3}-[0-9]{3}-[0-9]{4}" return re.search(regex, string) is not None is_phone_number("382-384-3840") ``` > True 在上面的代码中，我们使用 regex`[0-9]{3}-[0-9]{3}-[0-9]{4}`来匹配电话号码。虽然乍一看很神秘，但幸运的是，正则表达式的语法要比 Python 语言本身简单得多；我们仅在本节中介绍几乎所有的语法。我们还将介绍使用 regex 执行字符串操作的内置 python 模块`re`。 ## regex 语法[](#Regex-Syntax) 我们从正则表达式的语法开始。在 Python 中，正则表达式最常见的存储形式是原始字符串。原始字符串的行为类似于普通的 python 字符串，除了需要对反斜杠进行特殊处理。例如，要将字符串`hello \ world`存储在普通的 python 字符串中，我们必须编写： ```python # Backslashes need to be escaped in normal Python strings some_string = 'hello \\ world' print(some_string) ``` > hello \ world 使用原始字符串可以消除对反斜杠的转义： ```python # Note the `r` prefix on the string some_raw_string = r'hello \ world' print(some_raw_string) ``` > hello \ world 因为反斜杠经常出现在正则表达式中，所以我们将在本节中对所有正则表达式使用原始字符串。 ### 文字[](#Literals) 正则表达式中的**文字**字符与字符本身匹配。例如，regex`r"a"`将与`"Say! I like green eggs and ham!"`中的任何`"a"`匹配。所有字母数字字符和大多数标点符号都是 regex 文字。 ```python # HIDDEN def show_regex_match(text, regex): """ Prints the string with the regex match highlighted. """ print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text)) ``` ```python # The show_regex_match method highlights all regex matches in the input string regex = r"green" show_regex_match("Say! I like green eggs and ham!", regex) ``` > Say! I like **green** eggs and ham! ```python show_regex_match("Say! I like green eggs and ham!", r"a") ``` > S**a**y! I like green eggs **a**nd h**a**m! 在上面的示例中，我们观察到正则表达式可以匹配出现在输入字符串中任何位置的模式。在 python 中，此行为的差异取决于匹配 regex 所的方法，有些方法仅在 regex 出现在字符串开头时返回匹配；有些方法在字符串的任何位置返回匹配。还要注意，`show_regex_match`方法突出显示了输入字符串中所有出现的 regex。同样，这取决于使用的python 方法，某些方法返回所有匹配项，而有些方法只返回第一个匹配项。正则表达式区分大小写。在下面的示例中，regex 只匹配`eggs`中的小写`s`，而不匹配`Say`中的大写`S`。 ```python show_regex_match("Say! I like green eggs and ham!", r"s") ``` > Say! I like green egg**s** and ham! ### 通配符[](#Wildcard-Character) 有些字符在正则表达式中有特殊的含义。这些元字符允许正则表达式匹配各种模式。在正则表达式中，句点字符`.`与除换行符以外的任何字符匹配。 ```python show_regex_match("Call me at 382-384-3840.", r".all") ``` > **Call** me at 382-384-3840. 为了只匹配句点文字字符，我们必须用反斜杠转义它： ```python show_regex_match("Call me at 382-384-3840.", r"\.") ``` > Call me at 382-384-3840 **.** 通过使用句点字符来标记不同模式的各个部分，我们构造了一个 regex 来匹配电话号码。例如，我们可以将原始电话号码`382-384-3840`中的数字替换为`.`，将破折号保留为文字。得到的 regex 结果为 `...-...-....`。 ```python show_regex_match("Call me at 382-384-3840.", "...-...-....") ``` > Call me at **382-384-3840**. 但是，由于句点字符与所有字符匹配，下面的输入字符串将产生一个伪匹配。 ```python show_regex_match("My truck is not-all-blue.", "...-...-....") ``` > My truck is **not-all-blue**. ### 字符类[](#Character-Classes) **字符类**匹配指定的字符集，允许我们创建限制比只有`.`字符更严格的匹配。要创建字符类，请将所需字符集括在括号`[ ]`中。 ```python show_regex_match("I like your gray shirt.", "gr[ae]y") ``` > I like your **gray** shirt. ```python show_regex_match("I like your grey shirt.", "gr[ae]y") ``` > I like your **grey** shirt. ```python # Does not match; a character class only matches one character from a set show_regex_match("I like your graey shirt.", "gr[ae]y") ``` > I like your graey shirt. ```python # In this example, repeating the character class will match show_regex_match("I like your graey shirt.", "gr[ae][ae]y") ``` > I like your **graey** shirt. 在字符类中，`.`字符被视为文字，而不是通配符。 ```python show_regex_match("I like your grey shirt.", "irt[.]") ``` > I like your grey sh**irt.** 对于常用的字符类，我们可以使用一些特殊的速记符号： | 速记 | 意义 | | ---: | ---: | | [0－9] | 所有的数字 | | [a-z] | 小写字母 | | [A-Z] | 大写字母 | ```python show_regex_match("I like your gray shirt.", "y[a-z]y") ``` > I like your gray shirt. 字符类允许我们为电话号码创建更具体的 regex。 ```python # We replaced every `.` character in ...-...-.... with [0-9] to restrict # matches to digits. phone_regex = r'[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' show_regex_match("Call me at 382-384-3840.", phone_regex) ``` > Call me at **382-384-3840**. ```python # Now we no longer match this string: show_regex_match("My truck is not-all-blue.", phone_regex) ``` > My truck is not-all-blue. ### 否定的字符类[](#Negated-Character-Classes) **否定的字符类**匹配**除类中字符以外**的任何字符。要创建否定字符类，请将否定字符括在`[^ ]`中。 ```python show_regex_match("The car parked in the garage.", r"[^c]ar") ``` > The car **par**ked in the **gar**age. ### 量词[](#Quantifiers) 为了创建与电话号码匹配的 regex，我们编写了： ``` [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] ``` 这匹配 3 位数字、一个破折号、3 位数字、一个破折号和 4 位数字。量词允许我们匹配一个模式的多个连续出现。我们通过将数字放在大括号`{ }`中指定重复次数。 ```python phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}' show_regex_match("Call me at 382-384-3840.", phone_regex) ``` > Call me at **382-384-3840**. ```python # No match phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}' show_regex_match("Call me at 12-384-3840.", phone_regex) ``` > Call me at 12-384-3840. 量词总是将字符或字符类修改为最左边。下表给出了量词的完整语法。 | 量词 | 意义 | | ---: | ---: | | {m,n} | 与前面的字符匹配 m ~ n 次。 | | {m} | 与前面的字符刚好匹配 m 次。 | | {m,} | 至少与前面的字符匹配 m 次。 | | {,n} | 最多与前面的字符匹配 n 次。 | **速记量词** 一些常用的量词有一个简写： | 符号 | 量词 | 意义 | | ---: | ---: | ---: | | * | {0,} | 与前面的字符匹配 0 次或更多次 | | + | {1,} | 与前面的字符匹配 1 次或更多次 | | ? | {0,1} | 与前面的字符匹配 0 或 1 次 | 在下面的示例中，我们使用`*`字符而不是`{0,}`。 ```python # 3 a's show_regex_match('He screamed "Aaaah!" as the cart took a plunge.', "Aa*h!") ``` > He screamed "**Aaaah!**" as the cart took a plunge. ```python # Lots of a's show_regex_match( 'He screamed "Aaaaaaaaaaaaaaaaaaaah!" as the cart took a plunge.', "Aa*h!" ) ``` > He screamed "**Aaaaaaaaaaaaaaaaaaaah!**" as the cart took a plunge. ```python # No lowercase a's show_regex_match('He screamed "Ah!" as the cart took a plunge.', "Aa*h!") ``` > He screamed "**Ah!**" as the cart took a plunge. ### 量词贪婪[](#Quantifiers-are-greedy) 量词总是返回尽可能长的匹配。这有时会导致令人惊讶的行为： ```python # We tried to match 311 and 911 but matched the ` and ` as well because # `<311> and <911>` is the longest match possible for `<.+>`. show_regex_match("Remember the numbers <311> and <911>", "<.+>") ``` > Remember the numbers **<311> and <911>** 在许多情况下，使用更具体的字符类可以防止这些错误匹配： ```python show_regex_match("Remember the numbers <311> and <911>", "<[0-9]+>") ``` > Remember the numbers **<311>** and **<911>** ### 固定[](#Anchoring) 有时模式应该只在字符串的开头或结尾匹配。特殊字符`^`仅当模式出现在字符串的开头时才固定匹配 regex ；特殊字符`$`仅当模式出现在字符串的结尾时固定匹配 regex 。例如，regex `well$` 只匹配字符串末尾的`well`。 ```python show_regex_match('well, well, well', r"well$") ``` > well, well, **well** 同时使用`^`和`$`需要 regex 匹配整个字符串。 ```python phone_regex = r"^[0-9]{3}-[0-9]{3}-[0-9]{4}$" show_regex_match('382-384-3840', phone_regex) ``` > **382-384-3840** ```python # No match show_regex_match('You can call me at 382-384-3840.', phone_regex) ``` > You can call me at 382-384-3840. ### 转义元字符[](#Escaping-Meta-Characters) 在正则表达式中，所有 regex 元字符都具有特殊意义。为了将元字符匹配为文字，我们使用`\`字符对它们进行转义。 ```python # `[` is a meta character and requires escaping show_regex_match("Call me at [382-384-3840].", "\[") ``` > Call me at **\[** 382-384-3840]. ```python # `.` is a meta character and requires escaping show_regex_match("Call me at [382-384-3840].", "\.") ``` > Call me at [382-384-3840]**.** ## 参考表[](#Reference-Tables) 我们现在已经介绍了最重要的 regex 语法和元字符。为了更完整的参考，我们包括了下表。 **元字符** 此表包含大多数重要的 *元字符* ，这有助于我们在字符串中指定要匹配的某些模式。 | 字符 | 说明 | 例子 | 匹配 | 不匹配 | | ---: | ---: | ---: | ---: | ---: | | . | 除\n以外的任何字符 | `...` | abc | ab abcd | | [] | 括号内的任何字符 | `[cb.]ar` | car .ar | jar | | [^ ] | *除了*括号内的任何字符 | `[^b]ar` | car par | bar ar | | * | ≥0次的前一个字符 | `[pb]*ark` | bbark ark | dark | | + | ≥1次的前一个字符 | `[pb]+ark` | bbpark bark | dark ark | | ? | 0或1次的前一个字符 | `s?he` | she he | the | | {n} | 刚好n次的前一个字符 | `hello{3}` | hellooo | hello | | \| | \|前或\|后的模式 | `we\|[ui]s` | we us is | e s | | \ | 转义下一个字符 | `\[hi\]` | [hi] | hi | | ^ | 行首 | `^ark` | ark&ensp;two | dark | | $ | 行尾 | `ark$` | noahs&ensp;ark | noahs&ensp;arks | **速记字符集** 一些常用的字符集有速记。 | 说明 | 括号形式 | 速记 | | ---: | ---: | ---: | | 字母数字字符 | `[a-zA-Z0-9]` | `\w` | | 不是字母数字字符 | `[^a-zA-Z0-9]` | `\W` | | 数字 | `[0-9]` | `\d` | | 不是数字 | `[^0-9]` | `\D` | | 空白 | `[\t\n\f\r\p{Z}]` | `\s` | | 不是空白 | `[^\t\n\f\r\p{z}]` | `\S` | ## 摘要[](#Summary) 几乎所有的编程语言都有一个库，可以使用正则表达式来匹配模式，使它们无论使用哪种特定的语言都很有用。在本节中，我们介绍了 regex 语法和最有用的元字符。