# Python 和 pandas 中的 Regex
> 原文:[Regex in Python and pandas](https://www.textbook.ds100.org/ch/08/text_re.html)
>
> 校验:[Kitty Du](https://github.com/miaoxiaozui2017)
>
> 自豪地采用[谷歌翻译](https://translate.google.cn/)
```python
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))
```
```python
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
```
在本节中,我们将介绍python内置的`re`模块中 regex 的用法。因为我们只介绍了一些最常用的方法,所以您也可以参考[有关`re`模块的官方文档](https://docs.python.org/3/library/re.html)。
#### `re.search`[](#re.search)
`re.search(pattern, string)`在`string`中的任意位置搜索 regex`pattern`的匹配项。如果找到模式,则返回一个 TruthyMatch 对象;如果没有,则返回`None`。
```python
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text = "Call me at 382-384-3840."
match = re.search(phone_re, text)
match
```
```
<_sre.SRE_Match object; span=(11, 23), match='382-384-3840'>
```
虽然返回的 match 对象有各种有用的属性,但我们最常用`re.search`来测试模式是否出现在字符串中。
```python
if re.search(phone_re, text):
print("Found a match!")
```
```
Found a match!
```
```python
if re.search(phone_re, 'Hello world'):
print("No match; this won't print")
```
另一个常用的方法`re.match(pattern, string)`的行为与`re.search`相同,但只检查`string`开头的匹配项,而不是字符串中任何位置的匹配项。
#### `re.findall`[](#re.findall)
我们使用`re.findall(pattern, string)`提取与 regex 匹配的子字符串。此方法返回`string`中所有匹配项的列表。
```python
gmail_re = r'[a-zA-Z0-9]+@gmail\.com'
text = '''
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
'''
re.findall(gmail_re, text)
```
```
['email1@gmail.com', 'email3@gmail.com']
```
## Regex 组[](#Regex-Groups)
使用**regex 组**,我们通过将子模式括在括号`( )`中指定要从 regex 提取的子模式。当 regex 包含 regex 组时,`re.findall`返回包含子模式内容的元组列表。
例如,以下是我们熟悉的用 regex 从字符串中提取电话号码:
```python
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
```
```
['382-384-3840', '123-456-7890']
```
为了将一个电话号码的三位或四位组成部分分开,我们可以将每个数字组用括号括起来。
```python
# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
```
```
[('382', '384', '3840'), ('123', '456', '7890')]
```
正如所承诺的那样,`re.findall`返回包含匹配电话号码的各个组成部分的元组列表。
#### `re.sub`[](#re.sub)
`re.sub(pattern, replacement, string)`用`replacement`替换`string`中所有出现的`pattern`。此方法的行为类似于 python 字符串方法`str.sub`,但使用 regex 来匹配模式。
在下面的代码中,我们通过用破折号替换日期分隔符来将日期更改为通用格式。
```python
messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = r'[/.:]'
re.sub(regex, '-', messy_dates)
```
```
'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'
```
#### `re.split`[](#re.split)
`re.split(pattern, string)`在每次出现regex `pattern`时分割输入的`string`。此方法的行为类似于 python 字符串方法`str.split`,但使用 regex 进行分割。
在下面的代码中,我们使用`re.split`将一本书目录的章节名称和它们的页码分开。
```python
toc = '''
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
'''.strip()
# First, split into individual lines
lines = re.split('\n', toc)
lines
```
```
['PLAYING PILGRIMS============3',
'A MERRY CHRISTMAS===========13',
'THE LAURENCE BOY============31',
'BURDENS=====================55',
'BEING NEIGHBORLY============76']
```
```python
# Then, split into chapter title and page number
split_re = r'=+' # Matches any sequence of = characters
[re.split(split_re, line) for line in lines]
```
```
[['PLAYING PILGRIMS', '3'],
['A MERRY CHRISTMAS', '13'],
['THE LAURENCE BOY', '31'],
['BURDENS', '55'],
['BEING NEIGHBORLY', '76']]
```
## Regex 和 pandas[](#Regex-and-pandas)
回想一下,`pandas` Series对象有一个`.str`属性,它支持使用 python 字符串方法进行字符串操作。很方便的是,`.str`属性还支持一些`re`模块的函数。我们演示了regex在`pandas`中的基本用法,完整的方法列表在[有关字符串方法的`pandas`文档](https://pandas.pydata.org/pandas-docs/stable/text.html)中。
我们在下面的DataFrame中保存了小说《小女人》(*Little Women*)前五句话的文本。我们可以使用`pandas`提供的字符串方法来提取每个句子中的口语对话。
```python
# HIDDEN
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''.strip()
little = pd.DataFrame({
'sentences': text.split('\n')
})
```
```
little
```
| | sentences |
| --- | ---: |
| 0 | "Christmas won't be Christmas without any pres... |
| 1 | "It's so dreadful to be poor!" sighed Meg, loo... |
| 2 | "I don't think it's fair for some girls to hav... |
| 3 | "We've got Father and Mother, and each other,"... |
| 4 | The four young faces on which the firelight sh... |
由于口语对话位于双引号内,因此我们创建一个 regex,它捕获左双引号、除双引号外的任何字符序列和右双引号。
```python
quote_re = r'"[^"]+"'
little['sentences'].str.findall(quote_re)
```
```
0 ["Christmas won't be Christmas without any pre...
1 ["It's so dreadful to be poor!"]
2 ["I don't think it's fair for some girls to ha...
3 ["We've got Father and Mother, and each other,"]
4 ["We haven't got Father, and shall not have hi...
Name: sentences, dtype: object
```
由于`Series.str.findall`方法返回匹配项列表,`pandas`还提供`Series.str.extract`和`Series.str.extractall`方法将匹配项提取到Series或DataFrame中。这些方法要求 regex 至少包含一个 regex 组。
```python
# Extract text within double quotes
quote_re = r'"([^"]+)"'
spoken = little['sentences'].str.extract(quote_re)
spoken
```
```
0 Christmas won't be Christmas without any prese...
1 It's so dreadful to be poor!
2 I don't think it's fair for some girls to have...
3 We've got Father and Mother, and each other,
4 We haven't got Father, and shall not have him ...
Name: sentences, dtype: object
```
我们可以将此序列添加为`little`DataFrame的列:
```python
little['dialog'] = spoken
little
```
| | sentences | dialog |
| --- | ---: | ---: |
| 0 | "Christmas won't be Christmas without any pres... | Christmas won't be Christmas without any prese... |
| 1 | "It's so dreadful to be poor!" sighed Meg, loo... | It's so dreadful to be poor! |
| 2 | "I don't think it's fair for some girls to hav... | I don't think it's fair for some girls to have... |
| 3 | "We've got Father and Mother, and each other,"... | We've got Father and Mother, and each other, |
| 4 | The four young faces on which the firelight sh... | We haven't got Father, and shall not have him ... |
我们可以通过打印原始文本和提取文本来确认字符串操作在DataFrame中的最后一句话上是否如预期执行:
```python
print(little.loc[4, 'sentences'])
```
```
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
```
```python
print(little.loc[4, 'dialog'])
```
```
We haven't got Father, and shall not have him for a long time.
```
## 摘要[](#Summary)
python 中的`re`模块提供了一组使用正则表达式操作文本的实用方法。在处理DataFrame时,我们经常使用`pandas`中实现的类似的字符串操作方法。
有关`re`模块的完整文档,请参阅[https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)
有关`pandas`字符串方法的完整文档,请参阅[https://pandas.pydata.org/pandas-docs/stable/text.html](https://pandas.pydata.org/pandas-docs/stable/text.html)
- 一、数据科学的生命周期
- 二、数据生成
- 三、处理表格数据
- 四、数据清理
- 五、探索性数据分析
- 六、数据可视化
- Web 技术
- 超文本传输协议
- 处理文本
- python 字符串方法
- 正则表达式
- regex 和 python
- 关系数据库和 SQL
- 关系模型
- SQL
- SQL 连接
- 建模与估计
- 模型
- 损失函数
- 绝对损失和 Huber 损失
- 梯度下降与数值优化
- 使用程序最小化损失
- 梯度下降
- 凸性
- 随机梯度下降法
- 概率与泛化
- 随机变量
- 期望和方差
- 风险
- 线性模型
- 预测小费金额
- 用梯度下降拟合线性模型
- 多元线性回归
- 最小二乘-几何透视
- 线性回归案例研究
- 特征工程
- 沃尔玛数据集
- 预测冰淇淋评级
- 偏方差权衡
- 风险和损失最小化
- 模型偏差和方差
- 交叉验证
- 正规化
- 正则化直觉
- L2 正则化:岭回归
- L1 正则化:LASSO 回归
- 分类
- 概率回归
- Logistic 模型
- Logistic 模型的损失函数
- 使用逻辑回归
- 经验概率分布的近似
- 拟合 Logistic 模型
- 评估 Logistic 模型
- 多类分类
- 统计推断
- 假设检验和置信区间
- 置换检验
- 线性回归的自举(真系数的推断)
- 学生化自举
- P-HACKING
- 向量空间回顾
- 参考表
- Pandas
- Seaborn
- Matplotlib
- Scikit Learn