python 字符串方法 · UCB DS100 数据科学的原理与技巧

# python 字符串方法 > 原文：[Python String Methods](https://www.textbook.ds100.org/ch/08/text_strings.html) > > 校验：[Kitty Du](https://github.com/miaoxiaozui2017) > > 自豪地采用[谷歌翻译](https://translate.google.cn/) ```python # HIDDEN # Clear previously defined variables %reset -f # Set directory for data loading to work properly import os os.chdir(os.path.expanduser('~/notebooks/08')) ``` ```python # HIDDEN import warnings # Ignore numpy dtype warnings. These warnings are caused by an interaction # between numpy and Cython and can be safely ignored. # Reference: https://stackoverflow.com/a/40846742 warnings.filterwarnings("ignore", message="numpy.dtype size changed") warnings.filterwarnings("ignore", message="numpy.ufunc size changed") import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns %matplotlib inline import ipywidgets as widgets from ipywidgets import interact, interactive, fixed, interact_manual import nbinteract as nbi sns.set() sns.set_context('talk') np.set_printoptions(threshold=20, precision=2, suppress=True) pd.options.display.max_rows = 7 pd.options.display.max_columns = 8 pd.set_option('precision', 2) # This option stops scientific notation for pandas # pd.set_option('display.float_format', '{:.2f}'.format) ``` python 为基本的字符串操作提供了多种方法。虽然这些方法简单，但是作为基础组合在一起形成了更加复杂的字符串操作。我们将在“处理文本：数据清洗”的通用用例中介绍 Python 的字符串方法。 ## 清洗文本数据数据通常来自几个不同的来源，每个来源都实现了自己的信息编码方式。在下面的示例中，我们用一个表记录一个县(County)所属的州(State)，另一个表记录该县的人口(Population)。 ```python # HIDDEN state = pd.DataFrame({ 'County': [ 'De Witt County', 'Lac qui Parle County', 'Lewis and Clark County', 'St John the Baptist Parish', ], 'State': [ 'IL', 'MN', 'MT', 'LA', ] }) population = pd.DataFrame({ 'County': [ 'DeWitt ', 'Lac Qui Parle', 'Lewis & Clark', 'St. John the Baptist', ], 'Population': [ '16,798', '8,067', '55,716', '43,044', ] }) ``` ``` state ``` | | County | State | | --- | --- | --- | | 0 | De Witt County | IL(伊利诺伊州) | | 1 | Lac qui Parle County | MN(明尼苏达州) | | 2 | Lewis and Clark County | MT(蒙大拿州) | | 3 | St John the Baptist Parish | LA(路易斯安那州) | ``` population ``` | | County | Population | | --- | --- | --- | | 0 | DeWitt | 16,798 | | 1 | Lac Qui Parle | 8,067 | | 2 | Lewis & Clark | 55,716 | | 3 | St. John the Baptist | 43,044 | 我们当然希望使用`County`列连接`state`和`population`表。不幸的是，两张表中没有一个县的拼写相同。此示例说明了文本数据中存在以下常见问题： 1. 大写：qui 对应 Qui 2. 不同的标点符号习惯：St. 对应 St 3. 缺少单词：在`population`表中缺少单词`County`/`Parish` 4. 空白的使用：DeWitt 对应 De Witt 5. 不同的缩写习惯：& 对应 and ## 字符串方法[](#String-Methods) python 的字符串方法允许我们着手解决这些问题。这些方法在所有 python 字符串上都被方便地定义，因此不需要导入其他模块。虽然有必要熟悉一下[字符串方法的完整列表](https://docs.python.org/3/library/stdtypes.html#string-methods)，但我们在下表中描述了一些最常用的方法。 | 方法 | 说明 | | --- | --- | | `str[x:y]` | 切片`str`，返回索引 x（包含）到 y（不包含） | | `str.lower()` | 返回字符串的副本，所有字母都转换为小写 | | `str.replace(a, b)` | 用子字符串`b`替换`str`中子字符串`a`的所有实例 | | `str.split(a)` | 返回在子字符串`a`处拆分的`str`子字符串 | | `str.strip()` | 从`str`中删除前导空格和尾随空格 | 我们从`state`和`population`表中选择St. John the Baptist parish的字符串，并应用字符串方法去除大写、标点符号和`county`/`parish`的出现。 ```python john1 = state.loc[3, 'County'] john2 = population.loc[3, 'County'] (john1 .lower() .strip() .replace(' parish', '') .replace(' county', '') .replace('&', 'and') .replace('.', '') .replace(' ', '') ) ``` ``` 'stjohnthebaptist' ``` 将同一组方法应用于`john2`，这样我们就能验证两个字符串现在是否相同。 ```python (john2 .lower() .strip() .replace(' parish', '') .replace(' county', '') .replace('&', 'and') .replace('.', '') .replace(' ', '') ) ``` ``` 'stjohnthebaptist' ``` 满意的是，我们创建了一个名为`clean_county`的方法来规范化输入的county。 ```python def clean_county(county): return (county .lower() .strip() .replace(' county', '') .replace(' parish', '') .replace('&', 'and') .replace(' ', '') .replace('.', '')) ``` 现在，我们可以验证`clean_county`方法为两个表中的所有的county生成匹配的county： ```python ([clean_county(county) for county in state['County']], [clean_county(county) for county in population['County']] ) ``` ``` (['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'], ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist']) ``` 因为两个表中的每个county都有相同的转换表示，所以我们可以使用转换后的county成功地连接两个表。 ## pandas 中的字符串方法[](#String-Methods-in-pandas) 在上面的代码中，我们使用一个循环来转换每个county。`pandas`的Series对象提供了一种将字符串方法应用于序列中每个项的方便方法。首先，`state`表中的county序列： ```python state['County'] ``` ``` 0 De Witt County 1 Lac qui Parle County 2 Lewis and Clark County 3 St John the Baptist Parish Name: County, dtype: object ``` `pandas`的Series的`.str`属性提供了和原生Python 中相同的字符串方法。对`.str`属性调用方法会对序列中的每个项调用该方法。 ```python state['County'].str.lower() ``` ``` 0 de witt county 1 lac qui parle county 2 lewis and clark county 3 st john the baptist parish Name: County, dtype: object ``` 这允许我们在不使用循环的情况下转换序列中的每个字符串。 ```python (state['County'] .str.lower() .str.strip() .str.replace(' parish', '') .str.replace(' county', '') .str.replace('&', 'and') .str.replace('.', '') .str.replace(' ', '') ) ``` ``` 0 dewitt 1 lacquiparle 2 lewisandclark 3 stjohnthebaptist Name: County, dtype: object ``` 我们将转换后的county保存回其原始表： ```python state['County'] = (state['County'] .str.lower() .str.strip() .str.replace(' parish', '') .str.replace(' county', '') .str.replace('&', 'and') .str.replace('.', '') .str.replace(' ', '') ) population['County'] = (population['County'] .str.lower() .str.strip() .str.replace(' parish', '') .str.replace(' county', '') .str.replace('&', 'and') .str.replace('.', '') .str.replace(' ', '') ) ``` 现在，这两个表包含了county的相同字符串表示： ``` state ``` | | County | State | | --- | --- | --- | | 0 | dewitt | IL | | 1 | lacquiparle | MN | | 2 | lewisandclark | MT | | 3 | stjohnthebaptist | LA | ``` population ``` | | County | Population | | --- | --- | --- | | 0 | dewitt | 16,798 | | 1 | lacquiparle | 8,067 | | 2 | lewisandclark | 55,716 | | 3 | stjohnthebaptist | 43,044 | 一旦county匹配，就很容易连接这些表了。 ```python state.merge(population, on='County') ``` | | County | State | Population | | --- | --- | --- | --- | | 0 | dewitt | IL | 16,798 | | 1 | lacquiparle | MN | 8,067 | | 2 | lewisandclark | MT | 55,716 | | 3 | stjohnthebaptist | LA | 43,044 | ## 摘要[](#Summary) python 的字符串方法形成了一组简单而有用的字符串操作。`pandas`的Series实现了相同的方法，将底层 python 方法应用于序列中的每个字符串。您可以在[这里](https://docs.python.org/3/library/stdtypes.html#string-methods)找到关于 python 的`string`方法的完整文档。还可以在[这里](https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary)找到关于 pandas 的文档`str`方法的完整文档。