# 十九、数据整理(下)
> 作者:[Chris Albon](https://chrisalbon.com/)
>
> 译者:[飞龙](https://github.com/wizardforcel)
>
> 协议:[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)
## 连接和合并数据帧
```py
# 导入模块
import pandas as pd
from IPython.display import display
from IPython.display import Image
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a
```
| | subject_id | first_name | last_name |
| --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson |
| 1 | 2 | Amy | Ackerman |
| 2 | 3 | Allen | Ali |
| 3 | 4 | Alice | Aoni |
| 4 | 5 | Ayoung | Atiches |
```py
# 创建第二个数据帧
raw_data = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_b
```
| | subject_id | first_name | last_name |
| --- | --- | --- | --- |
| 0 | 4 | Billy | Bonder |
| 1 | 5 | Brian | Black |
| 2 | 6 | Bran | Balwner |
| 3 | 7 | Bryce | Brice |
| 4 | 8 | Betty | Btisan |
```py
# 创建第三个数据帧
raw_data = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id'])
df_n
```
| | subject_id | test_id |
| --- | --- | --- |
| 0 | 1 | 51 |
| 1 | 2 | 15 |
| 2 | 3 | 15 |
| 3 | 4 | 61 |
| 4 | 5 | 16 |
| 5 | 7 | 14 |
| 6 | 8 | 15 |
| 7 | 9 | 1 |
| 8 | 10 | 61 |
| 9 | 11 | 16 |
```py
# 将两个数据帧按行连接
df_new = pd.concat([df_a, df_b])
df_new
```
| | subject_id | first_name | last_name |
| --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson |
| 1 | 2 | Amy | Ackerman |
| 2 | 3 | Allen | Ali |
| 3 | 4 | Alice | Aoni |
| 4 | 5 | Ayoung | Atiches |
| 0 | 4 | Billy | Bonder |
| 1 | 5 | Brian | Black |
| 2 | 6 | Bran | Balwner |
| 3 | 7 | Bryce | Brice |
| 4 | 8 | Betty | Btisan |
```py
# 将两个数据帧按列连接
pd.concat([df_a, df_b], axis=1)
```
| | subject_id | first_name | last_name | subject_id | first_name | last_name |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson | 4 | Billy | Bonder |
| 1 | 2 | Amy | Ackerman | 5 | Brian | Black |
| 2 | 3 | Allen | Ali | 6 | Bran | Balwner |
| 3 | 4 | Alice | Aoni | 7 | Bryce | Brice |
| 4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan |
```py
# 按两个数据帧按 subject_id 连接
pd.merge(df_new, df_n, on='subject_id')
```
| | subject_id | first_name | last_name | test_id |
| --- | --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson | 51 |
| 1 | 2 | Amy | Ackerman | 15 |
| 2 | 3 | Allen | Ali | 15 |
| 3 | 4 | Alice | Aoni | 61 |
| 4 | 4 | Billy | Bonder | 61 |
| 5 | 5 | Ayoung | Atiches | 16 |
| 6 | 5 | Brian | Black | 16 |
| 7 | 7 | Bryce | Brice | 14 |
| 8 | 8 | Betty | Btisan | 15 |
```py
# 将两个数据帧按照左和右数据帧的 subject_id 连接
pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')
```
| | subject_id | first_name | last_name | test_id |
| --- | --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson | 51 |
| 1 | 2 | Amy | Ackerman | 15 |
| 2 | 3 | Allen | Ali | 15 |
| 3 | 4 | Alice | Aoni | 61 |
| 4 | 4 | Billy | Bonder | 61 |
| 5 | 5 | Ayoung | Atiches | 16 |
| 6 | 5 | Brian | Black | 16 |
| 7 | 7 | Bryce | Brice | 14 |
| 8 | 8 | Betty | Btisan | 15 |
使用外连接来合并。
“全外连接产生表 A 和表 B 中所有记录的集合,带有来自两侧的匹配记录。如果没有匹配,则缺少的一侧将包含空值。” -- [来源](http://blog .codinghorror.com/a-visual-explanation-of-sql-joins/)
```py
pd.merge(df_a, df_b, on='subject_id', how='outer')
```
| | subject_id | first_name_x | last_name_x | first_name_y | last_name_y |
| --- | --- | --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson | NaN | NaN |
| 1 | 2 | Amy | Ackerman | NaN | NaN |
| 2 | 3 | Allen | Ali | NaN | NaN |
| 3 | 4 | Alice | Aoni | Billy | Bonder |
| 4 | 5 | Ayoung | Atiches | Brian | Black |
| 5 | 6 | NaN | NaN | Bran | Balwner |
| 6 | 7 | NaN | NaN | Bryce | Brice |
| 7 | 8 | NaN | NaN | Betty | Btisan |
使用内连接来合并。
“内联接只生成匹配表 A 和表 B 的记录集。” -- [来源](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/)
```py
pd.merge(df_a, df_b, on='subject_id', how='inner')
```
| | subject_id | first_name_x | last_name_x | first_name_y | last_name_y |
| --- | --- | --- | --- | --- | --- |
| 0 | 4 | Alice | Aoni | Billy | Bonder |
| 1 | 5 | Ayoung | Atiches | Brian | Black |
```py
# 使用右连接来合并
pd.merge(df_a, df_b, on='subject_id', how='right')
```
| | subject_id | first_name_x | last_name_x | first_name_y | last_name_y |
| --- | --- | --- | --- | --- | --- |
| 0 | 4 | Alice | Aoni | Billy | Bonder |
| 1 | 5 | Ayoung | Atiches | Brian | Black |
| 2 | 6 | NaN | NaN | Bran | Balwner |
| 3 | 7 | NaN | NaN | Bryce | Brice |
| 4 | 8 | NaN | NaN | Betty | Btisan |
使用左连接来合并。
“左外连接从表 A 中生成一组完整的记录,它们在表 B 中有匹配的记录。如果没有匹配,右侧将包含空。” -- [来源](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/)
```py
pd.merge(df_a, df_b, on='subject_id', how='left')
```
| | subject_id | first_name_x | last_name_x | first_name_y | last_name_y |
| --- | --- | --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson | NaN | NaN |
| 1 | 2 | Amy | Ackerman | NaN | NaN |
| 2 | 3 | Allen | Ali | NaN | NaN |
| 3 | 4 | Alice | Aoni | Billy | Bonder |
| 4 | 5 | Ayoung | Atiches | Brian | Black |
```py
# 合并时添加后缀以复制列名称
pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right'))
```
| | subject_id | first_name_left | last_name_left | first_name_right | last_name_right |
| --- | --- | --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson | NaN | NaN |
| 1 | 2 | Amy | Ackerman | NaN | NaN |
| 2 | 3 | Allen | Ali | NaN | NaN |
| 3 | 4 | Alice | Aoni | Billy | Bonder |
| 4 | 5 | Ayoung | Atiches | Brian | Black |
```py
# 基于索引的合并
pd.merge(df_a, df_b, right_index=True, left_index=True)
```
| | subject_id_x | first_name_x | last_name_x | subject_id_y | first_name_y | last_name_y |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 1 | Alex | Anderson | 4 | Billy | Bonder |
| 1 | 2 | Amy | Ackerman | 5 | Brian | Black |
| 2 | 3 | Allen | Ali | 6 | Bran | Balwner |
| 3 | 4 | Alice | Aoni | 7 | Bryce | Brice |
| 4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan |
## 列出 pandas 列中的唯一值
特别感谢 Bob Haffner 指出了一种更好的方法。
```py
# 导入模块
import pandas as pd
# 设置 ipython 的最大行显示
pd.set_option('display.max_row', 1000)
# 设置 ipython 的最大列宽
pd.set_option('display.max_columns', 50)
# 创建示例数据帧
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
```py
# 列出 df['name'] 的唯一值
df.name.unique()
# array(['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], dtype=object)
```
## 加载 JSON 文件
```py
# 加载库
import pandas as pd
# 创建 JSON 文件的 URL(或者可以是文件路径)
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json'
# 将 JSON 文件加载到数据框中
df = pd.read_json(url, orient='columns')
# 查看前十行
df.head(10)
```
| | category | datetime | integer |
| --- | --- | --- | --- |
| 0 | 0 | 2015-01-01 00:00:00 | 5 |
| 1 | 0 | 2015-01-01 00:00:01 | 5 |
| 10 | 0 | 2015-01-01 00:00:10 | 5 |
| 11 | 0 | 2015-01-01 00:00:11 | 5 |
| 12 | 0 | 2015-01-01 00:00:12 | 8 |
| 13 | 0 | 2015-01-01 00:00:13 | 9 |
| 14 | 0 | 2015-01-01 00:00:14 | 8 |
| 15 | 0 | 2015-01-01 00:00:15 | 8 |
| 16 | 0 | 2015-01-01 00:00:16 | 2 |
| 17 | 0 | 2015-01-01 00:00:17 | 1 |
# 加载 Excel 文件
```py
# 加载库
import pandas as pd
# 创建 Excel 文件的 URL(或者可以是文件路径)
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.xlsx'
# 将 Excel 文件的第一页加载到数据框中
df = pd.read_excel(url, sheetname=0, header=1)
# 查看前十行
df.head(10)
```
| | 5 | 2015-01-01 00:00:00 | 0 |
| --- | --- | --- | --- |
| 0 | 5 | 2015-01-01 00:00:01 | 0 |
| 1 | 9 | 2015-01-01 00:00:02 | 0 |
| 2 | 6 | 2015-01-01 00:00:03 | 0 |
| 3 | 6 | 2015-01-01 00:00:04 | 0 |
| 4 | 9 | 2015-01-01 00:00:05 | 0 |
| 5 | 7 | 2015-01-01 00:00:06 | 0 |
| 6 | 1 | 2015-01-01 00:00:07 | 0 |
| 7 | 6 | 2015-01-01 00:00:08 | 0 |
| 8 | 9 | 2015-01-01 00:00:09 | 0 |
| 9 | 5 | 2015-01-01 00:00:10 | 0 |
## 将 Excel 表格加载为数据帧
```py
# 导入模块
import pandas as pd
# 加载 excel 文件并赋给 xls_file
xls_file = pd.ExcelFile('../data/example.xls')
xls_file
# <pandas.io.excel.ExcelFile at 0x111912be0>
# 查看电子表格的名称
xls_file.sheet_names
# ['Sheet1']
# 将 xls 文件 的 Sheet1 加载为数据帧
df = xls_file.parse('Sheet1')
df
```
| | year | deaths_attacker | deaths_defender | soldiers_attacker | soldiers_defender | wounded_attacker | wounded_defender |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 1945 | 425 | 423 | 2532 | 37235 | 41 | 14 |
| 1 | 1956 | 242 | 264 | 6346 | 2523 | 214 | 1424 |
| 2 | 1964 | 323 | 1231 | 3341 | 2133 | 131 | 131 |
| 3 | 1969 | 223 | 23 | 6732 | 1245 | 12 | 12 |
| 4 | 1971 | 783 | 23 | 12563 | 2671 | 123 | 34 |
| 5 | 1981 | 436 | 42 | 2356 | 7832 | 124 | 124 |
| 6 | 1982 | 324 | 124 | 253 | 2622 | 264 | 1124 |
| 7 | 1992 | 3321 | 631 | 5277 | 3331 | 311 | 1431 |
| 8 | 1999 | 262 | 232 | 2732 | 2522 | 132 | 122 |
| 9 | 2004 | 843 | 213 | 6278 | 26773 | 623 | 2563 |
## 加载 CSV
```py
# 导入模块
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, ".", "."],
'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25,000 |
| 1 | Molly | Jacobson | 52 | 24 | 94,000 |
| 2 | Tina | . | 36 | 31 | 57 |
| 3 | Jake | Milner | 24 | . | 62 |
| 4 | Amy | Cooze | 73 | . | 70 |
```py
# 将数据帧保存为工作目录中的 csv
df.to_csv('pandas_dataframe_importing_csv/example.csv')
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
df
```
| | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 0 | Jason | Miller | 42 | 4 | 25,000 |
| 1 | 1 | Molly | Jacobson | 52 | 24 | 94,000 |
| 2 | 2 | Tina | . | 36 | 31 | 57 |
| 3 | 3 | Jake | Milner | 24 | . | 62 |
| 4 | 4 | Amy | Cooze | 73 | . | 70 |
```py
# 加载无头 CSV
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', header=None)
df
```
| | 0 | 1 | 2 | 3 | 4 | 5 |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | NaN | first_name | last_name | age | preTestScore | postTestScore |
| 1 | 0.0 | Jason | Miller | 42 | 4 | 25,000 |
| 2 | 1.0 | Molly | Jacobson | 52 | 24 | 94,000 |
| 3 | 2.0 | Tina | . | 36 | 31 | 57 |
| 4 | 3.0 | Jake | Milner | 24 | . | 62 |
| 5 | 4.0 | Amy | Cooze | 73 | . | 70 |
```py
# 在加载 csv 时指定列名称
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
```
| | UID | First Name | Last Name | Age | Pre-Test Score | Post-Test Score |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | NaN | first_name | last_name | age | preTestScore | postTestScore |
| 1 | 0.0 | Jason | Miller | 42 | 4 | 25,000 |
| 2 | 1.0 | Molly | Jacobson | 52 | 24 | 94,000 |
| 3 | 2.0 | Tina | . | 36 | 31 | 57 |
| 4 | 3.0 | Jake | Milner | 24 | . | 62 |
| 5 | 4.0 | Amy | Cooze | 73 | . | 70 |
```py
# 通过将索引列设置为 UID 来加载 csv
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col='UID', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
```
| | First Name | Last Name | Age | Pre-Test Score | Post-Test Score |
| --- | --- | --- | --- | --- | --- |
| UID | | | | | |
| NaN | first_name | last_name | age | preTestScore | postTestScore |
| 0.0 | Jason | Miller | 42 | 4 | 25,000 |
| 1.0 | Molly | Jacobson | 52 | 24 | 94,000 |
| 2.0 | Tina | . | 36 | 31 | 57 |
| 3.0 | Jake | Milner | 24 | . | 62 |
| 4.0 | Amy | Cooze | 73 | . | 70 |
```py
# 在加载 csv 时将索引列设置为名字和姓氏
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col=['First Name', 'Last Name'], names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df
```
| | | UID | Age | Pre-Test Score | Post-Test Score |
| --- | --- | --- | --- | --- | --- |
| First Name | Last Name | | | | |
| first_name | last_name | NaN | age | preTestScore | postTestScore |
| Jason | Miller | 0.0 | 42 | 4 | 25,000 |
| Molly | Jacobson | 1.0 | 52 | 24 | 94,000 |
| Tina | . | 2.0 | 36 | 31 | 57 |
| Jake | Milner | 3.0 | 24 | . | 62 |
| Amy | Cooze | 4.0 | 73 | . | 70 |
```py
# 在加载 csv 时指定 '.' 为缺失值
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.'])
pd.isnull(df)
```
| | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False |
| 2 | False | False | True | False | False | False |
| 3 | False | False | False | False | True | False |
| 4 | False | False | False | False | True | False |
```py
# 加载csv,同时指定 '.' 和 'NA' 为“姓氏”列的缺失值,指定 '.' 为 preTestScore 列的缺失值
sentinels = {'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']}
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels)
df
```
| | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 0 | Jason | Miller | 42 | 4 | 25,000 |
| 1 | 1 | Molly | Jacobson | 52 | 24 | 94,000 |
| 2 | 2 | Tina | . | 36 | 31 | 57 |
| 3 | 3 | Jake | Milner | 24 | . | 62 |
| 4 | 4 | Amy | Cooze | 73 | . | 70 |
```py
# 在加载 csv 时跳过前 3 行
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels, skiprows=3)
df
```
| | 2 | Tina | . | 36 | 31 | 57 |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 3 | Jake | Milner | 24 | . | 62 |
| 1 | 4 | Amy | Cooze | 73 | . | 70 |
```py
# 加载 csv,同时将数字字符串中的 ',' 解释为千位分隔符
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', thousands=',')
df
```
| | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 0 | Jason | Miller | 42 | 4 | 25000 |
| 1 | 1 | Molly | Jacobson | 52 | 24 | 94000 |
| 2 | 2 | Tina | . | 36 | 31 | 57 |
| 3 | 3 | Jake | Milner | 24 | . | 62 |
| 4 | 4 | Amy | Cooze | 73 | . | 70 |
## 长到宽的格式
```py
# 导入模块
import pandas as pd
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
```
| | patient | obs | treatment | score |
| --- | --- | --- | --- | --- |
| 0 | 1 | 1 | 0 | 6252 |
| 1 | 1 | 2 | 1 | 24243 |
| 2 | 1 | 3 | 0 | 2345 |
| 3 | 2 | 1 | 1 | 2342 |
| 4 | 2 | 2 | 0 | 23525 |
制作“宽的”数据。
现在,我们将创建一个“宽的”数据帧,其中行数按患者编号,列按观测编号,单元格值为得分值。
```py
df.pivot(index='patient', columns='obs', values='score')
```
| obs | 1 | 2 | 3 |
| --- | --- | --- | --- |
| patient | | | |
| 1 | 6252.0 | 24243.0 | 2345.0 |
| 2 | 2342.0 | 23525.0 | NaN |
## 在数据帧中小写列名
```py
# 导入模块
import pandas as pd
# 设置 ipython 的最大行显示
pd.set_option('display.max_row', 1000)
# 设置 ipython 的最大列宽
pd.set_option('display.max_columns', 50)
# 创建示例数据帧
data = {'NAME': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'YEAR': [2012, 2012, 2013, 2014, 2014],
'REPORTS': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | NAME | REPORTS | YEAR |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
```py
# 小写列名称
# Map the lowering function to all column names
df.columns = map(str.lower, df.columns)
df
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
## 使用函数创建新列
```py
# 导入模块
import pandas as pd
# 示例数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
```
| | regiment | company | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Nighthawks | 1st | Miller | 4 | 25 |
| 1 | Nighthawks | 1st | Jacobson | 24 | 94 |
| 2 | Nighthawks | 2nd | Ali | 31 | 57 |
| 3 | Nighthawks | 2nd | Milner | 2 | 62 |
| 4 | Dragoons | 1st | Cooze | 3 | 70 |
| 5 | Dragoons | 1st | Jacon | 4 | 25 |
| 6 | Dragoons | 2nd | Ryaner | 24 | 94 |
| 7 | Dragoons | 2nd | Sone | 31 | 57 |
| 8 | Scouts | 1st | Sloan | 2 | 62 |
| 9 | Scouts | 1st | Piger | 3 | 70 |
| 10 | Scouts | 2nd | Riani | 2 | 62 |
| 11 | Scouts | 2nd | Ali | 3 | 70 |
```py
# 创建一个接受两个输入,pre 和 post 的函数
def pre_post_difference(pre, post):
# 返回二者的差
return post - pre
# 创建一个变量,它是函数的输出
df['score_change'] = pre_post_difference(df['preTestScore'], df['postTestScore'])
# 查看数据帧
df
```
| | regiment | company | name | preTestScore | postTestScore | score_change |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | Nighthawks | 1st | Miller | 4 | 25 | 21 |
| 1 | Nighthawks | 1st | Jacobson | 24 | 94 | 70 |
| 2 | Nighthawks | 2nd | Ali | 31 | 57 | 26 |
| 3 | Nighthawks | 2nd | Milner | 2 | 62 | 60 |
| 4 | Dragoons | 1st | Cooze | 3 | 70 | 67 |
| 5 | Dragoons | 1st | Jacon | 4 | 25 | 21 |
| 6 | Dragoons | 2nd | Ryaner | 24 | 94 | 70 |
| 7 | Dragoons | 2nd | Sone | 31 | 57 | 26 |
| 8 | Scouts | 1st | Sloan | 2 | 62 | 60 |
| 9 | Scouts | 1st | Piger | 3 | 70 | 67 |
| 10 | Scouts | 2nd | Riani | 2 | 62 | 60 |
| 11 | Scouts | 2nd | Ali | 3 | 70 | 67 |
```py
# 创建一个接受一个输入 x 的函数
def score_multipler_2x_and_3x(x):
# 返回两个东西,2x 和 3x
return x*2, x*3
# 创建两个新变量,它是函数的两个输出
df['post_score_x2'], df['post_score_x3'] = zip(*df['postTestScore'].map(score_multipler_2x_and_3x))
df
```
| | regiment | company | name | preTestScore | postTestScore | score_change | post_score_x2 | post_score_x3 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | Nighthawks | 1st | Miller | 4 | 25 | 21 | 50 | 75 |
| 1 | Nighthawks | 1st | Jacobson | 24 | 94 | 70 | 188 | 282 |
| 2 | Nighthawks | 2nd | Ali | 31 | 57 | 26 | 114 | 171 |
| 3 | Nighthawks | 2nd | Milner | 2 | 62 | 60 | 124 | 186 |
| 4 | Dragoons | 1st | Cooze | 3 | 70 | 67 | 140 | 210 |
| 5 | Dragoons | 1st | Jacon | 4 | 25 | 21 | 50 | 75 |
| 6 | Dragoons | 2nd | Ryaner | 24 | 94 | 70 | 188 | 282 |
| 7 | Dragoons | 2nd | Sone | 31 | 57 | 26 | 114 | 171 |
| 8 | Scouts | 1st | Sloan | 2 | 62 | 60 | 124 | 186 |
| 9 | Scouts | 1st | Piger | 3 | 70 | 67 | 140 | 210 |
| 10 | Scouts | 2nd | Riani | 2 | 62 | 60 | 124 | 186 |
| 11 | Scouts | 2nd | Ali | 3 | 70 | 67 | 140 | 210 |
## 将外部值映射为数据帧的值
```py
# 导入模块
import pandas as pd
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])
df
```
| | first_name | last_name | age | city |
| --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | San Francisco |
| 1 | Molly | Jacobson | 52 | Baltimore |
| 2 | Tina | Ali | 36 | Miami |
| 3 | Jake | Milner | 24 | Douglas |
| 4 | Amy | Cooze | 73 | Boston |
```py
# 创建值的字典
city_to_state = { 'San Francisco' : 'California',
'Baltimore' : 'Maryland',
'Miami' : 'Florida',
'Douglas' : 'Arizona',
'Boston' : 'Massachusetts'}
df['state'] = df['city'].map(city_to_state)
df
```
| | first_name | last_name | age | city | state |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | San Francisco | California |
| 1 | Molly | Jacobson | 52 | Baltimore | Maryland |
| 2 | Tina | Ali | 36 | Miami | Florida |
| 3 | Jake | Milner | 24 | Douglas | Arizona |
| 4 | Amy | Cooze | 73 | Boston | Massachusetts |
## 数据帧中的缺失数据
```py
# 导入模块
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
'age': [42, np.nan, 36, 24, 73],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'preTestScore': [4, np.nan, np.nan, 2, 3],
'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | sex | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Tina | Ali | 36.0 | f | NaN | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 |
```py
# 丢弃缺失值
df_no_missing = df.dropna()
df_no_missing
```
| | first_name | last_name | age | sex | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 |
###
```py
# 删除所有单元格为 NA 的行
df_cleaned = df.dropna(how='all')
df_cleaned
```
| | first_name | last_name | age | sex | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 |
| 2 | Tina | Ali | 36.0 | f | NaN | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 |
```py
# 创建一个缺失值填充的新列
df['location'] = np.nan
df
```
| | first_name | last_name | age | sex | preTestScore | postTestScore | location |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Tina | Ali | 36.0 | f | NaN | NaN | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN |
```py
# 如果列仅包含缺失值,删除列
df.dropna(axis=1, how='all')
```
| | first_name | last_name | age | sex | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Tina | Ali | 36.0 | f | NaN | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 |
```py
# 删除少于五个观测值的行
# 这对时间序列来说非常有用
df.dropna(thresh=5)
```
| | first_name | last_name | age | sex | preTestScore | postTestScore | location |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN |
```py
# 用零填充缺失数据
df.fillna(0)
```
| | first_name | last_name | age | sex | preTestScore | postTestScore | location |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | 0.0 |
| 1 | 0 | 0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 |
| 2 | Tina | Ali | 36.0 | f | 0.0 | 0.0 | 0.0 |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | 0.0 |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | 0.0 |
```py
# 使用 preTestScore 的平均值填充 preTestScore 中的缺失
# inplace=True 表示更改会立即保存到 df 中
df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
df
```
| | first_name | last_name | age | sex | preTestScore | postTestScore | location |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN |
| 1 | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 2 | Tina | Ali | 36.0 | f | 3.0 | NaN | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN |
###
```py
# 使用 postTestScore 的每个性别的均值填充 postTestScore 中的缺失
df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df
```
| | first_name | last_name | age | sex | preTestScore | postTestScore | location |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN |
| 1 | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 2 | Tina | Ali | 36.0 | f | 3.0 | 70.0 | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN |
```py
# 选择年龄不是 NaN 且性别不是 NaN 的行
df[df['age'].notnull() & df['sex'].notnull()]
```
| | first_name | last_name | age | sex | preTestScore | postTestScore | location |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN |
| 2 | Tina | Ali | 36.0 | f | 3.0 | 70.0 | NaN |
| 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN |
| 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN |
## pandas 中的移动平均
```py
# 导入模块
import pandas as pd
# 创建数据
data = {'score': [1,1,1,2,2,2,3,3,3]}
# 创建数据帧
df = pd.DataFrame(data)
# 查看数据帧
df
```
| | score |
| --- | --- |
| 0 | 1 |
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 2 |
| 6 | 3 |
| 7 | 3 |
| 8 | 3 |
```py
# 计算移动平均。也就是说,取前两个值,取平均值
# 然后丢弃第一个,再加上第三个,以此类推。
df.rolling(window=2).mean()
```
| | score |
| --- | --- |
| 0 | NaN |
| 1 | 1.0 |
| 2 | 1.0 |
| 3 | 1.5 |
| 4 | 2.0 |
| 5 | 2.0 |
| 6 | 2.5 |
| 7 | 3.0 |
| 8 | 3.0 |
## 规范化一列
```py
# 导入所需模块
import pandas as pd
from sklearn import preprocessing
# 设置图表为内联
%matplotlib inline
# 创建示例数据帧,带有未规范化的一列
data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
df = pd.DataFrame(data)
df
```
| | score |
| --- | --- |
| 0 | 234 |
| 1 | 24 |
| 2 | 14 |
| 3 | 27 |
| 4 | -74 |
| 5 | 46 |
| 6 | 73 |
| 7 | -18 |
| 8 | 59 |
| 9 | 160 |
```py
# 查看为未规范化的数据
df['score'].plot(kind='bar')
# <matplotlib.axes._subplots.AxesSubplot at 0x11b9c88d0>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_normalize_column/pandas_normalize_column_5_1.png)
```py
# 创建 x,其中 x 的得分列的值为浮点数
x = df[['score']].values.astype(float)
# 创建 minmax 处理器对象
min_max_scaler = preprocessing.MinMaxScaler()
# 创建一个对象,转换数据,拟合 minmax 处理器
x_scaled = min_max_scaler.fit_transform(x)
# 在数据帧上运行规范化器
df_normalized = pd.DataFrame(x_scaled)
# 查看数据帧
df_normalized
```
| | 0 |
| --- | --- |
| 0 | 1.000000 |
| 1 | 0.318182 |
| 2 | 0.285714 |
| 3 | 0.327922 |
| 4 | 0.000000 |
| 5 | 0.389610 |
| 6 | 0.477273 |
| 7 | 0.181818 |
| 8 | 0.431818 |
| 9 | 0.759740 |
```py
# 绘制数据帧
df_normalized.plot(kind='bar')
# <matplotlib.axes._subplots.AxesSubplot at 0x11ba31c50>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_normalize_column/pandas_normalize_column_9_1.png)
## Pandas 中的级联表
```py
# 导入模块
import pandas as pd
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'TestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'TestScore'])
df
```
| | regiment | company | TestScore |
| --- | --- | --- | --- |
| 0 | Nighthawks | 1st | 4 |
| 1 | Nighthawks | 1st | 24 |
| 2 | Nighthawks | 2nd | 31 |
| 3 | Nighthawks | 2nd | 2 |
| 4 | Dragoons | 1st | 3 |
| 5 | Dragoons | 1st | 4 |
| 6 | Dragoons | 2nd | 24 |
| 7 | Dragoons | 2nd | 31 |
| 8 | Scouts | 1st | 2 |
| 9 | Scouts | 1st | 3 |
| 10 | Scouts | 2nd | 2 |
| 11 | Scouts | 2nd | 3 |
```py
# 按公司和团队创建分组均值的透视表
pd.pivot_table(df, index=['regiment','company'], aggfunc='mean')
```
| | | TestScore |
| --- | --- | --- |
| regiment | company | |
| Dragoons | 1st | 3.5 |
| | 2nd | 27.5 |
| Nighthawks | 1st | 14.0 |
| | 2nd | 16.5 |
| Scouts | 1st | 2.5 |
| | 2nd | 2.5 |
```py
# 按公司和团队创建分组计数的透视表
df.pivot_table(index=['regiment','company'], aggfunc='count')
```
| | | TestScore |
| --- | --- | --- |
| regiment | company | |
| Dragoons | 1st | 2 |
| | 2nd | 2 |
| Nighthawks | 1st | 2 |
| | 2nd | 2 |
| Scouts | 1st | 2 |
| | 2nd | 2 |
## 在 Pandas 中快速修改字符串列
我经常需要或想要改变一串字符串中所有项目的大小写(例如`BRAZIL`到`Brazil`等)。 有很多方法可以实现这一目标,但我已经确定这是最容易和最快的方法。
```py
# 导入 pandas
import pandas as pd
# 创建名称的列表
first_names = pd.Series(['Steve Murrey', 'Jane Fonda', 'Sara McGully', 'Mary Jane'])
# 打印列
first_names
'''
0 Steve Murrey
1 Jane Fonda
2 Sara McGully
3 Mary Jane
dtype: object
'''
# 打印列的小写
first_names.str.lower()
'''
0 steve murrey
1 jane fonda
2 sara mcgully
3 mary jane
dtype: object
'''
# 打印列的大写
first_names.str.upper()
'''
0 STEVE MURREY
1 JANE FONDA
2 SARA MCGULLY
3 MARY JANE
dtype: object
'''
# 打印列的标题大小写
first_names.str.title()
'''
0 Steve Murrey
1 Jane Fonda
2 Sara Mcgully
3 Mary Jane
dtype: object
'''
# 打印以空格分割的列
first_names.str.split(" ")
'''
0 [Steve, Murrey]
1 [Jane, Fonda]
2 [Sara, McGully]
3 [Mary, Jane]
dtype: object
'''
# 打印首字母大写的列
first_names.str.capitalize()
'''
0 Steve murrey
1 Jane fonda
2 Sara mcgully
3 Mary jane
dtype: object
'''
```
明白了吧。更多字符串方法在[这里](https://docs.python.org/3.5/library/stdtypes.html#string-methods)。
## 随机抽样数据帧
```py
# 导入模块
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25 |
| 1 | Molly | Jacobson | 52 | 24 | 94 |
| 2 | Tina | Ali | 36 | 31 | 57 |
| 3 | Jake | Milner | 24 | 2 | 62 |
| 4 | Amy | Cooze | 73 | 3 | 70 |
```py
# 不放回选择大小为 2 的随机子集
df.take(np.random.permutation(len(df))[:2])
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 1 | Molly | Jacobson | 52 | 24 | 94 |
| 4 | Amy | Cooze | 73 | 3 | 70 |
## 对数据帧的行排名
```py
# 导入模块
import pandas as pd
# 创建数据帧
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Cochice | 25 | Jason | 4 | 2012 |
| Pima | 94 | Molly | 24 | 2012 |
| Santa Cruz | 57 | Tina | 31 | 2013 |
| Maricopa | 62 | Jake | 2 | 2014 |
| Yuma | 70 | Amy | 3 | 2014 |
5 rows × 4 columns
```py
# 创建一个新列,该列是 coverage 值的升序排名
df['coverageRanked'] = df['coverage'].rank(ascending=1)
df
```
| | coverage | name | reports | year | coverageRanked |
| --- | --- | --- | --- | --- | --- |
| Cochice | 25 | Jason | 4 | 2012 | 1 |
| Pima | 94 | Molly | 24 | 2012 | 5 |
| Santa Cruz | 57 | Tina | 31 | 2013 | 2 |
| Maricopa | 62 | Jake | 2 | 2014 | 3 |
| Yuma | 70 | Amy | 3 | 2014 | 4 |
5 rows × 5 columns
## 正则表达式基础
```py
# 导入正则包
import re
import sys
text = 'The quick brown fox jumped over the lazy black bear.'
three_letter_word = '\w{3}'
pattern_re = re.compile(three_letter_word); pattern_re
re.compile(r'\w{3}', re.UNICODE)
re_search = re.search('..own', text)
if re_search:
# 打印搜索结果
print(re_search.group())
# brown
```
`re.match`
`re.match()`仅用于匹配字符串的开头或整个字符串。对于其他任何内容,请使用`re.search`。
### Match all three letter words in text
```py
# 在文本中匹配所有三个字母的单词
re_match = re.match('..own', text)
if re_match:
# 打印所有匹配
print(re_match.group())
else:
# 打印这个
print('No matches')
# No matches
```
`re.split`
```py
# 使用 'e' 作为分隔符拆分字符串。
re_split = re.split('e', text); re_split
# ['Th', ' quick brown fox jump', 'd ov', 'r th', ' lazy black b', 'ar.']
```
`re.sub`
用其他东西替换正则表达式模式串。`3`表示要进行的最大替换次数。
```py
# 用 'E' 替换前三个 'e' 实例,然后打印出来
re_sub = re.sub('e', 'E', text, 3); print(re_sub)
# ThE quick brown fox jumpEd ovEr the lazy black bear.
```
## 正则表达式示例
```py
# 导入 regex
import re
# 创建一些数据
text = 'A flock of 120 quick brown foxes jumped over 30 lazy brown, bears.'
re.findall('^A', text)
# ['A']
re.findall('bears.$', text)
# ['bears.']
re.findall('f..es', text)
# ['foxes']
# 寻找所有元音
re.findall('[aeiou]', text)
# ['o', 'o', 'u', 'i', 'o', 'o', 'e', 'u', 'e', 'o', 'e', 'a', 'o', 'e', 'a']
# 查找不是小写元音的所有字符
re.findall('[^aeiou]', text)
'''
['A',
' ',
'f',
'l',
'c',
'k',
' ',
'f',
' ',
'1',
'2',
'0',
' ',
'q',
'c',
'k',
' ',
'b',
'r',
'w',
'n',
' ',
'f',
'x',
's',
' ',
'j',
'm',
'p',
'd',
' ',
'v',
'r',
' ',
'3',
'0',
' ',
'l',
'z',
'y',
' ',
'b',
'r',
'w',
'n',
',',
' ',
'b',
'r',
's',
'.']
'''
re.findall('a|A', text)
# ['A', 'a', 'a']
# 寻找任何 'fox' 的实例
re.findall('(foxes)', text)
# ['foxes']
# 寻找所有五个字母的单词
re.findall('\w\w\w\w\w', text)
# ['flock', 'quick', 'brown', 'foxes', 'jumpe', 'brown', 'bears']
re.findall('\W\W', text)
# [', ']
re.findall('\s', text)
# [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
re.findall('\S\S', text)
'''
['fl',
'oc',
'of',
'12',
'qu',
'ic',
'br',
'ow',
'fo',
'xe',
'ju',
'mp',
'ed',
'ov',
'er',
'30',
'la',
'zy',
'br',
'ow',
'n,',
'be',
'ar',
's.']
'''
re.findall('\d\d\d', text)
# ['120']
re.findall('\D\D\D\D\D', text)
'''
['A flo',
'ck of',
' quic',
'k bro',
'wn fo',
'xes j',
'umped',
' over',
' lazy',
' brow',
'n, be']
'''
re.findall('\AA', text)
# ['A']
re.findall('bears.\Z', text)
# ['bears.']
re.findall('\b[foxes]', text)
# []
re.findall('\n', text)
# []
re.findall('[Ff]oxes', 'foxes Foxes Doxes')
# ['foxes', 'Foxes']
re.findall('[Ff]oxes', 'foxes Foxes Doxes')
# ['foxes', 'Foxes']
re.findall('[a-z]', 'foxes Foxes')
# ['f', 'o', 'x', 'e', 's', 'o', 'x', 'e', 's']
re.findall('[A-Z]', 'foxes Foxes')
# ['F']
re.findall('[a-zA-Z0-9]', 'foxes Foxes')
# ['f', 'o', 'x', 'e', 's', 'F', 'o', 'x', 'e', 's']
re.findall('[^aeiou]', 'foxes Foxes')
# ['f', 'x', 's', ' ', 'F', 'x', 's']
re.findall('[^0-9]', 'foxes Foxes')
# ['f', 'o', 'x', 'e', 's', ' ', 'F', 'o', 'x', 'e', 's']
re.findall('foxes?', 'foxes Foxes')
# ['foxes']
re.findall('ox*', 'foxes Foxes')
# ['ox', 'ox']
re.findall('ox+', 'foxes Foxes')
# ['ox', 'ox']
re.findall('\d{3}', text)
# ['120']
re.findall('\d{2,}', text)
# ['120', '30']
re.findall('\d{2,3}', text)
# ['120', '30']
re.findall('^A', text)
# ['A']
re.findall('bears.$', text)
# ['bears.']
re.findall('\AA', text)
# ['A']
re.findall('bears.\Z', text)
# ['bears.']
re.findall('bears(?=.)', text)
# ['bears']
re.findall('foxes(?!!)', 'foxes foxes!')
# ['foxes']
re.findall('foxes|foxes!', 'foxes foxes!')
# ['foxes', 'foxes']
re.findall('fox(es!)', 'foxes foxes!')
# ['es!']
re.findall('foxes(!)', 'foxes foxes!')
# ['!']
```
## 重索引序列和数据帧
```py
# 导入模块
import pandas as pd
import numpy as np
# 创建亚利桑那州南部的火灾风险序列
brushFireRisk = pd.Series([34, 23, 12, 23], index = ['Bisbee', 'Douglas', 'Sierra Vista', 'Tombstone'])
brushFireRisk
'''
Bisbee 34
Douglas 23
Sierra Vista 12
Tombstone 23
dtype: int64
'''
# 重索引这个序列并创建一个新的序列变量
brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson'])
brushFireRiskReindexed
'''
Tombstone 23.0
Douglas 23.0
Bisbee 34.0
Sierra Vista 12.0
Barley NaN
Tucson NaN
dtype: float64
'''
# 重索引序列并在任何缺失的索引处填入 0
brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson'], fill_value = 0)
brushFireRiskReindexed
'''
Tombstone 23
Douglas 23
Bisbee 34
Sierra Vista 12
Barley 0
Tucson 0
dtype: int64
'''
# 创建数据帧
data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data)
df
```
| | county | reports | year |
| --- | --- | --- | --- |
| 0 | Cochice | 4 | 2012 |
| 1 | Pima | 24 | 2012 |
| 2 | Santa Cruz | 31 | 2013 |
| 3 | Maricopa | 2 | 2014 |
| 4 | Yuma | 3 | 2014 |
```py
# 更改行的顺序(索引)
df.reindex([4, 3, 2, 1, 0])
```
| | county | reports | year |
| --- | --- | --- | --- |
| 4 | Yuma | 3 | 2014 |
| 3 | Maricopa | 2 | 2014 |
| 2 | Santa Cruz | 31 | 2013 |
| 1 | Pima | 24 | 2012 |
| 0 | Cochice | 4 | 2012 |
```py
# 更改列的顺序(索引)
columnsTitles = ['year', 'reports', 'county']
df.reindex(columns=columnsTitles)
```
| | year | reports | county |
| --- | --- | --- | --- |
| 0 | 2012 | 4 | Cochice |
| 1 | 2012 | 24 | Pima |
| 2 | 2013 | 31 | Santa Cruz |
| 3 | 2014 | 2 | Maricopa |
| 4 | 2014 | 3 | Yuma |
## 重命名列标题
来自 [StackOverflow](http://stackoverflow.com/users/3291077/rgalbo) 上的 [rgalbo](http://stackoverflow.com/questions/31328861/python-pandas-replacing-header-with-top-row)。
```py
# 导入所需模块
import pandas as pd
# 创建列表的字典,作为值
raw_data = {'0': ['first_name', 'Molly', 'Tina', 'Jake', 'Amy'],
'1': ['last_name', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'2': ['age', 52, 36, 24, 73],
'3': ['preTestScore', 24, 31, 2, 3]}
# 创建数据帧
df = pd.DataFrame(raw_data)
# 查看数据帧
df
```
| | 0 | 1 | 2 | 3 |
| --- | --- | --- | --- | --- |
| 0 | first_name | last_name | age | preTestScore |
| 1 | Molly | Jacobson | 52 | 24 |
| 2 | Tina | Ali | 36 | 31 |
| 3 | Jake | Milner | 24 | 2 |
| 4 | Amy | Cooze | 73 | 3 |
```py
# 从数据集的第一行创建一个名为 header 的新变量
header = df.iloc[0]
'''
0 first_name
1 last_name
2 age
3 preTestScore
Name: 0, dtype: object
'''
# 将数据帧替换为不包含第一行的新数据帧
df = df[1:]
# 使用标题变量重命名数据帧的列值
df.rename(columns = header)
```
| | first_name | last_name | age | preTestScore |
| --- | --- | --- | --- | --- |
| 1 | Molly | Jacobson | 52 | 24 |
| --- | --- | --- | --- | --- |
| 2 | Tina | Ali | 36 | 31 |
| --- | --- | --- | --- | --- |
| 3 | Jake | Milner | 24 | 2 |
| --- | --- | --- | --- | --- |
| 4 | Amy | Cooze | 73 | 3 |
| --- | --- | --- | --- | --- |
## 重命名多个数据帧的列名
```py
# 导入模块
import pandas as pd
# 设置 ipython 的最大行显示
pd.set_option('display.max_row', 1000)
# 设置 ipython 的最大列宽
pd.set_option('display.max_columns', 50)
# 创建示例数据帧
data = {'Commander': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'Date': ['2012, 02, 08', '2012, 02, 08', '2012, 02, 08', '2012, 02, 08', '2012, 02, 08'],
'Score': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | Commander | Date | Score |
| --- | --- | --- | --- |
| Cochice | Jason | 2012, 02, 08 | 4 |
| Pima | Molly | 2012, 02, 08 | 24 |
| Santa Cruz | Tina | 2012, 02, 08 | 31 |
| Maricopa | Jake | 2012, 02, 08 | 2 |
| Yuma | Amy | 2012, 02, 08 | 3 |
```py
# 重命名列名
df.columns = ['Leader', 'Time', 'Score']
df
```
| | Leader | Time | Score |
| --- | --- | --- | --- |
| Cochice | Jason | 2012, 02, 08 | 4 |
| Pima | Molly | 2012, 02, 08 | 24 |
| Santa Cruz | Tina | 2012, 02, 08 | 31 |
| Maricopa | Jake | 2012, 02, 08 | 2 |
| Yuma | Amy | 2012, 02, 08 | 3 |
```py
df.rename(columns={'Leader': 'Commander'}, inplace=True)
df
```
| | Commander | Time | Score |
| --- | --- | --- | --- |
| Cochice | Jason | 2012, 02, 08 | 4 |
| Pima | Molly | 2012, 02, 08 | 24 |
| Santa Cruz | Tina | 2012, 02, 08 | 31 |
| Maricopa | Jake | 2012, 02, 08 | 2 |
| Yuma | Amy | 2012, 02, 08 | 3 |
## 替换值
```py
# 导入模块
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [-999, -999, -999, 2, 1],
'postTestScore': [2, 2, -999, 2, -999]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | -999 | 2 |
| 1 | Molly | Jacobson | 52 | -999 | 2 |
| 2 | Tina | Ali | 36 | -999 | -999 |
| 3 | Jake | Milner | 24 | 2 | 2 |
| 4 | Amy | Cooze | 73 | 1 | -999 |
```py
# 将所有 -999 替换为 NAN
df.replace(-999, np.nan)
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | NaN | 2.0 |
| 1 | Molly | Jacobson | 52 | NaN | 2.0 |
| 2 | Tina | Ali | 36 | NaN | NaN |
| 3 | Jake | Milner | 24 | 2.0 | 2.0 |
| 4 | Amy | Cooze | 73 | 1.0 | NaN |
## 将数据帧保存为 CSV
```py
# 导入模块
import pandas as pd
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25 |
| 1 | Molly | Jacobson | 52 | 24 | 94 |
| 2 | Tina | Ali | 36 | 31 | 57 |
| 3 | Jake | Milner | 24 | 2 | 62 |
| 4 | Amy | Cooze | 73 | 3 | 70 |
将名为`df`的数据帧保存为 csv。
```py
df.to_csv('example.csv')
```
## 在列中搜索某个值
```py
# 导入模块
import pandas as pd
raw_data = {'first_name': ['Jason', 'Jason', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Miller', 'Ali', 'Milner', 'Cooze'],
'age': [42, 42, 36, 24, 73],
'preTestScore': [4, 4, 31, 2, 3],
'postTestScore': [25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25 |
| 1 | Jason | Miller | 42 | 4 | 25 |
| 2 | Tina | Ali | 36 | 31 | 57 |
| 3 | Jake | Milner | 24 | 2 | 62 |
| 4 | Amy | Cooze | 73 | 3 | 70 |
```py
# 在列中寻找值在哪里
# 查看 postTestscore 大于 50 的地方
df['preTestScore'].where(df['postTestScore'] > 50)
'''
0 NaN
1 NaN
2 31.0
3 2.0
4 3.0
Name: preTestScore, dtype: float64
'''
```
## 选择包含特定值的行和列
```py
# 导入模块
import pandas as pd
# 设置 ipython 的最大行显示
pd.set_option('display.max_row', 1000)
# 设置 ipython 的最大列宽
pd.set_option('display.max_columns', 50)
# 创建示例数据帧
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
```py
# 按照列值抓取行
value_list = ['Tina', 'Molly', 'Jason']
df[df.name.isin(value_list)]
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
```py
# 获取列值不是某个值的行
df[~df.name.isin(value_list)]
```
| | name | reports | year |
| --- | --- | --- | --- |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
## 选择具有特定值的行
```py
import pandas as pd
# 创建示例数据帧
data = {'name': ['Jason', 'Molly'],
'country': [['Syria', 'Lebanon'],['Spain', 'Morocco']]}
df = pd.DataFrame(data)
df
```
| | country | name |
| --- | --- | --- |
| 0 | [Syria, Lebanon] | Jason |
| 1 | [Spain, Morocco] | Molly |
```py
df[df['country'].map(lambda country: 'Syria' in country)]
```
| | country | name |
| --- | --- | --- |
| 0 | [Syria, Lebanon] | Jason |
## 使用多个过滤器选择行
```py
import pandas as pd
# 创建示例数据帧
data = {'name': ['A', 'B', 'C', 'D', 'E'],
'score': [1,2,3,4,5]}
df = pd.DataFrame(data)
df
```
| | name | score |
| --- | --- | --- |
| 0 | A | 1 |
| 1 | B | 2 |
| 2 | C | 3 |
| 3 | D | 4 |
| 4 | E | 5 |
```py
# 选择数据帧的行,其中 df.score 大于 1 且小于 5
df[(df['score'] > 1) & (df['score'] < 5)]
```
| | name | score |
| --- | --- | --- |
| 1 | B | 2 |
| 2 | C | 3 |
| 3 | D | 4 |
## 根据条件选择数据帧的行
```py
# 导入模块
import pandas as pd
import numpy as np
# 创建数据帧
raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
'age': [42, 52, 36, 24, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])
df
```
| | first_name | nationality | age |
| --- | --- | --- | --- |
| 0 | Jason | USA | 42 |
| 1 | Molly | USA | 52 |
| 2 | NaN | France | 36 |
| 3 | NaN | UK | 24 |
| 4 | NaN | UK | 70 |
```py
# 方法 1:使用布尔变量
# 如果国籍是美国,则变量为 TRUE
american = df['nationality'] == "USA"
# 如果年龄大于 50,则变量为 TRUE
elderly = df['age'] > 50
# 选择所有国籍为美国且年龄大于 50 的案例
df[american & elderly]
```
| | first_name | nationality | age |
| --- | --- | --- | --- |
| 1 | Molly | USA | 52 |
```py
# 方法 2:使用变量属性
# 选择所有不缺少名字且国籍为美国的案例
df[df['first_name'].notnull() & (df['nationality'] == "USA")]
```
| | first_name | nationality | age |
| --- | --- | --- | --- |
| 0 | Jason | USA | 42 |
| 1 | Molly | USA | 52 |
## 数据帧简单示例
```py
# 导入模块
import pandas as pd
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25 |
| 1 | Molly | Jacobson | 52 | 24 | 94 |
| 2 | Tina | Ali | 36 | 31 | 57 |
| 3 | Jake | Milner | 24 | 2 | 62 |
| 4 | Amy | Cooze | 73 | 3 | 70 |
```py
# 创建第二个数据帧
raw_data_2 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'],
'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'],
'age': [53, 26, 72, 73, 24],
'preTestScore': [13, 52, 72, 26, 26],
'postTestScore': [82, 52, 56, 234, 254]}
df_2 = pd.DataFrame(raw_data_2, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df_2
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Sarah | Mornig | 53 | 13 | 82 |
| 1 | Gueniva | Jaker | 26 | 52 | 52 |
| 2 | Know | Alom | 72 | 72 | 56 |
| 3 | Sara | Ormon | 73 | 26 | 234 |
| 4 | Cat | Koozer | 24 | 26 | 254 |
```py
# 创建第三个数据帧
raw_data_3 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'],
'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'],
'postTestScore_2': [82, 52, 56, 234, 254]}
df_3 = pd.DataFrame(raw_data_3, columns = ['first_name', 'last_name', 'postTestScore_2'])
df_3
```
| | first_name | last_name | postTestScore_2 |
| --- | --- | --- | --- |
| 0 | Sarah | Mornig | 82 |
| 1 | Gueniva | Jaker | 52 |
| 2 | Know | Alom | 56 |
| 3 | Sara | Ormon | 234 |
| 4 | Cat | Koozer | 254 |
## 排序数据帧的行
```py
# 导入模块
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [1, 2, 1, 2, 3],
'coverage': [2, 2, 3, 3, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Cochice | 2 | Jason | 1 | 2012 |
| Pima | 2 | Molly | 2 | 2012 |
| Santa Cruz | 3 | Tina | 1 | 2013 |
| Maricopa | 3 | Jake | 2 | 2014 |
| Yuma | 3 | Amy | 3 | 2014 |
```py
# 按报告对数据框的行降序排序
df.sort_values(by='reports', ascending=0)
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Yuma | 3 | Amy | 3 | 2014 |
| Pima | 2 | Molly | 2 | 2012 |
| Maricopa | 3 | Jake | 2 | 2014 |
| Cochice | 2 | Jason | 1 | 2012 |
| Santa Cruz | 3 | Tina | 1 | 2013 |
```py
# 按 coverage 然后是报告对数据帧的行升序排序
df.sort_values(by=['coverage', 'reports'])
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Cochice | 2 | Jason | 1 | 2012 |
| Pima | 2 | Molly | 2 | 2012 |
| Santa Cruz | 3 | Tina | 1 | 2013 |
| Maricopa | 3 | Jake | 2 | 2014 |
| Yuma | 3 | Amy | 3 | 2014 |
## 将经纬度坐标变量拆分为单独的变量
```py
import pandas as pd
import numpy as np
raw_data = {'geo': ['40.0024, -105.4102', '40.0068, -105.266', '39.9318, -105.2813', np.nan]}
df = pd.DataFrame(raw_data, columns = ['geo'])
df
```
| | geo |
| --- | --- |
| 0 | 40.0024, -105.4102 |
| 1 | 40.0068, -105.266 |
| 2 | 39.9318, -105.2813 |
| 3 | NaN |
| --- | --- |
```py
# 为要放置的循环结果创建两个列表
lat = []
lon = []
# 对于变量中的每一行
for row in df['geo']:
# Try to,
try:
# 用逗号分隔行,转换为浮点
# 并将逗号前的所有内容追加到 lat
lat.append(row.split(',')[0])
# 用逗号分隔行,转换为浮点
# 并将逗号后的所有内容追加到 lon
lon.append(row.split(',')[1])
# 但是如果你得到了错误
except:
# 向 lat 添加缺失值
lat.append(np.NaN)
# 向 lon 添加缺失值
lon.append(np.NaN)
# 从 lat 和 lon 创建新的两列
df['latitude'] = lat
df['longitude'] = lon
df
```
| | geo | latitude | longitude |
| --- | --- | --- | --- |
| 0 | 40.0024, -105.4102 | 40.0024 | -105.4102 |
| 1 | 40.0068, -105.266 | 40.0068 | -105.266 |
| 2 | 39.9318, -105.2813 | 39.9318 | -105.2813 |
| 3 | NaN | NaN | NaN |
## 数据流水线
```py
# 创建一些原始数据
raw_data = [1,2,3,4,5,6,7,8,9,10]
# 定义产生 input+6 的生成器
def add_6(numbers):
for x in numbers:
output = x+6
yield output
# 定义产生 input-2 的生成器
def subtract_2(numbers):
for x in numbers:
output = x-2
yield output
# 定义产生 input*100 的生成器
def multiply_by_100(numbers):
for x in numbers:
output = x*100
yield output
# 流水线的第一步
step1 = add_6(raw_data)
# 流水线的第二步
step2 = subtract_2(step1)
# 流水线的第三步
pipeline = multiply_by_100(step2)
# 原始数据的第一个元素
next(pipeline)
# 500
# 原始数据的第二个元素
next(pipeline)
# 600
# 处理所有数据
for raw_data in pipeline:
print(raw_data)
'''
700
800
900
1000
1100
1200
1300
1400
'''
```
## 数据帧中的字符串整理
```py
# 导入模块
import pandas as pd
import numpy as np
import re as re
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'email': ['[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)', np.NAN, '[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)'],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | email | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | [[email protected]](/cdn-cgi/l/email-protection) | 4 | 25 |
| 1 | Molly | Jacobson | [[email protected]](/cdn-cgi/l/email-protection) | 24 | 94 |
| 2 | Tina | Ali | NaN | 31 | 57 |
| 3 | Jake | Milner | [[email protected]](/cdn-cgi/l/email-protection) | 2 | 62 |
| 4 | Amy | Cooze | [[email protected]](/cdn-cgi/l/email-protection) | 3 | 70 |
```py
# 电子邮件列中的哪些字符串包含 'gmail'
df['email'].str.contains('gmail')
'''
0 True
1 True
2 NaN
3 False
4 False
Name: email, dtype: object
'''
pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
df['email'].str.findall(pattern, flags=re.IGNORECASE)
'''
0 [(jas203, gmail, com)]
1 [(momomolly, gmail, com)]
2 NaN
3 [(battler, milner, com)]
4 [(Ames1234, yahoo, com)]
Name: email, dtype: object
'''
matches = df['email'].str.match(pattern, flags=re.IGNORECASE)
matches
'''
/Users/chrisralbon/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
if __name__ == '__main__':
0 (jas203, gmail, com)
1 (momomolly, gmail, com)
2 NaN
3 (battler, milner, com)
4 (Ames1234, yahoo, com)
Name: email, dtype: object
'''
matches.str[1]
'''
0 gmail
1 gmail
2 NaN
3 milner
4 yahoo
Name: email, dtype: object
'''
```
## 和 Pandas 一起使用列表推导式
```py
# 导入模块
import pandas as pd
# 设置 ipython 的最大行显示
pd.set_option('display.max_row', 1000)
# 设置 ipython 的最大列宽
pd.set_option('display.max_columns', 50)
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
作为循环的列表推导式。
```py
# 创建变量
next_year = []
# 对于 df.years 的每一行
for row in df['year']:
# 为这一行添加 1 并将其附加到 next_year
next_year.append(row + 1)
# 创建 df.next_year
df['next_year'] = next_year
# 查看数据帧
df
```
| | name | reports | year | next_year |
| --- | --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 | 2013 |
| Pima | Molly | 24 | 2012 | 2013 |
| Santa Cruz | Tina | 31 | 2013 | 2014 |
| Maricopa | Jake | 2 | 2014 | 2015 |
| Yuma | Amy | 3 | 2014 | 2015 |
作为列表推导式。
```py
# 对于 df.year 中的每一行,从行中减去 1
df['previous_year'] = [row-1 for row in df['year']]
df
```
| | name | reports | year | next_year | previous_year |
| --- | --- | --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 | 2013 | 2011 |
| Pima | Molly | 24 | 2012 | 2013 | 2011 |
| Santa Cruz | Tina | 31 | 2013 | 2014 | 2012 |
| Maricopa | Jake | 2 | 2014 | 2015 | 2013 |
| Yuma | Amy | 3 | 2014 | 2015 | 2013 |
## 使用 Seaborn 来可视化数据帧
```py
import pandas as pd
%matplotlib inline
import random
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame()
df['x'] = random.sample(range(1, 100), 25)
df['y'] = random.sample(range(1, 100), 25)
df.head()
```
| | x | y |
| --- | --- | --- |
| 0 | 18 | 25 |
| 1 | 42 | 67 |
| 2 | 52 | 77 |
| 3 | 4 | 34 |
| 4 | 14 | 69 |
```py
# 散点图
sns.lmplot('x', 'y', data=df, fit_reg=False)
# <seaborn.axisgrid.FacetGrid at 0x114563b00>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_6_1.png)
```py
# 密度图
sns.kdeplot(df.y)
# <matplotlib.axes._subplots.AxesSubplot at 0x113ea2ef0>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_8_1.png)
```py
sns.kdeplot(df.y, df.x)
# <matplotlib.axes._subplots.AxesSubplot at 0x113d7fef0>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_9_1.png)
```py
sns.distplot(df.x)
# <matplotlib.axes._subplots.AxesSubplot at 0x114294160>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_10_1.png)
```py
# 直方图
plt.hist(df.x, alpha=.3)
sns.rugplot(df.x);
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_12_0.png)
```py
# 箱形图
sns.boxplot([df.y, df.x])
# <matplotlib.axes._subplots.AxesSubplot at 0x1142b8b38>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_14_1.png)
```py
# 提琴图
sns.violinplot([df.y, df.x])
# <matplotlib.axes._subplots.AxesSubplot at 0x114444a58>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_16_1.png)
```py
# 热力图
sns.heatmap([df.y, df.x], annot=True, fmt="d")
# <matplotlib.axes._subplots.AxesSubplot at 0x114530c88>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_18_1.png)
```py
# 聚类图
sns.clustermap(df)
# <seaborn.matrix.ClusterGrid at 0x116f313c8>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_20_1.png)
# Pandas 数据结构
```py
# 导入模块
import pandas as pd
```
### 序列 101
序列是一维数组(类似 R 的向量)。
```py
# 创建 floodingReports 数量的序列
floodingReports = pd.Series([5, 6, 2, 9, 12])
floodingReports
'''
0 5
1 6
2 2
3 9
4 12
dtype: int64
'''
```
请注意,第一列数字(0 到 4)是索引。
```py
# 将县名设置为 floodingReports 序列的索引
floodingReports = pd.Series([5, 6, 2, 9, 12], index=['Cochise County', 'Pima County', 'Santa Cruz County', 'Maricopa County', 'Yuma County'])
floodingReports
'''
Cochise County 5
Pima County 6
Santa Cruz County 2
Maricopa County 9
Yuma County 12
dtype: int64
'''
floodingReports['Cochise County']
# 5
floodingReports[floodingReports > 6]
'''
Maricopa County 9
Yuma County 12
dtype: int64
'''
```
从字典中创建 Pandas 序列。
注意:执行此操作时,字典的键将成为序列索引。
```py
# 创建字典
fireReports_dict = {'Cochise County': 12, 'Pima County': 342, 'Santa Cruz County': 13, 'Maricopa County': 42, 'Yuma County' : 52}
# 将字典转换为 pd.Series,然后查看它
fireReports = pd.Series(fireReports_dict); fireReports
'''
Cochise County 12
Maricopa County 42
Pima County 342
Santa Cruz County 13
Yuma County 52
dtype: int64
'''
fireReports.index = ["Cochice", "Pima", "Santa Cruz", "Maricopa", "Yuma"]
fireReports
'''
Cochice 12
Pima 42
Santa Cruz 342
Maricopa 13
Yuma 52
dtype: int64
'''
```
### 数据帧 101
数据帧就像 R 的数据帧。
```py
# 从等长列表或 NumPy 数组的字典中创建数据帧
data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data)
df
```
| | county | reports | year |
| --- | --- | --- | --- |
| 0 | Cochice | 4 | 2012 |
| 1 | Pima | 24 | 2012 |
| 2 | Santa Cruz | 31 | 2013 |
| 3 | Maricopa | 2 | 2014 |
| 4 | Yuma | 3 | 2014 |
```py
# 使用 columns 属性设置列的顺序
dfColumnOrdered = pd.DataFrame(data, columns=['county', 'year', 'reports'])
dfColumnOrdered
```
| | county | year | reports |
| --- | --- | --- | --- |
| 0 | Cochice | 2012 | 4 |
| 1 | Pima | 2012 | 24 |
| 2 | Santa Cruz | 2013 | 31 |
| 3 | Maricopa | 2014 | 2 |
| 4 | Yuma | 2014 | 3 |
```py
# 添加一列
dfColumnOrdered['newsCoverage'] = pd.Series([42.3, 92.1, 12.2, 39.3, 30.2])
dfColumnOrdered
```
| | county | year | reports | newsCoverage |
| --- | --- | --- | --- | --- |
| 0 | Cochice | 2012 | 4 | 42.3 |
| 1 | Pima | 2012 | 24 | 92.1 |
| 2 | Santa Cruz | 2013 | 31 | 12.2 |
| 3 | Maricopa | 2014 | 2 | 39.3 |
| 4 | Yuma | 2014 | 3 | 30.2 |
```py
# 删除一列
del dfColumnOrdered['newsCoverage']
dfColumnOrdered
```
| | county | year | reports |
| --- | --- | --- | --- |
| 0 | Cochice | 2012 | 4 |
| 1 | Pima | 2012 | 24 |
| 2 | Santa Cruz | 2013 | 31 |
| 3 | Maricopa | 2014 | 2 |
| 4 | Yuma | 2014 | 3 |
```py
# 转置数据帧
dfColumnOrdered.T
```
| | 0 | 1 | 2 | 3 | 4 |
| --- | --- | --- | --- | --- | --- |
| county | Cochice | Pima | Santa Cruz | Maricopa | Yuma |
| year | 2012 | 2012 | 2013 | 2014 | 2014 |
| reports | 4 | 24 | 31 | 2 | 3 |
## Pandas 时间序列基础
```py
# 导入模块
from datetime import datetime
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as pyplot
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41]}
df = pd.DataFrame(data, columns = ['date', 'battle_deaths'])
print(df)
'''
date battle_deaths
0 2014-05-01 18:47:05.069722 34
1 2014-05-01 18:47:05.119994 25
2 2014-05-02 18:47:05.178768 26
3 2014-05-02 18:47:05.230071 15
4 2014-05-02 18:47:05.230071 15
5 2014-05-02 18:47:05.280592 14
6 2014-05-03 18:47:05.332662 26
7 2014-05-03 18:47:05.385109 25
8 2014-05-04 18:47:05.436523 62
9 2014-05-04 18:47:05.486877 41
'''
df['date'] = pd.to_datetime(df['date'])
df.index = df['date']
del df['date']
df
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 18:47:05.069722 | 34 |
| 2014-05-01 18:47:05.119994 | 25 |
| 2014-05-02 18:47:05.178768 | 26 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.280592 | 14 |
| 2014-05-03 18:47:05.332662 | 26 |
| 2014-05-03 18:47:05.385109 | 25 |
| 2014-05-04 18:47:05.436523 | 62 |
| 2014-05-04 18:47:05.486877 | 41 |
```py
# 查看 2014 年的所有观测
df['2014']
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 18:47:05.069722 | 34 |
| 2014-05-01 18:47:05.119994 | 25 |
| 2014-05-02 18:47:05.178768 | 26 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.280592 | 14 |
| 2014-05-03 18:47:05.332662 | 26 |
| 2014-05-03 18:47:05.385109 | 25 |
| 2014-05-04 18:47:05.436523 | 62 |
| 2014-05-04 18:47:05.486877 | 41 |
```py
# 查看 2014 年 5 月的所有观测
df['2014-05']
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 18:47:05.069722 | 34 |
| 2014-05-01 18:47:05.119994 | 25 |
| 2014-05-02 18:47:05.178768 | 26 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.280592 | 14 |
| 2014-05-03 18:47:05.332662 | 26 |
| 2014-05-03 18:47:05.385109 | 25 |
| 2014-05-04 18:47:05.436523 | 62 |
| 2014-05-04 18:47:05.486877 | 41 |
```py
# 查看 2014.5.3 的所有观测
df[datetime(2014, 5, 3):]
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-03 18:47:05.332662 | 26 |
| 2014-05-03 18:47:05.385109 | 25 |
| 2014-05-04 18:47:05.436523 | 62 |
| 2014-05-04 18:47:05.486877 | 41 |
### Observations between May 3rd and May 4th
```py
# 查看 2014.5.3~4 的所有观测
df['5/3/2014':'5/4/2014']
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-03 18:47:05.332662 | 26 |
| 2014-05-03 18:47:05.385109 | 25 |
| 2014-05-04 18:47:05.436523 | 62 |
| 2014-05-04 18:47:05.486877 | 41 |
```py
# 截断 2014.5.2 之后的观测
df.truncate(after='5/3/2014')
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 18:47:05.069722 | 34 |
| 2014-05-01 18:47:05.119994 | 25 |
| 2014-05-02 18:47:05.178768 | 26 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.280592 | 14 |
```py
# 2014.5 的观测
df['5-2014']
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 18:47:05.069722 | 34 |
| 2014-05-01 18:47:05.119994 | 25 |
| 2014-05-02 18:47:05.178768 | 26 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.230071 | 15 |
| 2014-05-02 18:47:05.280592 | 14 |
| 2014-05-03 18:47:05.332662 | 26 |
| 2014-05-03 18:47:05.385109 | 25 |
| 2014-05-04 18:47:05.436523 | 62 |
| 2014-05-04 18:47:05.486877 | 41 |
```py
# 计算每个时间戳的观测数
df.groupby(level=0).count()
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 18:47:05.069722 | 1 |
| 2014-05-01 18:47:05.119994 | 1 |
| 2014-05-02 18:47:05.178768 | 1 |
| 2014-05-02 18:47:05.230071 | 2 |
| 2014-05-02 18:47:05.280592 | 1 |
| 2014-05-03 18:47:05.332662 | 1 |
| 2014-05-03 18:47:05.385109 | 1 |
| 2014-05-04 18:47:05.436523 | 1 |
| 2014-05-04 18:47:05.486877 | 1 |
```py
# 每天的 battle_deaths 均值
df.resample('D').mean()
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 | 29.5 |
| 2014-05-02 | 17.5 |
| 2014-05-03 | 25.5 |
| 2014-05-04 | 51.5 |
```py
# 每天的 battle_deaths 总数
df.resample('D').sum()
```
| | battle_deaths |
| --- | --- |
| date | |
| 2014-05-01 | 59 |
| 2014-05-02 | 70 |
| 2014-05-03 | 51 |
| 2014-05-04 | 103 |
```py
# 绘制每天的总死亡人数
df.resample('D').sum().plot()
# <matplotlib.axes._subplots.AxesSubplot at 0x11187a940>
```
![png](https://chrisalbon.com/python/data_wrangling/pandas_time_series_basics/pandas_time_series_basics_29_1.png)