💎一站式轻松地调用各大LLM模型接口,支持GPT4、智谱、星火、月之暗面及文生图 广告
# 十九、数据整理(下) > 作者:[Chris Albon](https://chrisalbon.com/) > > 译者:[飞龙](https://github.com/wizardforcel) > > 协议:[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) ## 连接和合并数据帧 ```py # 导入模块 import pandas as pd from IPython.display import display from IPython.display import Image raw_data = { 'subject_id': ['1', '2', '3', '4', '5'], 'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']} df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name']) df_a ``` | | subject_id | first_name | last_name | | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | | 1 | 2 | Amy | Ackerman | | 2 | 3 | Allen | Ali | | 3 | 4 | Alice | Aoni | | 4 | 5 | Ayoung | Atiches | ```py # 创建第二个数据帧 raw_data = { 'subject_id': ['4', '5', '6', '7', '8'], 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']} df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name']) df_b ``` | | subject_id | first_name | last_name | | --- | --- | --- | --- | | 0 | 4 | Billy | Bonder | | 1 | 5 | Brian | Black | | 2 | 6 | Bran | Balwner | | 3 | 7 | Bryce | Brice | | 4 | 8 | Betty | Btisan | ```py # 创建第三个数据帧 raw_data = { 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'], 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]} df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id']) df_n ``` | | subject_id | test_id | | --- | --- | --- | | 0 | 1 | 51 | | 1 | 2 | 15 | | 2 | 3 | 15 | | 3 | 4 | 61 | | 4 | 5 | 16 | | 5 | 7 | 14 | | 6 | 8 | 15 | | 7 | 9 | 1 | | 8 | 10 | 61 | | 9 | 11 | 16 | ```py # 将两个数据帧按行连接 df_new = pd.concat([df_a, df_b]) df_new ``` | | subject_id | first_name | last_name | | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | | 1 | 2 | Amy | Ackerman | | 2 | 3 | Allen | Ali | | 3 | 4 | Alice | Aoni | | 4 | 5 | Ayoung | Atiches | | 0 | 4 | Billy | Bonder | | 1 | 5 | Brian | Black | | 2 | 6 | Bran | Balwner | | 3 | 7 | Bryce | Brice | | 4 | 8 | Betty | Btisan | ```py # 将两个数据帧按列连接 pd.concat([df_a, df_b], axis=1) ``` | | subject_id | first_name | last_name | subject_id | first_name | last_name | | --- | --- | --- | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | 4 | Billy | Bonder | | 1 | 2 | Amy | Ackerman | 5 | Brian | Black | | 2 | 3 | Allen | Ali | 6 | Bran | Balwner | | 3 | 4 | Alice | Aoni | 7 | Bryce | Brice | | 4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan | ```py # 按两个数据帧按 subject_id 连接 pd.merge(df_new, df_n, on='subject_id') ``` | | subject_id | first_name | last_name | test_id | | --- | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | 51 | | 1 | 2 | Amy | Ackerman | 15 | | 2 | 3 | Allen | Ali | 15 | | 3 | 4 | Alice | Aoni | 61 | | 4 | 4 | Billy | Bonder | 61 | | 5 | 5 | Ayoung | Atiches | 16 | | 6 | 5 | Brian | Black | 16 | | 7 | 7 | Bryce | Brice | 14 | | 8 | 8 | Betty | Btisan | 15 | ```py # 将两个数据帧按照左和右数据帧的 subject_id 连接 pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id') ``` | | subject_id | first_name | last_name | test_id | | --- | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | 51 | | 1 | 2 | Amy | Ackerman | 15 | | 2 | 3 | Allen | Ali | 15 | | 3 | 4 | Alice | Aoni | 61 | | 4 | 4 | Billy | Bonder | 61 | | 5 | 5 | Ayoung | Atiches | 16 | | 6 | 5 | Brian | Black | 16 | | 7 | 7 | Bryce | Brice | 14 | | 8 | 8 | Betty | Btisan | 15 | 使用外连接来合并。 “全外连接产生表 A 和表 B 中所有记录的集合,带有来自两侧的匹配记录。如果没有匹配,则缺少的一侧将包含空值。” -- [来源](http://blog .codinghorror.com/a-visual-explanation-of-sql-joins/) ```py pd.merge(df_a, df_b, on='subject_id', how='outer') ``` | | subject_id | first_name_x | last_name_x | first_name_y | last_name_y | | --- | --- | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | NaN | NaN | | 1 | 2 | Amy | Ackerman | NaN | NaN | | 2 | 3 | Allen | Ali | NaN | NaN | | 3 | 4 | Alice | Aoni | Billy | Bonder | | 4 | 5 | Ayoung | Atiches | Brian | Black | | 5 | 6 | NaN | NaN | Bran | Balwner | | 6 | 7 | NaN | NaN | Bryce | Brice | | 7 | 8 | NaN | NaN | Betty | Btisan | 使用内连接来合并。 “内联接只生成匹配表 A 和表 B 的记录集。” -- [来源](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/) ```py pd.merge(df_a, df_b, on='subject_id', how='inner') ``` | | subject_id | first_name_x | last_name_x | first_name_y | last_name_y | | --- | --- | --- | --- | --- | --- | | 0 | 4 | Alice | Aoni | Billy | Bonder | | 1 | 5 | Ayoung | Atiches | Brian | Black | ```py # 使用右连接来合并 pd.merge(df_a, df_b, on='subject_id', how='right') ``` | | subject_id | first_name_x | last_name_x | first_name_y | last_name_y | | --- | --- | --- | --- | --- | --- | | 0 | 4 | Alice | Aoni | Billy | Bonder | | 1 | 5 | Ayoung | Atiches | Brian | Black | | 2 | 6 | NaN | NaN | Bran | Balwner | | 3 | 7 | NaN | NaN | Bryce | Brice | | 4 | 8 | NaN | NaN | Betty | Btisan | 使用左连接来合并。 “左外连接从表 A 中生成一组完整的记录,它们在表 B 中有匹配的记录。如果没有匹配,右侧将包含空。” -- [来源](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/) ```py pd.merge(df_a, df_b, on='subject_id', how='left') ``` | | subject_id | first_name_x | last_name_x | first_name_y | last_name_y | | --- | --- | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | NaN | NaN | | 1 | 2 | Amy | Ackerman | NaN | NaN | | 2 | 3 | Allen | Ali | NaN | NaN | | 3 | 4 | Alice | Aoni | Billy | Bonder | | 4 | 5 | Ayoung | Atiches | Brian | Black | ```py # 合并时添加后缀以复制列名称 pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right')) ``` | | subject_id | first_name_left | last_name_left | first_name_right | last_name_right | | --- | --- | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | NaN | NaN | | 1 | 2 | Amy | Ackerman | NaN | NaN | | 2 | 3 | Allen | Ali | NaN | NaN | | 3 | 4 | Alice | Aoni | Billy | Bonder | | 4 | 5 | Ayoung | Atiches | Brian | Black | ```py # 基于索引的合并 pd.merge(df_a, df_b, right_index=True, left_index=True) ``` | | subject_id_x | first_name_x | last_name_x | subject_id_y | first_name_y | last_name_y | | --- | --- | --- | --- | --- | --- | --- | | 0 | 1 | Alex | Anderson | 4 | Billy | Bonder | | 1 | 2 | Amy | Ackerman | 5 | Brian | Black | | 2 | 3 | Allen | Ali | 6 | Bran | Balwner | | 3 | 4 | Alice | Aoni | 7 | Bryce | Brice | | 4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan | ## 列出 pandas 列中的唯一值 特别感谢 Bob Haffner 指出了一种更好的方法。 ```py # 导入模块 import pandas as pd # 设置 ipython 的最大行显示 pd.set_option('display.max_row', 1000) # 设置 ipython 的最大列宽 pd.set_option('display.max_columns', 50) # 创建示例数据帧 data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df ``` | | name | reports | year | | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | | Pima | Molly | 24 | 2012 | | Santa Cruz | Tina | 31 | 2013 | | Maricopa | Jake | 2 | 2014 | | Yuma | Amy | 3 | 2014 | ```py # 列出 df['name'] 的唯一值 df.name.unique() # array(['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], dtype=object) ``` ## 加载 JSON 文件 ```py # 加载库 import pandas as pd # 创建 JSON 文件的 URL(或者可以是文件路径) url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json' # 将 JSON 文件加载到数据框中 df = pd.read_json(url, orient='columns') # 查看前十行 df.head(10) ``` | | category | datetime | integer | | --- | --- | --- | --- | | 0 | 0 | 2015-01-01 00:00:00 | 5 | | 1 | 0 | 2015-01-01 00:00:01 | 5 | | 10 | 0 | 2015-01-01 00:00:10 | 5 | | 11 | 0 | 2015-01-01 00:00:11 | 5 | | 12 | 0 | 2015-01-01 00:00:12 | 8 | | 13 | 0 | 2015-01-01 00:00:13 | 9 | | 14 | 0 | 2015-01-01 00:00:14 | 8 | | 15 | 0 | 2015-01-01 00:00:15 | 8 | | 16 | 0 | 2015-01-01 00:00:16 | 2 | | 17 | 0 | 2015-01-01 00:00:17 | 1 | # 加载 Excel 文件 ```py # 加载库 import pandas as pd # 创建 Excel 文件的 URL(或者可以是文件路径) url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.xlsx' # 将 Excel 文件的第一页加载到数据框中 df = pd.read_excel(url, sheetname=0, header=1) # 查看前十行 df.head(10) ``` | | 5 | 2015-01-01 00:00:00 | 0 | | --- | --- | --- | --- | | 0 | 5 | 2015-01-01 00:00:01 | 0 | | 1 | 9 | 2015-01-01 00:00:02 | 0 | | 2 | 6 | 2015-01-01 00:00:03 | 0 | | 3 | 6 | 2015-01-01 00:00:04 | 0 | | 4 | 9 | 2015-01-01 00:00:05 | 0 | | 5 | 7 | 2015-01-01 00:00:06 | 0 | | 6 | 1 | 2015-01-01 00:00:07 | 0 | | 7 | 6 | 2015-01-01 00:00:08 | 0 | | 8 | 9 | 2015-01-01 00:00:09 | 0 | | 9 | 5 | 2015-01-01 00:00:10 | 0 | ## 将 Excel 表格加载为数据帧 ```py # 导入模块 import pandas as pd # 加载 excel 文件并赋给 xls_file xls_file = pd.ExcelFile('../data/example.xls') xls_file # <pandas.io.excel.ExcelFile at 0x111912be0> # 查看电子表格的名称 xls_file.sheet_names # ['Sheet1'] # 将 xls 文件 的 Sheet1 加载为数据帧 df = xls_file.parse('Sheet1') df ``` | | year | deaths_attacker | deaths_defender | soldiers_attacker | soldiers_defender | wounded_attacker | wounded_defender | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | 1945 | 425 | 423 | 2532 | 37235 | 41 | 14 | | 1 | 1956 | 242 | 264 | 6346 | 2523 | 214 | 1424 | | 2 | 1964 | 323 | 1231 | 3341 | 2133 | 131 | 131 | | 3 | 1969 | 223 | 23 | 6732 | 1245 | 12 | 12 | | 4 | 1971 | 783 | 23 | 12563 | 2671 | 123 | 34 | | 5 | 1981 | 436 | 42 | 2356 | 7832 | 124 | 124 | | 6 | 1982 | 324 | 124 | 253 | 2622 | 264 | 1124 | | 7 | 1992 | 3321 | 631 | 5277 | 3331 | 311 | 1431 | | 8 | 1999 | 262 | 232 | 2732 | 2522 | 132 | 122 | | 9 | 2004 | 843 | 213 | 6278 | 26773 | 623 | 2563 | ## 加载 CSV ```py # 导入模块 import pandas as pd import numpy as np raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, ".", "."], 'postTestScore': ["25,000", "94,000", 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | 4 | 25,000 | | 1 | Molly | Jacobson | 52 | 24 | 94,000 | | 2 | Tina | . | 36 | 31 | 57 | | 3 | Jake | Milner | 24 | . | 62 | | 4 | Amy | Cooze | 73 | . | 70 | ```py # 将数据帧保存为工作目录中的 csv df.to_csv('pandas_dataframe_importing_csv/example.csv') df = pd.read_csv('pandas_dataframe_importing_csv/example.csv') df ``` | | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | 0 | Jason | Miller | 42 | 4 | 25,000 | | 1 | 1 | Molly | Jacobson | 52 | 24 | 94,000 | | 2 | 2 | Tina | . | 36 | 31 | 57 | | 3 | 3 | Jake | Milner | 24 | . | 62 | | 4 | 4 | Amy | Cooze | 73 | . | 70 | ```py # 加载无头 CSV df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', header=None) df ``` | | 0 | 1 | 2 | 3 | 4 | 5 | | --- | --- | --- | --- | --- | --- | --- | | 0 | NaN | first_name | last_name | age | preTestScore | postTestScore | | 1 | 0.0 | Jason | Miller | 42 | 4 | 25,000 | | 2 | 1.0 | Molly | Jacobson | 52 | 24 | 94,000 | | 3 | 2.0 | Tina | . | 36 | 31 | 57 | | 4 | 3.0 | Jake | Milner | 24 | . | 62 | | 5 | 4.0 | Amy | Cooze | 73 | . | 70 | ```py # 在加载 csv 时指定列名称 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score']) df ``` | | UID | First Name | Last Name | Age | Pre-Test Score | Post-Test Score | | --- | --- | --- | --- | --- | --- | --- | | 0 | NaN | first_name | last_name | age | preTestScore | postTestScore | | 1 | 0.0 | Jason | Miller | 42 | 4 | 25,000 | | 2 | 1.0 | Molly | Jacobson | 52 | 24 | 94,000 | | 3 | 2.0 | Tina | . | 36 | 31 | 57 | | 4 | 3.0 | Jake | Milner | 24 | . | 62 | | 5 | 4.0 | Amy | Cooze | 73 | . | 70 | ```py # 通过将索引列设置为 UID 来加载 csv df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col='UID', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score']) df ``` | | First Name | Last Name | Age | Pre-Test Score | Post-Test Score | | --- | --- | --- | --- | --- | --- | | UID | | | | | | | NaN | first_name | last_name | age | preTestScore | postTestScore | | 0.0 | Jason | Miller | 42 | 4 | 25,000 | | 1.0 | Molly | Jacobson | 52 | 24 | 94,000 | | 2.0 | Tina | . | 36 | 31 | 57 | | 3.0 | Jake | Milner | 24 | . | 62 | | 4.0 | Amy | Cooze | 73 | . | 70 | ```py # 在加载 csv 时将索引列设置为名字和姓氏 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col=['First Name', 'Last Name'], names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score']) df ``` | | | UID | Age | Pre-Test Score | Post-Test Score | | --- | --- | --- | --- | --- | --- | | First Name | Last Name | | | | | | first_name | last_name | NaN | age | preTestScore | postTestScore | | Jason | Miller | 0.0 | 42 | 4 | 25,000 | | Molly | Jacobson | 1.0 | 52 | 24 | 94,000 | | Tina | . | 2.0 | 36 | 31 | 57 | | Jake | Milner | 3.0 | 24 | . | 62 | | Amy | Cooze | 4.0 | 73 | . | 70 | ```py # 在加载 csv 时指定 '.' 为缺失值 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.']) pd.isnull(df) ``` | | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | False | False | False | False | False | False | | 1 | False | False | False | False | False | False | | 2 | False | False | True | False | False | False | | 3 | False | False | False | False | True | False | | 4 | False | False | False | False | True | False | ```py # 加载csv,同时指定 '.' 和 'NA' 为“姓氏”列的缺失值,指定 '.' 为 preTestScore 列的缺失值 sentinels = {'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']} df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels) df ``` | | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | 0 | Jason | Miller | 42 | 4 | 25,000 | | 1 | 1 | Molly | Jacobson | 52 | 24 | 94,000 | | 2 | 2 | Tina | . | 36 | 31 | 57 | | 3 | 3 | Jake | Milner | 24 | . | 62 | | 4 | 4 | Amy | Cooze | 73 | . | 70 | ```py # 在加载 csv 时跳过前 3 行 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels, skiprows=3) df ``` | | 2 | Tina | . | 36 | 31 | 57 | | --- | --- | --- | --- | --- | --- | --- | | 0 | 3 | Jake | Milner | 24 | . | 62 | | 1 | 4 | Amy | Cooze | 73 | . | 70 | ```py # 加载 csv,同时将数字字符串中的 ',' 解释为千位分隔符 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', thousands=',') df ``` | | Unnamed: 0 | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | 0 | Jason | Miller | 42 | 4 | 25000 | | 1 | 1 | Molly | Jacobson | 52 | 24 | 94000 | | 2 | 2 | Tina | . | 36 | 31 | 57 | | 3 | 3 | Jake | Milner | 24 | . | 62 | | 4 | 4 | Amy | Cooze | 73 | . | 70 | ## 长到宽的格式 ```py # 导入模块 import pandas as pd raw_data = {'patient': [1, 1, 1, 2, 2], 'obs': [1, 2, 3, 1, 2], 'treatment': [0, 1, 0, 1, 0], 'score': [6252, 24243, 2345, 2342, 23525]} df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score']) df ``` | | patient | obs | treatment | score | | --- | --- | --- | --- | --- | | 0 | 1 | 1 | 0 | 6252 | | 1 | 1 | 2 | 1 | 24243 | | 2 | 1 | 3 | 0 | 2345 | | 3 | 2 | 1 | 1 | 2342 | | 4 | 2 | 2 | 0 | 23525 | 制作“宽的”数据。 现在,我们将创建一个“宽的”数据帧,其中行数按患者编号,列按观测编号,单元格值为得分值。 ```py df.pivot(index='patient', columns='obs', values='score') ``` | obs | 1 | 2 | 3 | | --- | --- | --- | --- | | patient | | | | | 1 | 6252.0 | 24243.0 | 2345.0 | | 2 | 2342.0 | 23525.0 | NaN | ## 在数据帧中小写列名 ```py # 导入模块 import pandas as pd # 设置 ipython 的最大行显示 pd.set_option('display.max_row', 1000) # 设置 ipython 的最大列宽 pd.set_option('display.max_columns', 50) # 创建示例数据帧 data = {'NAME': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'YEAR': [2012, 2012, 2013, 2014, 2014], 'REPORTS': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df ``` | | NAME | REPORTS | YEAR | | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | | Pima | Molly | 24 | 2012 | | Santa Cruz | Tina | 31 | 2013 | | Maricopa | Jake | 2 | 2014 | | Yuma | Amy | 3 | 2014 | ```py # 小写列名称 # Map the lowering function to all column names df.columns = map(str.lower, df.columns) df ``` | | name | reports | year | | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | | Pima | Molly | 24 | 2012 | | Santa Cruz | Tina | 31 | 2013 | | Maricopa | Jake | 2 | 2014 | | Yuma | Amy | 3 | 2014 | ## 使用函数创建新列 ```py # 导入模块 import pandas as pd # 示例数据帧 raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3], 'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]} df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore']) df ``` | | regiment | company | name | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Nighthawks | 1st | Miller | 4 | 25 | | 1 | Nighthawks | 1st | Jacobson | 24 | 94 | | 2 | Nighthawks | 2nd | Ali | 31 | 57 | | 3 | Nighthawks | 2nd | Milner | 2 | 62 | | 4 | Dragoons | 1st | Cooze | 3 | 70 | | 5 | Dragoons | 1st | Jacon | 4 | 25 | | 6 | Dragoons | 2nd | Ryaner | 24 | 94 | | 7 | Dragoons | 2nd | Sone | 31 | 57 | | 8 | Scouts | 1st | Sloan | 2 | 62 | | 9 | Scouts | 1st | Piger | 3 | 70 | | 10 | Scouts | 2nd | Riani | 2 | 62 | | 11 | Scouts | 2nd | Ali | 3 | 70 | ```py # 创建一个接受两个输入,pre 和 post 的函数 def pre_post_difference(pre, post): # 返回二者的差 return post - pre # 创建一个变量,它是函数的输出 df['score_change'] = pre_post_difference(df['preTestScore'], df['postTestScore']) # 查看数据帧 df ``` | | regiment | company | name | preTestScore | postTestScore | score_change | | --- | --- | --- | --- | --- | --- | --- | | 0 | Nighthawks | 1st | Miller | 4 | 25 | 21 | | 1 | Nighthawks | 1st | Jacobson | 24 | 94 | 70 | | 2 | Nighthawks | 2nd | Ali | 31 | 57 | 26 | | 3 | Nighthawks | 2nd | Milner | 2 | 62 | 60 | | 4 | Dragoons | 1st | Cooze | 3 | 70 | 67 | | 5 | Dragoons | 1st | Jacon | 4 | 25 | 21 | | 6 | Dragoons | 2nd | Ryaner | 24 | 94 | 70 | | 7 | Dragoons | 2nd | Sone | 31 | 57 | 26 | | 8 | Scouts | 1st | Sloan | 2 | 62 | 60 | | 9 | Scouts | 1st | Piger | 3 | 70 | 67 | | 10 | Scouts | 2nd | Riani | 2 | 62 | 60 | | 11 | Scouts | 2nd | Ali | 3 | 70 | 67 | ```py # 创建一个接受一个输入 x 的函数 def score_multipler_2x_and_3x(x): # 返回两个东西,2x 和 3x return x*2, x*3 # 创建两个新变量,它是函数的两个输出 df['post_score_x2'], df['post_score_x3'] = zip(*df['postTestScore'].map(score_multipler_2x_and_3x)) df ``` | | regiment | company | name | preTestScore | postTestScore | score_change | post_score_x2 | post_score_x3 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | Nighthawks | 1st | Miller | 4 | 25 | 21 | 50 | 75 | | 1 | Nighthawks | 1st | Jacobson | 24 | 94 | 70 | 188 | 282 | | 2 | Nighthawks | 2nd | Ali | 31 | 57 | 26 | 114 | 171 | | 3 | Nighthawks | 2nd | Milner | 2 | 62 | 60 | 124 | 186 | | 4 | Dragoons | 1st | Cooze | 3 | 70 | 67 | 140 | 210 | | 5 | Dragoons | 1st | Jacon | 4 | 25 | 21 | 50 | 75 | | 6 | Dragoons | 2nd | Ryaner | 24 | 94 | 70 | 188 | 282 | | 7 | Dragoons | 2nd | Sone | 31 | 57 | 26 | 114 | 171 | | 8 | Scouts | 1st | Sloan | 2 | 62 | 60 | 124 | 186 | | 9 | Scouts | 1st | Piger | 3 | 70 | 67 | 140 | 210 | | 10 | Scouts | 2nd | Riani | 2 | 62 | 60 | 124 | 186 | | 11 | Scouts | 2nd | Ali | 3 | 70 | 67 | 140 | 210 | ## 将外部值映射为数据帧的值 ```py # 导入模块 import pandas as pd raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city']) df ``` | | first_name | last_name | age | city | | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | San Francisco | | 1 | Molly | Jacobson | 52 | Baltimore | | 2 | Tina | Ali | 36 | Miami | | 3 | Jake | Milner | 24 | Douglas | | 4 | Amy | Cooze | 73 | Boston | ```py # 创建值的字典 city_to_state = { 'San Francisco' : 'California', 'Baltimore' : 'Maryland', 'Miami' : 'Florida', 'Douglas' : 'Arizona', 'Boston' : 'Massachusetts'} df['state'] = df['city'].map(city_to_state) df ``` | | first_name | last_name | age | city | state | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | San Francisco | California | | 1 | Molly | Jacobson | 52 | Baltimore | Maryland | | 2 | Tina | Ali | 36 | Miami | Florida | | 3 | Jake | Milner | 24 | Douglas | Arizona | | 4 | Amy | Cooze | 73 | Boston | Massachusetts | ## 数据帧中的缺失数据 ```py # 导入模块 import pandas as pd import numpy as np raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 'age': [42, np.nan, 36, 24, 73], 'sex': ['m', np.nan, 'f', 'm', 'f'], 'preTestScore': [4, np.nan, np.nan, 2, 3], 'postTestScore': [25, np.nan, np.nan, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | | 1 | NaN | NaN | NaN | NaN | NaN | NaN | | 2 | Tina | Ali | 36.0 | f | NaN | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | ```py # 丢弃缺失值 df_no_missing = df.dropna() df_no_missing ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | ### ```py # 删除所有单元格为 NA 的行 df_cleaned = df.dropna(how='all') df_cleaned ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | | 2 | Tina | Ali | 36.0 | f | NaN | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | ```py # 创建一个缺失值填充的新列 df['location'] = np.nan df ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | location | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN | | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | | 2 | Tina | Ali | 36.0 | f | NaN | NaN | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN | ```py # 如果列仅包含缺失值,删除列 df.dropna(axis=1, how='all') ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | | 1 | NaN | NaN | NaN | NaN | NaN | NaN | | 2 | Tina | Ali | 36.0 | f | NaN | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | ```py # 删除少于五个观测值的行 # 这对时间序列来说非常有用 df.dropna(thresh=5) ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | location | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN | ```py # 用零填充缺失数据 df.fillna(0) ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | location | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | 0.0 | | 1 | 0 | 0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | | 2 | Tina | Ali | 36.0 | f | 0.0 | 0.0 | 0.0 | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | 0.0 | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | 0.0 | ```py # 使用 preTestScore 的平均值填充 preTestScore 中的缺失 # inplace=True 表示更改会立即保存到 df 中 df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True) df ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | location | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN | | 1 | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN | | 2 | Tina | Ali | 36.0 | f | 3.0 | NaN | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN | ### ```py # 使用 postTestScore 的每个性别的均值填充 postTestScore 中的缺失 df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True) df ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | location | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN | | 1 | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN | | 2 | Tina | Ali | 36.0 | f | 3.0 | 70.0 | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN | ```py # 选择年龄不是 NaN 且性别不是 NaN 的行 df[df['age'].notnull() & df['sex'].notnull()] ``` | | first_name | last_name | age | sex | preTestScore | postTestScore | location | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42.0 | m | 4.0 | 25.0 | NaN | | 2 | Tina | Ali | 36.0 | f | 3.0 | 70.0 | NaN | | 3 | Jake | Milner | 24.0 | m | 2.0 | 62.0 | NaN | | 4 | Amy | Cooze | 73.0 | f | 3.0 | 70.0 | NaN | ## pandas 中的移动平均 ```py # 导入模块 import pandas as pd # 创建数据 data = {'score': [1,1,1,2,2,2,3,3,3]} # 创建数据帧 df = pd.DataFrame(data) # 查看数据帧 df ``` | | score | | --- | --- | | 0 | 1 | | 1 | 1 | | 2 | 1 | | 3 | 2 | | 4 | 2 | | 5 | 2 | | 6 | 3 | | 7 | 3 | | 8 | 3 | ```py # 计算移动平均。也就是说,取前两个值,取平均值 # 然后丢弃第一个,再加上第三个,以此类推。 df.rolling(window=2).mean() ``` | | score | | --- | --- | | 0 | NaN | | 1 | 1.0 | | 2 | 1.0 | | 3 | 1.5 | | 4 | 2.0 | | 5 | 2.0 | | 6 | 2.5 | | 7 | 3.0 | | 8 | 3.0 | ## 规范化一列 ```py # 导入所需模块 import pandas as pd from sklearn import preprocessing # 设置图表为内联 %matplotlib inline # 创建示例数据帧,带有未规范化的一列 data = {'score': [234,24,14,27,-74,46,73,-18,59,160]} df = pd.DataFrame(data) df ``` | | score | | --- | --- | | 0 | 234 | | 1 | 24 | | 2 | 14 | | 3 | 27 | | 4 | -74 | | 5 | 46 | | 6 | 73 | | 7 | -18 | | 8 | 59 | | 9 | 160 | ```py # 查看为未规范化的数据 df['score'].plot(kind='bar') # <matplotlib.axes._subplots.AxesSubplot at 0x11b9c88d0> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_normalize_column/pandas_normalize_column_5_1.png) ```py # 创建 x,其中 x 的得分列的值为浮点数 x = df[['score']].values.astype(float) # 创建 minmax 处理器对象 min_max_scaler = preprocessing.MinMaxScaler() # 创建一个对象,转换数据,拟合 minmax 处理器 x_scaled = min_max_scaler.fit_transform(x) # 在数据帧上运行规范化器 df_normalized = pd.DataFrame(x_scaled) # 查看数据帧 df_normalized ``` | | 0 | | --- | --- | | 0 | 1.000000 | | 1 | 0.318182 | | 2 | 0.285714 | | 3 | 0.327922 | | 4 | 0.000000 | | 5 | 0.389610 | | 6 | 0.477273 | | 7 | 0.181818 | | 8 | 0.431818 | | 9 | 0.759740 | ```py # 绘制数据帧 df_normalized.plot(kind='bar') # <matplotlib.axes._subplots.AxesSubplot at 0x11ba31c50> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_normalize_column/pandas_normalize_column_9_1.png) ## Pandas 中的级联表 ```py # 导入模块 import pandas as pd raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'TestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3]} df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'TestScore']) df ``` | | regiment | company | TestScore | | --- | --- | --- | --- | | 0 | Nighthawks | 1st | 4 | | 1 | Nighthawks | 1st | 24 | | 2 | Nighthawks | 2nd | 31 | | 3 | Nighthawks | 2nd | 2 | | 4 | Dragoons | 1st | 3 | | 5 | Dragoons | 1st | 4 | | 6 | Dragoons | 2nd | 24 | | 7 | Dragoons | 2nd | 31 | | 8 | Scouts | 1st | 2 | | 9 | Scouts | 1st | 3 | | 10 | Scouts | 2nd | 2 | | 11 | Scouts | 2nd | 3 | ```py # 按公司和团队创建分组均值的透视表 pd.pivot_table(df, index=['regiment','company'], aggfunc='mean') ``` | | | TestScore | | --- | --- | --- | | regiment | company | | | Dragoons | 1st | 3.5 | | | 2nd | 27.5 | | Nighthawks | 1st | 14.0 | | | 2nd | 16.5 | | Scouts | 1st | 2.5 | | | 2nd | 2.5 | ```py # 按公司和团队创建分组计数的透视表 df.pivot_table(index=['regiment','company'], aggfunc='count') ``` | | | TestScore | | --- | --- | --- | | regiment | company | | | Dragoons | 1st | 2 | | | 2nd | 2 | | Nighthawks | 1st | 2 | | | 2nd | 2 | | Scouts | 1st | 2 | | | 2nd | 2 | ## 在 Pandas 中快速修改字符串列 我经常需要或想要改变一串字符串中所有项目的大小写(例如`BRAZIL`到`Brazil`等)。 有很多方法可以实现这一目标,但我已经确定这是最容易和最快的方法。 ```py # 导入 pandas import pandas as pd # 创建名称的列表 first_names = pd.Series(['Steve Murrey', 'Jane Fonda', 'Sara McGully', 'Mary Jane']) # 打印列 first_names ''' 0 Steve Murrey 1 Jane Fonda 2 Sara McGully 3 Mary Jane dtype: object ''' # 打印列的小写 first_names.str.lower() ''' 0 steve murrey 1 jane fonda 2 sara mcgully 3 mary jane dtype: object ''' # 打印列的大写 first_names.str.upper() ''' 0 STEVE MURREY 1 JANE FONDA 2 SARA MCGULLY 3 MARY JANE dtype: object ''' # 打印列的标题大小写 first_names.str.title() ''' 0 Steve Murrey 1 Jane Fonda 2 Sara Mcgully 3 Mary Jane dtype: object ''' # 打印以空格分割的列 first_names.str.split(" ") ''' 0 [Steve, Murrey] 1 [Jane, Fonda] 2 [Sara, McGully] 3 [Mary, Jane] dtype: object ''' # 打印首字母大写的列 first_names.str.capitalize() ''' 0 Steve murrey 1 Jane fonda 2 Sara mcgully 3 Mary jane dtype: object ''' ``` 明白了吧。更多字符串方法在[这里](https://docs.python.org/3.5/library/stdtypes.html#string-methods)。 ## 随机抽样数据帧 ```py # 导入模块 import pandas as pd import numpy as np raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3], 'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | 4 | 25 | | 1 | Molly | Jacobson | 52 | 24 | 94 | | 2 | Tina | Ali | 36 | 31 | 57 | | 3 | Jake | Milner | 24 | 2 | 62 | | 4 | Amy | Cooze | 73 | 3 | 70 | ```py # 不放回选择大小为 2 的随机子集 df.take(np.random.permutation(len(df))[:2]) ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 1 | Molly | Jacobson | 52 | 24 | 94 | | 4 | Amy | Cooze | 73 | 3 | 70 | ## 对数据帧的行排名 ```py # 导入模块 import pandas as pd # 创建数据帧 data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3], 'coverage': [25, 94, 57, 62, 70]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df ``` | | coverage | name | reports | year | | --- | --- | --- | --- | --- | | Cochice | 25 | Jason | 4 | 2012 | | Pima | 94 | Molly | 24 | 2012 | | Santa Cruz | 57 | Tina | 31 | 2013 | | Maricopa | 62 | Jake | 2 | 2014 | | Yuma | 70 | Amy | 3 | 2014 | 5 rows × 4 columns ```py # 创建一个新列,该列是 coverage 值的升序排名 df['coverageRanked'] = df['coverage'].rank(ascending=1) df ``` | | coverage | name | reports | year | coverageRanked | | --- | --- | --- | --- | --- | --- | | Cochice | 25 | Jason | 4 | 2012 | 1 | | Pima | 94 | Molly | 24 | 2012 | 5 | | Santa Cruz | 57 | Tina | 31 | 2013 | 2 | | Maricopa | 62 | Jake | 2 | 2014 | 3 | | Yuma | 70 | Amy | 3 | 2014 | 4 | 5 rows × 5 columns ## 正则表达式基础 ```py # 导入正则包 import re import sys text = 'The quick brown fox jumped over the lazy black bear.' three_letter_word = '\w{3}' pattern_re = re.compile(three_letter_word); pattern_re re.compile(r'\w{3}', re.UNICODE) re_search = re.search('..own', text) if re_search: # 打印搜索结果 print(re_search.group()) # brown ``` `re.match` `re.match()`仅用于匹配字符串的开头或整个字符串。对于其他任何内容,请使用`re.search`。 ### Match all three letter words in text ```py # 在文本中匹配所有三个字母的单词 re_match = re.match('..own', text) if re_match: # 打印所有匹配 print(re_match.group()) else: # 打印这个 print('No matches') # No matches ``` `re.split` ```py # 使用 'e' 作为分隔符拆分字符串。 re_split = re.split('e', text); re_split # ['Th', ' quick brown fox jump', 'd ov', 'r th', ' lazy black b', 'ar.'] ``` `re.sub` 用其他东西替换正则表达式模式串。`3`表示要进行的最大替换次数。 ```py # 用 'E' 替换前三个 'e' 实例,然后打印出来 re_sub = re.sub('e', 'E', text, 3); print(re_sub) # ThE quick brown fox jumpEd ovEr the lazy black bear. ``` ## 正则表达式示例 ```py # 导入 regex import re # 创建一些数据 text = 'A flock of 120 quick brown foxes jumped over 30 lazy brown, bears.' re.findall('^A', text) # ['A'] re.findall('bears.$', text) # ['bears.'] re.findall('f..es', text) # ['foxes'] # 寻找所有元音 re.findall('[aeiou]', text) # ['o', 'o', 'u', 'i', 'o', 'o', 'e', 'u', 'e', 'o', 'e', 'a', 'o', 'e', 'a'] # 查找不是小写元音的所有字符 re.findall('[^aeiou]', text) ''' ['A', ' ', 'f', 'l', 'c', 'k', ' ', 'f', ' ', '1', '2', '0', ' ', 'q', 'c', 'k', ' ', 'b', 'r', 'w', 'n', ' ', 'f', 'x', 's', ' ', 'j', 'm', 'p', 'd', ' ', 'v', 'r', ' ', '3', '0', ' ', 'l', 'z', 'y', ' ', 'b', 'r', 'w', 'n', ',', ' ', 'b', 'r', 's', '.'] ''' re.findall('a|A', text) # ['A', 'a', 'a'] # 寻找任何 'fox' 的实例 re.findall('(foxes)', text) # ['foxes'] # 寻找所有五个字母的单词 re.findall('\w\w\w\w\w', text) # ['flock', 'quick', 'brown', 'foxes', 'jumpe', 'brown', 'bears'] re.findall('\W\W', text) # [', '] re.findall('\s', text) # [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '] re.findall('\S\S', text) ''' ['fl', 'oc', 'of', '12', 'qu', 'ic', 'br', 'ow', 'fo', 'xe', 'ju', 'mp', 'ed', 'ov', 'er', '30', 'la', 'zy', 'br', 'ow', 'n,', 'be', 'ar', 's.'] ''' re.findall('\d\d\d', text) # ['120'] re.findall('\D\D\D\D\D', text) ''' ['A flo', 'ck of', ' quic', 'k bro', 'wn fo', 'xes j', 'umped', ' over', ' lazy', ' brow', 'n, be'] ''' re.findall('\AA', text) # ['A'] re.findall('bears.\Z', text) # ['bears.'] re.findall('\b[foxes]', text) # [] re.findall('\n', text) # [] re.findall('[Ff]oxes', 'foxes Foxes Doxes') # ['foxes', 'Foxes'] re.findall('[Ff]oxes', 'foxes Foxes Doxes') # ['foxes', 'Foxes'] re.findall('[a-z]', 'foxes Foxes') # ['f', 'o', 'x', 'e', 's', 'o', 'x', 'e', 's'] re.findall('[A-Z]', 'foxes Foxes') # ['F'] re.findall('[a-zA-Z0-9]', 'foxes Foxes') # ['f', 'o', 'x', 'e', 's', 'F', 'o', 'x', 'e', 's'] re.findall('[^aeiou]', 'foxes Foxes') # ['f', 'x', 's', ' ', 'F', 'x', 's'] re.findall('[^0-9]', 'foxes Foxes') # ['f', 'o', 'x', 'e', 's', ' ', 'F', 'o', 'x', 'e', 's'] re.findall('foxes?', 'foxes Foxes') # ['foxes'] re.findall('ox*', 'foxes Foxes') # ['ox', 'ox'] re.findall('ox+', 'foxes Foxes') # ['ox', 'ox'] re.findall('\d{3}', text) # ['120'] re.findall('\d{2,}', text) # ['120', '30'] re.findall('\d{2,3}', text) # ['120', '30'] re.findall('^A', text) # ['A'] re.findall('bears.$', text) # ['bears.'] re.findall('\AA', text) # ['A'] re.findall('bears.\Z', text) # ['bears.'] re.findall('bears(?=.)', text) # ['bears'] re.findall('foxes(?!!)', 'foxes foxes!') # ['foxes'] re.findall('foxes|foxes!', 'foxes foxes!') # ['foxes', 'foxes'] re.findall('fox(es!)', 'foxes foxes!') # ['es!'] re.findall('foxes(!)', 'foxes foxes!') # ['!'] ``` ## 重索引序列和数据帧 ```py # 导入模块 import pandas as pd import numpy as np # 创建亚利桑那州南部的火灾风险序列 brushFireRisk = pd.Series([34, 23, 12, 23], index = ['Bisbee', 'Douglas', 'Sierra Vista', 'Tombstone']) brushFireRisk ''' Bisbee 34 Douglas 23 Sierra Vista 12 Tombstone 23 dtype: int64 ''' # 重索引这个序列并创建一个新的序列变量 brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson']) brushFireRiskReindexed ''' Tombstone 23.0 Douglas 23.0 Bisbee 34.0 Sierra Vista 12.0 Barley NaN Tucson NaN dtype: float64 ''' # 重索引序列并在任何缺失的索引处填入 0 brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson'], fill_value = 0) brushFireRiskReindexed ''' Tombstone 23 Douglas 23 Bisbee 34 Sierra Vista 12 Barley 0 Tucson 0 dtype: int64 ''' # 创建数据帧 data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data) df ``` | | county | reports | year | | --- | --- | --- | --- | | 0 | Cochice | 4 | 2012 | | 1 | Pima | 24 | 2012 | | 2 | Santa Cruz | 31 | 2013 | | 3 | Maricopa | 2 | 2014 | | 4 | Yuma | 3 | 2014 | ```py # 更改行的顺序(索引) df.reindex([4, 3, 2, 1, 0]) ``` | | county | reports | year | | --- | --- | --- | --- | | 4 | Yuma | 3 | 2014 | | 3 | Maricopa | 2 | 2014 | | 2 | Santa Cruz | 31 | 2013 | | 1 | Pima | 24 | 2012 | | 0 | Cochice | 4 | 2012 | ```py # 更改列的顺序(索引) columnsTitles = ['year', 'reports', 'county'] df.reindex(columns=columnsTitles) ``` | | year | reports | county | | --- | --- | --- | --- | | 0 | 2012 | 4 | Cochice | | 1 | 2012 | 24 | Pima | | 2 | 2013 | 31 | Santa Cruz | | 3 | 2014 | 2 | Maricopa | | 4 | 2014 | 3 | Yuma | ## 重命名列标题 来自 [StackOverflow](http://stackoverflow.com/users/3291077/rgalbo) 上的 [rgalbo](http://stackoverflow.com/questions/31328861/python-pandas-replacing-header-with-top-row)。 ```py # 导入所需模块 import pandas as pd # 创建列表的字典,作为值 raw_data = {'0': ['first_name', 'Molly', 'Tina', 'Jake', 'Amy'], '1': ['last_name', 'Jacobson', 'Ali', 'Milner', 'Cooze'], '2': ['age', 52, 36, 24, 73], '3': ['preTestScore', 24, 31, 2, 3]} # 创建数据帧 df = pd.DataFrame(raw_data) # 查看数据帧 df ``` | | 0 | 1 | 2 | 3 | | --- | --- | --- | --- | --- | | 0 | first_name | last_name | age | preTestScore | | 1 | Molly | Jacobson | 52 | 24 | | 2 | Tina | Ali | 36 | 31 | | 3 | Jake | Milner | 24 | 2 | | 4 | Amy | Cooze | 73 | 3 | ```py # 从数据集的第一行创建一个名为 header 的新变量 header = df.iloc[0] ''' 0 first_name 1 last_name 2 age 3 preTestScore Name: 0, dtype: object ''' # 将数据帧替换为不包含第一行的新数据帧 df = df[1:] # 使用标题变量重命名数据帧的列值 df.rename(columns = header) ``` | | first_name | last_name | age | preTestScore | | --- | --- | --- | --- | --- | | 1 | Molly | Jacobson | 52 | 24 | | --- | --- | --- | --- | --- | | 2 | Tina | Ali | 36 | 31 | | --- | --- | --- | --- | --- | | 3 | Jake | Milner | 24 | 2 | | --- | --- | --- | --- | --- | | 4 | Amy | Cooze | 73 | 3 | | --- | --- | --- | --- | --- | ## 重命名多个数据帧的列名 ```py # 导入模块 import pandas as pd # 设置 ipython 的最大行显示 pd.set_option('display.max_row', 1000) # 设置 ipython 的最大列宽 pd.set_option('display.max_columns', 50) # 创建示例数据帧 data = {'Commander': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'Date': ['2012, 02, 08', '2012, 02, 08', '2012, 02, 08', '2012, 02, 08', '2012, 02, 08'], 'Score': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df ``` | | Commander | Date | Score | | --- | --- | --- | --- | | Cochice | Jason | 2012, 02, 08 | 4 | | Pima | Molly | 2012, 02, 08 | 24 | | Santa Cruz | Tina | 2012, 02, 08 | 31 | | Maricopa | Jake | 2012, 02, 08 | 2 | | Yuma | Amy | 2012, 02, 08 | 3 | ```py # 重命名列名 df.columns = ['Leader', 'Time', 'Score'] df ``` | | Leader | Time | Score | | --- | --- | --- | --- | | Cochice | Jason | 2012, 02, 08 | 4 | | Pima | Molly | 2012, 02, 08 | 24 | | Santa Cruz | Tina | 2012, 02, 08 | 31 | | Maricopa | Jake | 2012, 02, 08 | 2 | | Yuma | Amy | 2012, 02, 08 | 3 | ```py df.rename(columns={'Leader': 'Commander'}, inplace=True) df ``` | | Commander | Time | Score | | --- | --- | --- | --- | | Cochice | Jason | 2012, 02, 08 | 4 | | Pima | Molly | 2012, 02, 08 | 24 | | Santa Cruz | Tina | 2012, 02, 08 | 31 | | Maricopa | Jake | 2012, 02, 08 | 2 | | Yuma | Amy | 2012, 02, 08 | 3 | ## 替换值 ```py # 导入模块 import pandas as pd import numpy as np raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [-999, -999, -999, 2, 1], 'postTestScore': [2, 2, -999, 2, -999]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | -999 | 2 | | 1 | Molly | Jacobson | 52 | -999 | 2 | | 2 | Tina | Ali | 36 | -999 | -999 | | 3 | Jake | Milner | 24 | 2 | 2 | | 4 | Amy | Cooze | 73 | 1 | -999 | ```py # 将所有 -999 替换为 NAN df.replace(-999, np.nan) ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | NaN | 2.0 | | 1 | Molly | Jacobson | 52 | NaN | 2.0 | | 2 | Tina | Ali | 36 | NaN | NaN | | 3 | Jake | Milner | 24 | 2.0 | 2.0 | | 4 | Amy | Cooze | 73 | 1.0 | NaN | ## 将数据帧保存为 CSV ```py # 导入模块 import pandas as pd raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3], 'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | 4 | 25 | | 1 | Molly | Jacobson | 52 | 24 | 94 | | 2 | Tina | Ali | 36 | 31 | 57 | | 3 | Jake | Milner | 24 | 2 | 62 | | 4 | Amy | Cooze | 73 | 3 | 70 | 将名为`df`的数据帧保存为 csv。 ```py df.to_csv('example.csv') ``` ## 在列中搜索某个值 ```py # 导入模块 import pandas as pd raw_data = {'first_name': ['Jason', 'Jason', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Miller', 'Ali', 'Milner', 'Cooze'], 'age': [42, 42, 36, 24, 73], 'preTestScore': [4, 4, 31, 2, 3], 'postTestScore': [25, 25, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | 4 | 25 | | 1 | Jason | Miller | 42 | 4 | 25 | | 2 | Tina | Ali | 36 | 31 | 57 | | 3 | Jake | Milner | 24 | 2 | 62 | | 4 | Amy | Cooze | 73 | 3 | 70 | ```py # 在列中寻找值在哪里 # 查看 postTestscore 大于 50 的地方 df['preTestScore'].where(df['postTestScore'] > 50) ''' 0 NaN 1 NaN 2 31.0 3 2.0 4 3.0 Name: preTestScore, dtype: float64 ''' ``` ## 选择包含特定值的行和列 ```py # 导入模块 import pandas as pd # 设置 ipython 的最大行显示 pd.set_option('display.max_row', 1000) # 设置 ipython 的最大列宽 pd.set_option('display.max_columns', 50) # 创建示例数据帧 data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df ``` | | name | reports | year | | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | | Pima | Molly | 24 | 2012 | | Santa Cruz | Tina | 31 | 2013 | | Maricopa | Jake | 2 | 2014 | | Yuma | Amy | 3 | 2014 | ```py # 按照列值抓取行 value_list = ['Tina', 'Molly', 'Jason'] df[df.name.isin(value_list)] ``` | | name | reports | year | | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | | Pima | Molly | 24 | 2012 | | Santa Cruz | Tina | 31 | 2013 | ```py # 获取列值不是某个值的行 df[~df.name.isin(value_list)] ``` | | name | reports | year | | --- | --- | --- | --- | | Maricopa | Jake | 2 | 2014 | | Yuma | Amy | 3 | 2014 | ## 选择具有特定值的行 ```py import pandas as pd # 创建示例数据帧 data = {'name': ['Jason', 'Molly'], 'country': [['Syria', 'Lebanon'],['Spain', 'Morocco']]} df = pd.DataFrame(data) df ``` | | country | name | | --- | --- | --- | | 0 | [Syria, Lebanon] | Jason | | 1 | [Spain, Morocco] | Molly | ```py df[df['country'].map(lambda country: 'Syria' in country)] ``` | | country | name | | --- | --- | --- | | 0 | [Syria, Lebanon] | Jason | ## 使用多个过滤器选择行 ```py import pandas as pd # 创建示例数据帧 data = {'name': ['A', 'B', 'C', 'D', 'E'], 'score': [1,2,3,4,5]} df = pd.DataFrame(data) df ``` | | name | score | | --- | --- | --- | | 0 | A | 1 | | 1 | B | 2 | | 2 | C | 3 | | 3 | D | 4 | | 4 | E | 5 | ```py # 选择数据帧的行,其中 df.score 大于 1 且小于 5 df[(df['score'] > 1) & (df['score'] < 5)] ``` | | name | score | | --- | --- | --- | | 1 | B | 2 | | 2 | C | 3 | | 3 | D | 4 | ## 根据条件选择数据帧的行 ```py # 导入模块 import pandas as pd import numpy as np # 创建数据帧 raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan], 'nationality': ['USA', 'USA', 'France', 'UK', 'UK'], 'age': [42, 52, 36, 24, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age']) df ``` | | first_name | nationality | age | | --- | --- | --- | --- | | 0 | Jason | USA | 42 | | 1 | Molly | USA | 52 | | 2 | NaN | France | 36 | | 3 | NaN | UK | 24 | | 4 | NaN | UK | 70 | ```py # 方法 1:使用布尔变量 # 如果国籍是美国,则变量为 TRUE american = df['nationality'] == "USA" # 如果年龄大于 50,则变量为 TRUE elderly = df['age'] > 50 # 选择所有国籍为美国且年龄大于 50 的案例 df[american & elderly] ``` | | first_name | nationality | age | | --- | --- | --- | --- | | 1 | Molly | USA | 52 | ```py # 方法 2:使用变量属性 # 选择所有不缺少名字且国籍为美国的案例 df[df['first_name'].notnull() & (df['nationality'] == "USA")] ``` | | first_name | nationality | age | | --- | --- | --- | --- | | 0 | Jason | USA | 42 | | 1 | Molly | USA | 52 | ## 数据帧简单示例 ```py # 导入模块 import pandas as pd raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3], 'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | 42 | 4 | 25 | | 1 | Molly | Jacobson | 52 | 24 | 94 | | 2 | Tina | Ali | 36 | 31 | 57 | | 3 | Jake | Milner | 24 | 2 | 62 | | 4 | Amy | Cooze | 73 | 3 | 70 | ```py # 创建第二个数据帧 raw_data_2 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'], 'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'], 'age': [53, 26, 72, 73, 24], 'preTestScore': [13, 52, 72, 26, 26], 'postTestScore': [82, 52, 56, 234, 254]} df_2 = pd.DataFrame(raw_data_2, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df_2 ``` | | first_name | last_name | age | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Sarah | Mornig | 53 | 13 | 82 | | 1 | Gueniva | Jaker | 26 | 52 | 52 | | 2 | Know | Alom | 72 | 72 | 56 | | 3 | Sara | Ormon | 73 | 26 | 234 | | 4 | Cat | Koozer | 24 | 26 | 254 | ```py # 创建第三个数据帧 raw_data_3 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'], 'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'], 'postTestScore_2': [82, 52, 56, 234, 254]} df_3 = pd.DataFrame(raw_data_3, columns = ['first_name', 'last_name', 'postTestScore_2']) df_3 ``` | | first_name | last_name | postTestScore_2 | | --- | --- | --- | --- | | 0 | Sarah | Mornig | 82 | | 1 | Gueniva | Jaker | 52 | | 2 | Know | Alom | 56 | | 3 | Sara | Ormon | 234 | | 4 | Cat | Koozer | 254 | ## 排序数据帧的行 ```py # 导入模块 import pandas as pd data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [1, 2, 1, 2, 3], 'coverage': [2, 2, 3, 3, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df ``` | | coverage | name | reports | year | | --- | --- | --- | --- | --- | | Cochice | 2 | Jason | 1 | 2012 | | Pima | 2 | Molly | 2 | 2012 | | Santa Cruz | 3 | Tina | 1 | 2013 | | Maricopa | 3 | Jake | 2 | 2014 | | Yuma | 3 | Amy | 3 | 2014 | ```py # 按报告对数据框的行降序排序 df.sort_values(by='reports', ascending=0) ``` | | coverage | name | reports | year | | --- | --- | --- | --- | --- | | Yuma | 3 | Amy | 3 | 2014 | | Pima | 2 | Molly | 2 | 2012 | | Maricopa | 3 | Jake | 2 | 2014 | | Cochice | 2 | Jason | 1 | 2012 | | Santa Cruz | 3 | Tina | 1 | 2013 | ```py # 按 coverage 然后是报告对数据帧的行升序排序 df.sort_values(by=['coverage', 'reports']) ``` | | coverage | name | reports | year | | --- | --- | --- | --- | --- | | Cochice | 2 | Jason | 1 | 2012 | | Pima | 2 | Molly | 2 | 2012 | | Santa Cruz | 3 | Tina | 1 | 2013 | | Maricopa | 3 | Jake | 2 | 2014 | | Yuma | 3 | Amy | 3 | 2014 | ## 将经纬度坐标变量拆分为单独的变量 ```py import pandas as pd import numpy as np raw_data = {'geo': ['40.0024, -105.4102', '40.0068, -105.266', '39.9318, -105.2813', np.nan]} df = pd.DataFrame(raw_data, columns = ['geo']) df ``` | | geo | | --- | --- | | 0 | 40.0024, -105.4102 | | 1 | 40.0068, -105.266 | | 2 | 39.9318, -105.2813 | | 3 | NaN | | --- | --- | ```py # 为要放置的循环结果创建两个列表 lat = [] lon = [] # 对于变量中的每一行 for row in df['geo']: # Try to, try: # 用逗号分隔行,转换为浮点 # 并将逗号前的所有内容追加到 lat lat.append(row.split(',')[0]) # 用逗号分隔行,转换为浮点 # 并将逗号后的所有内容追加到 lon lon.append(row.split(',')[1]) # 但是如果你得到了错误 except: # 向 lat 添加缺失值 lat.append(np.NaN) # 向 lon 添加缺失值 lon.append(np.NaN) # 从 lat 和 lon 创建新的两列 df['latitude'] = lat df['longitude'] = lon df ``` | | geo | latitude | longitude | | --- | --- | --- | --- | | 0 | 40.0024, -105.4102 | 40.0024 | -105.4102 | | 1 | 40.0068, -105.266 | 40.0068 | -105.266 | | 2 | 39.9318, -105.2813 | 39.9318 | -105.2813 | | 3 | NaN | NaN | NaN | ## 数据流水线 ```py # 创建一些原始数据 raw_data = [1,2,3,4,5,6,7,8,9,10] # 定义产生 input+6 的生成器 def add_6(numbers): for x in numbers: output = x+6 yield output # 定义产生 input-2 的生成器 def subtract_2(numbers): for x in numbers: output = x-2 yield output # 定义产生 input*100 的生成器 def multiply_by_100(numbers): for x in numbers: output = x*100 yield output # 流水线的第一步 step1 = add_6(raw_data) # 流水线的第二步 step2 = subtract_2(step1) # 流水线的第三步 pipeline = multiply_by_100(step2) # 原始数据的第一个元素 next(pipeline) # 500 # 原始数据的第二个元素 next(pipeline) # 600 # 处理所有数据 for raw_data in pipeline: print(raw_data) ''' 700 800 900 1000 1100 1200 1300 1400 ''' ``` ## 数据帧中的字符串整理 ```py # 导入模块 import pandas as pd import numpy as np import re as re raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'email': ['[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)', np.NAN, '[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)'], 'preTestScore': [4, 24, 31, 2, 3], 'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore']) df ``` | | first_name | last_name | email | preTestScore | postTestScore | | --- | --- | --- | --- | --- | --- | | 0 | Jason | Miller | [[email protected]](/cdn-cgi/l/email-protection) | 4 | 25 | | 1 | Molly | Jacobson | [[email protected]](/cdn-cgi/l/email-protection) | 24 | 94 | | 2 | Tina | Ali | NaN | 31 | 57 | | 3 | Jake | Milner | [[email protected]](/cdn-cgi/l/email-protection) | 2 | 62 | | 4 | Amy | Cooze | [[email protected]](/cdn-cgi/l/email-protection) | 3 | 70 | ```py # 电子邮件列中的哪些字符串包含 'gmail' df['email'].str.contains('gmail') ''' 0 True 1 True 2 NaN 3 False 4 False Name: email, dtype: object ''' pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})' df['email'].str.findall(pattern, flags=re.IGNORECASE) ''' 0 [(jas203, gmail, com)] 1 [(momomolly, gmail, com)] 2 NaN 3 [(battler, milner, com)] 4 [(Ames1234, yahoo, com)] Name: email, dtype: object ''' matches = df['email'].str.match(pattern, flags=re.IGNORECASE) matches ''' /Users/chrisralbon/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer. if __name__ == '__main__': 0 (jas203, gmail, com) 1 (momomolly, gmail, com) 2 NaN 3 (battler, milner, com) 4 (Ames1234, yahoo, com) Name: email, dtype: object ''' matches.str[1] ''' 0 gmail 1 gmail 2 NaN 3 milner 4 yahoo Name: email, dtype: object ''' ``` ## 和 Pandas 一起使用列表推导式 ```py # 导入模块 import pandas as pd # 设置 ipython 的最大行显示 pd.set_option('display.max_row', 1000) # 设置 ipython 的最大列宽 pd.set_option('display.max_columns', 50) data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df ``` | | name | reports | year | | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | | Pima | Molly | 24 | 2012 | | Santa Cruz | Tina | 31 | 2013 | | Maricopa | Jake | 2 | 2014 | | Yuma | Amy | 3 | 2014 | 作为循环的列表推导式。 ```py # 创建变量 next_year = [] # 对于 df.years 的每一行 for row in df['year']: # 为这一行添加 1 并将其附加到 next_year next_year.append(row + 1) # 创建 df.next_year df['next_year'] = next_year # 查看数据帧 df ``` | | name | reports | year | next_year | | --- | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | 2013 | | Pima | Molly | 24 | 2012 | 2013 | | Santa Cruz | Tina | 31 | 2013 | 2014 | | Maricopa | Jake | 2 | 2014 | 2015 | | Yuma | Amy | 3 | 2014 | 2015 | 作为列表推导式。 ```py # 对于 df.year 中的每一行,从行中减去 1 df['previous_year'] = [row-1 for row in df['year']] df ``` | | name | reports | year | next_year | previous_year | | --- | --- | --- | --- | --- | --- | | Cochice | Jason | 4 | 2012 | 2013 | 2011 | | Pima | Molly | 24 | 2012 | 2013 | 2011 | | Santa Cruz | Tina | 31 | 2013 | 2014 | 2012 | | Maricopa | Jake | 2 | 2014 | 2015 | 2013 | | Yuma | Amy | 3 | 2014 | 2015 | 2013 | ## 使用 Seaborn 来可视化数据帧 ```py import pandas as pd %matplotlib inline import random import matplotlib.pyplot as plt import seaborn as sns df = pd.DataFrame() df['x'] = random.sample(range(1, 100), 25) df['y'] = random.sample(range(1, 100), 25) df.head() ``` | | x | y | | --- | --- | --- | | 0 | 18 | 25 | | 1 | 42 | 67 | | 2 | 52 | 77 | | 3 | 4 | 34 | | 4 | 14 | 69 | ```py # 散点图 sns.lmplot('x', 'y', data=df, fit_reg=False) # <seaborn.axisgrid.FacetGrid at 0x114563b00> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_6_1.png) ```py # 密度图 sns.kdeplot(df.y) # <matplotlib.axes._subplots.AxesSubplot at 0x113ea2ef0> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_8_1.png) ```py sns.kdeplot(df.y, df.x) # <matplotlib.axes._subplots.AxesSubplot at 0x113d7fef0> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_9_1.png) ```py sns.distplot(df.x) # <matplotlib.axes._subplots.AxesSubplot at 0x114294160> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_10_1.png) ```py # 直方图 plt.hist(df.x, alpha=.3) sns.rugplot(df.x); ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_12_0.png) ```py # 箱形图 sns.boxplot([df.y, df.x]) # <matplotlib.axes._subplots.AxesSubplot at 0x1142b8b38> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_14_1.png) ```py # 提琴图 sns.violinplot([df.y, df.x]) # <matplotlib.axes._subplots.AxesSubplot at 0x114444a58> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_16_1.png) ```py # 热力图 sns.heatmap([df.y, df.x], annot=True, fmt="d") # <matplotlib.axes._subplots.AxesSubplot at 0x114530c88> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_18_1.png) ```py # 聚类图 sns.clustermap(df) # <seaborn.matrix.ClusterGrid at 0x116f313c8> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_with_seaborn/pandas_with_seaborn_20_1.png) # Pandas 数据结构 ```py # 导入模块 import pandas as pd ``` ### 序列 101 序列是一维数组(类似 R 的向量)。 ```py # 创建 floodingReports 数量的序列 floodingReports = pd.Series([5, 6, 2, 9, 12]) floodingReports ''' 0 5 1 6 2 2 3 9 4 12 dtype: int64 ''' ``` 请注意,第一列数字(0 到 4)是索引。 ```py # 将县名设置为 floodingReports 序列的索引 floodingReports = pd.Series([5, 6, 2, 9, 12], index=['Cochise County', 'Pima County', 'Santa Cruz County', 'Maricopa County', 'Yuma County']) floodingReports ''' Cochise County 5 Pima County 6 Santa Cruz County 2 Maricopa County 9 Yuma County 12 dtype: int64 ''' floodingReports['Cochise County'] # 5 floodingReports[floodingReports > 6] ''' Maricopa County 9 Yuma County 12 dtype: int64 ''' ``` 从字典中创建 Pandas 序列。 注意:执行此操作时,字典的键将成为序列索引。 ```py # 创建字典 fireReports_dict = {'Cochise County': 12, 'Pima County': 342, 'Santa Cruz County': 13, 'Maricopa County': 42, 'Yuma County' : 52} # 将字典转换为 pd.Series,然后查看它 fireReports = pd.Series(fireReports_dict); fireReports ''' Cochise County 12 Maricopa County 42 Pima County 342 Santa Cruz County 13 Yuma County 52 dtype: int64 ''' fireReports.index = ["Cochice", "Pima", "Santa Cruz", "Maricopa", "Yuma"] fireReports ''' Cochice 12 Pima 42 Santa Cruz 342 Maricopa 13 Yuma 52 dtype: int64 ''' ``` ### 数据帧 101 数据帧就像 R 的数据帧。 ```py # 从等长列表或 NumPy 数组的字典中创建数据帧 data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data) df ``` | | county | reports | year | | --- | --- | --- | --- | | 0 | Cochice | 4 | 2012 | | 1 | Pima | 24 | 2012 | | 2 | Santa Cruz | 31 | 2013 | | 3 | Maricopa | 2 | 2014 | | 4 | Yuma | 3 | 2014 | ```py # 使用 columns 属性设置列的顺序 dfColumnOrdered = pd.DataFrame(data, columns=['county', 'year', 'reports']) dfColumnOrdered ``` | | county | year | reports | | --- | --- | --- | --- | | 0 | Cochice | 2012 | 4 | | 1 | Pima | 2012 | 24 | | 2 | Santa Cruz | 2013 | 31 | | 3 | Maricopa | 2014 | 2 | | 4 | Yuma | 2014 | 3 | ```py # 添加一列 dfColumnOrdered['newsCoverage'] = pd.Series([42.3, 92.1, 12.2, 39.3, 30.2]) dfColumnOrdered ``` | | county | year | reports | newsCoverage | | --- | --- | --- | --- | --- | | 0 | Cochice | 2012 | 4 | 42.3 | | 1 | Pima | 2012 | 24 | 92.1 | | 2 | Santa Cruz | 2013 | 31 | 12.2 | | 3 | Maricopa | 2014 | 2 | 39.3 | | 4 | Yuma | 2014 | 3 | 30.2 | ```py # 删除一列 del dfColumnOrdered['newsCoverage'] dfColumnOrdered ``` | | county | year | reports | | --- | --- | --- | --- | | 0 | Cochice | 2012 | 4 | | 1 | Pima | 2012 | 24 | | 2 | Santa Cruz | 2013 | 31 | | 3 | Maricopa | 2014 | 2 | | 4 | Yuma | 2014 | 3 | ```py # 转置数据帧 dfColumnOrdered.T ``` | | 0 | 1 | 2 | 3 | 4 | | --- | --- | --- | --- | --- | --- | | county | Cochice | Pima | Santa Cruz | Maricopa | Yuma | | year | 2012 | 2012 | 2013 | 2014 | 2014 | | reports | 4 | 24 | 31 | 2 | 3 | ## Pandas 时间序列基础 ```py # 导入模块 from datetime import datetime import pandas as pd %matplotlib inline import matplotlib.pyplot as pyplot data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'], 'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41]} df = pd.DataFrame(data, columns = ['date', 'battle_deaths']) print(df) ''' date battle_deaths 0 2014-05-01 18:47:05.069722 34 1 2014-05-01 18:47:05.119994 25 2 2014-05-02 18:47:05.178768 26 3 2014-05-02 18:47:05.230071 15 4 2014-05-02 18:47:05.230071 15 5 2014-05-02 18:47:05.280592 14 6 2014-05-03 18:47:05.332662 26 7 2014-05-03 18:47:05.385109 25 8 2014-05-04 18:47:05.436523 62 9 2014-05-04 18:47:05.486877 41 ''' df['date'] = pd.to_datetime(df['date']) df.index = df['date'] del df['date'] df ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 18:47:05.069722 | 34 | | 2014-05-01 18:47:05.119994 | 25 | | 2014-05-02 18:47:05.178768 | 26 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.280592 | 14 | | 2014-05-03 18:47:05.332662 | 26 | | 2014-05-03 18:47:05.385109 | 25 | | 2014-05-04 18:47:05.436523 | 62 | | 2014-05-04 18:47:05.486877 | 41 | ```py # 查看 2014 年的所有观测 df['2014'] ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 18:47:05.069722 | 34 | | 2014-05-01 18:47:05.119994 | 25 | | 2014-05-02 18:47:05.178768 | 26 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.280592 | 14 | | 2014-05-03 18:47:05.332662 | 26 | | 2014-05-03 18:47:05.385109 | 25 | | 2014-05-04 18:47:05.436523 | 62 | | 2014-05-04 18:47:05.486877 | 41 | ```py # 查看 2014 年 5 月的所有观测 df['2014-05'] ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 18:47:05.069722 | 34 | | 2014-05-01 18:47:05.119994 | 25 | | 2014-05-02 18:47:05.178768 | 26 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.280592 | 14 | | 2014-05-03 18:47:05.332662 | 26 | | 2014-05-03 18:47:05.385109 | 25 | | 2014-05-04 18:47:05.436523 | 62 | | 2014-05-04 18:47:05.486877 | 41 | ```py # 查看 2014.5.3 的所有观测 df[datetime(2014, 5, 3):] ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-03 18:47:05.332662 | 26 | | 2014-05-03 18:47:05.385109 | 25 | | 2014-05-04 18:47:05.436523 | 62 | | 2014-05-04 18:47:05.486877 | 41 | ### Observations between May 3rd and May 4th ```py # 查看 2014.5.3~4 的所有观测 df['5/3/2014':'5/4/2014'] ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-03 18:47:05.332662 | 26 | | 2014-05-03 18:47:05.385109 | 25 | | 2014-05-04 18:47:05.436523 | 62 | | 2014-05-04 18:47:05.486877 | 41 | ```py # 截断 2014.5.2 之后的观测 df.truncate(after='5/3/2014') ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 18:47:05.069722 | 34 | | 2014-05-01 18:47:05.119994 | 25 | | 2014-05-02 18:47:05.178768 | 26 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.280592 | 14 | ```py # 2014.5 的观测 df['5-2014'] ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 18:47:05.069722 | 34 | | 2014-05-01 18:47:05.119994 | 25 | | 2014-05-02 18:47:05.178768 | 26 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.230071 | 15 | | 2014-05-02 18:47:05.280592 | 14 | | 2014-05-03 18:47:05.332662 | 26 | | 2014-05-03 18:47:05.385109 | 25 | | 2014-05-04 18:47:05.436523 | 62 | | 2014-05-04 18:47:05.486877 | 41 | ```py # 计算每个时间戳的观测数 df.groupby(level=0).count() ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 18:47:05.069722 | 1 | | 2014-05-01 18:47:05.119994 | 1 | | 2014-05-02 18:47:05.178768 | 1 | | 2014-05-02 18:47:05.230071 | 2 | | 2014-05-02 18:47:05.280592 | 1 | | 2014-05-03 18:47:05.332662 | 1 | | 2014-05-03 18:47:05.385109 | 1 | | 2014-05-04 18:47:05.436523 | 1 | | 2014-05-04 18:47:05.486877 | 1 | ```py # 每天的 battle_deaths 均值 df.resample('D').mean() ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 | 29.5 | | 2014-05-02 | 17.5 | | 2014-05-03 | 25.5 | | 2014-05-04 | 51.5 | ```py # 每天的 battle_deaths 总数 df.resample('D').sum() ``` | | battle_deaths | | --- | --- | | date | | | 2014-05-01 | 59 | | 2014-05-02 | 70 | | 2014-05-03 | 51 | | 2014-05-04 | 103 | ```py # 绘制每天的总死亡人数 df.resample('D').sum().plot() # <matplotlib.axes._subplots.AxesSubplot at 0x11187a940> ``` ![png](https://chrisalbon.com/python/data_wrangling/pandas_time_series_basics/pandas_time_series_basics_29_1.png)