2.1 索引选取 · soton_数据分析

[TOC] ***** **索引选取:** 根据某种条件筛选出数据的子集，类似sql那样整数列表的切片时前闭后开 pandas中标签名称切片是前后都闭合的超出索引值范围和标签名不存在会报错 ***** 标签指的是索引列中的值索引选取两种方式： 1.基于loc(根据索引的标签名选取) 2.基于iloc(根据行和列的位置，用整数选取) ***** loc ![](https://img.kancloud.cn/da/6a/da6acb2132ae136c712ca5495bc7bd15_1280x447.png) 没必要使用callable function来进行索引选取 ***** iloc ![](https://img.kancloud.cn/93/ee/93eece3aafdf5b94085af89a2b970fbf_1010x317.png) ***** ### 2.1.1. 基于label.loc **Series操作** ``` #随机生成6个数字，并生成series和索引列 s1 = pd.Series(np.random.randn(6), index=list('abcdef')) ``` ![](https://img.kancloud.cn/dd/ee/ddee452c845548220c87b0f512a8675f_202x193.png) ***** 根据单个标签名选取数据 ``` s1.loc['c'] # 结果 0.7140279186233623 ``` ***** 根据标签列表选取数据 ``` s1.loc[['a','e','f']] ``` ![](https://img.kancloud.cn/e5/72/e572641fbf69dde38ef0e4920d5430dc_160x92.png) ***** 根据标签切片选取数据，按照生成时的标签顺序进行切片 ``` s1.loc['d':'f'] ``` ![](https://img.kancloud.cn/ea/52/ea520d78084daa1010d0270f2f117961_152x97.png) ***** 判断s1中的每个元素是否大于0，返回一个布尔数组 ![](https://img.kancloud.cn/15/3f/153fe7cb123ee8060f7308dd9d749dfb_131x197.png) 根据布尔数组选取数据 ![](https://img.kancloud.cn/6a/29/6a29d401de85392d70dcacf610aff7bf_168x157.png) **DataFrame操作** DataFrame是二维数据，可以操控index与column。DataFrame的行索引是整数 ***** 根据行和列选取数据 ![](https://img.kancloud.cn/0f/d0/0fd06245749ddd913615bc1769f4dcd1_297x125.png) ***** 根据列表 ![](https://img.kancloud.cn/f9/33/f9331e5f51fd2038b2e27285836ddfc7_363x210.png) ***** 根据切片 ![](https://img.kancloud.cn/c3/df/c3df634f925ec7ca262beba7d9811b0e_360x270.png) ***** 根据布尔数组选取有关height行的数据 df.loc[df['height'] >= 200,['Player','height']] ![](https://img.kancloud.cn/cf/0d/cf0d2e9aaa26c1cc4fe6f134bfd321da_318x344.png) 下列代码生成的是一个series ![](https://img.kancloud.cn/db/8d/db8daa81a52f23c0bdd4696693736a0d_245x269.png) ***** ### 2.1.2. 基于位置.iloc 用索引的位置下标选取数据 ![](https://img.kancloud.cn/34/b5/34b5242ad9af80a5a75cd5e19d0d6bcf_389x560.png) 传统切片通过位置下标整数进行切片 ![](https://img.kancloud.cn/06/fa/06fae5659b5cd71784b9d78fb7ab2760_214x209.png) ***** 根据列表选取行 ![](https://img.kancloud.cn/90/0b/900b4472addb1ef6a42ec3bf6cb83c8e_677x193.png) 根据位置下表整数切片选取行 ![](https://img.kancloud.cn/6a/17/6a17d3942839d7386f3e41e47ab18873_659x203.png) ***** 用Player列作为索引列 ![](https://img.kancloud.cn/9e/ba/9ebab083ee7fff946fe7f19dc9f6c042_694x352.png) ***** 用iloc进行选取行或列，只能用整数位置下标例子：切片选取行，列表选取列，逗号分割的是行和列的参数 ![](https://img.kancloud.cn/75/e8/75e8299c2a04c283ba246c5a9b00aa98_235x429.png) ### 2.1.3. 随机选取数据使用sample()方法对数据进行列或者行的随机选取，默认是对行进行选取。函数接受一个参数用来指定返回的数量或者百分比 ``` 随机抽取10行 df.sample(10) ``` ![](https://img.kancloud.cn/63/f4/63f4552ed83d8e7705da020e0413de3a_789x385.png) ***** ``` 随机返回总数量的百分之一数据 df.sample(frac=0.01) ``` ![](https://img.kancloud.cn/eb/d2/ebd2dcf9ff7ebbca3813d4fdc43c4811_887x442.png) ***** **通过控制axis对列进行抽样** .head()默认显示前五行 ``` 随机抽取三列，并显示前五行 df.sample(n = 3,axis = 1).head() ``` ![](https://img.kancloud.cn/aa/db/aadbeb4d24226aedff59399e4ee4ab3a_362x206.png) ***** sample有一个参数,random_state是随机数种子，决定是否返回固定的随机数据 df.sample(n = 3,axis = 1,random_state=10).head() ![](https://img.kancloud.cn/21/66/2166d5e61e432436e37a74d06e53bf98_300x206.png) ***** ### 2.1.4. 使用isin() 该函数回返回一个布尔型向量，根据Series里面的值是否在给定的列表中。通过这个条件筛选出多行数据！ ``` # 过滤 "Chicago","New York" s = df['birth_city'] s.isin(["Chicago","New York"]) ``` ![](https://img.kancloud.cn/da/df/dadffee8b64561eaf3d8358187d35fa1_266x228.png) dataFrame默认按行选取 ![](https://img.kancloud.cn/f2/7d/f27d42f87b101ba7d45a002dda5c8c24_364x278.png) ***** ![](https://img.kancloud.cn/53/4d/534d313717771c475b5c5a202a686af9_790x290.png) ***** **对多列进行布尔值选取** 对于DataFrame我们可以使用dict进行处理，dict的key就是对应的column name. 我们经常与all()或者any函数组合进行数据过滤选取 * all 指定axis上的元素全部为True * any 指定axis上的元素至少一个为True ![](https://img.kancloud.cn/6a/05/6a054fe5de920b6d770cda8dac5344b2_859x538.png) ***** 选出全为true的列 ![](https://img.kancloud.cn/64/1b/641b1c594e247b1f6737fe141c0cb632_480x77.png) ***** axis= 1是筛选行，axis=0是筛选列一行中有一个true，这行被标记为true，通过loc返回行 ![](https://img.kancloud.cn/47/66/4766bfcb196c65ed8ecf714b1b951ade_223x277.png) ![](https://img.kancloud.cn/9c/11/9c113f8064c5225a6255001148bdecff_804x302.png) ### 2.1.5. 数据过滤基于loc的强大功能，我们可以对数据做很多复杂的操作。第一个就是实现数据的过滤，类似于SQL里面的where功能选取出height >= 180 ,weight >= 80的运动员数据。 ***** 选出符合条件的数据行 ![](https://img.kancloud.cn/5d/7c/5d7c1f29f5b874a065a06f061e893573_1081x286.png) ***** * 如果height >= 180, weight >=80, 值为 “high" * 如果height=170, weight=70 值为 ”msize" * 其余的值为 "small" ``` #1 新建一个flag列，将符合条件的行的flag列填入"high"值 df.loc[(df['height'] >=180) & (df['weight'] >=80),"flag"] = "high" ``` ![](https://img.kancloud.cn/dc/5f/dc5f7dd89f118628501fc5365e829c15_949x243.png) ***** ``` #2 新建一个flag列，将符合条件的行的flag列填入"msize"值 df.loc[((df['height'] <=180) & (df['height']>=170)) & ((df['weight'] <=80) & (df['weight'] >=70)),"flag"] = "msize" ``` ***** ``` #3 其余的值为 "small" ~对条件1和2进行否定，不满足条件一或2的。新建一个flag列，将符合条件的行的flag列填入"small"值 df.loc[~(((df['height'] >=180) & (df['weight'] >=80)) |(((df['height'] <=180) & (df['height']>=170))&((df['weight'] <=80) & (df['weight'] >=70)))),"flag"] = "small" ``` ***** 对flag列各值的频数进行统计 ![](https://img.kancloud.cn/d5/65/d5655184be2d7c3e25996356649b8ba5_278x137.png) ***** 对各行的height进行条件判断，满足条件，判定该行为true ![](https://img.kancloud.cn/c5/0c/c50c77b5864633e988f302347e82a512_194x282.png) ***** ### 2.1.6. query()方法使用表达式进行数据筛选.类似sql中的where表达式 ![](https://img.kancloud.cn/41/fc/41fc4dd935033d4fecbf7170c696aca8_1006x311.png) ![](https://img.kancloud.cn/d6/2a/d62a01a25d1615450e5b8d189579f099_970x262.png) **注意** query里面不可以引用变量 ![](https://img.kancloud.cn/44/9c/449c7f5c1649e7377f7006e866a7653e_878x396.png) ### 2.1.7. 索引设置 set_index()方法可以将一列或者多列设置为索引 ``` #keys设置索引列，drop保留作为索引列的数据，append是否保留原来的位置索引，inplace修改原数据集 df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False) ``` ***** 把Player和collage列作为索引列 ![](https://img.kancloud.cn/ab/f8/abf8cbca28162f4db031f55af1b3d92f_729x329.png) ***** 将索引列放回数据框中，并且设置简单的整数索引 ``` df1.reset_index(level=None, drop=False, inplace=False, co_level=0, col_fill='') ``` ![](https://img.kancloud.cn/f2/b3/f2b370f2cb84b8440c029f5177310a6d_994x321.png) ### 2.1.8. where方法 * 通过布尔类型的数组选取数据仅仅返回数据的子集 * where()函数能够确保返回的结果和原数据集的结构一样 ***** 保留符合条件的元素和索引 ![](https://img.kancloud.cn/4a/cd/4acd967abfc95c919be5e85009457c11_323x179.png) ***** 保留原来数据集的结构 ![](https://img.kancloud.cn/91/4c/914c840a4d35a53db3aa413beb3b0b4c_144x239.png) ### 2.1.9. 重复数据 duplicate * duplicated 返回一个和行数相等的布尔数组，表明某一行是否是重复的,true是重复值 * drop_duplicates 删除重复行 ***** **通过keep参数来控制行的取舍** * 除第一个值外，剩下重复的值都认为是重复值 keep='first' (default): mark / drop duplicates except for the first occurrence. * 除最后一个值外，剩下重复的值都认为是重复值 keep='last': mark / drop duplicates except for the last occurrence. * 标记所有的值为重复值 keep=False: mark / drop all duplicates. ***** ![](https://img.kancloud.cn/40/5e/405efc776e1f04f0fa603e502fcdf540_688x426.png) ***** ``` #判断是否有两行是重复的 df2.duplicated() ``` ![](https://img.kancloud.cn/f5/53/f55365e770d85ff5df540c4da8332150_134x168.png) ***** ``` #判断a列是否有重复值 df2.duplicated('a',keep = False) ``` ![](https://img.kancloud.cn/91/23/912360baa7d65a18c433cb10bcf6af40_150x175.png) ![](https://img.kancloud.cn/66/68/6668e5eb1578fda72546a88cfc15a1ca_355x214.png) ![](https://img.kancloud.cn/5b/25/5b2550a8179c04b1c82885647fa5729d_317x208.png) ***** 删去重复行 ![](https://img.kancloud.cn/bd/be/bdbe065c0e44605dc8db3b964753f54f_253x321.png) ***** 删去a b两列都一样的行。第二行和第四行相等，默认删去第四行 ![](https://img.kancloud.cn/3c/9d/3c9deb5959b14d203ccd8b95783565a9_362x298.png) ***** keep = 'last'，删去最后一个重复的值之前的值 ![](https://img.kancloud.cn/f5/4c/f54c0ed60c897ca3787713185d023fbd_472x316.png) ### 2.1.10. MultiIndex 层次索引可以允许我们操作更加复杂的数据 ``` arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] 根据嵌套列表创建两列索引，第一列索引叫first,第二列索引叫second index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second']) ``` 第一列索引有四个值，第二列索引有两个值。第一列索引分为四组，每组有两个相同的值。第二列索引也分为四组，每组有两个不同的值 ![](https://img.kancloud.cn/85/6f/856f8792c288004dc46471f63c94fb1b_642x116.png) ***** 把数组变为元组 ![](https://img.kancloud.cn/6f/06/6f0644f051209f9a3c7541a551449152_267x238.png) ***** ``` #根据元组创建两层索引 index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) #得到第一层索引的值 index.get_level_values(0) ``` ![](https://img.kancloud.cn/51/47/5147bebbf83af9be356deaa9ba336934_821x44.png) ***** ``` #用随机数创建series值，用建好的多层索引作为series的索引 pd.Series(np.random.randn(8), index=index) ``` ![](https://img.kancloud.cn/9d/af/9dafa7a1346800a97bc92fd87e9b5ddc_260x217.png) ***** **索引选取** ``` #用随机数创建8行4列的数据集。用两层索引index作为数据集的索引。list('ABCD')作为数据集的列 df = pd.DataFrame(np.random.randn(8, 4), index=index,columns = list('ABCD')) ``` ![](https://img.kancloud.cn/1d/a3/1da36028bec0be3b1ceb0ad802f8f51e_432x384.png) ***** ``` #用多层索引，类似找书的第几章第几节。章是第一层索引，节是第二层索引 #loc根据索引标签名选取数据各层索引标签名要放到一个元组里 df.loc[('bar', 'two')] ``` ![](https://img.kancloud.cn/35/b8/35b8b2435b15fbe33004b48885142b20_299x107.png) ***** 选指定多层索引的行与指定列交叉的数据 ![](https://img.kancloud.cn/9b/ff/9bff5de74cf6166479e326d7b646b96c_272x80.png) ***** 多个多层索引放入列表中选取行 ![](https://img.kancloud.cn/fc/3e/fc3e44b72dee26071f66573e7258bcff_446x194.png) ***** 选中指定的第一层索引bar下的所有行 ![](https://img.kancloud.cn/72/58/7258ab97f296212c54864def864772e7_410x139.png) ***** 对于多层索引，不可以跳过前面层的索引，用后面层的索引选择行。如：想选所有第二层索引为'one'的行，报错 ![](https://img.kancloud.cn/13/b0/13b01a07db3c85e3dc02fefa62ac947b_462x128.png) ***** **可以使用切片(slicers)对多重索引进行操作** * 你可以使用任意的列表，元祖，布尔型作为Indexer * 可以使用sclie(None)表达在某个level上选取全部的内容，不需要对全部的level进行指定，它们会被隐式的推导为slice(None) * 所有的axis必须都被指定，意味着index和column上都要被显式的指明 **正确的方式** ~~~python 选取第一层索引A1和第二层索引A3 选取所有的列 df.loc[(slice('A1', 'A3'), ...), :] ~~~ **错误的方式X** 没有选择列 ~~~python df.loc[(slice('A1', 'A3'), ...)] ~~~ ***** ![](https://img.kancloud.cn/7f/80/7f808e0aee43fd87aa360b083c5866a0_431x390.png) ***** ``` #slice(None)选择第一层全部索引,'one'选择第二层带'one'的索引，：选择全部的列 df.loc[(slice(None),'one'),:] ``` ![](https://img.kancloud.cn/eb/99/eb995eb8b60fa94bb463cc7587e05c06_424x212.png) ***** ``` # IndexSlice是一种更接近自然语法的用法，可以替换slice # 生成IndexSlice idx = pd.IndexSlice # idx:选择第一层全部索引。'one' : 第二层'one'索引。: 选择全部的列。 df.loc[idx[:,'one'],:] ``` ![](https://img.kancloud.cn/6e/fa/6efa1b050d553a9c567b2ca74c673c82_427x231.png) ***** ![](https://img.kancloud.cn/d5/7f/d57fb5042be9f32248f0f0888c337142_321x253.png) ***** **函数xs()可以让我们在指定level的索引上进行数据选取** level=1 指定为第二层索引，第一层为level=0。选第二层带 'one'索引的数据行 ![](https://img.kancloud.cn/53/4f/534fe3d5bcf2f0b530e72759d6612295_382x259.png) ![](https://img.kancloud.cn/5b/95/5b9532e8d3c5222f2afae874066795dc_438x276.png) ![](https://img.kancloud.cn/68/24/6824bb2a9404c002fae28e86b0cfda5f_444x178.png) ![](https://img.kancloud.cn/ee/c6/eec6662ea9a2c51e26f4a4814bc69aa5_443x202.png) **索引排序** 先排第一层索引，下层索引再组内排序 ![](https://img.kancloud.cn/51/de/51de38ceff6464d1672040b4247e10a3_450x390.png) ***** 数据行按第二层索引排序 ![](https://img.kancloud.cn/9a/e0/9ae0d503ce90504a837ac8ad121a453f_432x398.png)