# DataFrame
> DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值)。DataFrame既有行索引也有列索引,它可以被看做由`Series`组成的字典。
* * *
## 由等长列表或`NumPy`数组组成的字典 构建`DataFrame`
~~~
from pandas import Series, DataFrame
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame)
//
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
~~~
如果指定了序列列,则`DataFrame`的列就会按照指定顺序进行排列:
~~~
from pandas import Series, DataFrame
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data, columns=['year', 'state', 'pop'])
print(frame)
//
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
~~~
* * *
## 通过类似字典标记的方式或属性的方式,可以将`DataFrame`的列获取为一个`Series`
~~~
from pandas import Series, DataFrame
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
print(frame2['state'])
//
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
print(frame2.year)
//
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
~~~
返回的`Series`拥有原`DataFrame`相同的索引。
## 行也可以通过位置或名称的方式进行获取,比如使用索引字段`ix`
~~~
from pandas import Series, DataFrame
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
print(frame2.ix['three'])
//
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
~~~
## 列可以通过赋值的方式进行修改,例如给那个空的debt列赋上一个标量值或一组值
~~~
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
frame2['debt'] = 16.5
print(frame2)
//
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
frame2['debt'] = np.arange(5.)
print(frame2)
//
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
~~~
## 将列表或数值赋值给某个列时,其长度必须根`DataFrame`的长度相匹配。如果赋值的是一个`Series`,就会精确匹配`DataFrame`的索引,所有的空位都会被填上缺失值
~~~
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)
//
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
~~~
## 为不存在的列赋值会创建出一个新列
~~~
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2['eastern'] = frame2['state'] == 'Ohio'
print(frame2)
//
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
~~~