技术标签: # python - 数据分析
在许多数据分析工作中,缺失数据是经常发生的。pandas的目标之一就是尽量轻松地处理缺失数据。例如,pandas对象的所有描述性统计默认都不包括缺失数据。
缺失数据在pandas中呈现的方式有些不完美,但对于大多数用户可以保证功能正常。对于数值数据,pandas使用浮点值NaN(Not a Number)表示缺失数据。我们称其为哨兵值,可以方便的检测出来:
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
--------------------------------------------------------------------
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
-----------------------------------------------------------------
0 False
1 False
2 True
3 False
dtype: bool
string_data.describe()
------------------------------------------------------------------------------
count 3
unique 3
top artichoke
freq 1
dtype: object
在pandas中将缺失值表示为NA,它表示不可用not available。在统计应用中,NA数据可能是不存在的数据或者虽然存在,但是没有观察到(例如,数据采集中发生了问题)。当进行数据清洗以进行分析时,最好直接对缺失数据进行分析,以判断数据采集的问题或缺失数据可能导致的偏差。
Python内置的None值在对象数组中也可以作为NA:
string_data
-----------------------------------------------
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data[0] = None
string_data.isnull()
----------------------------------------------------------------------------
0 True
1 False
2 True
3 False
dtype: bool
下表列出了一些关于缺失数据处理的函数:
过滤掉缺失数据的办法有很多种。你可以通过pandas.isnull或布尔索引的手工方法,但dropna可能会更实用一些。对于一个Series,dropna返回一个仅含非空数据和索引值的Series:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data
---------------------------------------------------------------------------
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data.dropna()
-----------------------------------------------------------------------
0 1.0
2 3.5
4 7.0
dtype: float64
这等价于:
data[data.notnull()]
-----------------------------------------------------------------------
0 1.0
2 3.5
4 7.0
dtype: float64
- data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
data
-----------------------------------------------------------------------------------
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
******************************************************************
cleaned = data.dropna()
cleaned
------------------------------------------
0 1 2
0 1.0 6.5 3.0
data.dropna(how='all')
--------------------------------------
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4] = NA
data
data.dropna(axis=1, how='all')
-------------------------------------------------
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
df.dropna()
df.dropna(thresh=2)
--------------------------------------------------------------------
0 1 2
0 0.603531 NaN NaN
1 -0.664478 NaN NaN
2 1.574308 NaN -2.026062
3 -0.796888 NaN -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
0 1 2
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
0 1 2
2 1.574308 NaN -2.026062
3 -0.796888 NaN -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
df.fillna(0)
-----------------------------------------------------------
0 1 2
0 0.603531 0.000000 0.000000
1 -0.664478 0.000000 0.000000
2 1.574308 0.000000 -2.026062
3 -0.796888 0.000000 -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
fillna
,就可以实现对不同的列填充不同的值:df.fillna({
1:0.5, 2:0})
inplace
对现有对象进行就地修改:df.fillna(0, inplace=True)
df
----------------------------------------------------------------------
0 1 2
0 0.603531 0.000000 0.000000
1 -0.664478 0.000000 0.000000
2 1.574308 0.000000 -2.026062
3 -0.796888 0.000000 -0.550593
4 -0.146118 0.090815 1.564721
5 0.766400 0.971270 -0.779148
6 -1.802502 -1.139013 -1.289828
reindexing
有效的那些插值方法也可用于fillna
:df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
df.fillna(method='ffill')
df.fillna(method='ffill', limit=2)
----------------------------------------------------------------------------
0 1 2
0 0.786754 -1.180053 2.165359
1 -0.802301 0.124125 -0.737388
2 0.285469 NaN 0.210286
3 0.647529 NaN 1.124584
4 -0.706682 NaN NaN
5 1.418289 NaN NaN
0 1 2
0 0.786754 -1.180053 2.165359
1 -0.802301 0.124125 -0.737388
2 0.285469 0.124125 0.210286
3 0.647529 0.124125 1.124584
4 -0.706682 0.124125 1.124584
5 1.418289 0.124125 1.124584
0 1 2
0 0.786754 -1.180053 2.165359
1 -0.802301 0.124125 -0.737388
2 0.285469 0.124125 0.210286
3 0.647529 0.124125 1.124584
4 -0.706682 NaN 1.124584
5 1.418289 NaN 1.124584
data = pd.Series([1., NA, 3.5, NA, 7])
data
data.fillna(data.mean())
-------------------------------------------------------------------
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
DataFrame中出现重复行有多种原因。下面就是一个例子:
data = pd.DataFrame({
'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
data.duplicated()
data.drop_duplicates()
----------------------------------------------------------------
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
这两个方法默认会判断全部列,你也可以指定部分列进行重复项判断。假设我们还有一列值,且只希望根据k1列过滤重复项:
data['v1'] = range(7)
data
data.drop_duplicates(['k1'])
----------------------------------------------------------------------
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
k1 k2 v1
0 one 1 0
1 two 1 1
duplicated
和drop_duplicates
默认保留的是第一个出现的值组合。传入keep='last'
则保留最后一个:
data.drop_duplicates(['k1', 'k2'], keep='last')
----------------------------------------------------------------------
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
data = pd.DataFrame({
'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
----------------------------------------------------------------------
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
map
方法可以接受一个函数或含有映射关系的字典型对象,但是这里有一个小问题,即有些肉类的首字母大写了,而另一些则没有。因此,我们还需要使用Series的str.lower
方法,将各个值转换为小写:lowercased = data['food'].str.lower()
lowercased
data['animal'] = lowercased.map(meat_to_animal)
data
-----------------------------------------------------------------------------
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
data['food'].map(lambda x: meat_to_animal[x.lower()])
-----------------------------------------------------------
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
map
是一种实现元素级转换以及其他数据清理工作的便捷方式。fillna
方法填充缺失数据可以看做值替换的一种特殊情况。前面已经看到,map可用于修改对象的数据子集,而replace
则提供了一种实现该功能的更简单、更灵活的方式。我们来看看下面这个Series:data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
----------------------------------------------------------------
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
inplace=True
):data.replace(-999, np.nan)
-----------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data.replace([-999, -1000], np.nan)
-----------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
要让每个值有不同的替换值,可以传递一个替换列表:
data.replace([-999, -1000], [np.nan, 0])
-----------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
传入的参数也可以是字典:
data.replace({
-999: np.nan, -1000: 0})
-----------------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
-----------------------------------------------------------------------
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
transform = lambda x: x[:4].upper()
data.index.map(transform)
-------------------------------------------------------------------------
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
-------------------------------------------------------------------------
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
如果想要创建数据集的转换版(而不是修改原始数据),比较实用的方法是rename
:
data.rename(index=str.title, columns=str.upper)
str.title('this is a test')
-------------------------------------------------------------------------
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
'This Is A Test'
rename
可以结合字典型对象实现对部分轴标签的更新:data.rename(index={
'OHIO': 'INDIANA'},
columns={
'three': 'peekaboo'})
-------------------------------------------------------------------------
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
rename
可以实现复制DataFrame并对其索引和列标签进行赋值。如果希望就地修改某个数据集,传入inplace=True
即可data.rename(index={
'OHIO': 'INDIANA'}, inplace=True)
data
-------------------------------------------------------------------------
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
为了便于分析,连续数据常常被离散化或拆分为“面元”(bin)。假设有一组人员数据,而你希望将它们划分为不同的年龄组:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
cut
函数:bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
-------------------------------------------------------------------------
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的是一个特殊的Categorical对象。结果展示了pandas.cut
划分的面元。你可以将其看做一组表示面元名称的字符串。它的底层含有一个表示不同分类名称的类型数组,以及一个codes
属性中的年龄数据的标签:
cats.codes
-------------------------------------------------------------------------
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
-------------------------------------------------------------------------
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
# pd.value_counts(cats)是pandas.cut结果的面元计数。
pd.value_counts(cats)
-------------------------------------------------------------------------
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
right=False
进行修改:pd.cut(ages, [18, 26, 36, 61, 100], right=False)
-------------------------------------------------------------------------
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
-------------------------------------------------------------------------
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
data = np.random.rand(20)
pd.cut(data, 4, precision=2)
-------------------------------------------------------------------------
[(0.67, 0.9], (0.67, 0.9], (0.011, 0.23], (0.23, 0.45], (0.23, 0.45], ..., (0.45, 0.67], (0.011, 0.23], (0.23, 0.45], (0.011, 0.23], (0.67, 0.9]]
Length: 20
Categories (4, interval[float64]): [(0.011, 0.23] < (0.23, 0.45] < (0.45, 0.67] < (0.67, 0.9]]
qcut
是一个非常类似于cut
的函数,它可以根据样本分位数对数据进行面元划分。根据数据的分布情况,cut
可能无法使各个面元中含有相同数量的数据点。而qcut
由于使用的是样本分位数,因此可以得到大小基本相等的面元:data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles
cats
-------------------------------------------------------------------------
[(-3.627, -0.665], (-0.665, -0.0236], (-3.627, -0.665], (-3.627, -0.665], (0.595, 3.352], ..., (-3.627, -0.665], (0.595, 3.352], (-0.0236, 0.595], (-0.0236, 0.595], (-0.0236, 0.595]]
Length: 1000
Categories (4, interval[float64]): [(-3.627, -0.665] < (-0.665, -0.0236] < (-0.0236, 0.595] < (0.595, 3.352]]
pd.value_counts(cats)
-------------------------------------------------------------------------
(0.595, 3.352] 250
(-0.0236, 0.595] 250
(-0.665, -0.0236] 250
(-3.627, -0.665] 250
dtype: int64
cut
类似,你也可以传递自定义的分位数(0到1之间的数值,包含端点):pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
----------------------------------------------------------------------------------
[(-1.319, -0.0236], (-1.319, -0.0236], (-3.627, -1.319], (-1.319, -0.0236], (1.218, 3.352], ..., (-3.627, -1.319], (1.218, 3.352], (-0.0236, 1.218], (-0.0236, 1.218], (-0.0236, 1.218]]
Length: 1000
Categories (4, interval[float64]): [(-3.627, -1.319] < (-1.319, -0.0236] < (-0.0236, 1.218] < (1.218, 3.352]]
data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
----------------------------------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.000284 0.003834 0.052079 0.021614
std 1.010268 1.006346 1.007687 1.029508
min -3.261766 -3.762325 -3.544134 -3.124230
25% -0.741905 -0.643554 -0.613592 -0.651136
50% -0.019509 -0.039399 0.036878 0.094302
75% 0.695546 0.676507 0.737588 0.712619
max 3.026675 3.482009 3.390924 3.116602
# 假设你想要找出某列中绝对值大小超过3的值:
col = data[2]
col[np.abs(col) > 3]
----------------------------------------------------------------------------------
156 3.174432
406 3.312773
708 -3.544134
788 3.390924
Name: 2, dtype: float64
data[(np.abs(data) > 3).any(1)]
----------------------------------------------------------------------------------
0 1 2 3
32 -0.464433 -3.762325 0.593019 -3.124230
156 0.771649 0.483199 3.174432 0.117988
257 1.056537 1.435697 -1.161737 3.095851
307 -3.056003 -1.987528 -0.628526 0.297810
406 0.912561 -0.073800 3.312773 -0.802398
526 3.026675 -0.463901 -0.486543 -1.458392
628 -1.211783 -1.272122 -1.051027 3.116602
708 -0.276584 -0.283202 -3.544134 -0.418665
713 -3.261766 -1.122042 0.869456 -0.035866
779 0.082870 3.482009 0.439070 -0.291009
788 -0.111096 -0.608896 3.390924 -0.688649
837 0.884890 -3.544678 -0.799940 -1.402440
883 -0.140538 -3.086551 -2.524166 -2.525502
901 -0.949264 3.354125 1.301728 0.192881
# sign()是Python的Numpy中的取数字符号(数字前的正负号)的函数
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
----------------------------------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.000575 0.004392 0.051745 0.021525
std 1.009208 0.999045 1.003169 1.028508
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.741905 -0.643554 -0.613592 -0.651136
50% -0.019509 -0.039399 0.036878 0.094302
75% 0.695546 0.676507 0.737588 0.712619
max 3.000000 3.000000 3.000000 3.000000
np.sign(data).head()
----------------------------------------------------------------------------------
0 1 2 3
0 1.0 1.0 1.0 -1.0
1 1.0 1.0 -1.0 -1.0
2 1.0 1.0 1.0 1.0
3 -1.0 1.0 -1.0 -1.0
4 -1.0 1.0 -1.0 -1.0
numpy.random.permutation
函数可以轻松实现对Series
或DataFrame
的列的排列工作(permuting,随机重排序)。通过需要排列的轴的长度调用permutation
,可产生一个表示新顺序的整数数组:df = pd.DataFrame(np.arange(5 * 4).reshape((5,4)))
sampler = np.random.permutation(5)
sampler
----------------------------------------------------------------------------------
array([1, 0, 2, 3, 4])
iloc
的索引操作或take
函数中使用该数组了:df
----------------------------------------------------------------------------------
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
df.take(sampler)
----------------------------------------------------------------------------------
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
df.sample(n=3)
----------------------------------------------------------------------------------
0 1 2 3
2 8 9 10 11
4 16 17 18 19
1 4 5 6 7
replace=True
到sample
:choices = pd.Series([5,7,-1,6,4])
draws = choices.sample(n=10, replace=True)
draws
----------------------------------------------------------------------------------
4 4
2 -1
1 7
0 5
0 5
3 6
3 6
3 6
1 7
4 4
dtype: int64
另一种常用于统计建模或机器学习的转换方式是:将分类变量(categorical variable)转换为“哑变量”或“指标矩阵”。
如果DataFrame的某一列中含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值全为1和0)。pandas有一个get_dummies
函数可以实现该功能(其实自己动手做一个也不难)。使用之前的一个DataFrame例子:
df = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
df
----------------------------------------------------------------------------------
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
pd.get_dummies(df['key'])
----------------------------------------------------------------------------------
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
有时候,你可能想给指标DataFrame
的列加上一个前缀,以便能够跟其他数据进行合并。get_dummies
的prefix
参数可以实现该功能:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
----------------------------------------------------------------------------------
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
如果DataFrame中的某行同属于多个分类,则事情就会有点复杂。看一下MovieLens 1M数据集
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/movie/movies.dat', sep='::',
header=None, names=mnames)
movies[:10]
----------------------------------------------------------------------------------
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
----------------------------------------------------------------------------------
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
现在,迭代每一部电影,并将dummies各行的条目设为1。要这么做,我们使用dummies.columns
来计算每个类型的列索引:
gen = movies.genres[0]
gen.split('|')
----------------------------------------------------------------------------------
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split('|'))
----------------------------------------------------------------------------------
array([0, 1, 2])
.iloc
设定值:for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
----------------------------------------------------------------------------------
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1
Genre_Children's 1
Genre_Comedy 1
Genre_Adventure 0
Genre_Fantasy 0
Genre_Romance 0
Genre_Drama 0
Genre_Action 0
Genre_Crime 0
Genre_Thriller 0
Genre_Horror 0
Genre_Sci-Fi 0
Genre_Documentary 0
Genre_War 0
Genre_Musical 0
Genre_Mystery 0
Genre_Film-Noir 0
Genre_Western 0
Name: 0, dtype: object
Python能够成为流行的数据处理语言,部分原因是其简单易用的字符串和文本处理功能。大部分文本运算都直接做成了字符串对象的内置方法。对于更为复杂的模式匹配和文本操作,则可能需要用到正则表达式。pandas对此进行了加强,它使你能够对整组数据应用字符串表达式和正则表达式,而且能处理烦人的缺失数据。
对于许多字符串处理和脚本应用,内置的字符串方法已经能够满足要求了。例如,以逗号分隔的字符串可以用split
拆分成数段:
val = 'a,b, guido'
val.split(',')
----------------------------------------------------------------------------------
['a', 'b', ' guido']
strip
一起使用,以去除空白符(包括换行符):pieces = [x.strip() for x in val.split(',')]
pieces
----------------------------------------------------------------------------------
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
----------------------------------------------------------------------------------
'a::b::guido'
'::'.join(pieces)
----------------------------------------------------------------------------------
'a::b::guido'
in
关键字,还可以使用index
和find
:'guido' in val
True
----------------------------------------------------------------------------------
val.index(',')
1
----------------------------------------------------------------------------------
val.find(':')
-1
val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-105-2c016e7367ac> in <module>()
----> 1 val.index(':')
ValueError: substring not found
与此相关,count
可以返回指定子串的出现次数:
val.count(',')
2
replace
用于将指定模式替换为另一个模式。通过传入空字符串,它也常常用于删除模式:
val.replace(',', '::')
'a::b:: guido'
val.replace(',','')
'ab guido'
下表列出了Python内置的字符串方法,这些运算大部分都能使用正则表达式实现:
正则表达式提供了一种灵活的在文本中搜索或匹配(通常比前者复杂)字符串模式的方式。正则表达式,常称作regex,是根据正则表达式语言编写的字符串。Python内置的re模块负责对字符串应用正则表达式。
re模块的函数可以分为三个大类:模式匹配、替换以及拆分。当然,它们之间是相辅相成的。一个regex描述了需要在文本中定位的一个模式,它可以用于许多目的。我们先来看一个简单的例子:假设想要拆分一个字符串,分隔符为数量不定的一组空白符(制表符、空格、换行符等)。描述一个或多个空白符的regex是\s+:
import re
text = 'foo bar\t baz \tqux'
re.split('\s+', text)
----------------------------------------------------------------------------------
['foo', 'bar', 'baz', 'qux']
re.split('\s+',text)
时,正则表达式会先被编译,然后再在text上调用其split
方法。你可以用re.compile
自己编译regex
以得到一个可重用的regex
对象:regex = re.compile('\s+')
regex.split(text)
----------------------------------------------------------------------------------
['foo', 'bar', 'baz', 'qux']
如果只希望得到匹配regex的所有模式,则可以使用findall
方法:
regex.findall(text)
----------------------------------------------------------------------------------
[' ', '\t ', ' \t']
如果打算对许多字符串应用同一条正则表达式,强烈建议通过re.compile
创建regex
对象。这样将可以节省大量的CPU时间。
match
和search跟findall
功能类似。findall返回的是字符串中所有的匹配项,而search则只返回第一个匹配项。match更加严格,它只匹配字符串的首部。来看一个小例子,假设我们有一段文本以及一条能够识别大部分电子邮件地址的正则表达式:
text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
# 对text使用findall将得到一组电子邮件地址:
regex.findall(text)
----------------------------------------------------------------------------------------
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
m = regex.search(text)
m
----------------------------------------------------------------------------------------
<_sre.SRE_Match object; span=(5, 20), match='[email protected]'>
text[m.start():m.end()]
----------------------------------------------------------------------------------------
'[email protected]'
regex.match
则将返回None,因为它只匹配出现在字符串开头的模式:print(regex.match(text))
----------------------------------------------------------------------------------------
None
sub
方法可以将匹配到的模式替换为指定字符串,并返回所得到的新字符串:print(regex.sub('REDACTED', text))
----------------------------------------------------------------------------------------
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
groups
方法返回一个由模式各段组成的元组:m = regex.match('[email protected]')
m.groups()
----------------------------------------------------------------------------------------
('wesm', 'bright', 'net')
对于带有分组功能的模式,findall会返回一个元组列表:
regex.findall(text)
----------------------------------------------------------------------------------------
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
sub
还能通过诸如\1、\2之类的特殊符号访问各匹配项中的分组。符号\1对应第一个匹配的组,\2对应第二个匹配的组,以此类推:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
----------------------------------------------------------------------------------------
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
清理待分析的散乱数据时,常常需要做一些字符串规整化工作。更为复杂的情况是,含有字符串的列有时还含有缺失数据:
data = {
'Dave': '[email protected]', 'Steve': '[email protected]',
'Rob': '[email protected]', 'Wes': np.nan}
data = pd.Series(data)
data
-----------------------------------------------------------------------
Dave dave@google.com
Rob rob@gmail.com
Steve steve@gmail.com
Wes NaN
dtype: object
data.isnull()
-----------------------------------------------------------------------
Dave False
Rob False
Steve False
Wes True
dtype: bool
data.map
,所有字符串和正则表达式方法都能被应用于(传入lambda表达式或其他函数)各个值,但是如果存在NA(null)就会报错。为了解决这个问题,Series有一些能够跳过NA值的面向数组方法,进行字符串操作。通过Series的str属性即可访问这些方法。例如,我们可以通过str.contains
检查各个电子邮件地址是否含有"gmail":data.str.contains('gmail')
-----------------------------------------------------------------------
Dave False
Rob True
Steve True
Wes NaN
dtype: object
re
选项(如IGNORECASE):pattern
-----------------------------------------------------------------------
'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)
-----------------------------------------------------------------------
Dave [(dave, google, com)]
Rob [(rob, gmail, com)]
Steve [(steve, gmail, com)]
Wes NaN
dtype: object
str.get
,要么在str属性上使用索引:matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
---------------------------------------------------------------------------
Dave True
Rob True
Steve True
Wes NaN
dtype: object
data.str.get(1)
---------------------------------------------------------------------------
Dave a
Rob o
Steve t
Wes NaN
dtype: object
data.str[0]
---------------------------------------------------------------------------
Dave d
Rob r
Steve s
Wes NaN
dtype: object
data.str[:5]
---------------------------------------------------------------------------
Dave dave@
Rob rob@g
Steve steve
Wes NaN
dtype: object
pandas
字符串方法:前面介绍的这些向量化字符串操作方法非常适合用来处理现实中那些凌 乱的数据。下面将通过一个从不同网站获取的公开食谱数据库的案例来 进行演示。我们的目标是将这些食谱数据解析为食材列表,这样就可以 根据现有的食材快速找到食谱。
获取数据的脚本可以在https://github.com/fictivekin/openrecipes
上找到, 那里还有最新版的数据库链接。
这个数据库是 JSON 格式的,来试试通过 pd.read_json
读取数据:
try:
recipes = pd.read_json('data/openrecipes.txt')
except ValueError as e:
print("ValueError:", e)
---------------------------------------------------------------------------
ValueError: Trailing data
提示数据里有“trailing data”(数据断行)的 ValueError 错误。原因是虽然文件中的每一行都是一个有效的 JSON 对象,但是全文却不是这样。来看看文件是不是这样:
with open('data/openrecipes.txt') as f:
line = f.readline()
print(line)
---------------------------------------------------------------------------
{
"name": "Easter Leftover Sandwich", "ingredients": "12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\n3 Tablespoons Grainy Dijon Mustard\n Salt And Pepper, to taste\n Several Dashes Worcestershire Sauce\n Leftover Baked Ham, Sliced\n Kaiser Rolls Or Other Bread\n Extra Mayonnaise And Dijon, For Spreading\n Swiss Cheese Or Other Cheese Slices\n Thinly Sliced Red Onion\n Avocado Slices\n Sliced Tomatoes\n Lettuce, Spinach, Or Arugula", "url": "http://thepioneerwoman.com/cooking/2013/04/easter-leftover-sandwich/", "image": "http://static.thepioneerwoman.com/cooking/files/2013/03/leftoversandwich.jpg", "cookTime": "PT", "recipeYield": "8", "datePublished": "2013-04-01", "prepTime": "PT15M", "description": "Got leftover Easter eggs? Got leftover Easter ham? Got a hearty appetite? Good! You've come to the right place! I..."}
显然每一行都是一个有效的 JSON 对象,因此需要将这些字符串连接在 一起。解决这个问题的一种方法就是新建一个字符串,将所有行 JSON 对象连接起来,然后再通过 pd.read_json
来读取所有数据:
import pandas as pd
# 将文件内容读取成Python数组
with open('data/openrecipes.txt', 'r') as f:
# 提取每一行内容
data = (line.strip() for line in f)
# 将所有内容合并成一个列表
data_json = "[{0}]".format(','.join(data))
# 用JSON形式读取数据
recipes = pd.read_json(data_json)
recipes.shape
---------------------------------------------------------------------------
(1042, 9)
recipes.iloc[0]
---------------------------------------------------------------------------
cookTime PT
datePublished 2013-04-01
description Got leftover Easter eggs? Got leftover East...
image http://static.thepioneerwoman.com/cooking/file...
ingredients 12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\...
name Easter Leftover Sandwich
prepTime PT15M
recipeYield 8
url http://thepioneerwoman.com/cooking/2013/04/eas...
Name: 0, dtype: object
这里有一堆信息,而且其中有不少都和从网站上抓取的数据一样,字段 形式混乱。值得关注的是,食材列表是字符串形式,我们需要从中抽取 感兴趣的信息。下面来仔细看看这个字段:
recipes.ingredients.str.len().describe()
---------------------------------------------------------------------------
count 1042.000000
mean 358.645873
std 187.332133
min 22.000000
25% 246.250000
50% 338.000000
75% 440.000000
max 3160.000000
Name: ingredients, dtype: float64
食材列表平均 250 个字符,最短的字符串是 22,最长的竟然3000多字符!
来看看这个拥有最长食材列表的究竟是哪道菜:
recipes.name[np.argmax(recipes.ingredients.str.len())]
---------------------------------------------------------------------------
'A Nice Berry Pie'
我们还可以再做一些累计探索,例如看看哪些食谱是早餐:
recipes.description.str.contains('[Bb]reakfast').sum()
---------------------------------------------------------------------------
11
或者看看有多少食谱用肉桂(cinnamon)作为食材:
recipes.ingredients.str.contains('[Cc]innamon').sum()
---------------------------------------------------------------------------
79
还可以看看究竟是哪些食谱里把肉桂错写成了“cinamon”:
recipes.ingredients.str.contains('[Cc]inamon').sum()
---------------------------------------------------------------------------
0
现在让我们更进一步,来制作一个简易的美食推荐系统:如果用户 提供一些食材,系统就会推荐使用了所有食材的食谱。这说起来是 容易,但是由于大量不规则(heterogeneity)数据的存在,这个任 务变得十分复杂,例如并没有一个简单直接的办法可以从每一行数 据中清理出一份干净的食材列表。因此,我们在这里简化处理:首 先提供一些常见食材列表,然后通过简单搜索判断这些食材是否在 食谱中。为了简化任务,这里只列举常用的香料和调味料:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
现在就可以通过一个由 True 与 False 构成的布尔类型的 DataFrame 来判断食材是否出现在某个食谱中:
import time
start = time.time()
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE,)) for spice in spice_list))
print(time.time() - start)
#spice_df[spice_df.any(1)]
0.011888265609741211
现在,来找一份使用了pepper、salt和sage这三种食材的食谱。我们可以通过 DataFrame 的 query()
方法来快速完成计算
%timeit selection = spice_df.query('pepper & salt & sage')
# selection
# len(selection)
1.57 ms ± 6.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit spice_df["pepper"] & spice_df["salt"]& spice_df["sage"]
391 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
recipes.iloc[selection.index][['name', 'ingredients']]
name ingredients
687 1 quart (4 cups) water\n1 bay leaf\n2 sage lea...
1014 Spiced Candied Walnuts Recipe Peanut or canola oil for deep-frying (hs note:...
recipes.head()
cookTime datePublished description image ingredients name prepTime recipeYield url
0 PT 2013-04-01 Got leftover Easter eggs? Got leftover East... http://static.thepioneerwoman.com/cooking/file... 12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\... Easter Leftover Sandwich PT15M 8 http://thepioneerwoman.com/cooking/2013/04/eas...
1 PT10M 2011-06-06 I finally have basil in my garden. Basil I can... http://static.thepioneerwoman.com/cooking/file... 3/4 cups Fresh Basil Leaves\n1/2 cup Grated Pa... Pasta with Pesto Cream Sauce PT6M 8 http://thepioneerwoman.com/cooking/2011/06/pas...
2 PT15M 2011-09-15 This was yummy. And easy. And pretty! And it t... http://static.thepioneerwoman.com/cooking/file... 2 whole Pork Tenderloins\n Salt And Pepper, to... Herb Roasted Pork Tenderloin with Preserves PT5M 12 http://thepioneerwoman.com/cooking/2011/09/her...
3 PT20M 2012-04-23 I made this for a late lunch Saturday, and it ... http://static.thepioneerwoman.com/cooking/file... 1 pound Penne\n4 whole Boneless, Skinless Chic... Chicken Florentine Pasta PT10M 10 http://thepioneerwoman.com/cooking/2012/04/chi...
4 PT 2011-06-13 Iced coffee is my life. When I wake up, often ... http://static.thepioneerwoman.com/cooking/file... 1 pound Ground Coffee (good, Rich Roast)\n8 qu... Perfect Iced Coffee PT8H 24 http://thepioneerwoman.com/cooking/2011/06/per...
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
---------------------------------------------------------------------------
#init dict
dict_inverse_index = {
}
for i in spice_list:
dict_inverse_index[i] = set()
ingredients = recipes.ingredients.values
---------------------------------------------------------------------------
start = time.time()
for recipe_index in range(len(ingredients)):
for spice in spice_list:
if(spice in ingredients[recipe_index]):
dict_inverse_index[spice].add(recipe_index)
print(time.time() - start)
0.005366086959838867
#selection = spice_df.query('pepper | salt | sage')
%timeit dict_inverse_index["pepper"] & dict_inverse_index["salt"] & dict_inverse_index["sage"]
2.95 µs ± 24.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
文章浏览阅读1k次。通过使用ajax方法跨域请求是浏览器所不允许的,浏览器出于安全考虑是禁止的。警告信息如下:不过jQuery对跨域问题也有解决方案,使用jsonp的方式解决,方法如下:$.ajax({ async:false, url: 'http://www.mysite.com/demo.do', // 跨域URL ty..._nginx不停的xhr
文章浏览阅读2k次。关于在 Oracle 中配置 extproc 以访问 ST_Geometry,也就是我们所说的 使用空间SQL 的方法,官方文档链接如下。http://desktop.arcgis.com/zh-cn/arcmap/latest/manage-data/gdbs-in-oracle/configure-oracle-extproc.htm其实简单总结一下,主要就分为以下几个步骤。..._extproc
文章浏览阅读1.5w次。linux下没有上面的两个函数,需要使用函数 mbstowcs和wcstombsmbstowcs将多字节编码转换为宽字节编码wcstombs将宽字节编码转换为多字节编码这两个函数,转换过程中受到系统编码类型的影响,需要通过设置来设定转换前和转换后的编码类型。通过函数setlocale进行系统编码的设置。linux下输入命名locale -a查看系统支持的编码_linux c++ gbk->utf8
文章浏览阅读750次。今天准备从生产库向测试库进行数据导入,结果在imp导入的时候遇到“ IMP-00009:导出文件异常结束” 错误,google一下,发现可能有如下原因导致imp的数据太大,没有写buffer和commit两个数据库字符集不同从低版本exp的dmp文件,向高版本imp导出的dmp文件出错传输dmp文件时,文件损坏解决办法:imp时指定..._imp-00009导出文件异常结束
文章浏览阅读143次。当下是一个大数据的时代,各个行业都离不开数据的支持。因此,网络爬虫就应运而生。网络爬虫当下最为火热的是Python,Python开发爬虫相对简单,而且功能库相当完善,力压众多开发语言。本次教程我们爬取前程无忧的招聘信息来分析Python程序员需要掌握那些编程技术。首先在谷歌浏览器打开前程无忧的首页,按F12打开浏览器的开发者工具。浏览器开发者工具是用于捕捉网站的请求信息,通过分析请求信息可以了解请..._初级python程序员能力要求
文章浏览阅读7.6k次,点赞2次,收藏6次。@Service标注的bean,类名:ABDemoService查看源码后发现,原来是经过一个特殊处理:当类的名字是以两个或以上的大写字母开头的话,bean的名字会与类名保持一致public class AnnotationBeanNameGenerator implements BeanNameGenerator { private static final String C..._@service beanname
文章浏览阅读6.9w次,点赞73次,收藏463次。1.前序创建#include<stdio.h>#include<string.h>#include<stdlib.h>#include<malloc.h>#include<iostream>#include<stack>#include<queue>using namespace std;typed_二叉树的建立
文章浏览阅读7.1k次。在Asp.net上使用Excel导出功能,如果文件名出现中文,便会以乱码视之。 解决方法: fileName = HttpUtility.UrlEncode(fileName, System.Text.Encoding.UTF8);_asp.net utf8 导出中文字符乱码
文章浏览阅读2.1k次,点赞4次,收藏23次。第一次实验 词法分析实验报告设计思想词法分析的主要任务是根据文法的词汇表以及对应约定的编码进行一定的识别,找出文件中所有的合法的单词,并给出一定的信息作为最后的结果,用于后续语法分析程序的使用;本实验针对 PL/0 语言 的文法、词汇表编写一个词法分析程序,对于每个单词根据词汇表输出: (单词种类, 单词的值) 二元对。词汇表:种别编码单词符号助记符0beginb..._对pl/0作以下修改扩充。增加单词
文章浏览阅读773次。我在使用adb.exe时遇到了麻烦.我想使用与bash相同的adb.exe shell提示符,所以我决定更改默认的bash二进制文件(当然二进制文件是交叉编译的,一切都很完美)更改bash二进制文件遵循以下顺序> adb remount> adb push bash / system / bin /> adb shell> cd / system / bin> chm..._adb shell mv 权限
文章浏览阅读6.8k次,点赞12次,收藏125次。1. 单目相机标定引言相机标定已经研究多年,标定的算法可以分为基于摄影测量的标定和自标定。其中,应用最为广泛的还是张正友标定法。这是一种简单灵活、高鲁棒性、低成本的相机标定算法。仅需要一台相机和一块平面标定板构建相机标定系统,在标定过程中,相机拍摄多个角度下(至少两个角度,推荐10~20个角度)的标定板图像(相机和标定板都可以移动),即可对相机的内外参数进行标定。下面介绍张氏标定法(以下也这么称呼)的原理。原理相机模型和单应矩阵相机标定,就是对相机的内外参数进行计算的过程,从而得到物体到图像的投影_相机-投影仪标定
文章浏览阅读2.2k次。文章目录Wayland 架构Wayland 渲染Wayland的 硬件支持简 述: 翻译一篇关于和 wayland 有关的技术文章, 其英文标题为Wayland Architecture .Wayland 架构若是想要更好的理解 Wayland 架构及其与 X (X11 or X Window System) 结构;一种很好的方法是将事件从输入设备就开始跟踪, 查看期间所有的屏幕上出现的变化。这就是我们现在对 X 的理解。 内核是从一个输入设备中获取一个事件,并通过 evdev 输入_wayland