當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析之pandas常见的数据处理(四)

發布時間：2025/4/16 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析之pandas常见的数据处理(四) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

常見聚合方法

方法說明

count	計數
describe	給出各列的常用統計量
min,max	最大最小值
argmin,argmax	最大最小值的索引位置（整數）
idxmin,idxmax	最大最小值的索引值
quantile	計算樣本分位數
sum,mean	對列求和，均值
mediam	中位數
mad	根據平均值計算平均絕對離差
var,std	方差，標準差
skew	偏度（三階矩）
Kurt	峰度（四階矩）
cumsum	累積和
Cummins，cummax	累計組大致和累計最小值
cumprod	累計積
diff	一階差分
pct_change	計算百分數變化

1 清洗無效數據

df[df.isnull()] #判斷是夠是Nan,None返回的是個true或false的Series對象 df[df.notnull()]#dropna(): 過濾丟失數據 #df3.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) df.dropna() #將所有含有nan項的row刪除 df.dropna(axis=1,thresh=3) #將在列的方向上三個為NaN的項刪除 df.dropna(how='ALL') #將全部項都是nan的row刪除df.dropna()與data[data.notnull()] #效果一致#fillna(): 填充丟失數據 #前置填充 axis = 0 行 #后置填充 axis = 1 列 df3.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) df.fillna({1:0, 2:0.5}) #對第一列nan值賦0，第二列賦值0.5 df.fillna(method='ffill') #在列方向上以前一個值作為值賦給NaN

2 drop函數使用

drop函數的使用：刪除行、刪除列

print frame.drop(['a']) print frame.drop(['Ohio'], axis = 1)

drop函數默認刪除行，列需要加axis = 1

drop函數的使用：inplace參數

采用drop方法，有下面三種等價的表達式：

1. DF= DF.drop('column_name', axis=1)； 2. DF.drop('column_name',axis=1, inplace=True) 3. DF.drop([DF.columns[[0,1, 3]]], axis=1, inplace=True)

注意：凡是會對原數組作出修改并返回一個新數組的，往往都有一個 inplace可選參數。如果手動設定為True（默認為False），那么原數組直接就被替換。也就是說，采用inplace=True之后，原數組名（如2和3情況所示）對應的內存值直接改變；

而采用inplace=False之后，原數組名對應的內存值并不改變，需要將新的結果賦給一個新的數組或者覆蓋原數組的內存位置（如1情況所示）。

drop函數的使用：數據類型轉換

df['Name'] = df['Name'].astype(np.datetime64)

DataFrame.astype() 方法可對整個DataFrame或某一列進行數據格式轉換，支持Python和NumPy的數據類型。

3 pandas數據處理方法

(1) 刪除重復數據

df.duplicated() 返回boolean列表,重復為True

df.drop_duplicates() 刪除重復元素即值為True的列行

參數列表

subset : column label or sequence of labels, optional
用來指定特定的列，默認所有列
keep : {‘first’, ‘last’, False}, default ‘first’
刪除重復項并保留第一次出現的項
inplace : boolean, default False
是直接在原來數據上修改還是保留一個副本

# 判斷是否重復 data.duplicated()` #移除重復數據 data.drop_duplicated() #對指定列判斷是否存在重復值，然后刪除重復數據 data.drop_duplicated(['key1'])df = DataFrame({'color':['white','white','red','red','white'],'value':[2,1,3,3,2]}) display(df,df.duplicated(),df.drop_duplicates())#輸出: color value 0 white 2 1 white 1 2 red 3 3 red 3 4 white 2 0 False 1 False 2 False 3 True 4 True dtype: bool color value 0 white 2 1 white 1 2 red 3

(2) 映射

1 replace() 替換元素 replace({索引鍵值對})

df = DataFrame({'item':['ball','mug','pen'],'color':['white','rosso','verde'],'price':[5.56,4.20,1.30]}) newcolors = {'rosso':'red','verde':'green'} display(df,df.replace(newcolors))#輸出：color item price 0 white ball 5.56 1 rosso mug 4.20 2 verde pen 1.30color item price 0 white ball 5.56 1 red mug 4.20 2 green pen 1.302.replace還經常用來替換NaN元素df2 = DataFrame({'math':[100,139,np.nan],'English':[146,None,119]},index = ['張三','李四','Tom']) newvalues = {np.nan:100} display(df2,df2.replace(newvalues))#輸出：English math 張三 146.0 100.0 李四 NaN 139.0 Tom 119.0 NaN English math 張三 146.0 100.0 李四 100.0 139.0 Tom 119.0 100.0

2 map()函數：新建一列

map(函數,可迭代對象) map(函數/{索引鍵值對})

map中返回的數據是一個具體值，不能迭代.

df3 = DataFrame({'color':['red','green','blue'],'project':['math','english','chemistry']}) price = {'red':5.56,'green':3.14,'chemistry':2.79} df3['price'] = df3['color'].map(price) display(df3)#輸出： color project price 0 red math 5.56 1 green english 3.14 2 blue chemistry NaNdf3 = DataFrame({'zs':[129,130,34],'ls':[136,98,8]},index = ['張三','李四','倩倩']) display(df3) display(df3['zs'].map({129:'你好',130:'非常好',34:'不錯'})) display(df3['zs'].map({129:120})) def mapscore(score):if score<90:return 'failed'elif score>120:return 'excellent'else:return 'pass' df3['status'] = ddd['zs'].map(mapscore) df3輸出：zs ls 張三 129 136 李四 130 98 倩倩 34 8張三你好李四非常好倩倩不錯 Name: zs, dtype: object張三 120.0 李四 NaN 倩倩 NaN Name: zs, dtype: float64 Out[96]: ls zs status 張三 136 129 excellent 李四 98 130 excellent 倩倩 8 34 failed

3 rename()函數：替換索引 rename({索引鍵值對})

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)}) new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'} display(df4,df4.rename(new_index))#輸出：color value 0 white 2 1 gray 0 2 purple 9 3 blue 2 4 green 0 color value first white 2 two gray 0 three purple 9 four blue 2 five green 0

(3) 異常值檢測與過濾

1 使用describe()函數查看每一列的描述性統計量

df = DataFrame(np.random.randint(10,size = 10)) display(df.describe()) 0 count 10.000000 mean 5.900000 std 2.685351 min 1.000000 25% 6.000000 50% 7.000000 75% 7.750000 max 8.000000

2 使用std()函數可以求得DataFrame對象每一列的標準差

df.std()#輸出： 0 3.306559 dtype: float64

3 根據每一列的標準差，對DataFrame元素進行過濾。
借助any()函數，對每一列應用篩選條件,any過濾出所有符合條件的數據

display(df[(df>df.std()*3).any(axis = 1)]) df.drop(df[(np.abs(df) > (3*df.std())).any(axis=1)].index,inplace=True) display(df,df.shape)輸出：0 1 2 7 9 6 8 8 9 8 1 0 1 0 5 0 1 3 3 3 3 5 4 2 4 5 7 6 7 1 6 8 7 7 (7, 2)

(4) 排序

使用take()函數排序
可以借助np.random.permutation()函數隨機排序

df5 = DataFrame(np.arange(25).reshape(5,5)) new_order = np.random.permutation(5) display(new_order) display(df5,df5.take(new_order))#輸出 array([4, 2, 3, 1, 0])0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 4 20 21 22 23 240 1 2 3 4 4 20 21 22 23 24 2 10 11 12 13 14 3 15 16 17 18 19 1 5 6 7 8 9 0 0 1 2 3 4

(5) 數據分類分組

groupby()函數

import pandas as pd df = pd.DataFrame([{'col1':'a', 'col2':1, 'col3':'aa'}, {'col1':'b', 'col2':2, 'col3':'bb'}, {'col1':'c', 'col2':3, 'col3':'cc'}, {'col1':'a', 'col2':44, 'col3':'aa'}]) display(df) # 按col1分組并按col2求和 display(df.groupby(by='col1').agg({'col2':sum}).reset_index()) # 按col1分組并按col2求最值 display(df.groupby(by='col1').agg({'col2':['max', 'min']}).reset_index()) # 按col1 ，col3分組并按col2求和 display(df.groupby(by=['col1', 'col3']).agg({'col2':sum}).reset_index()) import matplotlib.pyplot as plt import pandas as pd import numpy as np from datetime import datetime ''' 分組groupby ''' df=pd.DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.arange(5),'data2':np.arange(5)}) print(df) # key1 key2 data1 data2 # 0 a one 0 0 # 1 a two 1 1 # 2 b one 2 2 # 3 b two 3 3 # 4 a one 4 4''' 根據分組進行計算 ''' #按key1分組,計算data1的平均值 grouped=df['data1'].groupby(df['key1']) print(grouped.mean()) # a 1.666667 # b 2.500000#按key1和key2分組，計算data1的平均值 groupedmean=df['data1'].groupby([df['key1'],df['key2']]).mean() print(groupedmean) # key1 key2 # a one 2 # two 1 # b one 2 # two 3#列變行 print(groupedmean.unstack()) # key2 one two # key1 # a 2 1 # b 2 3df['key1']#獲取出來的數據series數據#groupby分組鍵可以是series還可以是數組 states=np.array(['Oh','Ca','Ca','Oh','Oh']) years=np.array([2005,2005,2006,2005,2006]) print(df['data1'].groupby([states,years]).mean()) # Ca 2005 1.0 # 2006 2.0 # Oh 2005 1.5 # 2006 4.0#直接將列名進行分組，非數據項不在其中，非數據項會自動排除分組 print(df.groupby('key1').mean()) # data1 data2 # key1 # a 1.666667 1.666667 # b 2.500000 2.500000#將入key2分組 print(df.groupby(['key1','key2']).mean()) # data1 data2 # key1 key2 # a one 2 2 # two 1 1 # b one 2 2 # two 3 3#size()方法，返回含有分組大小的Series，得到分組的數量 print(df.groupby(['key1','key2']).size()) # key1 key2 # a one 2 # two 1 # b one 1 # two 1''' 對分組信息進行迭代 '''#將a,b進行分組 for name,group in df.groupby('key1'):print(name)print(group) # a # key1 key2 data1 data2 # 0 a one 0 0 # 1 a two 1 1 # 4 a one 4 4 # b # key1 key2 data1 data2 # 2 b one 2 2 # 3 b two 3 3#根據多個建進行分組 for (k1,k2),group in df.groupby(['key1','key2']):print(name)print(group) # key1 key2 data1 data2 # 0 a one 0 0 # 4 a one 4 4 # b # key1 key2 data1 data2 # 1 a two 1 1 # b # key1 key2 data1 data2 # 2 b one 2 2 # b # key1 key2 data1 data2 # 3 b two 3 3''' 選取一個或一組列，返回的Series的分組對象 ''' #對于groupBy對象，如果用一個或一組列名進行索引。就會聚合 print(df.groupby(df['key1'])['data1'])#根據key1分組，生成data1的數據print(df.groupby(['key1'])[['data1','data2']].mean())#根據key1分組，生成data1，data2的數據 # data1 data2 # key1 # a 1.666667 1.666667 # b 2.500000 2.500000print(df.groupby(['key1','key2'])['data1'].mean()) # key1 key2 # a one 2 # two 1 # b one 2 # two 3''' 通過函數進行分組 ''' #加入你能根據人名長度進行分組的話，就直接傳入len函數print(people.groupby(len,axis=1).sum())#杭州3是三個字母 # 2 3 # a 30.0 20.0 # b 23.0 21.0 # c 26.0 22.0 # d 42.0 23.0 # e 46.0 24.0#還可以和數組、字典、列表、Series混合使用 key_list=['one','one','one','two','two'] print(people.groupby([len,key_list],axis=1).min()) # 2 3 # one two two # a 0.0 15.0 20.0 # b 1.0 16.0 21.0 # c 2.0 17.0 22.0 # d 3.0 18.0 23.0 # e 4.0 19.0 24.0''' 根據索引級別分組 ''' columns=pd.MultiIndex.from_arrays([['US',"US",'US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor']) hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns) print(hier_df) # cty US JP # tenor 1 3 5 1 3 # 0 -1.507729 2.112678 0.841736 -0.158109 -0.645219 # 1 0.355262 0.765209 -0.287648 1.134998 -0.440188 # 2 1.049813 0.763482 -0.362013 -0.428725 -0.355601 # 3 -0.868420 -1.213398 -0.386798 0.137273 0.678293#根據級別分組 print(hier_df.groupby(level='cty',axis=1).count()) # cty JP US # 0 2 3 # 1 2 3 # 2 2 3 # 3 2 3

(6) 高級數據聚合

1 可以使用pd.merge()函數包聚合操作的計算結果添加到df的每一行

d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],'color':['white','white','red','green','white','white','red','green'],'weight':np.random.randint(10,size = 8),'price':np.random.randint(10,size = 8)} df = DataFrame(d1) sums = df.groupby('color').sum().add_prefix('total_')items = df.groupby('item')['price','weight'].sum()means = items['price']/items['weight']means = DataFrame(means,columns=['means_price'])df2 = pd.merge(df,sums,left_on = 'color',right_index = True)df3 = pd.merge(df2,means,left_on = 'item',right_index = True) display(df2,df3)#輸出： color item price weight 0 white luobo 9 2 1 white baicai 5 9 2 red lajiao 5 8 3 green donggua 1 1 4 white luobo 7 4 5 white baicai 8 0 6 red lajiao 6 8 7 green donggua 4 3 total_price total_weight color green 5 4 red 11 16 white 29 15 pandas.core.frame.DataFrame pandas.core.frame.DataFrame Out[141]:color item price weight total_price total_weight 0 white luobo 9 2 29 15 1 white baicai 5 9 29 15 4 white luobo 7 4 29 15 5 white baicai 8 0 29 15 2 red lajiao 5 8 11 16 6 red lajiao 6 8 11 16 3 green donggua 1 1 5 4 7 green donggua 4 3 5 4

2 可以使用transform和apply實現相同功能

使用transform

d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],'color':['white','white','red','green','white','white','red','green'],'weight':np.random.randint(10,size = 8),'price':np.random.randint(10,size = 8)} df = DataFrame(d1) sum1 = df.groupby('color')['price','weight'].sum().add_prefix("total_") sums2 = df.groupby('color')['price','weight'].transform(lambda x:x.sum()).add_prefix('total_') sums3 = df.groupby('color')['price','weight'].transform(sum).add_prefix('total_') display(sum,df,sum1,sums2,sums3)輸出： <function sum> color item price weight 0 white luobo 7 7 1 white baicai 7 7 2 red lajiao 2 7 3 green donggua 6 6 4 white luobo 1 2 5 white baicai 3 6 6 red lajiao 7 0 7 green donggua 0 2 total_price total_weight color green 6 8 red 9 7 white 18 22 total_price total_weight 0 18 22 1 18 22 2 9 7 3 6 8 4 18 22 5 18 22 6 9 7 7 6 8 total_price total_weight 0 18 22 1 18 22 2 9 7 3 6 8 4 18 22 5 18 22 6 9 7 7 6 8

使用apply

def sum_price(x):return x.sum() sums3 = df.groupby('color')['price','weight'].apply(lambda x:x.sum()).add_prefix('total_') sums4 = df.groupby('color')['price','weight'].apply(sum_price).add_prefix('total_') display(df,sums3,sums4)輸出： color item price weight 0 white luobo 4 4 1 white baicai 0 3 2 red lajiao 0 4 3 green donggua 7 5 4 white luobo 3 1 5 white baicai 3 3 6 red lajiao 0 6 7 green donggua 0 7 total_price total_weight color green 7 12 red 0 10 white 10 11 totals_price totals_weight color green 7 12 red 0 10 white 10 11

總結

以上是生活随笔為你收集整理的数据分析之pandas常见的数据处理(四)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：更改开机默认不显示explorer.ex
下一篇： redux-form（V7.4.2）笔记