日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

九、Pandas高级处理

發布時間:2024/7/5 编程问答 36 豆豆
生活随笔 收集整理的這篇文章主要介紹了 九、Pandas高级处理 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

4.6高級處理-缺失值處理

點擊標題即可獲取文章源代碼和筆記
數據集:https://download.csdn.net/download/weixin_44827418/12548095

Pandas高級處理缺失值處理數據離散化合并交叉表與透視表分組與聚合綜合案例4.6 高級處理-缺失值處理1)如何進行缺失值處理兩種思路:1)刪除含有缺失值的樣本2)替換/插補4.6.1 如何處理nan1)判斷數據中是否存在NaNpd.isnull(df)pd.notnull(df)2)刪除含有缺失值的樣本df.dropna(inplace=False)替換/插補df.fillna(value, inplace=False)4.6.2 不是缺失值nan,有默認標記的1)替換 ?-> np.nandf.replace(to_replace="?", value=np.nan)2)處理np.nan缺失值的步驟2)缺失值處理實例 4.7 高級處理-數據離散化性別 年齡 A 1 23 B 2 30 C 1 18物種 毛發 A 1 B 2 C 3男 女 年齡 A 1 0 23 B 0 1 30 C 1 0 18狗 豬 老鼠 毛發 A 1 0 0 2 B 0 1 0 1 C 0 0 1 1 one-hot編碼&啞變量 4.7.1 什么是數據的離散化原始的身高數據:165174160180159163192184 4.7.2 為什么要離散化 4.7.3 如何實現數據的離散化1)分組自動分組sr=pd.qcut(data, bins)自定義分組sr=pd.cut(data, [])2)將分組好的結果轉換成one-hot編碼pd.get_dummies(sr, prefix=) 4.8 高級處理-合并numpynp.concatnate((a, b), axis=)水平拼接np.hstack()豎直拼接np.vstack()1)按方向拼接pd.concat([data1, data2], axis=1)2)按索引拼接pd.merge實現合并pd.merge(left, right, how="inner", on=[索引]) 4.9 高級處理-交叉表與透視表找到、探索兩個變量之間的關系4.9.1 交叉表與透視表什么作用4.9.2 使用crosstab(交叉表)實現pd.crosstab(value1, value2)4.9.3 pivot_table 4.10 高級處理-分組與聚合4.10.1 什么是分組與聚合4.10.2 分組與聚合APIdataframesr

4.6.1如何處理nan

import pandas as pd movie = pd.read_csv("./datas/IMDB-Movie-Data.csv") movie RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01234...995996997998999
1Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
2PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
3SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
4SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
5Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0
....................................
996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo...Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts...20151116.227585NaN45.0
997Hostel: Part IIHorrorThree American college students studying abroa...Eli RothLauren German, Heather Matarazzo, Bijou Philli...2007945.57315217.5446.0
998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen...Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,...2008986.27069958.0150.0
999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni...Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh...2014935.64881NaN22.0
1000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins...Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch...2016875.31243519.6411.0

1000 rows × 12 columns

# 1. 判斷是否存在NaN類型的缺失值,為True的就是缺失值 movie.isnull() RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01234...995996997998999
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
....................................
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

1000 rows × 12 columns

import numpy as np# any() 只要有一個True就會返回True # 返回結果為True,說明數據中存在缺失值 np.any(movie.isnull()) True # 為False的就是缺失值 pd.notnull(movie) RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01234...995996997998999
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
....................................
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue

1000 rows × 12 columns

# all()只要有一個False就返回False # 返回結果為False,說明數據中存在缺失值 np.all(pd.notnull(movie)) False pd.isnull(movie).any() Rank False Title False Genre False Description False Director False Actors False Year False Runtime (Minutes) False Rating False Votes False Revenue (Millions) True Metascore True dtype: bool pd.notnull(movie).all() Rank True Title True Genre True Description True Director True Actors True Year True Runtime (Minutes) True Rating True Votes True Revenue (Millions) False Metascore False dtype: bool # 缺失值處理 # 方法1: 刪除含有缺失值的樣本 movie_full = movie.dropna() movie_full.isnull().any() Rank False Title False Genre False Description False Director False Actors False Year False Runtime (Minutes) False Rating False Votes False Revenue (Millions) False Metascore False dtype: bool # 方法2: 替換 movie.head() RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01234
1Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
2PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
3SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
4SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
5Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0
movie["Revenue (Millions)"].mean() 82.95637614678897 # 含有缺失值的字段 # Revenue (Millions) False # Metascore False movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True) movie["Revenue (Millions)"].isnull().any() False # inplace=True ,直接在原數據上進行填充 movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True) movie["Metascore"].isnull().any() False movie.isnull().any() # 缺失值已經處理完畢 Rank False Title False Genre False Description False Director False Actors False Year False Runtime (Minutes) False Rating False Votes False Revenue (Millions) False Metascore False dtype: bool

不是缺失值nan,有默認標記的處理方法

data = pd.read_csv("./datas/GBvideos.csv",encoding="GBK") data video_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate01234...15951596159715981599
jt2OHQh0HoQLive Apple Event - Apple September Event 2017 ...Apple Event28apple events|apple event|iphone 8|iphone x|iph...74263937824013548705https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...13.09
AqokkXoa7uEHolly and Phillip Meet Samantha the Sex Robot ...This Morning24this morning|interview|holly willoughby|philli...494203265113090https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg13.09
YPVcg45W0z4My DNA Test Results? I'm WHAT??emmablackery24emmablackery|emma blackery|emma|blackery|briti...142819131191511141https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg13.09
T_PuZBdT2iMgetting into a conversation in a language you ...ProZD1skit|korean|language|conversation|esl|japanese...15800286572915293598https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg13.09
NsjsmgmbCfcBaby Name Challenge?Sprinkleofglitter26sprinkleofglitter|sprinkle of glitter|baby gli...40592501957490https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg13.09
.................................
w8fAellnPnsJuicy Chicken Breast - You Suck at Cooking (ep...You Suck At Cooking26how to|cooking|recipe|kitchen|chicken|chicken ...788466319459452274https://i.ytimg.com/vi/w8fAellnPns/default.jpg20.09
RsG37JcEQNwWeezer - Beach Boysweezer10weezer|pacific daydream|pacificdaydream|beach ...1079272435412641https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg20.09
htSiIA2g7G8Berry Frozen Yogurt Bark RecipeSORTEDfood26frozen yogurt bark|frozen yoghurt bark|frozen ...109222484035212https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg20.09
ZQK1F0wz6z4What Do You Want to Eat??Wong Fu Productions24panda|what should we eat|buzzfeed|comedy|boyfr...626223229625321559https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg20.09
DuPXdnSWoLkThe Child in Time: Trailer - BBC OneBBC24BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...992281699?135https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg20.09

1600 rows × 11 columns

# 1. 將 ! 替換為np.nan new_data = data.replace(to_replace="?",value=np.nan) new_data video_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate01234...15951596159715981599
jt2OHQh0HoQLive Apple Event - Apple September Event 2017 ...Apple Event28apple events|apple event|iphone 8|iphone x|iph...74263937824013548705https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...13.09
AqokkXoa7uEHolly and Phillip Meet Samantha the Sex Robot ...This Morning24this morning|interview|holly willoughby|philli...494203265113090https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg13.09
YPVcg45W0z4My DNA Test Results? I'm WHAT??emmablackery24emmablackery|emma blackery|emma|blackery|briti...142819131191511141https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg13.09
T_PuZBdT2iMgetting into a conversation in a language you ...ProZD1skit|korean|language|conversation|esl|japanese...15800286572915293598https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg13.09
NsjsmgmbCfcBaby Name Challenge?Sprinkleofglitter26sprinkleofglitter|sprinkle of glitter|baby gli...40592501957490https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg13.09
.................................
w8fAellnPnsJuicy Chicken Breast - You Suck at Cooking (ep...You Suck At Cooking26how to|cooking|recipe|kitchen|chicken|chicken ...788466319459452274https://i.ytimg.com/vi/w8fAellnPns/default.jpg20.09
RsG37JcEQNwWeezer - Beach Boysweezer10weezer|pacific daydream|pacificdaydream|beach ...1079272435412641https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg20.09
htSiIA2g7G8Berry Frozen Yogurt Bark RecipeSORTEDfood26frozen yogurt bark|frozen yoghurt bark|frozen ...109222484035212https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg20.09
ZQK1F0wz6z4What Do You Want to Eat??Wong Fu Productions24panda|what should we eat|buzzfeed|comedy|boyfr...626223229625321559https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg20.09
DuPXdnSWoLkThe Child in Time: Trailer - BBC OneBBC24BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...992281699NaN135https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg20.09

1600 rows × 11 columns

new_data.isnull().any() # 說明dislikes列中的?已經替換成了NaN video_id False title False channel_title False category_id False tags False views False likes False dislikes True comment_total False thumbnail_link False date False dtype: bool new_data.dropna(inplace=True) new_data.isnull().any() video_id False title False channel_title False category_id False tags False views False likes False dislikes False comment_total False thumbnail_link False date False dtype: bool

4.7 高級處理-數據離散化

import pandas as pd # 準備數據 data = pd.Series([165,174,160,180,159,163,192,184],index=["No1:165","No2:174","No3:160","No4:180","No5:159","No6:163","No7:192","No8:184"]) data No1:165 165 No2:174 174 No3:160 160 No4:180 180 No5:159 159 No6:163 163 No7:192 192 No8:184 184 dtype: int64

自動分組

# 1. 分組# 自動分組 #qcut(data,組數) sr = pd.qcut(data,3) sr No1:165 (163.667, 178.0] No2:174 (163.667, 178.0] No3:160 (158.999, 163.667] No4:180 (178.0, 192.0] No5:159 (158.999, 163.667] No6:163 (158.999, 163.667] No7:192 (178.0, 192.0] No8:184 (178.0, 192.0] dtype: category Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]] # 查看分組情況 sr.value_counts() (178.0, 192.0] 3 (158.999, 163.667] 3 (163.667, 178.0] 2 dtype: int64 type(sr) pandas.core.series.Series # 2. 將分組好的結果轉換成獨熱編碼 # prefix,設置列名的前綴 pd.get_dummies(sr,prefix="height") height_(158.999, 163.667]height_(163.667, 178.0]height_(178.0, 192.0]No1:165No2:174No3:160No4:180No5:159No6:163No7:192No8:184
010
010
100
001
100
100
001
001

自定義分組

# 自定義分組 # pd.cut(data,包含全部分界值的列表) sr = pd.cut(data,[150,165,180,195]) sr No1:165 (150, 165] No2:174 (165, 180] No3:160 (150, 165] No4:180 (165, 180] No5:159 (150, 165] No6:163 (150, 165] No7:192 (180, 195] No8:184 (180, 195] dtype: category Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]] sr.value_counts() (150, 165] 4 (180, 195] 2 (165, 180] 2 dtype: int64 pd.get_dummies(sr,prefix="身高") 身高_(150, 165]身高_(165, 180]身高_(180, 195]No1:165No2:174No3:160No4:180No5:159No6:163No7:192No8:184
100
010
100
010
100
100
001
001

4.8 高級處理-合并

4.8.1 pd.concat實現合并(按方向拼接)

data1 = np.arange(0,20,1).reshape(4,5) data1 = pd.DataFrame(data1) data1 012340123
01234
56789
1011121314
1516171819
data2 = np.arange(100,120,1).reshape(4,5) data2 = pd.DataFrame(data2) data2 012340123
100101102103104
105106107108109
110111112113114
115116117118119
# 將data1 和 data2 進行水平拼接 data_concat = pd.concat([data1,data2],axis=1) data_concat 01234012340123
01234100101102103104
56789105106107108109
1011121314110111112113114
1516171819115116117118119
data2.T 012301234
100105110115
101106111116
102107112117
103108113118
104109114119
# 將data1 和 data2 進行豎直拼接 data_concat1 = pd.concat([data1,data2.T],axis=0) data_concat1 01234012301234
01234.0
56789.0
1011121314.0
1516171819.0
100105110115NaN
101106111116NaN
102107112117NaN
103108113118NaN
104109114119NaN

4.8.2 pd.merge實現合并(按索引拼接)

left=pd.DataFrame({'key1':['K0','K0','K1','K2'], 'key2':['K0','K1','K0','K1'], 'A':['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3']}) left key1key2AB0123
K0K0A0B0
K0K1A1B1
K1K0A2B2
K2K1A3B3
right=pd.DataFrame({'key1':['K0','K1','K1','K2'], 'key2':['K0','K0','K0','K0'], 'C':['Co','C1','C2','C3'],'D':['DO','D1','D2','D3']}) right key1key2CD0123
K0K0CoDO
K1K0C1D1
K1K0C2D2
K2K0C3D3
# 默認內連接inner # inner 保留共有的key result = pd.merge(left,right,on=['key1','key2'],how="inner") result key1key2ABCD012
K0K0A0B0CoDO
K1K0A2B2C1D1
K1K0A2B2C2D2
# left ,左連接 # 左表中所有的key都保留,以左表為主進行合并 result_left = pd.merge(left,right,on=['key1','key2'],how="left") result_left key1key2ABCD01234
K0K0A0B0CoDO
K0K1A1B1NaNNaN
K1K0A2B2C1D1
K1K0A2B2C2D2
K2K1A3B3NaNNaN
# right ,右連接 # 右表中所有的key都保留,以右表為主進行合并 result_right = pd.merge(left,right,on=['key1','key2'],how="right") result_right key1key2ABCD0123
K0K0A0B0CoDO
K1K0A2B2C1D1
K1K0A2B2C2D2
K2K0NaNNaNC3D3
# outer ,外連接 # 左右兩表中所有的key都保留,進行合并 result_outer = pd.merge(left,right,on=['key1','key2'],how="outer") result_outer key1key2ABCD012345
K0K0A0B0CoDO
K0K1A1B1NaNNaN
K1K0A2B2C1D1
K1K0A2B2C2D2
K2K1A3B3NaNNaN
K2K0NaNNaNC3D3

4.9 高級處理-交叉表與透視表

  • 用來探索兩個變量之間的關系

4.9.2 使用crosstab(交叉表)實現

data = pd.read_excel("./datas/szfj_baoan.xls") data districtroomnumhallAREAC_floorfloor_numschoolsubwayper_price01234...12461247124812491250
baoan3289.3middle31007.0773
baoan42127.0high31006.9291
baoan1128.0low39003.9286
baoan1128.0middle30003.3568
baoan2278.0middle8115.0769
...........................
baoan4289.3low8004.2553
baoan2167.0middle30003.8060
baoan2267.4middle29105.3412
baoan2273.1low15105.9508
baoan3286.2middle32014.5244

1251 rows × 9 columns

time = "2020-06-23" # pandas日期類型 date = pd.to_datetime(time) date Timestamp('2020-06-23 00:00:00') type(date) pandas._libs.tslibs.timestamps.Timestamp date.year 2020 date.month 6 data["week"] = date.weekday data.drop("week",axis=1,inplace=True) data districtroomnumhallAREAC_floorfloor_numschoolsubwayper_price01234...12461247124812491250
baoan3289.3middle31007.0773
baoan42127.0high31006.9291
baoan1128.0low39003.9286
baoan1128.0middle30003.3568
baoan2278.0middle8115.0769
...........................
baoan4289.3low8004.2553
baoan2167.0middle30003.8060
baoan2267.4middle29105.3412
baoan2273.1low15105.9508
baoan3286.2middle32014.5244

1251 rows × 9 columns

data["feature"] = np.where(data["per_price"] > 5.0000,1,0) data districtroomnumhallAREAC_floorfloor_numschoolsubwayper_pricefeature01234...12461247124812491250
baoan3289.3middle31007.07731
baoan42127.0high31006.92911
baoan1128.0low39003.92860
baoan1128.0middle30003.35680
baoan2278.0middle8115.07691
..............................
baoan4289.3low8004.25530
baoan2167.0middle30003.80600
baoan2267.4middle29105.34121
baoan2273.1low15105.95081
baoan3286.2middle32014.52440

1251 rows × 10 columns

# 交叉表# 查看樓層 和 每平方米單價是否>50000的關系 # 返回值為每個樓層中,為0的個數和為1的個數 data0 = pd.crosstab(data["floor_num"],data["feature"]) data0 feature01floor_num1346789101112131415161718192021222324252627282930313233343536373839404344454750515253
68
01
010
37
1625
1932
211
49
811
13
420
05
833
919
2021
1735
115
24
16
01
48
1026
437
957
538
635
2668
3078
4151
21126
3420
15
12
04
11
01
510
13
01
06
07
01
01
03
02
01
data0.sum(axis=1) # 按行求和 floor_num 1 14 3 1 4 10 6 10 7 41 8 51 9 13 10 13 11 19 12 4 13 24 14 5 15 41 16 28 17 41 18 52 19 16 20 6 21 7 22 1 23 12 24 36 25 41 26 66 27 43 28 41 29 94 30 108 31 155 32 147 33 54 34 6 35 3 36 4 37 2 38 1 39 15 40 4 43 1 44 6 45 7 47 1 50 1 51 3 52 2 53 1 dtype: int64 data0.div(data0.sum(axis=1),axis=0) # 按行做除法 feature01floor_num1346789101112131415161718192021222324252627282930313233343536373839404344454750515253
0.4285710.571429
0.0000001.000000
0.0000001.000000
0.3000000.700000
0.3902440.609756
0.3725490.627451
0.1538460.846154
0.3076920.692308
0.4210530.578947
0.2500000.750000
0.1666670.833333
0.0000001.000000
0.1951220.804878
0.3214290.678571
0.4878050.512195
0.3269230.673077
0.6875000.312500
0.3333330.666667
0.1428570.857143
0.0000001.000000
0.3333330.666667
0.2777780.722222
0.0975610.902439
0.1363640.863636
0.1162790.883721
0.1463410.853659
0.2765960.723404
0.2777780.722222
0.0258060.974194
0.1428570.857143
0.6296300.370370
0.1666670.833333
0.3333330.666667
0.0000001.000000
0.5000000.500000
0.0000001.000000
0.3333330.666667
0.2500000.750000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
data_percent = data0.div(data0.sum(axis=1),axis=0) data_percent feature01floor_num1346789101112131415161718192021222324252627282930313233343536373839404344454750515253
0.4285710.571429
0.0000001.000000
0.0000001.000000
0.3000000.700000
0.3902440.609756
0.3725490.627451
0.1538460.846154
0.3076920.692308
0.4210530.578947
0.2500000.750000
0.1666670.833333
0.0000001.000000
0.1951220.804878
0.3214290.678571
0.4878050.512195
0.3269230.673077
0.6875000.312500
0.3333330.666667
0.1428570.857143
0.0000001.000000
0.3333330.666667
0.2777780.722222
0.0975610.902439
0.1363640.863636
0.1162790.883721
0.1463410.853659
0.2765960.723404
0.2777780.722222
0.0258060.974194
0.1428570.857143
0.6296300.370370
0.1666670.833333
0.3333330.666667
0.0000001.000000
0.5000000.500000
0.0000001.000000
0.3333330.666667
0.2500000.750000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
0.0000001.000000
# stacked=True 是否重疊顯示 data_percent.plot(kind="bar",stacked=True) <matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>

data_percent = data0.div(data0.sum(axis=1),axis=0) data_percent <tr><th>50</th><td>0.000000</td><td>1.000000</td> </tr> <tr><th>51</th><td>0.000000</td><td>1.000000</td> </tr> <tr><th>52</th><td>0.000000</td><td>1.000000</td> </tr> <tr><th>53</th><td>0.000000</td><td>1.000000</td> </tr> feature01floor_num1346789101112131415161718192021222324252627282930
0.4285710.571429
0.0000001.000000
0.0000001.000000
0.3000000.700000
0.3902440.609756
0.3725490.627451
0.1538460.846154
0.3076920.692308
0.4210530.578947
0.2500000.750000
0.1666670.833333
0.0000001.000000
0.1951220.804878
0.3214290.678571
0.4878050.512195
0.3269230.673077
0.6875000.312500
0.3333330.666667
0.1428570.857143
0.0000001.000000
0.3333330.666667
0.2777780.722222
0.0975610.902439
0.1363640.863636
0.1162790.883721
0.1463410.853659
0.2765960.723404
0.2777780.722222

4.9.3使用pivot_table(透視表)實現

# 通過透視表,整個過程會變得更加簡單些 # 結果直接就是值為1的百分比 data.pivot_table(["feature"],index=["floor_num"])

...

featurefloor_num134650515253
0.571429
1.000000
1.000000
0.700000
1.000000
1.000000
1.000000
1.000000

4.10 高級處理-分組與聚合

4.10.2 分組與聚合API

col = pd.DataFrame({'color':['white','red','green','red','green'],'object':["pen","pencil","pencil","ashtray","pen"],'price1':[4.56,4.20,1.30,0.56,2.75],'price2':[4.75,4.12,1.68,0.75,3.15]}) col colorobjectprice1price201234
whitepen4.564.75
redpencil4.204.12
greenpencil1.301.68
redashtray0.560.75
greenpen2.753.15
# 進行分組,對顏色進行分組,對價格price1進行聚合 # 用DataFrame的方法進行分組 col.groupby(by="color")["price1"].max() color green 2.75 red 4.20 white 4.56 Name: price1, dtype: float64 # 用Series的方法進行分組 col['price1'].groupby(col["color"]) <pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08> col['price1'].groupby(col["color"]).max() color green 2.75 red 4.20 white 4.56 Name: price1, dtype: float64

4.11 綜合案例

# 1. 準備數據 movie = pd.read_csv("./datas/IMDB-Movie-Data.csv") movie RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01234...995996997998999
1Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
2PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
3SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
4SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
5Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0
....................................
996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo...Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts...20151116.227585NaN45.0
997Hostel: Part IIHorrorThree American college students studying abroa...Eli RothLauren German, Heather Matarazzo, Bijou Philli...2007945.57315217.5446.0
998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen...Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,...2008986.27069958.0150.0
999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni...Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh...2014935.64881NaN22.0
1000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins...Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch...2016875.31243519.6411.0

1000 rows × 12 columns

#問題1:我們想知道這些電影數據中評分的平均分,導演的人數等信息, # 我們應該怎么獲取? movie["Rating"].mean() 6.723200000000003 movie["Director"] 0 James Gunn 1 Ridley Scott 2 M. Night Shyamalan 3 Christophe Lourdelet 4 David Ayer... 995 Billy Ray 996 Eli Roth 997 Jon M. Chu 998 Scot Armstrong 999 Barry Sonnenfeld Name: Director, Length: 1000, dtype: object # np.unique()去重,因為導演可能是多個電影的導演 np.unique(movie["Director"]) array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay','Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh','Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes','Alejandro Amenábar', 'Alejandro González I?árritu',...'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight','Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill','Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball','Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe','William Brent Bell', 'William Oldroyd', 'Woody Allen','Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder','Zackary Adler'], dtype=object) # 導演的人數 np.unique(movie["Director"]).size 644 # 問題2 : 對于這一組電影數據,如果我們先rating,runtime的分布情況,應該如何呈現數據? movie["Rating"].plot(kind="hist",figsize=(20,8),fontsize=40) <matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>

import matplotlib.pyplot as plt# 1. 創建畫布 plt.figure(figsize=(20,8),dpi=100)# 2. 繪制直方圖 plt.hist(movie["Rating"],20)# 修改刻度 plt.xticks(np.linspace(movie["Rating"].min(),movie["Rating"].max(),21))# 添加網格 plt.grid(linestyle="--",alpha=0.5)# 3. 顯示圖像 plt.show()

movie["Rating"] 0 8.1 1 7.0 2 7.3 3 7.2 4 6.2... 995 6.2 996 5.5 997 6.2 998 5.6 999 5.3 Name: Rating, Length: 1000, dtype: float64 # 問題3:對于這一組電影數據,如果我們希望統計電影分類(genre)的情況,應該如何處理數據?# 先統計電影類別有哪些 movie_genre = [i.split(",") for i in movie["Genre"]] movie_genre [['Action', 'Adventure', 'Sci-Fi'],['Adventure', 'Mystery', 'Sci-Fi'],['Horror', 'Thriller'],['Animation', 'Comedy', 'Family'],['Action', 'Adventure', 'Fantasy'],...['Horror'],['Drama', 'Music', 'Romance'],['Adventure', 'Comedy'],['Comedy', 'Family', 'Fantasy']] [j for i in movie_genre for j in i] ['Action','Adventure','Sci-Fi','Adventure','Mystery','Sci-Fi', ...'Animation','Action','Adventure','Action','Adventure','Drama',...] movie_class = np.unique([j for i in movie_genre for j in i]) movie_class array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'], dtype='<U9') len(movie_class) # 20 個電影類別 20 # 統計每個類別有幾個電影# 先創建一個空的DataFrame表 count = pd.DataFrame(np.zeros(shape=[1000,20],dtype="int32"),columns=movie_class) count.head() ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern01234
00000000000000000000
00000000000000000000
00000000000000000000
00000000000000000000
00000000000000000000
count.loc[0,movie_genre[0]] Action 0 Adventure 0 Sci-Fi 0 Name: 0, dtype: int32 movie_genre[0] ['Action', 'Adventure', 'Sci-Fi'] # 計數填表 for i in range(1000):count.loc[i,movie_genre[i]] = 1 count ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern01234...995996997998999
11000000000000010000
01000000000001010000
00000000001000000100
00101001000000000000
11000000100000000000
............................................................
00000110000001000000
00000000001000000000
00000010000100100000
01001000000000000000
00001001100000000000

1000 rows × 20 columns

# 按列求和 count.sum(axis=0) Action 303 Adventure 259 Animation 49 Biography 81 Comedy 279 Crime 150 Drama 513 Family 51 Fantasy 101 History 29 Horror 119 Music 16 Musical 5 Mystery 106 Romance 141 Sci-Fi 120 Sport 18 Thriller 195 War 13 Western 7 dtype: int64 count.sum(axis=0).sort_values(ascending=False) Drama 513 Action 303 Comedy 279 Adventure 259 Thriller 195 Crime 150 Romance 141 Sci-Fi 120 Horror 119 Mystery 106 Fantasy 101 Biography 81 Family 51 Animation 49 History 29 Sport 18 Music 16 War 13 Western 7 Musical 5 dtype: int64 count.sum(axis=0).sort_values(ascending=False).plot(kind="bar",fontsize=20,figsize=(20,9),colormap="cool") <matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>

總結

以上是生活随笔為你收集整理的九、Pandas高级处理的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。