當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python 中的Pandas库

發(fā)布時(shí)間：2023/12/13 python 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 中的Pandas库小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

待總結(jié)

用scikit-learn和pandas學(xué)習(xí)線性回歸

用scikit-learn和pandas學(xué)習(xí)Ridge回歸

基于python的數(shù)據(jù)分析庫(kù)Pandas

pandas——Python 數(shù)據(jù)分析庫(kù)，包括數(shù)據(jù)框架（dataframes）等結(jié)構(gòu) http://pandas.pydata.org/

10 Minutes to Pandas：http://suo.im/4an6gY

待整理的部分
Data Analysis with Python and Pandas Tutorial Introduction

Numpy & Pandas

#numpy是序列化的矩陣或者序列 #pandas是字典形式的numpy，可給不同行列進(jìn)行重新命名 ——————————- **Pandas數(shù)據(jù)轉(zhuǎn)為 numpy數(shù)據(jù)** df_numpyMatrix = df.as_matrix() df_numpyMatrix=df.values ————————————————— ***Pandas 小抄*** ————————————————— ***1. Reading and Writing Data*** —— import pandas as pd #a. Reading a csv file df=pd.read_csv('Analysis.cav') #b. Writing content of data frame to csv file df.to_csv('werfer.csv') # c.Reading an Excel file df=pd.read_excel('sdfsdgsd.xlsx', 'sheeet1') #d. Writing content of data frame to Excel file df.to_excel('sddg.xlsx', sheet_name='sheet2') # pandas 導(dǎo)入導(dǎo)出，讀取和儲(chǔ)存# The pandas I/O API is a set of top level reader functions accessed like # pd.read_csv() that pandas object.# read_csv # excel files # read_excel # read_hdf # read_sql # read_json # read_msgpack(experimental) # read_html # read_gbq(experimental) # read_stata # read_sas # read_clipboard # read_pickle #自帶的亞索# The corresponding writer functions are object methods that are accessed like # df.to_csv# to_csv # to_excel # to_hdf # to_sql # to_json # to_msgpack # to_html # to_gbq # to_stata # to_clipboard # to_pickleimport pandas as pddata = pd.read_csv('student.csv') print(data)data.to_packle('student.pickle') ————————————————— ***2. Getting Preview of Dataframe*** #a.Looking at top n record df.head(5) #b.Looking at bottom n record df.tail(5) #c.View columns name df.columns ————————————————— ***3. Rename Columns of Data Frame*** #a. Rename method helps to rename column of data frame df2 = df.rename(columns={'old_columnname':'new_columnname'}) #This method will create a new data frame with new column name. #b.To rename the column of existing data frame, set inplace=True. df.rename(columns={'old_columnname':'new_columnname'}, inplace=True) ————————————————— ***4. Selecting Columns or Rows*** #a. Accessing sub data frames df[['column1','column2']] #b.Filtering Records df[df['column1']>10] df[(df['column1']>10) & df['column2']==30] df[(df['column1']>10) | df['column2']==30] # pandas 數(shù)據(jù)選擇import pandas as pd import numpy as npdates = pd.date_range('20170101',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D']) print(df)print(df['A'],df.A)print(df[0:3],df['20170101':'20170104'])# select by label:loc print(df.loc['20170102'])print(df.loc[:,['A','B']])print(df.loc['20170102',['A','B']])#select by position:iloc print(df.iloc[3]) print(df.iloc[3,1]) print(df.iloc[1:3,1:3]) print(df.iloc[[1,3,5],1:3])#mixed selection:ix print(df.ix[:3,['A','C']])# Boolean indexing print(df) print(df[df.A>8]) ————————————————— ***5. Handing Missing Values*** This is an inevitale part of dealing wiht data. To overcom this hurdle, use dropna or fillna function #a. dropna: It is used to drop rows or columns having missing data df1.dropna() #b.fillna: It is used to fill missing values df2.fillna(value=5) # It replaces all missing values with 5 mean = df2['column1'].mean() df2['column1'].fillna(mean) # It replaces all missing values of column1 with mean of available values ————- from pandas import Series,DataFrame import pandas as pd ser = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) ser ser.drop('c') ser .drop() 返回的是一個(gè)新對(duì)象，元對(duì)象不會(huì)被改變。 from pandas import Series,DataFrame import pandas as pd import numpy as npdf = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5]], ... columns=list('ABCD'))df#Drop the columns where all elements are nan df.dropna(axis=1, how='all')A B D 0 NaN 2.0 0 1 3.0 4.0 1 2 NaN NaN 5#Drop the columns where any of the elements is nan>>> df.dropna(axis=1, how='any')D 0 0 1 1 2 5#Drop the rows where all of the elements are nan (there is no row to drop, so df #stays the same):>>> df.dropna(axis=0, how='all')A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5#Drop the rows where any of the elements are nan >>> df.dropna(axis=0, how='any') Empty DataFrame Columns: [A, B, C, D] Index: []#Keep only the rows with at least 2 non-na values:>>> df.dropna(thresh=2)A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1#Drop where all of the elements are nan, the default is the row, (there is no row to drop, so df #stays the same):>>> df.dropna(how='all')A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5#Drop where any of the elements are nan, default is the row >>> df.dropna( how='any') Empty DataFrame Columns: [A, B, C, D] Index: []dfnew = pd.DataFrame([[3435234, 2, 5666, 0], [3, 4, np.nan, 1],...: ... [np.nan, np.nan, np.nan, 5]],...: ... columns=list('ABCD'))dfnew.dropna() #默認(rèn)對(duì)row 進(jìn)行操作，去掉Na項(xiàng)A B C D 0 3435234 2 5666 0 # 處理丟失數(shù)據(jù) import numpy as np import pandas as pddates = pd.date_range('20170101',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D']) print(df)df.iloc[0,1]=np.nan df.iloc[1,2]=np.nan print(df.dropna(axis=0,how='any'))#how={'any','all'} default is 'any' print(df.dropna(axis=1,how='all'))#填入數(shù)據(jù) print(df.fillna(value=0)) #打印缺失數(shù)據(jù) print(df.isnull()) #打印出缺失數(shù)據(jù)，當(dāng)數(shù)據(jù)比較大時(shí) print(np.any(df.isnull())==True) ————————————————— ***6. Creating New Columns*** New column is a function of existing columns df['NewColumn1'] = df['column2'] # Create a copy of existing column2 df['NewColumn2'] = df['column2'] + 10 # Add 10 to existing column2 then create a new one df['NewColumn3'] = df['column1'] + df['column2'] # Add elements of column1 and column2 then create new column import pandas as pd import numpy as nps = pd.Series([1,3,5,np.nan,55,2]) print(s)dates = pd.date_range('20160101',periods=6) print(dates)df = pd.DataFrame(np.random.random(6,4),index=dates,columns=['a','b','c','d']) print(df)df1 = pd.DataFrame(np.arange(12).reshape((3,4))) print(df1)df2 = pd.DataFrame({'A':1.,'B':pd.Timestamp('20170101'),'C':pd.Series(1,index=list(range(4)),dtype='float32'),'D':np.array([3]*4,dtype='int32'),'E':pd.Categorical(["test","train,"test","train"]),'F':'foo'})print(df2.dtypes) print(df2.columns) print(df2.values)print(df2.describe)print(df2.T)print(df2.sort_index(axis=1,ascending=False)) df2.sort_values(by='E') #添加空行 df['F'] = np.nan print(df)df['E']=pd.Series([1,2,3,4,5,6],index=pd.date_range('20170101',periods=6)) print(df) ————————————————— ***7. Aggregate*** a. Groupby: Groupby helps to perform three operations. i. Splitting the data into groups ii. Applying a function to each group individually iii. Combining the result into a data structure df.groupby('column1').sum() df.groupby(['column1','column2']).count() b. Pivot Table: It helps to generate data structure. It has three components index, columns and values(similar to excel) pd.pivot_table(df, values='column1',index=['column2','column3'],columns=['column4']) By default, it shows the sum of values column but you can change it using argument aggfunc pd.pivot_table(df, values='column1',index=['column2','column3'],columns=['column4'], aggfunc=len) It shows count c. Cross Tab: Cross Tab computes the simple cross tabulation of two factors pd.crosstab(df.column1, df.column2) ————————————————— ***8. Merging /Concatenating DataFrames*** a. Concatenating: It concatenate two or more data frames based on their columns pd.concat([df1, df2]) b. Merging: We can perform left, right and inner join also. pd.merge(df1,df2, on='column1',how='inner') pd.merge(df1,df2, on='column1',how='left') pd.merge(df1,df2, on='column1',how='right') pd.merge(df1,df2, on='column1',how='outer') # pandas 合并concatimport pandas as pd import numpy as np#concatenatingdf1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])print(df1) print(df2) print(df3)result = pd.concat([df1,df2,df3],axis=0)#行合并 print(result) #result1 = pd.concat([df1,df2,df3],axis=1)#列合并 #print(result1)result = pd.concat([df1,df2,df3],axis=0,ignore_index=True)#行合并,忽略index print(result)#join,['inner','outer'] df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'],index=[2,3,4]) print(df1) print(df2)result2 = pd.concat([df1,df2],join='outer'，ignore_index=True)# 補(bǔ)充為na print(result2) result22 = pd.concat([df1,df2],join='outer')# 補(bǔ)充為na print(result22) result3 = pd.concat([df1,df2],join='inner'，ignore_index=True) # 裁剪掉 print(result3) result33 = pd.concat([df1,df2],join='inner') # 裁剪掉 print(result33)#join_axes df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'],index=[2,3,4]) res = pd.concat([df1,df2],axis=1,join_axes=[df1.index]) print(res)res1 = pd.concat([df1,df2],axis=1) print(res1)#append df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d']) res11 = df1.append(df2,ignore_index=True) print(res11) res12 = df1.append([df2,df3],ignore_index=True) print(res12)s1 = pd.Series([1,2,3,4],index=['a','b','c','d'])res13=df1.append(s1,ignore_index=True) print(res13)#pandas 合并mergeimport pandas as pd#merging two df by key/keys.(may be used in database) #simple example left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']}) right = pd.DataFrame({'key':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})print(lef) print(right) res14 = pd.merge(left,right,on='key') print(res14)#consider two keys left = pd.DataFrame({'key1':['K0','K0','K1','K2'],'key2':['K0','K1','K0','K1'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']}) right = pd.DataFrame({'key1':['K0','K1','K1','K2'],'key2':['K0','K0','K0','K0'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']}) print(left) print(right)res15 = pd.merge(left,right,on=['key1','key2']) print(res15) #how =['left','right','inner','outer'] res16 = pd.merge(left,right,on=['key1','key2'],how='inner') print(res16) ————————————————— ***9. Applying function to element, column or dataframe*** a. Map: It iterates over each element of a series df['column1'].map(lambda x: 10+x) #this will add 10 to each element of column1 df['column2'].map(lambda x:'AV'+x) # this will concatenate 'AV' at the beginning of each element of column2(column format is string) b. Apply: As the name suggests, applies a function along any axis of the DataFrame df[['column1','column2']].apply(sum) #It will returns the sum of all the values of column1 and column2 c. ApplyMap: This helps to apply a function to each element of dataframe func = lambda x: x+2 df.applymap(func) # it will add 2 to each element of dataframe(all columns of dataframe must be numeric type) ————————————————— ***10. Identify unique value*** Function unique helps to return unique values of a column df['Column1'].unique() ————————————————— ***11. Basic Stats*** Pandas helps to understand the data using basic statistical methods. a. describe: This returns the quick stats(count, mean, std, min, first quartile, median, third quartile, max) on suitable columns df.describe() b. covariance: It returns the co-variance between suitable columns df.cov() c.correlation: It returns the co-variance between suitable columns. df.corr() ——— 本文中的 Python-Pandas.ipynb格式見(jiàn)[CSDN下載](http://download.csdn.net/detail/jiandanjinxin/9826981)。 #https://python.freelycode.com/contribution/detail/333 #https://python.freelycode.com/contribution/detail/334 #http://www.datadependence.com/2016/05/scientific-python-pandas/#Python科學(xué)計(jì)算之Pandas #導(dǎo)入Pandas的標(biāo)準(zhǔn)方式 import pandas as pd # This is the standard #Pandas的數(shù)據(jù)類型 #Pandas基于兩種數(shù)據(jù)類型：series與dataframe。 #一個(gè)series是一個(gè)一維的數(shù)據(jù)類型，其中每一個(gè)元素都有一個(gè)標(biāo)簽。 #series類似于Numpy中元素帶標(biāo)簽的數(shù)組。其中，標(biāo)簽可以是數(shù)字或者字符串。 #一個(gè)dataframe是一個(gè)二維的表結(jié)構(gòu)。Pandas的dataframe可以存儲(chǔ)許多種不同的數(shù)據(jù)類型，并且每一個(gè)坐標(biāo)軸都有自己的標(biāo)簽。 #你可以把它想象成一個(gè)series的字典項(xiàng)。 #將數(shù)據(jù)導(dǎo)入Pandas,采用[英國(guó)政府?dāng)?shù)據(jù)中關(guān)于降雨量數(shù)據(jù)](https://data.gov.uk/dataset/average-temperature-and-rainfall-england-and-wales/resource/3fea0f7b-5304-4f11-a809-159f4558e7da) # Reading a csv into Pandas,從csv文件中讀取到了數(shù)據(jù)，并將他們存入了dataframe中 #header關(guān)鍵字告訴Pandas這些數(shù)據(jù)是否有列名，在哪里。如果沒(méi)有列名，你可以將其置為None。 df = pd.read_csv('uk_rain_2014.csv', header=0) #將你的數(shù)據(jù)準(zhǔn)備好以進(jìn)行挖掘和分析 #想要快速查看前x行數(shù)據(jù) #Getting first x rows df.head(5) Water YearRain (mm) Oct-SepOutflow (m3/s) Oct-SepRain (mm) Dec-FebOutflow (m3/s) Dec-FebRain (mm) Jun-AugOutflow (m3/s) Jun-Aug01234

1980/81	1182	5408	292	7248	174	2212
1981/82	1098	5112	257	7316	242	1936
1982/83	1156	5701	330	8567	124	1802
1983/84	993	4265	391	8905	141	1078
1984/85	1182	5364	217	5813	343	4313

#想要獲得最后x行的數(shù)據(jù) #Getting last x rows #Pandas不是從dataframe的結(jié)尾處開(kāi)始倒著輸出數(shù)據(jù)， #而是按照它們?cè)赿ataframe中固有的順序輸出給你。 df.tail(5) Water YearRain (mm) Oct-SepOutflow (m3/s) Oct-SepRain (mm) Dec-FebOutflow (m3/s) Dec-FebRain (mm) Jun-AugOutflow (m3/s) Jun-Aug2829303132

2008/09	1139	4941	268	6690	323	3189
2009/10	1103	4738	255	6435	244	1958
2010/11	1053	4521	265	6593	267	2885
2011/12	1285	5500	339	7630	379	5261
2012/13	1090	5329	350	9615	187	1797

df.columns Index([’Water Year’, ‘Rain (mm) Oct-Sep’, ‘Outflow (m3/s) Oct-Sep’, ‘Rain (mm) Dec-Feb’, ‘Outflow (m3/s) Dec-Feb’, ‘Rain (mm) Jun-Aug’, ‘Outflow (m3/s) Jun-Aug’], dtype=’object’) #Changing column labels. df.columns = ['water_year', 'rain_octsep', 'outflow_octsep', 'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug'] df.head(5) water_yearrain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaug01234

1980/81	1182	5408	292	7248	174	2212
1981/82	1098	5112	257	7316	242	1936
1982/83	1156	5701	330	8567	124	1802
1983/84	993	4265	391	8905	141	1078
1984/85	1182	5364	217	5813	343	4313

#取數(shù)據(jù)的行數(shù)，即條目數(shù) #Finding out how many rows dataset has. len(df) 33 #數(shù)據(jù)的一些基本的統(tǒng)計(jì)信息 #Finding out basic statistical information on your dataset. pd.options.display.float_format = '{:,.3f}'.format #Limit output to 3 decimal places.計(jì)數(shù)，均值，標(biāo)準(zhǔn)方差 df.describe() rain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaugcountmeanstdmin25%50%75%max

33.000	33.000	33.000	33.000	33.000	33.000
1,129.000	5,019.182	325.364	7,926.545	237.485	2,439.758
101.900	658.588	69.995	1,692.800	66.168	1,025.914
856.000	3,479.000	206.000	4,578.000	103.000	1,078.000
1,053.000	4,506.000	268.000	6,690.000	193.000	1,797.000
1,139.000	5,112.000	309.000	7,630.000	229.000	2,142.000
1,182.000	5,497.000	360.000	8,905.000	280.000	2,959.000
1,387.000	6,391.000	484.000	11,486.000	379.000	5,261.000

#過(guò)濾 #提取一整列。可以直接使用列標(biāo)簽 #Getting a column by label df['rain_octsep'] 0 1182 1 1098 2 1156 3 993 4 1182 5 1027 6 1151 7 1210 8 976 9 1130 10 1022 11 1151 12 1130 13 1162 14 1110 15 856 16 1047 17 1169 18 1268 19 1204 20 1239 21 1185 22 1021 23 1165 24 1095 25 1046 26 1387 27 1225 28 1139 29 1103 30 1053 31 1285 32 1090 Name: rain_octsep, dtype: int64 #不使用空格和橫線等可以讓我們以訪問(wèn)類屬性相同的方法來(lái)訪問(wèn)列，即使用點(diǎn)運(yùn)算符 #Getting a column by label using. df.rain_octsep 0 1182 1 1098 2 1156 3 993 4 1182 5 1027 6 1151 7 1210 8 976 9 1130 10 1022 11 1151 12 1130 13 1162 14 1110 15 856 16 1047 17 1169 18 1268 19 1204 20 1239 21 1185 22 1021 23 1165 24 1095 25 1046 26 1387 27 1225 28 1139 29 1103 30 1053 31 1285 32 1090 Name: rain_octsep, dtype: int64 #boolean masking #Creating a series of booleans based on a conditional df.rain_octsep <1000 df['rain_octsep'] <1000 0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 True 9 False 10 False 11 False 12 False 13 False 14 False 15 True 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False 30 False 31 False 32 False Name: rain_octsep, dtype: bool #使用多條條件表達(dá)式來(lái)進(jìn)行過(guò)濾 #Filtering by multiple conditionals #將返回rain_octsep小于1000并且outflow_octsep小于4000的那些條目。 df[(df.rain_octsep <1000) & (df.outflow_octsep <4000)] # Can't use the keyword 'and' water_yearrain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaug15

1995/96	856	3479	245	5515	172	1439

1995/96

856

3479

245

5515

172

1439

#數(shù)據(jù)中有字符串，也可以使用字符串方法來(lái)過(guò)濾數(shù)據(jù)。 #必須使用.str.[string method]，你不能直接在字符串上直接調(diào)用字符串方法。 #Filtering by string methods df[df.water_year.str.startswith('199')] #這一語(yǔ)句返回1990年代的所有條目 water_yearrain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaug10111213141516171819

1990/91	1022	4418	305	7120	216	1923
1991/92	1151	4506	246	5493	280	2118
1992/93	1130	5246	308	8751	219	2551
1993/94	1162	5583	422	10109	193	1638
1994/95	1110	5370	484	11486	103	1231
1995/96	856	3479	245	5515	172	1439
1996/97	1047	4019	258	5770	256	2102
1997/98	1169	4953	341	7747	285	3206
1998/99	1268	5824	360	8771	225	2240
1999/00	1204	5665	417	10021	197	2166

#索引 #如果行有數(shù)字索引，可以使用iloc引用他們 #Getting a row via a numerical index #iloc僅僅作用于數(shù)字索引。它將會(huì)返回該行的一個(gè)series。 df.iloc[30] water_year 2010/11 rain_octsep 1053 outflow_octsep 4521 rain_decfeb 265 outflow_decfeb 6593 rain_junaug 267 outflow_junaug 2885 Name: 30, dtype: object #可能在數(shù)據(jù)集里有年份的列，或者年代的列 #Setting a new index from an existing column df = df.set_index(['water_year']) df.head(5) rain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaugwater_year1980/811981/821982/831983/841984/85

1182	5408	292	7248	174	2212
1098	5112	257	7316	242	1936
1156	5701	330	8567	124	1802
993	4265	391	8905	141	1078
1182	5364	217	5813	343	4313

#在上面這個(gè)例子中，我們把我們的索引值全部設(shè)置為了字符串。這意味著我們不可以使用iloc索引這些列了。 #這種情況該如何？我們使用loc。 #Getting a row via a label-based index df.loc['2000/01'] #這里，loc和iloc一樣會(huì)返回你所索引的行數(shù)據(jù)的一個(gè)series。 #唯一的不同是此時(shí)你使用的是字符串標(biāo)簽進(jìn)行引用，而不是數(shù)字標(biāo)簽。 rain_octsep 1239 outflow_octsep 6092 rain_decfeb 328 outflow_decfeb 9347 rain_junaug 236 outflow_junaug 2142 Name: 2000/01, dtype: int64 #如果loc是字符串標(biāo)簽的索引方法，iloc是數(shù)字標(biāo)簽的索引方法，那什么是ix呢？ #事實(shí)上，ix是一個(gè)字符串標(biāo)簽的索引方法，但是它同樣支持?jǐn)?shù)字標(biāo)簽索引作為它的備選。 #Getting a row via a label-based or numerical index df.ix['1999/00'] # Label based with numerical index fallback * Not recommend #正如loc和iloc，上述代碼將返回一個(gè)series包含你所索引的行的數(shù)據(jù) rain_octsep 1204 outflow_octsep 5665 rain_decfeb 417 outflow_decfeb 10021 rain_junaug 197 outflow_junaug 2166 Name: 1999/00, dtype: int64 #既然ix可以完成loc和iloc二者的工作，為什么還需要它們呢? #最主要的原因是ix有一些輕微的不可預(yù)測(cè)性。還記得我說(shuō)數(shù)字標(biāo)簽索引是ix的備選嗎？ #數(shù)字標(biāo)簽可能會(huì)讓ix做出一些奇怪的事情，例如將一個(gè)數(shù)字解釋成一個(gè)位置。 #而loc和iloc則為你帶來(lái)了安全的、可預(yù)測(cè)的、內(nèi)心的寧?kù)o。 #然而必須指出的是，ix要比loc和iloc更快。 #調(diào)用sort_index來(lái)對(duì)dataframe實(shí)現(xiàn)排序 #inplace=True to apple the sorting in place #置了關(guān)鍵字參數(shù)’ascending’為False。這樣，我的數(shù)據(jù)會(huì)以降序排列 df.sort_index(ascending=False).head(5) rain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaugwater_year2012/132011/122010/112009/102008/09

1090	5329	350	9615	187	1797
1285	5500	339	7630	379	5261
1053	4521	265	6593	267	2885
1103	4738	255	6435	244	1958
1139	4941	268	6690	323	3189

#當(dāng)你為一列數(shù)據(jù)設(shè)置了一個(gè)索引時(shí)，它們將不再是數(shù)據(jù)本身了。 #如果你想把索引設(shè)置為原始數(shù)據(jù)的形式， #你可以使用和set_index相反的操作——reset_index。 #Returning an index to data #這將返回?cái)?shù)據(jù)原始的索引形式。 df = df.reset_index('water_year') df.head(5) water_yearrain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaug01234

1980/81	1182	5408	292	7248	174	2212
1981/82	1098	5112	257	7316	242	1936
1982/83	1156	5701	330	8567	124	1802
1983/84	993	4265	391	8905	141	1078
1984/85	1182	5364	217	5813	343	4313

#對(duì)數(shù)據(jù)集應(yīng)用函數(shù) #Applying a function to a column def base_year(year):base_year = year[:4]base_year = pd.to_datetime(base_year).yearreturn base_yeardf['year'] = df.water_year.apply(base_year) df.head(5) water_yearrain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaugyear01234

1980/81	1182	5408	292	7248	174	2212	1980
1981/82	1098	5112	257	7316	242	1936	1981
1982/83	1156	5701	330	8567	124	1802	1982
1983/84	993	4265	391	8905	141	1078	1983
1984/85	1182	5364	217	5813	343	4313	1984

#使用apply的方法，即如何對(duì)一列應(yīng)用一個(gè)函數(shù)。 #如果你想對(duì)整個(gè)數(shù)據(jù)集應(yīng)用某個(gè)函數(shù)，你可以使用dataset.applymap()。 #操作一個(gè)數(shù)據(jù)集結(jié)構(gòu) #另一件經(jīng)常會(huì)對(duì)dataframe所做的操作是為了讓它們呈現(xiàn)出一種更便于使用的形式而對(duì)它們進(jìn)行的重構(gòu)。 #Manipulating structure (groupby,unstack,pivot) #Groupby df.groupby(df.year // 10*10).max() water_yearrain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaugyearyear1980199020002010

1989/90	1210	5701	470	10520	343	4313	1989
1999/00	1268	5824	484	11486	285	3206	1999
2009/10	1387	6391	437	10926	357	5168	2009
2012/13	1285	5500	350	9615	379	5261	2012

#對(duì)多行進(jìn)行分組操作 #Grouping by multiple columns decade_rain = df.groupby([df.year // 10*10,df.rain_octsep // 1000*1000])[['outflow_octsep','outflow_decfeb','outflow_junaug']].mean() decade_rain outflow_octsepoutflow_decfeboutflow_junaugyearrain_octsep1980010001990010002000100020101000

4,297.500	7,685.000	1,259.000
5,289.625	7,933.000	2,572.250
3,479.000	5,515.000	1,439.000
5,064.889	8,363.111	2,130.556
5,030.800	7,812.100	2,685.900
5,116.667	7,946.000	3,314.333

#unstack操作的功能是將某一列前置成為列標(biāo)簽。 #Unstacking decade_rain.unstack(0) #它將標(biāo)識(shí)‘year’索引的第0列推起來(lái)，變?yōu)榱肆袠?biāo)簽。 outflow_octsepoutflow_decfeboutflow_junaugyear198019902000201019801990200020101980199020002010rain_octsep01000

4,297.500	3,479.000	nan	nan	7,685.000	5,515.000	nan	nan	1,259.000	1,439.000	nan	nan
5,289.625	5,064.889	5,030.800	5,116.667	7,933.000	8,363.111	7,812.100	7,946.000	2,572.250	2,130.556	2,685.900	3,314.333

#再附加一個(gè)unstack操作。這次我們對(duì)’rain_octsep’索引的第1列操作： #More unstacking decade_rain.unstack(1) outflow_octsepoutflow_decfeboutflow_junaugrain_octsep010000100001000year1980199020002010

4,297.500	5,289.625	7,685.000	7,933.000	1,259.000	2,572.250
3,479.000	5,064.889	5,515.000	8,363.111	1,439.000	2,130.556
nan	5,030.800	nan	7,812.100	nan	2,685.900
nan	5,116.667	nan	7,946.000	nan	3,314.333

#創(chuàng)造一個(gè)新的dataframe #Create a new dataframe containing entries which has rain_octsep values of #greater than 1250 high_rain = df[df.rain_octsep > 1250] high_rain water_yearrain_octsepoutflow_octseprain_decfeboutflow_decfebrain_junaugoutflow_junaugyear182631

1998/99	1268	5824	360	8771	225	2240	1998
2006/07	1387	6391	437	10926	357	5168	2006
2011/12	1285	5500	339	7630	379	5261	2011

#上述代碼為我們創(chuàng)建了如下的dataframe，我們將對(duì)它進(jìn)行pivot操作 #ivot實(shí)際上是在本文中我們已經(jīng)見(jiàn)過(guò)的操作的組合。 #首先，它設(shè)置了一個(gè)新的索引(set_index())，然后它對(duì)這個(gè)索引排序(sort_index())，最后它會(huì)進(jìn)行unstack操作。 #組合起來(lái)就是一個(gè)pivot操作。看看你能不能想想會(huì)發(fā)生什么： #Pivoting #does set_index,sort_index and unstack in a row high_rain.pivot('year', 'rain_octsep')[['outflow_octsep','outflow_decfeb','outflow_junaug']].fillna('') outflow_octsepoutflow_decfeboutflow_junaugrain_octsep126812851387126812851387126812851387year199820062011

5,824.000			8,771.000			2,240.000
		6,391.000			10,926.000			5,168.000
	5,500.000			7,630.000			5,261.000

#合并數(shù)據(jù)集 #有時(shí)候你有兩個(gè)單獨(dú)的數(shù)據(jù)集，它們直接互相關(guān)聯(lián)，而你想要比較它們的差異或者合并它們 #Merging two datasets together rain_jpn = pd.read_csv('jpn_rain.csv') rain_jpn.column = ['year', 'jpn_rainfall'] uk_jpn_rain = df.merge(rain_jpn, on = 'year') uk_jpn_rain.head(5) #可以看到，兩個(gè)數(shù)據(jù)集在年份這一類上已經(jīng)合并了。rain_jpn數(shù)據(jù)集僅僅包含年份以及降雨量。 #采用Pandas快速繪制圖表 #Using pandas to quickly plot graphs %matplotlib inline high_rain.plot(x='year', y='rain_octsep') <matplotlib.axes._subplots.AxesSubplot at 0x7f1214a5d748>

#存儲(chǔ)你的數(shù)據(jù)集 #Saving your data to a csv df.to_csv('high_rain.csv') # pandas plot import pandas as pd import numpy as np import matplotlib.pyplot as plt #plot data# Series data = pd.Series(np.random.randn(1000),index=np.arange(1000)) data = data.cumsum() data.plot() plt.show() plt.plot(x= , y = )#DataFrame data = pd.DataFrame(np.random.randn(1000,4),index=np.arange(1000),columns=list("ABCD")) data =data.cumsum() print(data.head()) data.plot() plt.show()#plot methods: #'bar','hist','box','area','scatter','hexbin','pie' data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class1') data.plot.scatter(x='A',y='C',color='DarkGreen',lable='Class2',ax=ax) plt.show()

References

Python科學(xué)計(jì)算之Pandas（上）

Python科學(xué)計(jì)算之Pandas（下）

An Introduction to Scientific Python – Pandas

CheatSheet: Data Exploration using Pandas in Python

機(jī)器學(xué)習(xí)入門必備的13張小抄

numpy教程 pandas教程 Python數(shù)據(jù)科學(xué)計(jì)算簡(jiǎn)介

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯，堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

總結(jié)

以上是生活随笔為你收集整理的Python 中的Pandas库的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：光学镜头参数详解（EFL、TTL、BFL
下一篇： Python 中的绘图matplotli