Python 中的Pandas库
生活随笔
收集整理的這篇文章主要介紹了
Python 中的Pandas库
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
待總結(jié)
用scikit-learn和pandas學(xué)習(xí)線性回歸
用scikit-learn和pandas學(xué)習(xí)Ridge回歸
基于python的數(shù)據(jù)分析庫(kù)Pandas
pandas——Python 數(shù)據(jù)分析庫(kù),包括數(shù)據(jù)框架(dataframes)等結(jié)構(gòu) http://pandas.pydata.org/
10 Minutes to Pandas:http://suo.im/4an6gY
待整理的部分
Data Analysis with Python and Pandas Tutorial Introduction
Numpy & Pandas
#numpy是序列化的矩陣或者序列 #pandas是字典形式的numpy,可給不同行列進(jìn)行重新命名 ——————————- **Pandas數(shù)據(jù)轉(zhuǎn)為 numpy數(shù)據(jù)** df_numpyMatrix = df.as_matrix() df_numpyMatrix=df.values ————————————————— ***Pandas 小抄*** ————————————————— ***1. Reading and Writing Data*** —— import pandas as pd #a. Reading a csv file df=pd.read_csv('Analysis.cav') #b. Writing content of data frame to csv file df.to_csv('werfer.csv') # c.Reading an Excel file df=pd.read_excel('sdfsdgsd.xlsx', 'sheeet1') #d. Writing content of data frame to Excel file df.to_excel('sddg.xlsx', sheet_name='sheet2') # pandas 導(dǎo)入導(dǎo)出,讀取和儲(chǔ)存# The pandas I/O API is a set of top level reader functions accessed like # pd.read_csv() that pandas object.# read_csv # excel files # read_excel # read_hdf # read_sql # read_json # read_msgpack(experimental) # read_html # read_gbq(experimental) # read_stata # read_sas # read_clipboard # read_pickle #自帶的亞索# The corresponding writer functions are object methods that are accessed like # df.to_csv# to_csv # to_excel # to_hdf # to_sql # to_json # to_msgpack # to_html # to_gbq # to_stata # to_clipboard # to_pickleimport pandas as pddata = pd.read_csv('student.csv') print(data)data.to_packle('student.pickle') ————————————————— ***2. Getting Preview of Dataframe*** #a.Looking at top n record df.head(5) #b.Looking at bottom n record df.tail(5) #c.View columns name df.columns ————————————————— ***3. Rename Columns of Data Frame*** #a. Rename method helps to rename column of data frame df2 = df.rename(columns={'old_columnname':'new_columnname'}) #This method will create a new data frame with new column name. #b.To rename the column of existing data frame, set inplace=True. df.rename(columns={'old_columnname':'new_columnname'}, inplace=True) ————————————————— ***4. Selecting Columns or Rows*** #a. Accessing sub data frames df[['column1','column2']] #b.Filtering Records df[df['column1']>10] df[(df['column1']>10) & df['column2']==30] df[(df['column1']>10) | df['column2']==30] # pandas 數(shù)據(jù)選擇import pandas as pd import numpy as npdates = pd.date_range('20170101',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D']) print(df)print(df['A'],df.A)print(df[0:3],df['20170101':'20170104'])# select by label:loc print(df.loc['20170102'])print(df.loc[:,['A','B']])print(df.loc['20170102',['A','B']])#select by position:iloc print(df.iloc[3]) print(df.iloc[3,1]) print(df.iloc[1:3,1:3]) print(df.iloc[[1,3,5],1:3])#mixed selection:ix print(df.ix[:3,['A','C']])# Boolean indexing print(df) print(df[df.A>8]) ————————————————— ***5. Handing Missing Values*** This is an inevitale part of dealing wiht data. To overcom this hurdle, use dropna or fillna function #a. dropna: It is used to drop rows or columns having missing data df1.dropna() #b.fillna: It is used to fill missing values df2.fillna(value=5) # It replaces all missing values with 5 mean = df2['column1'].mean() df2['column1'].fillna(mean) # It replaces all missing values of column1 with mean of available values ————- from pandas import Series,DataFrame import pandas as pd ser = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) ser ser.drop('c') ser .drop() 返回的是一個(gè)新對(duì)象,元對(duì)象不會(huì)被改變。 from pandas import Series,DataFrame import pandas as pd import numpy as npdf = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5]], ... columns=list('ABCD'))df#Drop the columns where all elements are nan df.dropna(axis=1, how='all')A B D 0 NaN 2.0 0 1 3.0 4.0 1 2 NaN NaN 5#Drop the columns where any of the elements is nan>>> df.dropna(axis=1, how='any')D 0 0 1 1 2 5#Drop the rows where all of the elements are nan (there is no row to drop, so df #stays the same):>>> df.dropna(axis=0, how='all')A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5#Drop the rows where any of the elements are nan >>> df.dropna(axis=0, how='any') Empty DataFrame Columns: [A, B, C, D] Index: []#Keep only the rows with at least 2 non-na values:>>> df.dropna(thresh=2)A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1#Drop where all of the elements are nan, the default is the row, (there is no row to drop, so df #stays the same):>>> df.dropna(how='all')A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5#Drop where any of the elements are nan, default is the row >>> df.dropna( how='any') Empty DataFrame Columns: [A, B, C, D] Index: []dfnew = pd.DataFrame([[3435234, 2, 5666, 0], [3, 4, np.nan, 1],...: ... [np.nan, np.nan, np.nan, 5]],...: ... columns=list('ABCD'))dfnew.dropna() #默認(rèn)對(duì)row 進(jìn)行操作,去掉Na項(xiàng)A B C D 0 3435234 2 5666 0 # 處理丟失數(shù)據(jù) import numpy as np import pandas as pddates = pd.date_range('20170101',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D']) print(df)df.iloc[0,1]=np.nan df.iloc[1,2]=np.nan print(df.dropna(axis=0,how='any'))#how={'any','all'} default is 'any' print(df.dropna(axis=1,how='all'))#填入數(shù)據(jù) print(df.fillna(value=0)) #打印缺失數(shù)據(jù) print(df.isnull()) #打印出缺失數(shù)據(jù),當(dāng)數(shù)據(jù)比較大時(shí) print(np.any(df.isnull())==True) ————————————————— ***6. Creating New Columns*** New column is a function of existing columns df['NewColumn1'] = df['column2'] # Create a copy of existing column2 df['NewColumn2'] = df['column2'] + 10 # Add 10 to existing column2 then create a new one df['NewColumn3'] = df['column1'] + df['column2'] # Add elements of column1 and column2 then create new column import pandas as pd import numpy as nps = pd.Series([1,3,5,np.nan,55,2]) print(s)dates = pd.date_range('20160101',periods=6) print(dates)df = pd.DataFrame(np.random.random(6,4),index=dates,columns=['a','b','c','d']) print(df)df1 = pd.DataFrame(np.arange(12).reshape((3,4))) print(df1)df2 = pd.DataFrame({'A':1.,'B':pd.Timestamp('20170101'),'C':pd.Series(1,index=list(range(4)),dtype='float32'),'D':np.array([3]*4,dtype='int32'),'E':pd.Categorical(["test","train,"test","train"]),'F':'foo'})print(df2.dtypes) print(df2.columns) print(df2.values)print(df2.describe)print(df2.T)print(df2.sort_index(axis=1,ascending=False)) df2.sort_values(by='E') #添加空行 df['F'] = np.nan print(df)df['E']=pd.Series([1,2,3,4,5,6],index=pd.date_range('20170101',periods=6)) print(df) ————————————————— ***7. Aggregate*** a. Groupby: Groupby helps to perform three operations. i. Splitting the data into groups ii. Applying a function to each group individually iii. Combining the result into a data structure df.groupby('column1').sum() df.groupby(['column1','column2']).count() b. Pivot Table: It helps to generate data structure. It has three components index, columns and values(similar to excel) pd.pivot_table(df, values='column1',index=['column2','column3'],columns=['column4']) By default, it shows the sum of values column but you can change it using argument aggfunc pd.pivot_table(df, values='column1',index=['column2','column3'],columns=['column4'], aggfunc=len) It shows count c. Cross Tab: Cross Tab computes the simple cross tabulation of two factors pd.crosstab(df.column1, df.column2) ————————————————— ***8. Merging /Concatenating DataFrames*** a. Concatenating: It concatenate two or more data frames based on their columns pd.concat([df1, df2]) b. Merging: We can perform left, right and inner join also. pd.merge(df1,df2, on='column1',how='inner') pd.merge(df1,df2, on='column1',how='left') pd.merge(df1,df2, on='column1',how='right') pd.merge(df1,df2, on='column1',how='outer') # pandas 合并concatimport pandas as pd import numpy as np#concatenatingdf1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])print(df1) print(df2) print(df3)result = pd.concat([df1,df2,df3],axis=0)#行合并 print(result) #result1 = pd.concat([df1,df2,df3],axis=1)#列合并 #print(result1)result = pd.concat([df1,df2,df3],axis=0,ignore_index=True)#行合并,忽略index print(result)#join,['inner','outer'] df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'],index=[2,3,4]) print(df1) print(df2)result2 = pd.concat([df1,df2],join='outer',ignore_index=True)# 補(bǔ)充為na print(result2) result22 = pd.concat([df1,df2],join='outer')# 補(bǔ)充為na print(result22) result3 = pd.concat([df1,df2],join='inner',ignore_index=True) # 裁剪掉 print(result3) result33 = pd.concat([df1,df2],join='inner') # 裁剪掉 print(result33)#join_axes df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'],index=[2,3,4]) res = pd.concat([df1,df2],axis=1,join_axes=[df1.index]) print(res)res1 = pd.concat([df1,df2],axis=1) print(res1)#append df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d']) res11 = df1.append(df2,ignore_index=True) print(res11) res12 = df1.append([df2,df3],ignore_index=True) print(res12)s1 = pd.Series([1,2,3,4],index=['a','b','c','d'])res13=df1.append(s1,ignore_index=True) print(res13)#pandas 合并mergeimport pandas as pd#merging two df by key/keys.(may be used in database) #simple example left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']}) right = pd.DataFrame({'key':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})print(lef) print(right) res14 = pd.merge(left,right,on='key') print(res14)#consider two keys left = pd.DataFrame({'key1':['K0','K0','K1','K2'],'key2':['K0','K1','K0','K1'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']}) right = pd.DataFrame({'key1':['K0','K1','K1','K2'],'key2':['K0','K0','K0','K0'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']}) print(left) print(right)res15 = pd.merge(left,right,on=['key1','key2']) print(res15) #how =['left','right','inner','outer'] res16 = pd.merge(left,right,on=['key1','key2'],how='inner') print(res16) ————————————————— ***9. Applying function to element, column or dataframe*** a. Map: It iterates over each element of a series df['column1'].map(lambda x: 10+x) #this will add 10 to each element of column1 df['column2'].map(lambda x:'AV'+x) # this will concatenate 'AV' at the beginning of each element of column2(column format is string) b. Apply: As the name suggests, applies a function along any axis of the DataFrame df[['column1','column2']].apply(sum) #It will returns the sum of all the values of column1 and column2 c. ApplyMap: This helps to apply a function to each element of dataframe func = lambda x: x+2 df.applymap(func) # it will add 2 to each element of dataframe(all columns of dataframe must be numeric type) ————————————————— ***10. Identify unique value*** Function unique helps to return unique values of a column df['Column1'].unique() ————————————————— ***11. Basic Stats*** Pandas helps to understand the data using basic statistical methods. a. describe: This returns the quick stats(count, mean, std, min, first quartile, median, third quartile, max) on suitable columns df.describe() b. covariance: It returns the co-variance between suitable columns df.cov() c.correlation: It returns the co-variance between suitable columns. df.corr() ——— 本文中的 Python-Pandas.ipynb格式見(jiàn)[CSDN下載](http://download.csdn.net/detail/jiandanjinxin/9826981)。 #https://python.freelycode.com/contribution/detail/333 #https://python.freelycode.com/contribution/detail/334 #http://www.datadependence.com/2016/05/scientific-python-pandas/#Python科學(xué)計(jì)算之Pandas #導(dǎo)入Pandas的標(biāo)準(zhǔn)方式 import pandas as pd # This is the standard #Pandas的數(shù)據(jù)類型 #Pandas基于兩種數(shù)據(jù)類型:series與dataframe。 #一個(gè)series是一個(gè)一維的數(shù)據(jù)類型,其中每一個(gè)元素都有一個(gè)標(biāo)簽。 #series類似于Numpy中元素帶標(biāo)簽的數(shù)組。其中,標(biāo)簽可以是數(shù)字或者字符串。 #一個(gè)dataframe是一個(gè)二維的表結(jié)構(gòu)。Pandas的dataframe可以存儲(chǔ)許多種不同的數(shù)據(jù)類型,并且每一個(gè)坐標(biāo)軸都有自己的標(biāo)簽。 #你可以把它想象成一個(gè)series的字典項(xiàng)。 #將數(shù)據(jù)導(dǎo)入Pandas,采用[英國(guó)政府?dāng)?shù)據(jù)中關(guān)于降雨量數(shù)據(jù)](https://data.gov.uk/dataset/average-temperature-and-rainfall-england-and-wales/resource/3fea0f7b-5304-4f11-a809-159f4558e7da) # Reading a csv into Pandas,從csv文件中讀取到了數(shù)據(jù),并將他們存入了dataframe中 #header關(guān)鍵字告訴Pandas這些數(shù)據(jù)是否有列名,在哪里。如果沒(méi)有列名,你可以將其置為None。 df = pd.read_csv('uk_rain_2014.csv', header=0) #將你的數(shù)據(jù)準(zhǔn)備好以進(jìn)行挖掘和分析 #想要快速查看前x行數(shù)據(jù) #Getting first x rows df.head(5)
| 1980/81 | 1182 | 5408 | 292 | 7248 | 174 | 2212 |
| 1981/82 | 1098 | 5112 | 257 | 7316 | 242 | 1936 |
| 1982/83 | 1156 | 5701 | 330 | 8567 | 124 | 1802 |
| 1983/84 | 993 | 4265 | 391 | 8905 | 141 | 1078 |
| 1984/85 | 1182 | 5364 | 217 | 5813 | 343 | 4313 |
| 2008/09 | 1139 | 4941 | 268 | 6690 | 323 | 3189 |
| 2009/10 | 1103 | 4738 | 255 | 6435 | 244 | 1958 |
| 2010/11 | 1053 | 4521 | 265 | 6593 | 267 | 2885 |
| 2011/12 | 1285 | 5500 | 339 | 7630 | 379 | 5261 |
| 2012/13 | 1090 | 5329 | 350 | 9615 | 187 | 1797 |
| 1980/81 | 1182 | 5408 | 292 | 7248 | 174 | 2212 |
| 1981/82 | 1098 | 5112 | 257 | 7316 | 242 | 1936 |
| 1982/83 | 1156 | 5701 | 330 | 8567 | 124 | 1802 |
| 1983/84 | 993 | 4265 | 391 | 8905 | 141 | 1078 |
| 1984/85 | 1182 | 5364 | 217 | 5813 | 343 | 4313 |
| 33.000 | 33.000 | 33.000 | 33.000 | 33.000 | 33.000 |
| 1,129.000 | 5,019.182 | 325.364 | 7,926.545 | 237.485 | 2,439.758 |
| 101.900 | 658.588 | 69.995 | 1,692.800 | 66.168 | 1,025.914 |
| 856.000 | 3,479.000 | 206.000 | 4,578.000 | 103.000 | 1,078.000 |
| 1,053.000 | 4,506.000 | 268.000 | 6,690.000 | 193.000 | 1,797.000 |
| 1,139.000 | 5,112.000 | 309.000 | 7,630.000 | 229.000 | 2,142.000 |
| 1,182.000 | 5,497.000 | 360.000 | 8,905.000 | 280.000 | 2,959.000 |
| 1,387.000 | 6,391.000 | 484.000 | 11,486.000 | 379.000 | 5,261.000 |
| 1995/96 | 856 | 3479 | 245 | 5515 | 172 | 1439 |
| 1990/91 | 1022 | 4418 | 305 | 7120 | 216 | 1923 |
| 1991/92 | 1151 | 4506 | 246 | 5493 | 280 | 2118 |
| 1992/93 | 1130 | 5246 | 308 | 8751 | 219 | 2551 |
| 1993/94 | 1162 | 5583 | 422 | 10109 | 193 | 1638 |
| 1994/95 | 1110 | 5370 | 484 | 11486 | 103 | 1231 |
| 1995/96 | 856 | 3479 | 245 | 5515 | 172 | 1439 |
| 1996/97 | 1047 | 4019 | 258 | 5770 | 256 | 2102 |
| 1997/98 | 1169 | 4953 | 341 | 7747 | 285 | 3206 |
| 1998/99 | 1268 | 5824 | 360 | 8771 | 225 | 2240 |
| 1999/00 | 1204 | 5665 | 417 | 10021 | 197 | 2166 |
| 1182 | 5408 | 292 | 7248 | 174 | 2212 |
| 1098 | 5112 | 257 | 7316 | 242 | 1936 |
| 1156 | 5701 | 330 | 8567 | 124 | 1802 |
| 993 | 4265 | 391 | 8905 | 141 | 1078 |
| 1182 | 5364 | 217 | 5813 | 343 | 4313 |
| 1090 | 5329 | 350 | 9615 | 187 | 1797 |
| 1285 | 5500 | 339 | 7630 | 379 | 5261 |
| 1053 | 4521 | 265 | 6593 | 267 | 2885 |
| 1103 | 4738 | 255 | 6435 | 244 | 1958 |
| 1139 | 4941 | 268 | 6690 | 323 | 3189 |
| 1980/81 | 1182 | 5408 | 292 | 7248 | 174 | 2212 |
| 1981/82 | 1098 | 5112 | 257 | 7316 | 242 | 1936 |
| 1982/83 | 1156 | 5701 | 330 | 8567 | 124 | 1802 |
| 1983/84 | 993 | 4265 | 391 | 8905 | 141 | 1078 |
| 1984/85 | 1182 | 5364 | 217 | 5813 | 343 | 4313 |
| 1980/81 | 1182 | 5408 | 292 | 7248 | 174 | 2212 | 1980 |
| 1981/82 | 1098 | 5112 | 257 | 7316 | 242 | 1936 | 1981 |
| 1982/83 | 1156 | 5701 | 330 | 8567 | 124 | 1802 | 1982 |
| 1983/84 | 993 | 4265 | 391 | 8905 | 141 | 1078 | 1983 |
| 1984/85 | 1182 | 5364 | 217 | 5813 | 343 | 4313 | 1984 |
| 1989/90 | 1210 | 5701 | 470 | 10520 | 343 | 4313 | 1989 |
| 1999/00 | 1268 | 5824 | 484 | 11486 | 285 | 3206 | 1999 |
| 2009/10 | 1387 | 6391 | 437 | 10926 | 357 | 5168 | 2009 |
| 2012/13 | 1285 | 5500 | 350 | 9615 | 379 | 5261 | 2012 |
| 4,297.500 | 7,685.000 | 1,259.000 |
| 5,289.625 | 7,933.000 | 2,572.250 |
| 3,479.000 | 5,515.000 | 1,439.000 |
| 5,064.889 | 8,363.111 | 2,130.556 |
| 5,030.800 | 7,812.100 | 2,685.900 |
| 5,116.667 | 7,946.000 | 3,314.333 |
| 4,297.500 | 3,479.000 | nan | nan | 7,685.000 | 5,515.000 | nan | nan | 1,259.000 | 1,439.000 | nan | nan |
| 5,289.625 | 5,064.889 | 5,030.800 | 5,116.667 | 7,933.000 | 8,363.111 | 7,812.100 | 7,946.000 | 2,572.250 | 2,130.556 | 2,685.900 | 3,314.333 |
| 4,297.500 | 5,289.625 | 7,685.000 | 7,933.000 | 1,259.000 | 2,572.250 |
| 3,479.000 | 5,064.889 | 5,515.000 | 8,363.111 | 1,439.000 | 2,130.556 |
| nan | 5,030.800 | nan | 7,812.100 | nan | 2,685.900 |
| nan | 5,116.667 | nan | 7,946.000 | nan | 3,314.333 |
| 1998/99 | 1268 | 5824 | 360 | 8771 | 225 | 2240 | 1998 |
| 2006/07 | 1387 | 6391 | 437 | 10926 | 357 | 5168 | 2006 |
| 2011/12 | 1285 | 5500 | 339 | 7630 | 379 | 5261 | 2011 |
| 5,824.000 | 8,771.000 | 2,240.000 | ||||||
| 6,391.000 | 10,926.000 | 5,168.000 | ||||||
| 5,500.000 | 7,630.000 | 5,261.000 |
References
Python科學(xué)計(jì)算之Pandas(上)
Python科學(xué)計(jì)算之Pandas(下)
An Introduction to Scientific Python – Pandas
CheatSheet: Data Exploration using Pandas in Python
機(jī)器學(xué)習(xí)入門必備的13張小抄
numpy教程 pandas教程 Python數(shù)據(jù)科學(xué)計(jì)算簡(jiǎn)介
創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)總結(jié)
以上是生活随笔為你收集整理的Python 中的Pandas库的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 光学镜头参数详解(EFL、TTL、BFL
- 下一篇: Python 中的绘图matplotli