pandas介紹
pandas 是基于NumPy 的一種工具,該工具是為了解決數據分析任務而創建的。 Pandas 納入了大量庫和一些標準的數據模型,提供了高效地操作大型數據集所需的工具。 pandas提供了大量能使我們快速便捷地處理數據的函數和方法。 它是使Python成為強大而高效的數據分析環境的重要因素之一。
pandas基礎
import pandasfood_info
= pandas
. read_csv
( "food_info.csv" )
print ( type ( food_info
) )
print ( food_info
. dtypes
) print ( food_info
. head
( 3 ) )
print ( food_info
. tail
( 4 ) )
print ( food_info
. columns
)
print ( food_info
. shape
)
打開一個文件:
food_info
= pandas
. read_csv
( "food_info.csv" )
文件截圖: 打印它的類型:
print ( type ( food_info
) )
打印每一列的類型:
print ( food_info
. dtypes
)
打印頭三行和尾四行:
print ( food_info
. head
( 3 ) )
print ( food_info
. tail
( 4 ) )
打印所有的列標題和文件規模:
print ( food_info
. columns
)
print ( food_info
. shape
)
(8618表示樣本,即行,36表示指標,即列)
打印第一行:
print ( food_info
. loc
[ 0 ] )
切片操作:
print ( food_info
. loc
[ 3 : 6 ] )
取出某一列值要根據列名:
ndb_col
= food_info
[ "NDB_No" ]
print ( ndb_col
)
取出某幾列的值,同樣也是根據列名:
columns
= [ "Shrt_Desc" , "Water_(g)" ]
zinc_copper
= food_info
[ columns
]
print ( zinc_copper
)
取出指定列的內容(以g為單位的列):
col_names
= food_info
. columns
. tolist
( )
print ( col_names
)
gram_columns
= [ ]
for c
in col_names
: if c
. endswith
( "(g)" ) : gram_columns
. append
( c
)
gram_df
= food_info
[ gram_columns
]
print ( gram_df
. head
( 3 ) )
先用一個列表存儲以g為單位的列名,然后打印前三行數據 找到相應的列并對列中所有的數據進行四則運算:
print ( food_info
[ "Iron_(mg)" ] )
div_1000
= food_info
[ "Iron_(mg)" ] / 1000
print ( div_1000
)
將某兩列中的數據進行乘法運算以及創建一個新的列:
water_energy
= food_info
[ "Water_(g)" ] * food_info
[ "Energ_Kcal" ]
iron_grams
= food_info
[ "Iron_(mg)" ] / 1000
print ( food_info
. shape
)
food_info
[ "Iron_(g)" ] = iron_grams
print ( food_info
. shape
)
將數據進行升序和降序排列:
food_info
. sort_values
( "Sodium_(mg)" , inplace
= True )
print ( food_info
[ "Sodium_(mg)" ] )
food_info
. sort_values
( "Sodium_(mg)" , inplace
= True , ascending
= False )
print ( food_info
[ "Sodium_(mg)" ] )
某一列中的 NaN (not a number)值: 打印前十行:
age
= titanic_survival
[ "Age" ]
print ( age
. loc
[ 0 : 10 ] )
判斷是否為NaN值:
age_is_null
= pd
. isnull
( age
)
print ( age_is_null
)
打印所有值為NaN的行號:
age_null_true
= age
[ age_is_null
]
print ( age_null_true
)
統計為NaN的行數:
age_null_count
= len ( age_null_true
)
print ( age_null_count
)
如果直接計算平均年齡:
mean_age
= sum ( titanic_survival
[ "Age" ] ) / len ( titanic_survival
[ "Age" ] )
print ( mean_age
)
去除NaN值之后計算平均年齡:
good_ages
= titanic_survival
[ "Age" ] [ age_is_null
== False ]
print ( good_ages
)
correct_mean_age
= sum ( good_ages
) / len ( good_ages
)
print ( correct_mean_age
)
其實在pandas中有內置的去除NaN值后計算的方法:
correct_mean_age
= titanic_survival
[ "Age" ] . mean
( )
print ( correct_mean_age
)
兩次結果一致 計算不同等級船艙的票價:
passenger_classes
= [ 1 , 2 , 3 ]
fares_by_class
= { }
for this_class
in passenger_classes
: pclass_rows
= titanic_survival
[ titanic_survival
[ "Pclass" ] == this_class
] pclass_fares
= pclass_rows
[ "Fare" ] fare_for_class
= pclass_fares
. mean
( ) fares_by_class
[ this_class
] = fare_for_class
print ( fares_by_class
)
計算相關關系(數據透視表):
passenger_survial=titanic_survival.pivot_table(index="Pclass",values="Survived",aggfunc=np.mean)
print(passenger_survial)passenger_age=titanic_survival.pivot_table(index="Pclass",values="Age")
print(passenger_age)port_stats=titanic_survival.pivot_table(index="Embarked",values=["Fare","Survived"],aggfunc=np.sum)
print(port_stats)
去掉缺失值:
drop_na_columns
= titanic_survival
. dropna
( axis
= 1 )
new_titanic_survival
= titanic_survival
. dropna
( axis
= 0 , subset
= [ "Age" , "Sex" ] )
print ( new_titanic_survival
)
根據索引找到相應的值:
row_index_83_age
= titanic_survival
. loc
[ 83 , "Age" ]
row_index_1000_pclass
= titanic_survival
. loc
[ 766 , "Pclass" ]
print ( row_index_83_age
)
print ( row_index_1000_pclass
)
排序:
new_titanic_survival
= titanic_survival
. sort_values
( "Age" , ascending
= False )
print ( new_titanic_survival
[ 0 : 10 ] )
titanic_reindexed
= new_titanic_survival
. reset_index
( drop
= True )
print ( titanic_survival
. loc
[ 0 : 10 ] )
定義一個函數,找到第100個值:
def hundredth_row ( columns
) : hundredth_item
= columns
. loc
[ 99 ] return hundredth_itemhundredth_row
= titanic_survival
. apply ( hundredth_row
)
print ( hundredth_row
)
定義一個函數,統計缺失值:
def not_null_count ( column
) : column_null
= pd
. isnull
( column
) null
= column
[ column_null
] return len ( null
) column_null_count
= titanic_survival
. apply ( not_null_count
)
print ( column_null_count
)
定義一個函數,對數據進行整體轉換:
def which_class ( row
) : pclass
= row
[ 'Pclass' ] if pd
. isnull
( pclass
) : return "Unknown" elif pclass
== 1 : return "First Class" elif pclass
== 2 : return "Second Class" elif pclass
== 3 : return "Third Class" classes
= titanic_survival
. apply ( which_class
, axis
= 1 )
print ( classes
)
定義一個函數,判斷是否成年:
def is_minor ( row
) : if row
[ "Age" ] < 18 : return True else : return False minors
= titanic_survival
. apply ( is_minor
, axis
= 1 )
print ( minors
)
定義一個函數,根據年齡返回相應值:
def generate_age_label ( row
) : age
= row
[ "Age" ] if pd
. isnull
( age
) : return "unknown" elif age
< 18 : return "minor" else : return "adult" age_labels
= titanic_survival
. apply ( generate_age_label
, axis
= 1 )
print ( age_labels
)
總結
以上是生活随笔 為你收集整理的01、python数据分析与机器学习实战——python数据分析处理库-Pandas 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。