當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

pandas category数据类型

發(fā)布時(shí)間：2023/12/18 编程问答 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 pandas category数据类型小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

pandas category數(shù)據(jù)類型

實(shí)際應(yīng)用pandas過程中，經(jīng)常會(huì)用到category數(shù)據(jù)類型，通常以string的形式顯示，包括顏色（紅，綠，藍(lán)），尺寸的大小（大，中，小），還有地理信息等（國家，省份），這些數(shù)據(jù)的處理經(jīng)常會(huì)有各種各樣的問題，pandas以及scikit-learn兩個(gè)包可以將category數(shù)據(jù)轉(zhuǎn)化為合適的數(shù)值型格式，這篇主要介紹通過這兩個(gè)包處理category類型的數(shù)據(jù)轉(zhuǎn)化為數(shù)值類型，也就是encoding的過程。
數(shù)據(jù)來源UCI Machine Learning Repository，這個(gè)數(shù)據(jù)集中包含了很多的category類型的數(shù)據(jù)，可以從鏈接匯總查看數(shù)據(jù)的代表的含義。
下面開始導(dǎo)入需要用到的包

import numpy as np import pandas as pd # 規(guī)定一下數(shù)據(jù)列的各個(gè)名稱， headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels", "engine_location","wheel_base", "length", "width", "height", "curb_weight","engine_type", "num_cylinders", "engine_size", "fuel_system","bore", "stroke", "compression_ratio", "horsepower", "peak_rpm","city_mpg", "highway_mpg", "price"] # 從pandas導(dǎo)入csv文件，將?標(biāo)記為NaN缺失值 df=pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",header=None,names=headers,na_values="?") df.head() symbolingnormalized_lossesmakefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationwheel_base...engine_sizefuel_systemborestrokecompression_ratiohorsepowerpeak_rpmcity_mpghighway_mpgprice01234

3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
2	164.0	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
2	164.0	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

5 rows × 26 columns

df.dtypes symboling int64 normalized_losses float64 make object fuel_type object aspiration object num_doors object body_style object drive_wheels object engine_location object wheel_base float64 length float64 width float64 height float64 curb_weight int64 engine_type object num_cylinders object engine_size int64 fuel_system object bore float64 stroke float64 compression_ratio float64 horsepower float64 peak_rpm float64 city_mpg int64 highway_mpg int64 price float64 dtype: object # 如果只關(guān)注category 類型的數(shù)據(jù)，其實(shí)根本沒有必要拿到這些全部數(shù)據(jù)，只需要將object類型的數(shù)據(jù)取出，然后進(jìn)行后續(xù)分析即可 obj_df = df.select_dtypes(include=['object']).copy() obj_df.head() makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system01234

alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
alfa-romero	gas	std	two	hatchback	rwd	front	ohcv	six	mpfi
audi	gas	std	four	sedan	fwd	front	ohc	four	mpfi
audi	gas	std	four	sedan	4wd	front	ohc	five	mpfi

# 在進(jìn)行下一步處理的之前，需要將數(shù)據(jù)進(jìn)行缺失值的處理，對(duì)列進(jìn)行處理axis=1 obj_df[obj_df.isnull().any(axis=1)] makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system2763

dodge	gas	turbo	NaN	sedan	fwd	front	ohc	four	mpfi
mazda	diesel	std	NaN	sedan	fwd	front	ohc	four	idi

# 處理缺失值的方式有很多種，根據(jù)項(xiàng)目的不同或者填補(bǔ)缺失值或者去掉該樣本。本文中的數(shù)據(jù)缺失用該列的眾數(shù)來補(bǔ)充。 obj_df.num_doors.value_counts() four 114 two 89 Name: num_doors, dtype: int64 obj_df=obj_df.fillna({"num_doors":"four"})

在處理完缺失值之后，有以下幾種方式進(jìn)行category數(shù)據(jù)轉(zhuǎn)化encoding

Find and Replace
label encoding
One Hot encoding
Custom Binary encoding
sklearn
advanced Approaches

# pandas里面的replace文檔非常豐富，筆者在使用該功能時(shí)候，深感其參數(shù)眾多，深感提供的功能也非常的強(qiáng)大 # 本文中使用replace的功能，創(chuàng)建map的字典，針對(duì)需要數(shù)據(jù)清理的列進(jìn)行清理更加方便，例如： cleanup_nums= {"num_doors":{"four":4,"two":2},"num_cylinders":{"four":4,"six":6,"five":5,"eight":8,"two":2,"twelve":12,"three":3} } obj_df.replace(cleanup_nums,inplace=True) obj_df.head() makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system01234

alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi
alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi
alfa-romero	gas	std	2	hatchback	rwd	front	ohcv	6	mpfi
audi	gas	std	4	sedan	fwd	front	ohc	4	mpfi
audi	gas	std	4	sedan	4wd	front	ohc	5	mpfi

label encoding 是將一組無規(guī)則的，沒有大小比較的數(shù)據(jù)轉(zhuǎn)化為數(shù)字

比如body_style 字段中含有多個(gè)數(shù)據(jù)值，可以使用該方法將其轉(zhuǎn)化
convertible > 0
hardtop > 1
hatchback > 2
sedan > 3
wagon > 4

這種方式就像是密碼編碼一樣，這，個(gè)比喻很有意思，就像之前看電影，記得一句臺(tái)詞，他們倆親密的像做賊一樣

# 通過pandas里面的 category數(shù)據(jù)類型，可以很方便的或者該編碼 obj_df["body_style"]=obj_df["body_style"].astype("category") obj_df.dtypes make object fuel_type object aspiration object num_doors int64 body_style category drive_wheels object engine_location object engine_type object num_cylinders int64 fuel_system object dtype: object # 我們可以通過賦值新的列，保存其對(duì)應(yīng)的code # 通過這種方法可以舒服的數(shù)據(jù)，便于以后的數(shù)據(jù)分析以及整理 obj_df["body_style_code"] = obj_df["body_style"].cat.codes obj_df.head() makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_systembody_style_code01234

alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi	0
alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi	0
alfa-romero	gas	std	2	hatchback	rwd	front	ohcv	6	mpfi	2
audi	gas	std	4	sedan	fwd	front	ohc	4	mpfi	3
audi	gas	std	4	sedan	4wd	front	ohc	5	mpfi	3

one hot encoding

label encoding 因?yàn)閷agon轉(zhuǎn)化為4，而convertible變成了0，這里面是不是會(huì)有大大小的比較，可能會(huì)造成誤解，然后利用one hot encoding這種方式
是將特征轉(zhuǎn)化為0或者1，這樣會(huì)增加數(shù)據(jù)的列的數(shù)量，同時(shí)也減少了label encoding造成的衡量數(shù)據(jù)大小的誤解。
pandas中提供了get_dummies 方法可以將需要轉(zhuǎn)化的列的值轉(zhuǎn)化為0,1,兩種編碼

# 新生成DataFrame包含了新生成的三列數(shù)據(jù), # drive_wheels_4wd # drive_wheels_fwd # drive_wheels_rwd pd.get_dummies(obj_df,columns=["drive_wheels"]).head() makefuel_typeaspirationnum_doorsbody_styleengine_locationengine_typenum_cylindersfuel_systembody_style_codedrive_wheels_4wddrive_wheels_fwddrive_wheels_rwd01234

alfa-romero	gas	std	2	convertible	front	dohc	4	mpfi	0	0	0	1
alfa-romero	gas	std	2	convertible	front	dohc	4	mpfi	0	0	0	1
alfa-romero	gas	std	2	hatchback	front	ohcv	6	mpfi	2	0	0	1
audi	gas	std	4	sedan	front	ohc	4	mpfi	3	0	1	0
audi	gas	std	4	sedan	front	ohc	5	mpfi	3	1	0	0

# 該方法之所以強(qiáng)大，是因?yàn)榭梢酝瑫r(shí)處理多個(gè)category的列，同時(shí)選擇prefix前綴分別對(duì)應(yīng)好 # 產(chǎn)生的新的DataFrame所有數(shù)據(jù)都包含 pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head() makefuel_typeaspirationnum_doorsengine_locationengine_typenum_cylindersfuel_systembody_style_codebody_convertiblebody_hardtopbody_hatchbackbody_sedanbody_wagondrive_4wddrive_fwddrive_rwd01234

alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1	0	0	0	0	1
alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1	0	0	0	0	1
alfa-romero	gas	std	2	front	ohcv	6	mpfi	2	0	1	0	0	0	1
audi	gas	std	4	front	ohc	4	mpfi	3	0	0	1	0	1	0
audi	gas	std	4	front	ohc	5	mpfi	3	0	0	1	1	0	0

自定義0,1 encoding

有的時(shí)候回根據(jù)業(yè)務(wù)需要，可能會(huì)結(jié)合label encoding以及not hot 兩種方式進(jìn)行二值化。

obj_df["engine_type"].value_counts() ohc 148 ohcf 15 ohcv 13 dohc 12 l 12 rotor 4 dohcv 1 Name: engine_type, dtype: int64 # 有的時(shí)候?yàn)榱藚^(qū)分出 engine_type是否是och技術(shù)的，可以使用二值化，將該列進(jìn)行處理 # 這也突出了領(lǐng)域知識(shí)是如何以最有效的方式解決問題 obj_df["engine_type_code"] = np.where(obj_df["engine_type"].str.contains("ohc"),1,0) obj_df[["make","engine_type","engine_type_code"]].head() makeengine_typeengine_type_code01234

alfa-romero	dohc	1
alfa-romero	dohc	1
alfa-romero	ohcv	1
audi	ohc	1
audi	ohc	1

scikit-learn中的數(shù)據(jù)轉(zhuǎn)化

sklearn.processing模塊提供了很多方便的數(shù)據(jù)轉(zhuǎn)化以及缺失值處理方式(Imputer)，可以直接從該模塊導(dǎo)入LabelEncoder，LabelBinarizer，0,1歸一化(最大最小標(biāo)準(zhǔn)化)，Normalizer正則化（L1，L2）一般用的不多，標(biāo)準(zhǔn)化（最大最小標(biāo)準(zhǔn)化max_mix），非線性轉(zhuǎn)換，生成多項(xiàng)式特征(PolynomialFeatures),將每個(gè)特征縮放在同樣的范圍或分布情況下
sklearn processing 模塊官網(wǎng)文檔鏈接
category_encoders包官方文檔

至此，數(shù)據(jù)預(yù)處理以及category轉(zhuǎn)化大致講完了。

posted on 2018-08-02 15:53 多一點(diǎn) 閱讀(...) 評(píng)論(...) 編輯收藏

總結(jié)

以上是生活随笔為你收集整理的pandas category数据类型的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： cka模拟题
下一篇： Excel 2010 VBA 入门 05