pandas category數(shù)據(jù)類型
- 實(shí)際應(yīng)用pandas過程中,經(jīng)常會(huì)用到category數(shù)據(jù)類型,通常以string的形式顯示,包括顏色(紅,綠,藍(lán)),尺寸的大小(大,中,小),還有地理信息等(國家,省份),這些數(shù)據(jù)的處理經(jīng)常會(huì)有各種各樣的問題,pandas以及scikit-learn兩個(gè)包可以將category數(shù)據(jù)轉(zhuǎn)化為合適的數(shù)值型格式,這篇主要介紹通過這兩個(gè)包處理category類型的數(shù)據(jù)轉(zhuǎn)化為數(shù)值類型,也就是encoding的過程。
- 數(shù)據(jù)來源UCI Machine Learning Repository,這個(gè)數(shù)據(jù)集中包含了很多的category類型的數(shù)據(jù),可以從鏈接匯總查看數(shù)據(jù)的代表的含義。
- 下面開始導(dǎo)入需要用到的包
import numpy as np
import pandas as pd # 規(guī)定一下數(shù)據(jù)列的各個(gè)名稱,
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels", "engine_location","wheel_base", "length", "width", "height", "curb_weight","engine_type", "num_cylinders", "engine_size", "fuel_system","bore", "stroke", "compression_ratio", "horsepower", "peak_rpm","city_mpg", "highway_mpg", "price"]
# 從pandas導(dǎo)入csv文件,將?標(biāo)記為NaN缺失值
df=pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",header=None,names=headers,na_values="?")
df.head()
symbolingnormalized_lossesmakefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationwheel_base...engine_sizefuel_systemborestrokecompression_ratiohorsepowerpeak_rpmcity_mpghighway_mpgprice
0| 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 |
1| 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
2| 1 | NaN | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
3| 2 | 164.0 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
4| 2 | 164.0 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
5 rows × 26 columns
df.dtypes symboling int64
normalized_losses float64
make object
fuel_type object
aspiration object
num_doors object
body_style object
drive_wheels object
engine_location object
wheel_base float64
length float64
width float64
height float64
curb_weight int64
engine_type object
num_cylinders object
engine_size int64
fuel_system object
bore float64
stroke float64
compression_ratio float64
horsepower float64
peak_rpm float64
city_mpg int64
highway_mpg int64
price float64
dtype: object # 如果只關(guān)注category 類型的數(shù)據(jù),其實(shí)根本沒有必要拿到這些全部數(shù)據(jù),只需要將object類型的數(shù)據(jù)取出,然后進(jìn)行后續(xù)分析即可
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()
makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system
0| alfa-romero | gas | std | two | convertible | rwd | front | dohc | four | mpfi |
1| alfa-romero | gas | std | two | convertible | rwd | front | dohc | four | mpfi |
2| alfa-romero | gas | std | two | hatchback | rwd | front | ohcv | six | mpfi |
3| audi | gas | std | four | sedan | fwd | front | ohc | four | mpfi |
4| audi | gas | std | four | sedan | 4wd | front | ohc | five | mpfi |
# 在進(jìn)行下一步處理的之前,需要將數(shù)據(jù)進(jìn)行缺失值的處理,對(duì)列進(jìn)行處理axis=1
obj_df[obj_df.isnull().any(axis=1)]
makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system
27| dodge | gas | turbo | NaN | sedan | fwd | front | ohc | four | mpfi |
63| mazda | diesel | std | NaN | sedan | fwd | front | ohc | four | idi |
# 處理缺失值的方式有很多種,根據(jù)項(xiàng)目的不同或者填補(bǔ)缺失值或者去掉該樣本。本文中的數(shù)據(jù)缺失用該列的眾數(shù)來補(bǔ)充。
obj_df.num_doors.value_counts()
four 114
two 89
Name: num_doors, dtype: int64 obj_df=obj_df.fillna({"num_doors":"four"})
在處理完缺失值之后,有以下幾種方式進(jìn)行category數(shù)據(jù)轉(zhuǎn)化encoding
- Find and Replace
- label encoding
- One Hot encoding
- Custom Binary encoding
- sklearn
- advanced Approaches
# pandas里面的replace文檔非常豐富,筆者在使用該功能時(shí)候,深感其參數(shù)眾多,深感提供的功能也非常的強(qiáng)大
# 本文中使用replace的功能,創(chuàng)建map的字典,針對(duì)需要數(shù)據(jù)清理的列進(jìn)行清理更加方便,例如:
cleanup_nums= {"num_doors":{"four":4,"two":2},"num_cylinders":{"four":4,"six":6,"five":5,"eight":8,"two":2,"twelve":12,"three":3}
}
obj_df.replace(cleanup_nums,inplace=True)
obj_df.head()
makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system
0| alfa-romero | gas | std | 2 | convertible | rwd | front | dohc | 4 | mpfi |
1| alfa-romero | gas | std | 2 | convertible | rwd | front | dohc | 4 | mpfi |
2| alfa-romero | gas | std | 2 | hatchback | rwd | front | ohcv | 6 | mpfi |
3| audi | gas | std | 4 | sedan | fwd | front | ohc | 4 | mpfi |
4| audi | gas | std | 4 | sedan | 4wd | front | ohc | 5 | mpfi |
label encoding 是將一組無規(guī)則的,沒有大小比較的數(shù)據(jù)轉(zhuǎn)化為數(shù)字
- 比如body_style 字段中含有多個(gè)數(shù)據(jù)值,可以使用該方法將其轉(zhuǎn)化
- convertible > 0
- hardtop > 1
- hatchback > 2
- sedan > 3
- wagon > 4
這種方式就像是密碼編碼一樣,這,個(gè)比喻很有意思,就像之前看電影,記得一句臺(tái)詞,他們倆親密的像做賊一樣
# 通過pandas里面的 category數(shù)據(jù)類型,可以很方便的或者該編碼
obj_df["body_style"]=obj_df["body_style"].astype("category")
obj_df.dtypes make object
fuel_type object
aspiration object
num_doors int64
body_style category
drive_wheels object
engine_location object
engine_type object
num_cylinders int64
fuel_system object
dtype: object # 我們可以通過賦值新的列,保存其對(duì)應(yīng)的code
# 通過這種方法可以舒服的數(shù)據(jù),便于以后的數(shù)據(jù)分析以及整理
obj_df["body_style_code"] = obj_df["body_style"].cat.codes
obj_df.head()
makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_systembody_style_code
0| alfa-romero | gas | std | 2 | convertible | rwd | front | dohc | 4 | mpfi | 0 |
1| alfa-romero | gas | std | 2 | convertible | rwd | front | dohc | 4 | mpfi | 0 |
2| alfa-romero | gas | std | 2 | hatchback | rwd | front | ohcv | 6 | mpfi | 2 |
3| audi | gas | std | 4 | sedan | fwd | front | ohc | 4 | mpfi | 3 |
4| audi | gas | std | 4 | sedan | 4wd | front | ohc | 5 | mpfi | 3 |
one hot encoding
- label encoding 因?yàn)閷agon轉(zhuǎn)化為4,而convertible變成了0,這里面是不是會(huì)有大大小的比較,可能會(huì)造成誤解,然后利用one hot encoding這種方式
是將特征轉(zhuǎn)化為0或者1,這樣會(huì)增加數(shù)據(jù)的列的數(shù)量,同時(shí)也減少了label encoding造成的衡量數(shù)據(jù)大小的誤解。 - pandas中提供了get_dummies 方法可以將需要轉(zhuǎn)化的列的值轉(zhuǎn)化為0,1,兩種編碼
# 新生成DataFrame包含了新生成的三列數(shù)據(jù),
# drive_wheels_4wd
# drive_wheels_fwd
# drive_wheels_rwd
pd.get_dummies(obj_df,columns=["drive_wheels"]).head()
makefuel_typeaspirationnum_doorsbody_styleengine_locationengine_typenum_cylindersfuel_systembody_style_codedrive_wheels_4wddrive_wheels_fwddrive_wheels_rwd
0| alfa-romero | gas | std | 2 | convertible | front | dohc | 4 | mpfi | 0 | 0 | 0 | 1 |
1| alfa-romero | gas | std | 2 | convertible | front | dohc | 4 | mpfi | 0 | 0 | 0 | 1 |
2| alfa-romero | gas | std | 2 | hatchback | front | ohcv | 6 | mpfi | 2 | 0 | 0 | 1 |
3| audi | gas | std | 4 | sedan | front | ohc | 4 | mpfi | 3 | 0 | 1 | 0 |
4| audi | gas | std | 4 | sedan | front | ohc | 5 | mpfi | 3 | 1 | 0 | 0 |
# 該方法之所以強(qiáng)大,是因?yàn)榭梢酝瑫r(shí)處理多個(gè)category的列,同時(shí)選擇prefix前綴分別對(duì)應(yīng)好
# 產(chǎn)生的新的DataFrame所有數(shù)據(jù)都包含
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()
makefuel_typeaspirationnum_doorsengine_locationengine_typenum_cylindersfuel_systembody_style_codebody_convertiblebody_hardtopbody_hatchbackbody_sedanbody_wagondrive_4wddrive_fwddrive_rwd
0| alfa-romero | gas | std | 2 | front | dohc | 4 | mpfi | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1| alfa-romero | gas | std | 2 | front | dohc | 4 | mpfi | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2| alfa-romero | gas | std | 2 | front | ohcv | 6 | mpfi | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
3| audi | gas | std | 4 | front | ohc | 4 | mpfi | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
4| audi | gas | std | 4 | front | ohc | 5 | mpfi | 3 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
自定義0,1 encoding
- 有的時(shí)候回根據(jù)業(yè)務(wù)需要,可能會(huì)結(jié)合label encoding以及not hot 兩種方式進(jìn)行二值化。
obj_df["engine_type"].value_counts() ohc 148
ohcf 15
ohcv 13
dohc 12
l 12
rotor 4
dohcv 1
Name: engine_type, dtype: int64 # 有的時(shí)候?yàn)榱藚^(qū)分出 engine_type是否是och技術(shù)的,可以使用二值化,將該列進(jìn)行處理
# 這也突出了領(lǐng)域知識(shí)是如何以最有效的方式解決問題
obj_df["engine_type_code"] = np.where(obj_df["engine_type"].str.contains("ohc"),1,0)
obj_df[["make","engine_type","engine_type_code"]].head()
makeengine_typeengine_type_code
0| alfa-romero | dohc | 1 |
1| alfa-romero | dohc | 1 |
2| alfa-romero | ohcv | 1 |
3| audi | ohc | 1 |
4| audi | ohc | 1 |
scikit-learn中的數(shù)據(jù)轉(zhuǎn)化
- sklearn.processing模塊提供了很多方便的數(shù)據(jù)轉(zhuǎn)化以及缺失值處理方式(Imputer),可以直接從該模塊導(dǎo)入LabelEncoder,LabelBinarizer,0,1歸一化(最大最小標(biāo)準(zhǔn)化),Normalizer正則化(L1,L2)一般用的不多,標(biāo)準(zhǔn)化(最大最小標(biāo)準(zhǔn)化max_mix),非線性轉(zhuǎn)換,生成多項(xiàng)式特征(PolynomialFeatures),將每個(gè)特征縮放在同樣的范圍或分布情況下
- sklearn processing 模塊官網(wǎng)文檔鏈接
- category_encoders包官方文檔
至此,數(shù)據(jù)預(yù)處理以及category轉(zhuǎn)化大致講完了。
總結(jié)
以上是生活随笔為你收集整理的pandas category数据类型的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。