當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数值分箱与one-hot

發布時間：2024/1/23 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了数值分箱与one-hot 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- pandas方式
- 將多個標簽onehot
- sklearn 方式

本部分僅介紹數值類特征的one-hot，關于文本的one-hot請參考上一部分。

數值one-hot可以使用pandas.cut()和get_dummies()或者sklearn.OnehotEncoder。
此外，skearn的preprocessing.KBinsDiscretizer類和Binarizer類也可以用于數值分箱。

pandas方式

基本思路是先使用cut()對數值進行分箱，分箱后使用get_dummies()得到onehot值。API:

https://pandas.pydata.org/docs/reference/api/pandas.cut.html?highlight=cut#pandas.cut

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

我們先對數據進行分箱：

我們這里使用的是指定分隔值的方式，還可以簡單的指定平均分成N個等分等，詳見cut()的API。

lst = np.arange(0,100, 3) print(lst) [ 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 6972 75 78 81 84 87 90 93 96 99] lst_bins = pd.cut(lst, [-1,10,50,100]) print(lst_bins) [(-1, 10], (-1, 10], (-1, 10], (-1, 10], (10, 50], ..., (50, 100], (50, 100], (50, 100], (50, 100], (50, 100]] Length: 34 Categories (3, interval[int64]): [(-1, 10] < (10, 50] < (50, 100]]

我們看一下每個區間的數量：

print(pd.value_counts(lst_bins)) (50, 100] 17 (10, 50] 13 (-1, 10] 4 dtype: int64

但這樣分箱后不是很適合閱讀，所以我們可以加上標簽：

lst_bins = pd.cut(lst, [-1,10,50,100], labels=['1','2','3']) print(lst_bins) ['1', '1', '1', '1', '2', ..., '3', '3', '3', '3', '3'] Length: 34 Categories (3, object): ['1' < '2' < '3']

我們還可以簡單的將數據分箱成N份：

lst_bins2 = pd.cut(lst, 5, labels=['1','2','3','4','5']) print(lst_bins2) print(pd.value_counts(lst_bins2)) ['1', '1', '1', '1', '1', ..., '5', '5', '5', '5', '5'] Length: 34 Categories (5, object): ['1' < '2' < '3' < '4' < '5'] 5 7 4 7 2 7 1 7 3 6 dtype: int64

得到分箱值后，我們就可以對分箱進行one-hot了。get_dummies處理的是DataFrame，所以我們先把數據包裝成DataFame。

df = pd.DataFrame() df['score'] = lst_bins print(df)df_onehot = pd.get_dummies(df['score']) print(df_onehot) score 0 1 1 1 2 1 3 1 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 3 18 3 19 3 20 3 21 3 22 3 23 3 24 3 25 3 26 3 27 3 28 3 29 3 30 3 31 3 32 3 33 31 2 3 0 1 0 0 1 1 0 0 2 1 0 0 3 1 0 0 4 0 1 0 5 0 1 0 6 0 1 0 7 0 1 0 8 0 1 0 9 0 1 0 10 0 1 0 11 0 1 0 12 0 1 0 13 0 1 0 14 0 1 0 15 0 1 0 16 0 1 0 17 0 0 1 18 0 0 1 19 0 0 1 20 0 0 1 21 0 0 1 22 0 0 1 23 0 0 1 24 0 0 1 25 0 0 1 26 0 0 1 27 0 0 1 28 0 0 1 29 0 0 1 30 0 0 1 31 0 0 1 32 0 0 1 33 0 0 1

完整代碼：

lst = np.arange(0,100, 3) lst_bins = pd.cut(lst, [-1,10,50,100]) lst_bins = pd.cut(lst, [-1,10,50,100], labels=['1','2','3'])df = pd.DataFrame() df['score'] = lst_bins df_onehot = pd.get_dummies(df['score']) print(df_onehot) 1 2 3 0 1 0 0 1 1 0 0 2 1 0 0 3 1 0 0 4 0 1 0 5 0 1 0 6 0 1 0 7 0 1 0 8 0 1 0 9 0 1 0 10 0 1 0 11 0 1 0 12 0 1 0 13 0 1 0 14 0 1 0 15 0 1 0 16 0 1 0 17 0 0 1 18 0 0 1 19 0 0 1 20 0 0 1 21 0 0 1 22 0 0 1 23 0 0 1 24 0 0 1 25 0 0 1 26 0 0 1 27 0 0 1 28 0 0 1 29 0 0 1 30 0 0 1 31 0 0 1 32 0 0 1 33 0 0 1

如果df中有多個字段：(如需先分箱，則參考上面）

df = pd.DataFrame({'A':['a','b','a'],'B':['b','a','c']})# Get one hot encoding of columns B one_hot = pd.get_dummies(df['B']) # Drop column B as it is now encoded df = df.drop('B',axis = 1) # Join the encoded df df = df.join(one_hot) print(df) Aabc012

a	0	1	0
b	1	0	0
a	0	0	1

將多個標簽onehot

使用get_dummies可以直接將所有的feature做onehot：

pd.Series(['a|b', 'a', 'a|c']).str.get_dummies()df = pd.DataFrame({'f':['a,b', 'a', 'a,c']}) df['f'].str.get_dummies(",") abc012

1	1	0
1	0	0
1	0	1

如果數據量比較大，可以使用MultiLabelBinarizer

https://stackoverflow.com/questions/63544536/convert-pd-get-dummies-result-to-df-str-get-dummies

from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer(sparse_output=True)output = pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df['f'].str.split(',')),columns=mlb.classes_) print(output) a b c 0 1 1 0 1 1 0 0 2 1 0 1

看一個完整的例子，我們將以下數據做onehot，如果有這個標簽則為0，沒有則為1：
label,features
1,80801|898509
0,80801|898509|59834
1,80801|898509|48983

import pandas as pd from sklearn.preprocessing import MultiLabelBinarizersample_dir = '/home/ljhn1829/jupyter/ljh/data/onehot_sample.csv' df_sample_all = pd.read_csv(sample_dir)mlb = MultiLabelBinarizer(sparse_output=True) output = pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df_sample_all['features'].str.split('|')),columns=mlb.classes_) df_sample_onehot_all = pd.DataFrame() df_sample_onehot_all['label'] = df_sample_all['label'] print(df_sample_onehot_all)df_sample_onehot_all= pd.concat([df_sample_onehot_all,output], axis=1) print(df_sample_onehot_all) label 0 1 1 0 2 1label 48983 59834 80801 898509 0 1 0 0 1 1 1 0 0 1 1 1 2 1 1 0 1 1

sklearn 方式

對于分類數值的onehot，其處理方式和上述的文本類別的處理方式并無不同。

如果是連續數值onehot，則需要使用上述的cut()或者skearn的preprocessing.KBinsDiscretizer類和Binarizer類先進行分箱。一般使用cut()即可。

from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # 類別的數量： print(enc.categories_) #onehot編碼 print(enc.transform([[0, 1, 1]]).toarray()) [array([0, 1]), array([0, 1, 2]), array([0, 1, 2, 3])] [[1. 0. 0. 1. 0. 0. 1. 0. 0.]]

總結

以上是生活随笔為你收集整理的数值分箱与one-hot的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

数值
hot

上一篇： sklearn与pandas的缺失值处理
下一篇： sklearn预处理转化流水线