當(dāng)前位置：首頁 >

8.Using Categorical Data with One Hot Encoding

發(fā)布時(shí)間：2023/12/10 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 8.Using Categorical Data with One Hot Encoding 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

本教程是機(jī)器學(xué)習(xí)系列的一部分。在此步驟中，您將了解“分類”變量是什么，以及處理此類數(shù)據(jù)的最常用方法。

Introduction

分類數(shù)據(jù)是僅采用有限數(shù)量值的數(shù)據(jù)。

例如，如果人們回答一項(xiàng)關(guān)于他們擁有哪種品牌汽車的調(diào)查，結(jié)果將是明確的（因?yàn)榇鸢笇⑹潜咎?#xff0c;豐田，福特，無等等）。答案屬于一組固定的類別。

如果您嘗試將這些變量插入Python中的大多數(shù)機(jī)器學(xué)習(xí)模型而不首先“編碼”它們，則會(huì)出現(xiàn)錯(cuò)誤。在這里，我們將展示最流行的分類變量編碼方法。

One-Hot Encoding : The Standard Approach for Categorical Data

One-Hot Encoding是最普遍的方法，除非你的分類變量具有大量的值，否則它的效果非常好（例如，對(duì)于變量超過15個(gè)不同值的變量，你通常不會(huì)這樣做。在數(shù)值較少的情況下它是一個(gè)糟糕的選擇，盡管情況有所不同。）

One-Hot Encoding創(chuàng)建新的（二進(jìn)制）列，指示原始數(shù)據(jù)中每個(gè)可能值的存在。讓我們通過一個(gè)例子來解決。

原始數(shù)據(jù)中的值為紅色，黃色和綠色。我們?yōu)槊總€(gè)可能的值創(chuàng)建一個(gè)單獨(dú)的列。只要原始值為紅色，我們?cè)诩t色列中放置1。

Example

我們?cè)诖a中看到這個(gè)。我們將跳過基本數(shù)據(jù)設(shè)置代碼，因此您可以從擁有train_predictors，test_predictors 的DataFrames位置開始。該數(shù)據(jù)包含住房特征。您將使用它們來預(yù)測(cè)房屋價(jià)格，房屋價(jià)格存儲(chǔ)在稱為目標(biāo)的系列中。

【1】

# Read the data import pandas as pd train_data = pd.read_csv('../input/train.csv') test_data = pd.read_csv('../input/test.csv')# Drop houses where the target is missing train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)target = train_data.SalePrice# Since missing values isn't the focus of this tutorial, we use the simplest # possible approach, which drops these columns. # For more detail (and a better approach) to missing values, see # https://www.kaggle.com/dansbecker/handling-missing-values cols_with_missing = [col for col in train_data.columns if train_data[col].isnull().any()] candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1) candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)# "cardinality" means the number of unique values in a column. # We use it as our only way to select categorical columns here. This is convenient, though # a little arbitrary. low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if candidate_train_predictors[cname].nunique() < 10 andcandidate_train_predictors[cname].dtype == "object"] numeric_cols = [cname for cname in candidate_train_predictors.columns if candidate_train_predictors[cname].dtype in ['int64', 'float64']] my_cols = low_cardinality_cols + numeric_cols train_predictors = candidate_train_predictors[my_cols] test_predictors = candidate_test_predictors[my_cols]

Pandas為每個(gè)列或系列分配數(shù)據(jù)類型（稱為dtype）。讓我們從預(yù)測(cè)數(shù)據(jù)中看到隨機(jī)的dtypes樣本：

【2】

train_predictors.dtypes.sample(10) Heating object CentralAir object Foundation object Condition1 object YrSold int64 PavedDrive object RoofMatl object PoolArea int64 EnclosedPorch int64 KitchenAbvGr int64 dtype: object

對(duì)象表示一列有文本（理論上可能有其他東西，但這對(duì)我們的目的來說并不重要）。對(duì)這些“對(duì)象”列進(jìn)行one-hot encode是最常見的，因?yàn)樗鼈儾荒苤苯硬迦氪蠖鄶?shù)模型中。 Pandas提供了一個(gè)名為get_dummies的便捷功能，可以獲得one-hot encodings。這樣叫：

[3]

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

或者，您可以刪除分類。為了了解這些方法的比較，我們可以計(jì)算兩組可選預(yù)測(cè)構(gòu)建的模型的平均絕對(duì)誤差：

???? One-hot encoded分類以及數(shù)字預(yù)測(cè)變量

???? 數(shù)值預(yù)測(cè)，我們刪除分類。

One-hot encoding通常有所幫助，但它會(huì)根據(jù)具體情況而有所不同。在這種情況下，使用one-hot encoded變量似乎沒有任何的好處。

[4]

from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestRegressordef get_mae(X, y):# multiple by -1 to make positive MAE score instead of neg value returned as sklearn conventionreturn -1 * cross_val_score(RandomForestRegressor(50), X, y, scoring = 'neg_mean_absolute_error').mean()predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])mae_without_categoricals = get_mae(predictors_without_categoricals, target)mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals))) print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded))) Mean Absolute Error when Dropping Categoricals: 18350 Mean Abslute Error with One-Hot Encoding: 18023

Applying to Multiple Files

到目前為止，您已經(jīng)對(duì)您的訓(xùn)練數(shù)據(jù)進(jìn)行了one-hot encoded。當(dāng)你有多個(gè)文件（例如測(cè)試數(shù)據(jù)集，或者你想要預(yù)測(cè)的其他數(shù)據(jù)）時(shí)怎么辦？ Scikit-learn對(duì)列的排序很敏感，因此如果訓(xùn)練數(shù)據(jù)集和測(cè)試數(shù)據(jù)集未對(duì)齊，則結(jié)果將是無意義的。如果分類在訓(xùn)練數(shù)據(jù)中與測(cè)試數(shù)據(jù)具有不同數(shù)量的值，則可能發(fā)生這種情況。

使用align命令確保測(cè)試數(shù)據(jù)的編碼方式與訓(xùn)練數(shù)據(jù)相同：

【5】

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors) one_hot_encoded_test_predictors = pd.get_dummies(test_predictors) final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,join='left', axis=1)

align命令確保列在兩個(gè)數(shù)據(jù)集中以相同的順序顯示（它使用列名來標(biāo)識(shí)每個(gè)數(shù)據(jù)集中的哪些列對(duì)齊。）參數(shù)join ='left'指定我們將執(zhí)行等效的SQL左連接。這意味著，如果有一列顯示在一個(gè)數(shù)據(jù)集而不是另一個(gè)數(shù)據(jù)集中，我們將保留我們的訓(xùn)練數(shù)據(jù)中的列。參數(shù)join ='inner'將執(zhí)行SQL數(shù)據(jù)庫調(diào)用內(nèi)連接的操作，僅保留兩個(gè)數(shù)據(jù)集中顯示的列。這也是一個(gè)明智的選擇。

Conclusion

世界充滿了分類數(shù)據(jù)。如果您知道如何使用這些數(shù)據(jù)，那么您將成為一名更有效的數(shù)據(jù)科學(xué)家。當(dāng)您開始使用cateogircal數(shù)據(jù)進(jìn)行更復(fù)雜的工作時(shí)，這些資源將非常有用。

???? 管道：將模型部署到生產(chǎn)就緒系統(tǒng)本身就是一個(gè)主題。雖然one-hot encoding仍然是一種很好的方法，但您的代碼需要以特別強(qiáng)大的方式構(gòu)建。 Scikit-learn管道是一個(gè)很好的工具。 Scikit-learn提供了class for one-hot encoding，可以將其添加到管道中。不幸的是，它不處理文本或?qū)ο笾?#xff0c;這是一個(gè)常見的用例。

???? 應(yīng)用于深度學(xué)習(xí)的文本：Keras和TensorFlow具有one-hot encoding的功能，這對(duì)于處理文本很有用。

? ? ?具有多個(gè)值的分類：Scikit-learn的FeatureHasher使用散列技巧來存儲(chǔ)高維數(shù)據(jù)。這將為您的建模代碼增加一些復(fù)雜性。

Your Turn

使用one-hot encoding允許課程項(xiàng)目中的分類。然后在X數(shù)據(jù)中添加一些分類列。如果選擇正確的變量，您的模型將會(huì)有相當(dāng)大的改進(jìn)。完成后，單擊此處返回學(xué)習(xí)機(jī)器學(xué)習(xí)，您可以繼續(xù)改進(jìn)模型。

總結(jié)

以上是生活随笔為你收集整理的8.Using Categorical Data with One Hot Encoding的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： pavprsrv.exe - pavpr
下一篇：【sdut 1751】区间覆盖问题