Feature Engineering 特征工程 2. Categorical Encodings
文章目錄
- 1. Count Encoding 計數編碼
- 2. Target Encoding 目標編碼
- 3. CatBoost Encoding
learn from https://www.kaggle.com/learn/feature-engineering
上一篇:Feature Engineering 特征工程 1. Baseline Model
下一篇:Feature Engineering 特征工程 3. Feature Generation
在中級機器學習里介紹過了Label Encoding、One-Hot Encoding,下面將學習count encoding計數編碼,target encoding目標編碼、singular value decomposition奇異值分解
在上一篇中使用LabelEncoder(),得分為Validation AUC score: 0.7467
# Label encoding cat_features = ['category', 'currency', 'country'] encoder = LabelEncoder() encoded = ks[cat_features].apply(encoder.fit_transform)1. Count Encoding 計數編碼
-
計數編碼,就是把該類型的value,替換為其出現的次數
例如:一個特征中CN出現了100次,那么就將CN,替換成數值100 -
category_encoders.CountEncoder(),最終得分Validation AUC score: 0.7486
2. Target Encoding 目標編碼
- category_encoders.TargetEncoder(),最終得分Validation AUC score: 0.7491
Target encoding replaces a categorical value with the average value of the target for that value of the feature.
目標編碼:將會用該特征值的 label 的平均值 替換 分類特征值
For example, given the country value “CA”, you’d calculate the average outcome for all the rows with country == ‘CA’, around 0.28.
舉例子:特征值 “CA”,你要計算所有 “CA” 行的 label(即outcome列)的均值,用該均值來替換 “CA”
This is often blended with the target probability over the entire dataset to reduce the variance of values with few occurences.
這么做,可以降低很少出現的值的方差?
This technique uses the targets to create new features. So including the validation or test data in the target encodings would be a form of target leakage.
這種編碼方法會產生新的特征,不要把驗證集和測試集拿進來fit,會產生數據泄露
Instead, you should learn the target encodings from the training dataset only and apply it to the other datasets.
應該從訓練集里fit,應用到其他數據集
3. CatBoost Encoding
- category_encoders.CatBoostEncoder(),最終得分Validation AUC score: 0.7492
This is similar to target encoding in that it’s based on the target probablity for a given value.
跟目標編碼類似的點在于,它基于給定值的 label 目標概率
However with CatBoost, for each row, the target probability is calculated only from the rows before it.
計算上,對每一行,目標概率的計算只依靠它之前的行
上一篇:Feature Engineering 特征工程 1. Baseline Model
下一篇:Feature Engineering 特征工程 3. Feature Generation
總結
以上是生活随笔為你收集整理的Feature Engineering 特征工程 2. Categorical Encodings的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 524. 通过删除字母
- 下一篇: LeetCode 683. K 个空花盆