python onehot_Python中的标签编码器和OneHot编码器
python onehot
Machine Learning algorithms understand the numbers and not texts. Hence, all the “text” columns must be converted into “numerical” columns to make it understandable for the algorithm.
中號 achine學(xué)習(xí)算法理解的數(shù)字,而不是文本。 因此,必須將所有“文本”列都轉(zhuǎn)換為“數(shù)字”列,以使其對算法易于理解。
This is the story of transforming labels or categorical or text values into numbers or numerical values. In simple words,
這是將標簽或分類或文本值轉(zhuǎn)換為數(shù)字或數(shù)值的故事。 簡單來說
Encoding is the process of transforming words into numbers
編碼是將單詞轉(zhuǎn)換為數(shù)字的過程
In Python, OneHot Encoding and Lebel Encoding are two methods for encoding the categorical columns into numerical columns. And these are part of one of the most commonly used Python library: Scikit-Learn
在Python中, OneHot編碼和Lebel編碼是將分類列編碼為數(shù)值列的兩種方法。 這些是最常用的Python庫之一的一部分: Scikit-Learn
But wait, you don’t want to import Scikit-Learn in your notebook ??
但是,等等,您不想在筆記本中導(dǎo)入Scikit-Learn?
No problem at all, ?? Pandas comes for your help.
完全沒有問題,?? 熊貓來找你。
Let us dive into this story of converting categorical variables into numerical ones so that ML algorithm understands it.
讓我們深入探討將分類變量轉(zhuǎn)換為數(shù)值變量的故事,以便ML算法理解它。
分類數(shù)據(jù) (Categorical Data)
Any dataset contains multiple columns containing numerical as well as categorical values.
任何數(shù)據(jù)集都包含多列,其中包含數(shù)值和分類值。
Image by Author: Datatypes of categorical column圖像作者:類別列的數(shù)據(jù)類型Categorical variables represent types of data which may be divided into groups. It has a limited and usually fixed number of possible values called categories. Variables like gender, social class, blood type, country codes, are examples of categorical data.
分類變量表示可以分為幾組的數(shù)據(jù)類型。 它具有數(shù)量有限且通常固定的可能的值,稱為類別 。 諸如性別,社會階層,血型,國家/地區(qū)代碼之類的變量是分類數(shù)據(jù)的示例。
But, if this data is encoded into numerical values then only it can be processed in a machine learning algorithm.
但是,如果將此數(shù)據(jù)編碼為數(shù)值,則只能在機器學(xué)習(xí)算法中對其進行處理。
Let us consider the below example to understand the encoding in a simple way.
讓我們考慮以下示例,以一種簡單的方式理解編碼。
import pandas as pdcountries = ["Germany","India","UK","Egypt","Iran"]continents = ["Europe","Asia","Europe","Africa","Asia"]
code = ["G","I","U","E","N"]
d={"country": countries, "continent":continents, "code":code}
df = pd.DataFrame(d)Image by Author: Example DataFrame圖片作者:示例數(shù)據(jù)框架
Converting the data type of column “code” from object to the category
將“代碼”列的數(shù)據(jù)類型從對象轉(zhuǎn)換為類別
df['code'] = df.code.astype('category')Image by Author: Datatypes of all columns圖像作者:所有列的數(shù)據(jù)類型With this example let us understand the encoding process.
通過此示例,讓我們了解編碼過程。
Python中的標簽編碼 (Label Encoding in Python)
Label encoding is a simple and straight forward approach. This converts each value in a categorical column into a numerical value. Each value in a categorical column is called Label.
標簽編碼是一種簡單直接的方法。 這會將分類列中的每個值轉(zhuǎn)換為數(shù)值。 分類列中的每個值稱為Label 。
Label encoding: Assign a unique integer to each label based on alphabetical order
標簽編碼:根據(jù)字母順序為每個標簽分配一個唯一的整數(shù)
Let me show you how Label encoding works in python with the same above example,
讓我用上面的相同示例向您展示Label編碼在python中的工作方式,
from sklearn.preprocessing import LabelEncoderle = LabelEncoder()df["labeled_continent"] = le.fit_transform(df["continent"])
df
the labels in column continent will be converted into numbers and will be stored in the new column — labeled_continent
continent列中的標簽將轉(zhuǎn)換為數(shù)字,并將存儲在新列中labeled_continent
The output will be,
輸出將是
Image by Author: Label Encoding in Python圖片由作者提供:Python中的標簽編碼In more simple words, labels are arranged in alphabetical order and a unique index is assigned to each label starting from 0.
用更簡單的詞來說,標簽是按字母順序排列的,并且唯一的索引從0開始分配給每個標簽。
Image by Author: Understand Label Encoding in Python作者提供的圖像:了解Python中的標簽編碼All looks good ?? Worked well ??
一切看起來都很好?? 運作良好??
Here jumps in the problem with Label encoding. It uses numbers in a sequence that introduces a comparison between the labels. In the above example, the labels in the column continent do not have an order or rank. But after label encoding, these labels are ordered in an alphabetical manner. because of these numbers, a machine learning model can interpret this ordering as Europe > Asia > Africa
這里跳入了標簽編碼的問題。 它按順序使用數(shù)字,從而在標簽之間進行比較。 在上面的示例中, continent列中的標簽沒有順序或等級。 但是在標簽編碼之后,這些標簽以字母順序排序。 由于這些數(shù)字,機器學(xué)習(xí)模型可以將此順序解釋為Europe > Asia > Africa
To overcome this ordering problem with Label Encoding, OneHot Encoding comes into the picture.
為了克服標簽編碼的排序問題,圖片中加入了OneHot編碼。
Python中的OneHot編碼 (OneHot Encoding in Python)
In OneHot encoding, a binary column is created for each label in a column. Here each label is transformed into a new column or new feature and assigned 1 (Hot) or 0 (Cold) value.
在OneHot編碼中,將為列中的每個標簽創(chuàng)建一個二進制列。 在這里,每個標簽都將轉(zhuǎn)換為新列或新特征,并分配1(熱)或0(冷)值。
Let me show you an example first to understand the above statement,
讓我先給你看一個例子,以理解上述陳述,
from sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder()df3 = pd.DataFrame(ohe.fit_transform(df[["continent"]]).toarray())
df_new=pd.concat([df,df3],axis=1)
df_newImage by Author: OneHot Encoding in Python圖片由作者提供:Python中的OneHot編碼
In this scenario, the last three columns are the result of OneHot Encoding. Labels Africa, Asia, and Europe have been encoded as 0, 1, 2 respectively. OneHot encoding transforms these labels into columns. hence, looking at the last 3 columns, we have 3 labels → 3 columns
在這種情況下,最后三列是OneHot編碼的結(jié)果。 標簽非洲,亞洲和歐洲分別被編碼為0、1、2。 OneHot編碼將這些標簽轉(zhuǎn)換為列。 因此,查看最后3列,我們有3 labels → 3 columns
OneHot Encoding: In a single row only one Label is Hot
OneHot編碼:在一行中,只有一個Label是Hot
In a particular row, only one label has a value of 1 and all other labels have a value of 0. Before feeding such an encoded dataset into a machine learning model few more transformations can be done as given in OneHot Encoding documentation.
在特定的行中,只有一個標簽的值是1,所有其他標簽的值是0。在將這種編碼后的數(shù)據(jù)集輸入到機器學(xué)習(xí)模型之前,如OneHot Encoding 文檔中所述,可以再進行幾步轉(zhuǎn)換。
Have a quick look at this article to know more options for merging 2 DataFrames
快速瀏覽本文,了解合并2個DataFrame的更多選項
Python中的pandas.get_dummies() (pandas.get_dummies() in Python)
OneHot encoding can be implemented in a simpler way and without importing Scikit-Learn.
可以以更簡單的方式實現(xiàn)OneHot編碼,而無需導(dǎo)入Scikit-Learn。
?? Yess !! Pandas is your friend here. This simple function pandas.get_dummies() will quickly transform all the labels from specified column into individual binary columns
??是的!! 熊貓是您的朋友在這里。 這個簡單的函數(shù)pandas.get_dummies()可以將所有標簽從指定列快速轉(zhuǎn)換為單個二進制列
df2=pd.get_dummies(df[["continent"]])df_new=pd.concat([df,df2],axis=1)
df_newImage by Author: Pandas dummy variables圖片作者:熊貓?zhí)摂M變量
The last 3 columns of above DataFrame are the same as observed in OneHot Encoding.
上面DataFrame的最后3列與OneHot Encoding中觀察到的相同。
pandas.get_dummies() generates dummy variables for each label in the column continent. Hence, continent_Africa, continent_Asia, and continent_Europe are the dummy binary variables for the labels Africa, Asia, and Europe respectively.
pandas.get_dummies()為continent中的每個標簽生成虛擬變量。 因此, 大洲 _ 非洲,大洲 _亞洲和大洲 _歐洲分別是標簽非洲,亞洲和歐洲的虛擬二進制變量 。
通過我的故事, (Through my story,)
I walked you through the methods of converting categorical variables into numerical variables. Each method has its own pros and limitations, hence it is important to understand all the methods. Depending on your dataset and machine learning model you want to implement, you can choose any of the above three label encoding methods in Python.
我向您介紹了將分類變量轉(zhuǎn)換為數(shù)值變量的方法。 每種方法都有其自身的優(yōu)缺點,因此了解所有方法非常重要。 根據(jù)您要實現(xiàn)的數(shù)據(jù)集和機器學(xué)習(xí)模型,您可以在Python中選擇以上三種標簽編碼方法中的任何一種。
Here are a few resources which can help you with this topic:
這里有一些資源可以幫助您解決此主題:
Label Encoding in Python
Python中的標簽編碼
OneHot Encoding in Python
Python中的OneHot編碼
Get Dummy variables using Pandas
使用Pandas獲取虛擬變量
Liked my way of Storytelling ??
喜歡我的講故事方式?
Here is an interesting fun & learn activity for you to create your own dataset. Have a look.
這是一個有趣的有趣的學(xué)習(xí)活動,可讓您創(chuàng)建自己的數(shù)據(jù)集。 看一看。
Thank you for your time!
感謝您的時間!
I am always open to getting suggestions, and new opportunities. Feel free to add your feedback and connect with me on LinkedIn.
我總是樂于獲得建議和新的機會。 隨時添加您的反饋,并在LinkedIn上與我聯(lián)系。
翻譯自: https://towardsdatascience.com/label-encoder-and-onehot-encoder-in-python-83d32288b592
python onehot
總結(jié)
以上是生活随笔為你收集整理的python onehot_Python中的标签编码器和OneHot编码器的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Linux常用调试工具总结
- 下一篇: Python 序列数据的One Hot编