當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python onehot_Python中的标签编码器和OneHot编码器

發(fā)布時(shí)間：2024/3/26 python 55 豆豆

生活随笔收集整理的這篇文章主要介紹了 python onehot_Python中的标签编码器和OneHot编码器小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

python onehot

Machine Learning algorithms understand the numbers and not texts. Hence, all the “text” columns must be converted into “numerical” columns to make it understandable for the algorithm.

中號(hào) achine學(xué)習(xí)算法理解的數(shù)字，而不是文本。因此，必須將所有“文本”列都轉(zhuǎn)換為“數(shù)字”列，以使其對(duì)算法易于理解。

This is the story of transforming labels or categorical or text values into numbers or numerical values. In simple words,

這是將標(biāo)簽或分類或文本值轉(zhuǎn)換為數(shù)字或數(shù)值的故事。簡(jiǎn)單來(lái)說(shuō)

Encoding is the process of transforming words into numbers

編碼是將單詞轉(zhuǎn)換為數(shù)字的過(guò)程

In Python, OneHot Encoding and Lebel Encoding are two methods for encoding the categorical columns into numerical columns. And these are part of one of the most commonly used Python library: Scikit-Learn

在Python中， OneHot編碼和Lebel編碼是將分類列編碼為數(shù)值列的兩種方法。這些是最常用的Python庫(kù)之一的一部分： Scikit-Learn

But wait, you don’t want to import Scikit-Learn in your notebook ??

但是，等等，您不想在筆記本中導(dǎo)入Scikit-Learn？

No problem at all, ?? Pandas comes for your help.

完全沒有問(wèn)題，?? 熊貓來(lái)找你。

Let us dive into this story of converting categorical variables into numerical ones so that ML algorithm understands it.

讓我們深入探討將分類變量轉(zhuǎn)換為數(shù)值變量的故事，以便ML算法理解它。

分類數(shù)據(jù) (Categorical Data)

Any dataset contains multiple columns containing numerical as well as categorical values.

任何數(shù)據(jù)集都包含多列，其中包含數(shù)值和分類值。

Image by Author: Datatypes of categorical column圖像作者：類別列的數(shù)據(jù)類型

Categorical variables represent types of data which may be divided into groups. It has a limited and usually fixed number of possible values called categories. Variables like gender, social class, blood type, country codes, are examples of categorical data.

分類變量表示可以分為幾組的數(shù)據(jù)類型。它具有數(shù)量有限且通常固定的可能的值，稱為類別。諸如性別，社會(huì)階層，血型，國(guó)家/地區(qū)代碼之類的變量是分類數(shù)據(jù)的示例。

But, if this data is encoded into numerical values then only it can be processed in a machine learning algorithm.

但是，如果將此數(shù)據(jù)編碼為數(shù)值，則只能在機(jī)器學(xué)習(xí)算法中對(duì)其進(jìn)行處理。

Let us consider the below example to understand the encoding in a simple way.

讓我們考慮以下示例，以一種簡(jiǎn)單的方式理解編碼。

import pandas as pdcountries = ["Germany","India","UK","Egypt","Iran"]
continents = ["Europe","Asia","Europe","Africa","Asia"]
code = ["G","I","U","E","N"]
d={"country": countries, "continent":continents, "code":code}
df = pd.DataFrame(d)Image by Author: Example DataFrame圖片作者：示例數(shù)據(jù)框架

Converting the data type of column “code” from object to the category

將“代碼”列的數(shù)據(jù)類型從對(duì)象轉(zhuǎn)換為類別

df['code'] = df.code.astype('category')Image by Author: Datatypes of all columns圖像作者：所有列的數(shù)據(jù)類型

With this example let us understand the encoding process.

通過(guò)此示例，讓我們了解編碼過(guò)程。

Python中的標(biāo)簽編碼 (Label Encoding in Python)

Label encoding is a simple and straight forward approach. This converts each value in a categorical column into a numerical value. Each value in a categorical column is called Label.

標(biāo)簽編碼是一種簡(jiǎn)單直接的方法。這會(huì)將分類列中的每個(gè)值轉(zhuǎn)換為數(shù)值。分類列中的每個(gè)值稱為L(zhǎng)abel 。

Label encoding: Assign a unique integer to each label based on alphabetical order

標(biāo)簽編碼：根據(jù)字母順序?yàn)槊總€(gè)標(biāo)簽分配一個(gè)唯一的整數(shù)

Let me show you how Label encoding works in python with the same above example,

讓我用上面的相同示例向您展示Label編碼在python中的工作方式，

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
df["labeled_continent"] = le.fit_transform(df["continent"])
df

the labels in column continent will be converted into numbers and will be stored in the new column — labeled_continent

continent列中的標(biāo)簽將轉(zhuǎn)換為數(shù)字，并將存儲(chǔ)在新列中labeled_continent

The output will be,

輸出將是

Image by Author: Label Encoding in Python圖片由作者提供：Python中的標(biāo)簽編碼

In more simple words, labels are arranged in alphabetical order and a unique index is assigned to each label starting from 0.

用更簡(jiǎn)單的詞來(lái)說(shuō)，標(biāo)簽是按字母順序排列的，并且唯一的索引從0開始分配給每個(gè)標(biāo)簽。

Image by Author: Understand Label Encoding in Python作者提供的圖像：了解Python中的標(biāo)簽編碼

All looks good ?? Worked well ??

一切看起來(lái)都很好?? 運(yùn)作良好??

Here jumps in the problem with Label encoding. It uses numbers in a sequence that introduces a comparison between the labels. In the above example, the labels in the column continent do not have an order or rank. But after label encoding, these labels are ordered in an alphabetical manner. because of these numbers, a machine learning model can interpret this ordering as Europe > Asia > Africa

這里跳入了標(biāo)簽編碼的問(wèn)題。它按順序使用數(shù)字，從而在標(biāo)簽之間進(jìn)行比較。在上面的示例中， continent列中的標(biāo)簽沒有順序或等級(jí)。但是在標(biāo)簽編碼之后，這些標(biāo)簽以字母順序排序。由于這些數(shù)字，機(jī)器學(xué)習(xí)模型可以將此順序解釋為Europe > Asia > Africa

To overcome this ordering problem with Label Encoding, OneHot Encoding comes into the picture.

為了克服標(biāo)簽編碼的排序問(wèn)題，圖片中加入了OneHot編碼。

Python中的OneHot編碼 (OneHot Encoding in Python)

In OneHot encoding, a binary column is created for each label in a column. Here each label is transformed into a new column or new feature and assigned 1 (Hot) or 0 (Cold) value.

在OneHot編碼中，將為列中的每個(gè)標(biāo)簽創(chuàng)建一個(gè)二進(jìn)制列。在這里，每個(gè)標(biāo)簽都將轉(zhuǎn)換為新列或新特征，并分配1(熱)或0(冷)值。

Let me show you an example first to understand the above statement,

讓我先給你看一個(gè)例子，以理解上述陳述，

from sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder()
df3 = pd.DataFrame(ohe.fit_transform(df[["continent"]]).toarray())
df_new=pd.concat([df,df3],axis=1)
df_newImage by Author: OneHot Encoding in Python圖片由作者提供：Python中的OneHot編碼

In this scenario, the last three columns are the result of OneHot Encoding. Labels Africa, Asia, and Europe have been encoded as 0, 1, 2 respectively. OneHot encoding transforms these labels into columns. hence, looking at the last 3 columns, we have 3 labels → 3 columns

在這種情況下，最后三列是OneHot編碼的結(jié)果。標(biāo)簽非洲，亞洲和歐洲分別被編碼為0、1、2。 OneHot編碼將這些標(biāo)簽轉(zhuǎn)換為列。因此，查看最后3列，我們有3 labels → 3 columns

OneHot Encoding: In a single row only one Label is Hot

OneHot編碼：在一行中，只有一個(gè)Label是Hot

In a particular row, only one label has a value of 1 and all other labels have a value of 0. Before feeding such an encoded dataset into a machine learning model few more transformations can be done as given in OneHot Encoding documentation.

在特定的行中，只有一個(gè)標(biāo)簽的值是1，所有其他標(biāo)簽的值是0。在將這種編碼后的數(shù)據(jù)集輸入到機(jī)器學(xué)習(xí)模型之前，如OneHot Encoding 文檔中所述，可以再進(jìn)行幾步轉(zhuǎn)換。

Have a quick look at this article to know more options for merging 2 DataFrames

快速瀏覽本文，了解合并2個(gè)DataFrame的更多選項(xiàng)

Python中的pandas.get_dummies() (pandas.get_dummies() in Python)

OneHot encoding can be implemented in a simpler way and without importing Scikit-Learn.

可以以更簡(jiǎn)單的方式實(shí)現(xiàn)OneHot編碼，而無(wú)需導(dǎo)入Scikit-Learn。

?? Yess !! Pandas is your friend here. This simple function pandas.get_dummies() will quickly transform all the labels from specified column into individual binary columns

??是的!! 熊貓是您的朋友在這里。這個(gè)簡(jiǎn)單的函數(shù)pandas.get_dummies()可以將所有標(biāo)簽從指定列快速轉(zhuǎn)換為單個(gè)二進(jìn)制列

df2=pd.get_dummies(df[["continent"]])
df_new=pd.concat([df,df2],axis=1)
df_newImage by Author: Pandas dummy variables圖片作者：熊貓?zhí)摂M變量

The last 3 columns of above DataFrame are the same as observed in OneHot Encoding.

上面DataFrame的最后3列與OneHot Encoding中觀察到的相同。

pandas.get_dummies() generates dummy variables for each label in the column continent. Hence, continent_Africa, continent_Asia, and continent_Europe are the dummy binary variables for the labels Africa, Asia, and Europe respectively.

pandas.get_dummies()為continent中的每個(gè)標(biāo)簽生成虛擬變量。因此，大洲 _ 非洲，大洲 _亞洲和大洲 _歐洲分別是標(biāo)簽非洲，亞洲和歐洲的虛擬二進(jìn)制變量 。

通過(guò)我的故事， (Through my story,)

I walked you through the methods of converting categorical variables into numerical variables. Each method has its own pros and limitations, hence it is important to understand all the methods. Depending on your dataset and machine learning model you want to implement, you can choose any of the above three label encoding methods in Python.

我向您介紹了將分類變量轉(zhuǎn)換為數(shù)值變量的方法。每種方法都有其自身的優(yōu)缺點(diǎn)，因此了解所有方法非常重要。根據(jù)您要實(shí)現(xiàn)的數(shù)據(jù)集和機(jī)器學(xué)習(xí)模型，您可以在Python中選擇以上三種標(biāo)簽編碼方法中的任何一種。

Here are a few resources which can help you with this topic:

這里有一些資源可以幫助您解決此主題：

Label Encoding in Python

Python中的標(biāo)簽編碼

OneHot Encoding in Python

Python中的OneHot編碼

Get Dummy variables using Pandas

使用Pandas獲取虛擬變量

Liked my way of Storytelling ??

喜歡我的講故事方式？

Here is an interesting fun & learn activity for you to create your own dataset. Have a look.

這是一個(gè)有趣的有趣的學(xué)習(xí)活動(dòng)，可讓您創(chuàng)建自己的數(shù)據(jù)集。看一看。

Thank you for your time!

感謝您的時(shí)間！

I am always open to getting suggestions, and new opportunities. Feel free to add your feedback and connect with me on LinkedIn.

我總是樂(lè)于獲得建議和新的機(jī)會(huì)。隨時(shí)添加您的反饋，并在LinkedIn上與我聯(lián)系。

翻譯自: https://towardsdatascience.com/label-encoder-and-onehot-encoder-in-python-83d32288b592