當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

fast.ai_使用fast.ai自组织地图—步骤4：使用Fast.ai DataBunch处理非监督数据

發(fā)布時間：2023/12/15 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 fast.ai_使用fast.ai自组织地图—步骤4：使用Fast.ai DataBunch处理非监督数据小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

fast.ai

This is the fourth part of the Self-Organizing Maps with fast.ai article series.

這是帶有fast.ai文章系列的自組織地圖的第四部分。

All the code has been published in this repository and this PyPi library.

所有代碼已在此存儲庫和PyPi庫中發(fā)布。

Overview: Self-Organizing Maps with Fast.ai
概述：使用Fast.ai自組織地圖
Step 1: Implementing a SOM with PyTorch
步驟1：使用PyTorch實施SOM
Step 2: Training the SOM Module with a Fast.ai Learner
步驟2：使用Fast.ai學(xué)習(xí)者訓(xùn)練SOM模塊
Step 3: Updating SOM hyperparameters with Fast.ai Callbacks
步驟3：使用Fast.ai回調(diào)更新SOM超參數(shù)

總覽 (Overview)

Many datasets come in tabular form. For this reason, Fast.ai has a handy Tabular subclass for its DataBunch that can natively perform categorization and handle continuous and categorical features.

許多數(shù)據(jù)集以表格形式出現(xiàn)。因此，Fast.ai的DataBunch擁有一個方便的Tabular子類，該子類可以本地執(zhí)行分類并處理連續(xù)和分類特征。

In this article, we will use TabularDataBunch as a starting point to load our data, and then build a conversion function to move it into our UnsupervisedDataBunch.

在本文中，我們將使用TabularDataBunch作為加載數(shù)據(jù)的起點，然后構(gòu)建轉(zhuǎn)換函數(shù)以將其移至UnsupervisedDataBunch 。

The main features we’re going to re-implement are:

我們將重新實現(xiàn)的主要功能是：

Normalization
正常化
Categorical feature encoding
分類特征編碼
Export of the SOM codebook into a Pandas DataFrame
將SOM代碼本導(dǎo)出到Pandas DataFrame中

正常化 (Normalization)

We will use a separate Normalizer class to perform per-feature normalization. Let’s define a base class:

我們將使用單獨的Normalizer類來執(zhí)行每個功能的歸一化。讓我們定義一個基類：

While comparing different normalizers, I found out that normalizing by mean and standard deviation helps a lot with SOM convergence, so we’ll extend our Normalizer class into a VarianceNormalizer:

在比較不同的規(guī)范化器時，我發(fā)現(xiàn)通過均值和標(biāo)準(zhǔn)差進(jìn)行規(guī)范化對于SOM收斂有很大幫助，因此我們將我們的Normalizer類擴(kuò)展為VarianceNormalizer ：

Note that we also implemented the denormalization function. Since we are normalizing our data, the trained SOM codebook will contain normalized data points: we will need denormalization in order to retrieve values in the initial data range.

請注意，我們還實現(xiàn)了非規(guī)范化功能。由于我們正在對數(shù)據(jù)進(jìn)行歸一化，因此訓(xùn)練有素的SOM代碼簿將包含歸一化的數(shù)據(jù)點：我們將需要非歸一化才能檢索初始數(shù)據(jù)范圍內(nèi)的值。

Let’s add the normalizer to our UnsupervisedDataBunch:

讓我們將規(guī)范化器添加到我們的UnsupervisedDataBunch ：

處理分類特征 (Handling categorical features)

Another important preprocessing step for Self-Organizing Maps is transforming categorical features into continous; this could be done by either One-Hot encoding features, or by using embeddings. Since One-Hot encoding is the easiest to implement, we’ll start with that, although embeddings have a better performance.

自組織地圖的另一個重要預(yù)處理步驟是將分類特征轉(zhuǎn)換為連續(xù)特征。這可以通過“一鍵編碼”功能或使用嵌入來完成。由于“一鍵編碼”是最容易實現(xiàn)的，因此我們從此開始，盡管嵌入的性能更好。

Both methods require a mixed distance function to compare actual continuous features and converted categoricals independently, but we will skip this step for simplicity’s sake. If you’re interested in how a mixed distance function can be implemented, feel free to have a look at the code on Github.

兩種方法都需要混合距離函數(shù)來獨立比較實際連續(xù)特征和轉(zhuǎn)換后的分類，但是為簡單起見，我們將跳過此步驟。如果您對如何實現(xiàn)混合距離函數(shù)感興趣，請隨時查看Github上的代碼。

As we did for normalizers, we will start by defining a base CatEncoder class:

就像我們對規(guī)范化器所做的那樣，我們將從定義基本CatEncoder類開始：

And subclass it into an OneHotCatEncoder:

并將其子類化為OneHotCatEncoder ：

All we’re doing here is using torch.nn.functional.one_hot to perform one-hot encoding of our input variables, storing the number of categories for each feature in the training set during fit, and then using this information to perform encoding with make_continuous and decoding with make_categorical .

我們在這里所做的全部工作是使用torch.nn.functional.one_hot對輸入變量執(zhí)行一次熱編碼，在fit期間存儲訓(xùn)練集中每個特征的類別數(shù)量，然后使用此信息對make_continuous并使用make_categorical解碼。

導(dǎo)入熊貓數(shù)據(jù)框 (Importing Pandas DataFrames)

One feature we might want for our UnsupervisedDataBunch is the ability of being created from a Pandas DataFrame. As mentioned in the overview above, we will leverage Fast.ai TabularDataBunch to do the data loading and preprocessing for us, then we’ll import the data into our own databunch.

我們可能想要的UnsupervisedDataBunch功能之一是可以從Pandas DataFrame創(chuàng)建的功能。如以上概述中所述，我們將利用Fast.ai TabularDataBunch為我們執(zhí)行數(shù)據(jù)加載和預(yù)處理，然后將數(shù)據(jù)導(dǎo)入到我們自己的數(shù)據(jù)束中。

A TabularDataBunch is usually created as follows:

TabularDataBunch通常如下創(chuàng)建：

Creation of a TabularDataBunch from a Pandas DataFrame從Pandas DataFrame創(chuàng)建TabularDataBunch

The code above does the following:

上面的代碼執(zhí)行以下操作：

Load the dataframe into a TabularList
將數(shù)據(jù)框加載到TabularList中
Split the list into training and validation sets
將列表分為訓(xùn)練和驗證集
Fill in missing values
填寫缺失值
Turn all categorical features into ordinal categories
將所有分類特征轉(zhuǎn)換為順序分類

Now we can write a conversion function to transform the TabularDataBunch into an UnsupervisedDataBunch. This is where things get hacky: we need to retrieve categorical and continuous features separately, process categorical features using OneHotCatEncoder and then concatenate everything into a single Tensor.

現(xiàn)在，我們可以編寫一個轉(zhuǎn)換函數(shù)，將TabularDataBunch轉(zhuǎn)換為UnsupervisedDataBunch。這就是問題的所在：我們需要分別檢索分類特征和連續(xù)特征，使用OneHotCatEncoder處理分類特征，然后將所有內(nèi)容連接到單個Tensor中。

TabularDataBunch -> UnsupervisedDataBunch conversionTabularDataBunch -> UnsupervisedDataBunch轉(zhuǎn)換

Since the TabularDataBunch can have a target variable, we are going to add the optional train_y and valid_y arguments to our UnsupervisedDataBunch:

由于TabularDataBunch可以具有目標(biāo)變量，因此我們將在我們的UnsupervisedDataBunch中添加可選的train_y和valid_y參數(shù)：

UnsupervisedDataBunch with optional targets具有可選目標(biāo)的UnsupervisedDataBunch

We can now convert any TabularDataBunch by simply using the extension function:

現(xiàn)在，我們只需使用擴(kuò)展功能即可轉(zhuǎn)換任何TabularDataBunch：

The next step is testing everything we’ve done so far on an actual dataset.

下一步是測試到目前為止我們在實際數(shù)據(jù)集上所做的一切。

房屋價格數(shù)據(jù)集培訓(xùn) (Training on House Prices dataset)

I chose the House Prices dataset, since it is well-known and it has a good number of categorical features that we can use to test our data workflow. You can find the dataset on Kaggle, among other places.

我選擇了“房屋價格”數(shù)據(jù)集，因為它是眾所周知的，并且具有大量可用于測試數(shù)據(jù)工作流的分類功能。您可以在Kaggle和其他地方找到數(shù)據(jù)集。

Let’s start from a CSV file containing the training set and go from there:

讓我們從包含訓(xùn)練集的CSV文件開始，然后從那里開始：

Pretty neat, right? In just about 40 lines of code we got ourselves a trained Self-Organizing Map 😊

很整潔吧？在大約40行代碼中，我們獲得了訓(xùn)練有素的自組織圖😊

Here’s the loss plot:

這是損失圖：

Loss plot on House Prices dataset房價數(shù)據(jù)集上的損失圖

從SOM代碼本創(chuàng)建DataFrame (Creating a DataFrame from the SOM codebook)

One of the best things about Self-Organizing Maps is the ability to run predictions of another model (trained on the same dataset) over the codebook elements, and then plot prediction values / classes for each item on the map.

關(guān)于自組織地圖的最好的事情之一是能夠在碼本元素上運行另一個模型(在相同數(shù)據(jù)集上訓(xùn)練)的預(yù)測，然后在地圖上為每個項目繪制預(yù)測值/類的能力。

To do so, we could write a codebook_to_df function inside our SomLearner:

為此，我們可以在SomLearner編寫一個codebook_to_df函數(shù)：

Creating a DataFrame from SOM codebook elements從SOM碼本元素創(chuàng)建DataFrame

Now we need a model to use for prediction. Let’s use this Kaggle submission of House Prices regression with a Fast.ai Tabular learner as a starter:

現(xiàn)在我們需要一個用于預(yù)測的模型。讓我們使用Fast.ai Tabular學(xué)習(xí)者作為入門者的Kaggle房價回歸報告：

Training a tabular model訓(xùn)練表格模型

在SOM代碼本上運行回歸 (Running regression on SOM codebook)

Now that we have a trained regressor, let’s generate the DataFrame of the SOM codebook and use it as a test set:

現(xiàn)在我們有了訓(xùn)練有素的回歸器，讓我們生成SOM碼本的DataFrame并將其用作測試集：

Regression on SOM codebookSOM碼本回歸

Now we can use plt.imshow() on predictions to get a visualization of house price distribution over the SOM 😊

現(xiàn)在我們可以對預(yù)測使用plt.imshow()以獲得SOM上房價分布的可視化😊

Tabular model predictions on SOM codebookSOM碼本上的表格模型預(yù)測

This is cool, right? In the next article we will complete our SOM toolkit by adding a whole lot of visualization and interpretation utilities, basing our API on Fast.ai ClassificationInterpretaion class.

太酷了吧？在接下來的文章中，我們將通過增加一大堆的可視化和解釋事業(yè)，立足我們的API上Fast.ai完成我們的SOM工具箱ClassificationInterpretaion類。

Note: the library code for UnsupervisedDataBunch has been rewritten by using Fast.ai TabularDataBunch with additional transforms. This article builds the DataBunch from scratch, and it was left untouched for easier understanding.

注意： UnsupervisedDataBunch的庫代碼已通過使用Fast.ai TabularDataBunch和其他轉(zhuǎn)換重寫。本文是從頭開始構(gòu)建DataBunch的，為了便于理解，它保持不變。

翻譯自: https://medium.com/kirey-group/self-organizing-maps-with-fast-ai-step-4-handling-unsupervised-data-with-fast-ai-databunch-39faaa334c9b