日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

fast.ai_使用fast.ai自组织地图—步骤4:使用Fast.ai DataBunch处理非监督数据

發布時間:2023/12/15 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 fast.ai_使用fast.ai自组织地图—步骤4:使用Fast.ai DataBunch处理非监督数据 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

fast.ai

This is the fourth part of the Self-Organizing Maps with fast.ai article series.

這是帶有fast.ai文章系列的自組織地圖的第四部分。

All the code has been published in this repository and this PyPi library.

所有代碼已在此存儲庫和PyPi庫中發布 。

  • Overview: Self-Organizing Maps with Fast.ai

    概述:使用Fast.ai自組織地圖

  • Step 1: Implementing a SOM with PyTorch

    步驟1:使用PyTorch實施SOM

  • Step 2: Training the SOM Module with a Fast.ai Learner

    步驟2:使用Fast.ai學習者訓練SOM模塊

  • Step 3: Updating SOM hyperparameters with Fast.ai Callbacks

    步驟3:使用Fast.ai回調更新SOM超參數

總覽 (Overview)

Many datasets come in tabular form. For this reason, Fast.ai has a handy Tabular subclass for its DataBunch that can natively perform categorization and handle continuous and categorical features.

許多數據集以表格形式出現。 因此,Fast.ai的DataBunch擁有一個方便的Tabular子類,該子類可以本地執行分類并處理連續和分類特征。

In this article, we will use TabularDataBunch as a starting point to load our data, and then build a conversion function to move it into our UnsupervisedDataBunch.

在本文中,我們將使用TabularDataBunch作為加載數據的起點,然后構建轉換函數以將其移至UnsupervisedDataBunch 。

The main features we’re going to re-implement are:

我們將重新實現的主要功能是:

  • Normalization

    正?;?
  • Categorical feature encoding

    分類特征編碼
  • Export of the SOM codebook into a Pandas DataFrame

    將SOM代碼本導出到Pandas DataFrame中

正?;?(Normalization)

We will use a separate Normalizer class to perform per-feature normalization. Let’s define a base class:

我們將使用單獨的Normalizer類來執行每個功能的歸一化。 讓我們定義一個基類:

While comparing different normalizers, I found out that normalizing by mean and standard deviation helps a lot with SOM convergence, so we’ll extend our Normalizer class into a VarianceNormalizer:

在比較不同的規范化器時,我發現通過均值和標準差進行規范化對于SOM收斂有很大幫助,因此我們將我們的Normalizer類擴展為VarianceNormalizer :

Note that we also implemented the denormalization function. Since we are normalizing our data, the trained SOM codebook will contain normalized data points: we will need denormalization in order to retrieve values in the initial data range.

請注意,我們還實現了非規范化功能。 由于我們正在對數據進行歸一化,因此訓練有素的SOM代碼簿將包含歸一化的數據點:我們將需要非歸一化才能檢索初始數據范圍內的值。

Let’s add the normalizer to our UnsupervisedDataBunch:

讓我們將規范化器添加到我們的UnsupervisedDataBunch :

處理分類特征 (Handling categorical features)

Another important preprocessing step for Self-Organizing Maps is transforming categorical features into continous; this could be done by either One-Hot encoding features, or by using embeddings. Since One-Hot encoding is the easiest to implement, we’ll start with that, although embeddings have a better performance.

自組織地圖的另一個重要預處理步驟是將分類特征轉換為連續特征。 這可以通過“一鍵編碼”功能或使用嵌入來完成。 由于“一鍵編碼”是最容易實現的,因此我們從此開始,盡管嵌入的性能更好。

Both methods require a mixed distance function to compare actual continuous features and converted categoricals independently, but we will skip this step for simplicity’s sake. If you’re interested in how a mixed distance function can be implemented, feel free to have a look at the code on Github.

兩種方法都需要混合距離函數來獨立比較實際連續特征和轉換后的分類,但是為簡單起見,我們將跳過此步驟。 如果您對如何實現混合距離函數感興趣,請隨時查看Github上的代碼 。

As we did for normalizers, we will start by defining a base CatEncoder class:

就像我們對規范化器所做的那樣,我們將從定義基本CatEncoder類開始:

And subclass it into an OneHotCatEncoder:

并將其子類化為OneHotCatEncoder :

All we’re doing here is using torch.nn.functional.one_hot to perform one-hot encoding of our input variables, storing the number of categories for each feature in the training set during fit, and then using this information to perform encoding with make_continuous and decoding with make_categorical .

我們在這里所做的全部工作是使用torch.nn.functional.one_hot對輸入變量執行一次熱編碼,在fit期間存儲訓練集中每個特征的類別數量,然后使用此信息對make_continuous并使用make_categorical解碼。

導入熊貓數據框 (Importing Pandas DataFrames)

One feature we might want for our UnsupervisedDataBunch is the ability of being created from a Pandas DataFrame. As mentioned in the overview above, we will leverage Fast.ai TabularDataBunch to do the data loading and preprocessing for us, then we’ll import the data into our own databunch.

我們可能想要的UnsupervisedDataBunch功能之一是可以從Pandas DataFrame創建的功能。 如以上概述中所述,我們將利用Fast.ai TabularDataBunch為我們執行數據加載和預處理,然后將數據導入到我們自己的數據束中。

A TabularDataBunch is usually created as follows:

TabularDataBunch通常如下創建:

Creation of a TabularDataBunch from a Pandas DataFrame從Pandas DataFrame創建TabularDataBunch

The code above does the following:

上面的代碼執行以下操作:

  • Load the dataframe into a TabularList

    將數據框加載到TabularList中
  • Split the list into training and validation sets

    將列表分為訓練和驗證集
  • Fill in missing values

    填寫缺失值
  • Turn all categorical features into ordinal categories

    將所有分類特征轉換為順序分類

Now we can write a conversion function to transform the TabularDataBunch into an UnsupervisedDataBunch. This is where things get hacky: we need to retrieve categorical and continuous features separately, process categorical features using OneHotCatEncoder and then concatenate everything into a single Tensor.

現在,我們可以編寫一個轉換函數,將TabularDataBunch轉換為UnsupervisedDataBunch。 這就是問題的所在:我們需要分別檢索分類特征和連續特征,使用OneHotCatEncoder處理分類特征,然后將所有內容連接到單個Tensor中。

TabularDataBunch -> UnsupervisedDataBunch conversionTabularDataBunch -> UnsupervisedDataBunch轉換

Since the TabularDataBunch can have a target variable, we are going to add the optional train_y and valid_y arguments to our UnsupervisedDataBunch:

由于TabularDataBunch可以具有目標變量,因此我們將在我們的UnsupervisedDataBunch中添加可選的train_y和valid_y參數:

UnsupervisedDataBunch with optional targets具有可選目標的UnsupervisedDataBunch

We can now convert any TabularDataBunch by simply using the extension function:

現在,我們只需使用擴展功能即可轉換任何TabularDataBunch:

The next step is testing everything we’ve done so far on an actual dataset.

下一步是測試到目前為止我們在實際數據集上所做的一切。

房屋價格數據集培訓 (Training on House Prices dataset)

I chose the House Prices dataset, since it is well-known and it has a good number of categorical features that we can use to test our data workflow. You can find the dataset on Kaggle, among other places.

我選擇了“房屋價格”數據集,因為它是眾所周知的,并且具有大量可用于測試數據工作流的分類功能。 您可以在Kaggle和其他地方找到數據集。

Let’s start from a CSV file containing the training set and go from there:

讓我們從包含訓練集的CSV文件開始,然后從那里開始:

Pretty neat, right? In just about 40 lines of code we got ourselves a trained Self-Organizing Map 😊

很整潔吧? 在大約40行代碼中,我們獲得了訓練有素的自組織圖😊

Here’s the loss plot:

這是損失圖:

Loss plot on House Prices dataset房價數據集上的損失圖

從SOM代碼本創建DataFrame (Creating a DataFrame from the SOM codebook)

One of the best things about Self-Organizing Maps is the ability to run predictions of another model (trained on the same dataset) over the codebook elements, and then plot prediction values / classes for each item on the map.

關于自組織地圖的最好的事情之一是能夠在碼本元素上運行另一個模型(在相同數據集上訓練)的預測,然后在地圖上為每個項目繪制預測值/類的能力。

To do so, we could write a codebook_to_df function inside our SomLearner:

為此,我們可以在SomLearner編寫一個codebook_to_df函數:

Creating a DataFrame from SOM codebook elements從SOM碼本元素創建DataFrame

Now we need a model to use for prediction. Let’s use this Kaggle submission of House Prices regression with a Fast.ai Tabular learner as a starter:

現在我們需要一個用于預測的模型。 讓我們使用Fast.ai Tabular學習者作為入門者的Kaggle房價回歸報告 :

Training a tabular model訓練表格模型

在SOM代碼本上運行回歸 (Running regression on SOM codebook)

Now that we have a trained regressor, let’s generate the DataFrame of the SOM codebook and use it as a test set:

現在我們有了訓練有素的回歸器,讓我們生成SOM碼本的DataFrame并將其用作測試集:

Regression on SOM codebookSOM碼本回歸

Now we can use plt.imshow() on predictions to get a visualization of house price distribution over the SOM 😊

現在我們可以對預測使用plt.imshow()以獲得SOM上房價分布的可視化😊

Tabular model predictions on SOM codebookSOM碼本上的表格模型預測

This is cool, right? In the next article we will complete our SOM toolkit by adding a whole lot of visualization and interpretation utilities, basing our API on Fast.ai ClassificationInterpretaion class.

太酷了吧? 在接下來的文章中,我們將通過增加一大堆的可視化和解釋事業,立足我們的API上Fast.ai完成我們的SOM工具箱ClassificationInterpretaion類。

Note: the library code for UnsupervisedDataBunch has been rewritten by using Fast.ai TabularDataBunch with additional transforms. This article builds the DataBunch from scratch, and it was left untouched for easier understanding.

注意: UnsupervisedDataBunch的庫代碼已通過使用Fast.ai TabularDataBunch和其他轉換重寫。 本文是從頭開始構建DataBunch的,為了便于理解,它保持不變。

翻譯自: https://medium.com/kirey-group/self-organizing-maps-with-fast-ai-step-4-handling-unsupervised-data-with-fast-ai-databunch-39faaa334c9b

fast.ai

總結

以上是生活随笔為你收集整理的fast.ai_使用fast.ai自组织地图—步骤4:使用Fast.ai DataBunch处理非监督数据的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。