當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据增强数据集扩充_数据扩充的抽象总结

發(fā)布時間：2023/12/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了数据增强数据集扩充_数据扩充的抽象总结小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

數(shù)據(jù)增強數(shù)據(jù)集擴充

班級分配不均衡的創(chuàng)新解決方案 (A Creative Solution to Imbalanced Class Distribution)

Imbalanced class distribution is a common problem in Machine Learning. I was recently confronted with this issue when training a sentiment classification model. Certain categories were far more prevalent than others and the predictive quality of the model suffered. The first technique I used to address this was random under-sampling, wherein I randomly sampled a subset of rows from each category up to a ceiling threshold. I selected a ceiling that reasonably balanced the upper 3 classes. Although a small improvement was observed, the model was still far from optimal.

班級分配不平衡是機器學(xué)習(xí)中的常見問題。最近，我在訓(xùn)練情感分類模型時遇到了這個問題。某些類別比其他類別更為普遍，因此模型的預(yù)測質(zhì)量受到影響。我用來解決此問題的第一個技術(shù)是隨機欠采樣，其中我從每個類別中隨機采樣了行的子集，直到上限閾值。我選擇了一個合理地平衡前三類的上限。盡管觀察到很小的改進，但是該模型仍遠非最佳。

I needed a way to deal with the under-represented classes. I could not rely on traditional techniques used in multi-class classification such as sample and class weighting, as I was working with a multi-label dataset. It became evident that I would need to leverage oversampling in this situation.

我需要一種方法來處理代表性不足的課程。當我使用多標簽數(shù)據(jù)集時，我不能依賴于用于多類分類的傳統(tǒng)技術(shù)，例如樣本和類加權(quán)。很明顯，在這種情況下，我需要利用過度采樣。

A technique such as SMOTE (Synthetic Minority Over-sampling Technique) can be effective for oversampling, although the problem again becomes a bit more difficult with multi-label datasets. MLSMOTE (Multi-Label Synthetic Minority Over-sampling Technique) has been proposed [1], but the high dimensional nature of the numerical vectors created from text can sometimes make other forms of data augmentation more appealing.

諸如SMOTE(合成少數(shù)族裔過采樣技術(shù))之類的技術(shù)可以有效地進行過采樣，盡管對于多標簽數(shù)據(jù)集，問題再次變得更加棘手。已經(jīng)提出了MLSMOTE (多標簽綜合少數(shù)族裔過采樣技術(shù))[1]，但是從文本創(chuàng)建的數(shù)字矢量的高維性質(zhì)有時會使其他形式的數(shù)據(jù)增強更具吸引力。

Photo by Christian Wagner on Unsplash 克里斯蒂安·瓦格納在《 Unsplash》上的照片

變形金剛救援！ (Transformers to the Rescue!)

If you decided to read this article, it is safe to assume that you are aware of the latest advances in Natural Language Processing bequeathed by the mighty Transformers. The exceptional developers at Hugging Face in particular have opened the door to this world through their open source contributions. One of their more recent releases implements a breakthrough in Transfer Learning called the Text-to-Text Transfer Transformer or T5 model, originally presented by Raffel et. al. in their paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [2].

如果您決定閱讀本文，可以假定您了解強大的變形金剛在自然語言處理方面的最新進展。 Hugging Face的杰出開發(fā)人員尤其通過其開源貢獻為這個世界打開了一扇門。他們的一個更新的版本工具的轉(zhuǎn)移的突破口學(xué)習(xí)所謂的T外部- 牛逼鄰T外部貿(mào)易交接牛逼 ransformer或T5型號，最初由拉費爾等人提出的。等在他們的論文《使用統(tǒng)一的文本到文本的轉(zhuǎn)換器探索遷移學(xué)習(xí)的局限性》中 [2]。

T5 allows us to execute various NLP tasks by specifying prefixes to the input text. In my case, I was interested in Abstractive Summarization, so I made use of the summarize prefix.

T5允許我們通過指定輸入文本的前綴來執(zhí)行各種NLP任務(wù)。就我而言，我感興趣的是寫意總結(jié)，所以我利用的summarize前綴。

Text-to-Text Transfer Transformer [2]文本到文本傳輸變壓器[2]

抽象總結(jié) (Abstractive Summarization)

Abstractive Summarization put simplistically is a technique by which a chunk of text is fed to an NLP model and a novel summary of that text is returned. This should not be confused with Extractive Summarization, where sentences are embedded and a clustering algorithm is executed to find those closest to the clusters’ centroids — namely, existing sentences are returned. Abstractive Summarization seemed particularly appealing as a Data Augmentation technique because of its ability to generate novel yet realistic sentences of text.

簡而言之，抽象摘要是一種將文本塊輸入NLP模型并返回該文本的新穎摘要的技術(shù)。這不應(yīng)與“提取摘要”相混淆，在“摘要提取”中嵌入句子并執(zhí)行聚類算法以查找最接近聚類質(zhì)心的那些，即返回現(xiàn)有的句子。抽象匯總作為一種數(shù)據(jù)增強技術(shù)特別吸引人，因為它能夠生成新穎而逼真的文本句子。

算法 (Algorithm)

Here are the steps I took to use Abstractive Summarization for Data Augmentation, including code segments illustrating the solution.

這是我使用抽象匯總進行數(shù)據(jù)增強所采取的步驟，包括說明解決方案的代碼段。

I first needed to determine how many rows each under-represented class required. The number of rows to add for each feature is thus calculated with a ceiling threshold, and we refer to these as the append_counts. Features with counts above the ceiling are not appended. In particular, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0. The following methods trivially achieve this in the situation where features have been one-hot encoded:

首先，我需要確定每個代表性不足的類需要多少行。因此，每個特征要添加的行數(shù)是使用上限閾值計算的，我們將其稱為append_counts 。計數(shù)不超過上限的要素不會被附加。特別是，如果給定要素具有1000行且上限為100，則其附加計數(shù)將為0。在要素已被一鍵編碼的情況下，以下方法可以輕松實現(xiàn)此目的：

def get_feature_counts(self, df):
shape_array = {} for feature in self.features:
shape_array[feature] = df[feature].sum() return shape_array
def get_append_counts(self, df):
append_counts = {}
feature_counts = self.get_feature_counts(df)
for feature in self.features:
if feature_counts[feature] >= self.threshold:
count = 0
else:
count = self.threshold - feature_counts[feature]
append_counts[feature] = count
return append_counts

For each feature, a loop is completed from an append index range to the append count specified for that given feature. This append_index variable along with a tasks array are introduced to allow for multi-processing which we will discuss shortly.

對于每個功能，從附加索引范圍到為該給定功能指定的附加計數(shù)的循環(huán)完成。引入了這個append_index變量以及一個task數(shù)組，以允許進行多重處理，我們將在稍后進行討論。

counts = self.get_append_counts(self.df)
# Create append dataframe with length of all rows to be appended
self.df_append = pd.DataFrame(
index=np.arange(sum(counts.values())),
columns=self.df.columns
)
# Creating array of tasks for multiprocessing
tasks = []
# set all feature values to 0
for feature in self.features:
self.df_append[feature] = 0
for feature in self.features:
num_to_append = counts[feature]
for num in range(
self.append_index,
self.append_index + num_to_append
):
tasks.append(
self.process_abstractive_summarization(feature, num)
)
# Updating index for insertion into shared appended dataframe
# to preserve indexing for multiprocessing
self.append_index += num_to_append

An Abstractive Summarization is calculated for a specified size subset of all rows that uniquely have the given feature, and is added to the append DataFrame with its respective feature one-hot encoded.

為唯一具有給定特征的所有行的指定大小的子集計算一個摘要匯總，并將其摘要附加到附加DataFrame中，并對其各個特征進行一次熱編碼。

df_feature = self.df[
(self.df[feature] == 1) &
(self.df[self.features].sum(axis=1) == 1)
]
df_sample = df_feature.sample(self.num_samples, replace=True)
text_to_summarize = ' '.join(
df_sample[:self.num_samples]['review_text'])
new_text = self.get_abstractive_summarization(text_to_summarize)
self.df_append.at[num, 'text'] = new_text
self.df_append.at[num, feature] = 1

The Abstractive Summarization itself is generated in the following way:

摘要匯總本身是通過以下方式生成的：

t5_prepared_text = "summarize: " + text_to_summarize
if self.device.type == 'cpu':
tokenized_text = self.tokenizer.encode(
t5_prepared_text,
return_tensors=self.return_tensors).to(self.device)
else:
tokenized_text = self.tokenizer.encode(
t5_prepared_text,
return_tensors=self.return_tensors)
summary_ids = self.model.generate(
tokenized_text,
num_beams=self.num_beams,
no_repeat_ngram_size=self.no_repeat_ngram_size,
min_length=self.min_length,
max_length=self.max_length,
early_stopping=self.early_stopping
)
output = self.tokenizer.decode(
summary_ids[0],
skip_special_tokens=self.skip_special_tokens
)

In initial tests the summarization calls to the T5 model were extremely time-consuming, reaching up to 25 seconds even on a GCP instance with an NVIDIA Tesla P100. Clearly this needed to be addressed to make this a feasible solution for data augmentation.

在最初的測試中，對T5模型的匯總調(diào)用非常耗時，即使在使用NVIDIA Tesla P100的GCP實例上，也要長達25秒。顯然，需要解決此問題，以使其成為可行的數(shù)據(jù)增強解決方案。

Photo by Brad Neathery on Unsplash Brad Neathery在Unsplash上拍攝的照片

多處理 (Multiprocessing)

I introduced a multiprocessing option, whereby the calls to Abstractive Summarization are stored in a task array later passed to a sub-routine that runs the calls in parallel using the multiprocessing library. This resulted in an exponential decrease in runtime. I must thank David Foster for his succinct stackoverflow contribution [3]!

我介紹了一個multiprocessing選項，其中對抽象總結(jié)的調(diào)用存儲在一個任務(wù)數(shù)組中，然后傳遞給一個子例程，該子例程使用多處理庫并行運行這些調(diào)用。這導(dǎo)致運行時間呈指數(shù)下降。我必須感謝David Foster所做的簡潔的stackoverflow貢獻[3]！

running_tasks = [Process(target=task) for task in tasks]
for running_task in running_tasks:
running_task.start()
for running_task in running_tasks:
running_task.join()

簡化解決方案 (Simplified Solution)

To make things easier for everybody I packaged this into a library called absum. Installing is possible through pip:pip install absum. One can also download directly from the repository.

為了使每個人都更容易，我將其打包到一個名為absum的庫中。可以通過pip install absum ： pip install absum 。也可以直接從資源庫下載。

Running the code on your own dataset is then simply a matter of importing the library’s Augmentor class and running its abs_sum_augment method as follows:

在自己的數(shù)據(jù)集運行的代碼則只需導(dǎo)入庫的事項Augmentor類和運行其abs_sum_augment方法如下：

import pandas as pd
from absum import Augmentorcsv = 'path_to_csv'
df = pd.read_csv(csv)
augmentor = Augmentor(df)
df_augmented = augmentor.abs_sum_augment()
df_augmented.to_csv(
csv.replace('.csv', '-augmented.csv'),
encoding='utf-8',
index=False
)

absum uses the Hugging Face T5 model by default, but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformer models capable of Abstractive Summarization. It is format agnostic, expecting only a DataFrame containing text and one-hot encoded features. If additional columns are present that you do not wish to be considered, you have the option to pass in specific one-hot encoded features as a comma-separated string to the features parameter.

absum默認情況下使用Hugging Face T5模型，但以模塊化方式設(shè)計，允許您使用任何能夠進行抽象總結(jié)的預(yù)訓(xùn)練或開箱即用的Transformer模型。它與格式無關(guān)，只希望包含文本和一鍵編碼功能的DataFrame。如果存在您不希望考慮的其他列，則可以選擇將特定的一鍵編碼特征作為逗號分隔的字符串傳遞給features參數(shù)。

Also of special note are the min_length and max_length parameters, which determine the size of the resulting summarizations. One trick I found useful is to find the average character count of the text data you’re working with and start with something a bit lower for the minimum length while slightly padding it for the maximum. All available parameters are detailed in the documentation.

還要特別注意的是min_length和max_length參數(shù)，它們確定所得匯總的大小。我發(fā)現(xiàn)有用的一個技巧是找到正在使用的文本數(shù)據(jù)的平均字符數(shù)，并從最小長度的小一些開始，而最大長度的填充一些。文檔中詳細介紹了所有可用參數(shù)。

Feel free to add any suggestions for improvement in the comments or even better yet in a PR. Happy coding!

可以隨意添加任何建議以改善評論，甚至可以改善PR 。編碼愉快！

翻譯自: https://towardsdatascience.com/abstractive-summarization-for-data-augmentation-1423d8ec079e

數(shù)據(jù)增強數(shù)據(jù)集擴充

總結(jié)

以上是生活随笔為你收集整理的数据增强数据集扩充_数据扩充的抽象总结的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：苹果xs max国行和美版的区别
下一篇：贝叶斯优化神经网络参数_贝叶斯超参数优化