當前位置：首頁 >

《reStructured Pre-training》笔记

發(fā)布時間：2023/12/29 55 豆豆

生活随笔收集整理的這篇文章主要介紹了《reStructured Pre-training》笔记小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

reStructured Pre-training 筆記

本文主要記錄論文中我覺得比較重要的部分，并加入個人的理解，如有錯誤請可直接指出；由于格式問題，強烈建議去notion觀看，完整版內容請移步notion網(wǎng)頁進行詳細閱讀，謝謝！

Abstract

In such a paradigm, the role of data will be re-emphasized, and model
pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing.

a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.
We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges.

💡 data是怎么被restructured的？

Hypothesis of NLP technique evolution

自然語言處理技術進化的假說

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-NpOvV25C-1657291156008)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_17.21.43.png)]

1 Introduction

We argue that the ultimate goal of data storage is to better serve human life, and how data is accessed is as important as how it is stored. However, there are often differences in the way that data is stored and accessed.

作者提出存儲數(shù)據(jù)的最終目標是更好地服務于人們的生活，因此如何獲取數(shù)據(jù)和如何存儲數(shù)據(jù)一樣重要。

Although prompting methods have narrowed the difference between data storage and access, it does not fundamentally eliminate the gap, as the way models store data in the pre-training stage is not transparent to diverse downstream tasks.

盡管prompting methods減少了存儲和獲取的差別，但沒有在根本上消除他們之間的代溝，因為模型在預訓練過程中存儲數(shù)據(jù)的方式對不同的下流任務是不透明的。

換句話說，下流任務不知道使用何種方法(即prompts)可以更好地從預訓練模型中獲取想要的數(shù)據(jù)。

比如，在情感分類任務中，為了在預訓練模型的幫助下預測句子的情感，我們必須選擇一個模型熟悉的提問方式，然而系統(tǒng)設計者并不了解模型更傾向于使用那種提問格式，因為預訓練數(shù)據(jù)的分布或者結構是不可解釋的。下面的圖可以生動地解釋這個例子：

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-nXKgyIhr-1657291156009)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_20.32.43.png)]

Methodologically, we present a new way to look at data that contains various types of information, which could be regarded as pre-training signals that can instruct models for parameter optimization. We structurally represent data in the unit of signals and claim that a good PLM should mark various signals during pre-training in a way that expected information could be accessed efficiently by downstream tasks.

作者將數(shù)據(jù)中包含的不同種類的數(shù)據(jù)看作預訓練信號，用來指導模型的參數(shù)優(yōu)化，在結構上以信號為單位表示數(shù)據(jù)。

一個好的PLM應該在預訓練過程中標記不同種類的信號，以便下游任務可以有效地獲取需要的數(shù)據(jù)。

就像我們使用數(shù)據(jù)庫存儲數(shù)據(jù)一樣，我們需要先將他們結構化并放進一個有結構的表格中，這樣就可以通過結構化語言（如SQL）準確地獲取我們想要的數(shù)據(jù)

Moreover, we argue that valuable signals are rich and exist everywhere from the data in the world instead of simply existing in the supervised datasets that are manually curated

有價值的信號是豐富的并且存在于世界的任何地方，而不僅僅存在于有監(jiān)督數(shù)據(jù)集中。

and what we need to do is to (a) identify them, (b) restructure them in a unified language, ? integrate and store them into the pre-trained language model. We call this learning paradigm reStructured Pre-training.

我們需要做的是：

識別它們

用統(tǒng)一的語言將它們重組

將它們整合并存儲到預訓練好的模型中

我們稱這種學習范式為重構式預訓練。

A good PLM should have a clear picture of the composition of the various signals in the data to provide accurate information for downstream tasks according to their different
needs.

一個好的PLM應該對數(shù)據(jù)中不同種類信號的組成有清楚的認知，從而根據(jù)下流任務的不同需求提供準確的信息。

2 reStructured Pre-training

2.1 Paradigm Shift in Modern NLP

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-5rweyvv8-1657291156010)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_14.11.13.png)]

2.2 reStructured Pre-training

Unlike existing paradigms that mainly focus on model-centric design, we think more from the data perspective to maximize the utility of the already available data.

專注于最大化利用現(xiàn)有數(shù)據(jù)

Specifically, we take a data storing & accessing view where the pre-training stage is considered as a data storing process while downstream task training based on pre-trained models is regarded as data accessing process from pre-trained models, and claim that a good data storage mechanism should make the stored data more accessible.

我們采用數(shù)據(jù)存儲和獲取的角度，其中預訓練階段被看作數(shù)據(jù)存儲過程，而下流任務則看作數(shù)據(jù)獲取過程。

一個好的數(shù)據(jù)存儲方法應該使存儲的數(shù)據(jù)更容易獲取。

To achieve this goal, we look at data as an object that consists of diverse signals and argue that a good pre-trained model should (1) cover as many types of signals as possible and (2) provide precise access mechanisms for these signals when required by downstream tasks. i.e., a shift from pre-training over plain texts to pre-training over structured signals. In general, there are three steps within this new paradigm.

為了實現(xiàn)這個目標，我們將數(shù)據(jù)看作由各種信號組成的對象，并主張一個好的預訓練模型應該：

包含盡可能多類型的信號

為下游任務需要的信號提供精確的獲取方法（即從訓練純文本轉變?yōu)榻Y構化信號）

總體上，這個新范式有3個步驟：

reStructure

Pre-train

Fine-tune

reStructure：由于現(xiàn)有的信號格式多種多樣，有必要將它們重組為統(tǒng)一的格式用以模型的預訓練。

Pre-train：當所有訓練數(shù)據(jù)都被結構化為統(tǒng)一的數(shù)據(jù)后，選擇預訓練結構，并訓練結構化數(shù)據(jù)。

Fine-tune：預訓練完成后，模型可以用結構化標簽數(shù)據(jù)進一步微調；另一種常見情況是直接將它們用于下游任務，通常通過zero-shot prompting。

2.3 Evolutionary Process of Engineering Cycles

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-TwKyC2cA-1657291156011)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_15.05.46.png)]

機器學習技術的核心推動力：

the iteration of technology always moves along the direction that system developers can design a better and more general system by doing fewer things.

技術的迭代總是朝著——系統(tǒng)開發(fā)者通過做更少的事情就可以設計一個更好的、更通用的系統(tǒng)——的方向發(fā)展。

2.4 Design Considerations

Signal Definition

作為restructured learning的第一步，我們需要知道哪些signals自然地存在于世界上，并且可收集、可獲取。

Data Mine Identification

Data Mine指一組包含多種類型信號的數(shù)據(jù)。一旦完成Siganal Defination就開始尋找合適的Data Mine。

Signal Extraction

如何有效地從Data Mine中提取Signals也很重要。

Signal Restructuring

這個過程關注如何使用統(tǒng)一的格式表示所有類型的signals，并且縮小數(shù)據(jù)存儲和檢索的差距。

Pre-training and Tuning

這個過程關注使用什么預訓練結構，使得所有結構化數(shù)據(jù)都可以被有效地表示。

3 reStructuring Engineering

總結

以上是生活随笔為你收集整理的《reStructured Pre-training》笔记的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：子网划分详解与子网划分例题解析
下一篇：机器学习系列(14)_PCA对图像数据集