日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

《reStructured Pre-training》笔记

發(fā)布時(shí)間:2023/12/29 编程问答 50 豆豆
生活随笔 收集整理的這篇文章主要介紹了 《reStructured Pre-training》笔记 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

reStructured Pre-training 筆記

本文主要記錄論文中我覺得比較重要的部分,并加入個(gè)人的理解,如有錯(cuò)誤請可直接指出;由于格式問題,強(qiáng)烈建議去notion觀看,完整版內(nèi)容請移步notion網(wǎng)頁進(jìn)行詳細(xì)閱讀,謝謝!

Abstract

In such a paradigm, the role of data will be re-emphasized, and model
pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing.

a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.
We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges.

💡 data是怎么被restructured的?

Hypothesis of NLP technique evolution

自然語言處理技術(shù)進(jìn)化的假說

[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-NpOvV25C-1657291156008)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_17.21.43.png)]

1 Introduction

We argue that the ultimate goal of data storage is to better serve human life, and how data is accessed is as important as how it is stored. However, there are often differences in the way that data is stored and accessed.

作者提出存儲數(shù)據(jù)的最終目標(biāo)是更好地服務(wù)于人們的生活,因此如何獲取數(shù)據(jù)和如何存儲數(shù)據(jù)一樣重要。

Although prompting methods have narrowed the difference between data storage and access, it does not fundamentally eliminate the gap, as the way models store data in the pre-training stage is not transparent to diverse downstream tasks.

盡管prompting methods減少了存儲和獲取的差別,但沒有在根本上消除他們之間的代溝,因?yàn)槟P驮陬A(yù)訓(xùn)練過程中存儲數(shù)據(jù)的方式對不同的下流任務(wù)是不透明的。

換句話說,下流任務(wù)不知道使用何種方法(即prompts)可以更好地從預(yù)訓(xùn)練模型中獲取想要的數(shù)據(jù)。

比如,在情感分類任務(wù)中,為了在預(yù)訓(xùn)練模型的幫助下預(yù)測句子的情感,我們必須選擇一個(gè)模型熟悉的提問方式,然而系統(tǒng)設(shè)計(jì)者并不了解模型更傾向于使用那種提問格式,因?yàn)轭A(yù)訓(xùn)練數(shù)據(jù)的分布或者結(jié)構(gòu)是不可解釋的。 下面的圖可以生動(dòng)地解釋這個(gè)例子:

[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-nXKgyIhr-1657291156009)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_20.32.43.png)]

Methodologically, we present a new way to look at data that contains various types of information, which could be regarded as pre-training signals that can instruct models for parameter optimization. We structurally represent data in the unit of signals and claim that a good PLM should mark various signals during pre-training in a way that expected information could be accessed efficiently by downstream tasks.

作者將數(shù)據(jù)中包含的不同種類的數(shù)據(jù)看作預(yù)訓(xùn)練信號,用來指導(dǎo)模型的參數(shù)優(yōu)化,在結(jié)構(gòu)上以信號為單位表示數(shù)據(jù)。

一個(gè)好的PLM應(yīng)該在預(yù)訓(xùn)練過程中標(biāo)記不同種類的信號,以便下游任務(wù)可以有效地獲取需要的數(shù)據(jù)。

就像我們使用數(shù)據(jù)庫存儲數(shù)據(jù)一樣,我們需要先將他們結(jié)構(gòu)化并放進(jìn)一個(gè)有結(jié)構(gòu)的表格中,這樣就可以通過結(jié)構(gòu)化語言(如SQL)準(zhǔn)確地獲取我們想要的數(shù)據(jù)

Moreover, we argue that valuable signals are rich and exist everywhere from the data in the world instead of simply existing in the supervised datasets that are manually curated

有價(jià)值的信號是豐富的并且存在于世界的任何地方,而不僅僅存在于有監(jiān)督數(shù)據(jù)集中。

and what we need to do is to (a) identify them, (b) restructure them in a unified language, ? integrate and store them into the pre-trained language model. We call this learning paradigm reStructured Pre-training.

我們需要做的是:

  • 識別它們
  • 用統(tǒng)一的語言將它們重組
  • 將它們整合并存儲到預(yù)訓(xùn)練好的模型中
  • 我們稱這種學(xué)習(xí)范式為重構(gòu)式預(yù)訓(xùn)練。

    A good PLM should have a clear picture of the composition of the various signals in the data to provide accurate information for downstream tasks according to their different
    needs
    .

    一個(gè)好的PLM應(yīng)該對數(shù)據(jù)中不同種類信號的組成有清楚的認(rèn)知,從而根據(jù)下流任務(wù)的不同需求提供準(zhǔn)確的信息。

    2 reStructured Pre-training

    2.1 Paradigm Shift in Modern NLP

    [外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-5rweyvv8-1657291156010)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_14.11.13.png)]

    2.2 reStructured Pre-training

    Unlike existing paradigms that mainly focus on model-centric design, we think more from the data perspective to maximize the utility of the already available data.

    專注于最大化利用現(xiàn)有數(shù)據(jù)

    Specifically, we take a data storing & accessing view where the pre-training stage is considered as a data storing process while downstream task training based on pre-trained models is regarded as data accessing process from pre-trained models, and claim that a good data storage mechanism should make the stored data more accessible.

    我們采用數(shù)據(jù)存儲和獲取的角度,其中預(yù)訓(xùn)練階段被看作數(shù)據(jù)存儲過程,而下流任務(wù)則看作數(shù)據(jù)獲取過程。

    一個(gè)好的數(shù)據(jù)存儲方法應(yīng)該使存儲的數(shù)據(jù)更容易獲取。

    To achieve this goal, we look at data as an object that consists of diverse signals and argue that a good pre-trained model should (1) cover as many types of signals as possible and (2) provide precise access mechanisms for these signals when required by downstream tasks. i.e., a shift from pre-training over plain texts to pre-training over structured signals. In general, there are three steps within this new paradigm.

    為了實(shí)現(xiàn)這個(gè)目標(biāo),我們將數(shù)據(jù)看作由各種信號組成的對象,并主張一個(gè)好的預(yù)訓(xùn)練模型應(yīng)該:

  • 包含盡可能多類型的信號
  • 為下游任務(wù)需要的信號提供精確的獲取方法(即從訓(xùn)練純文本轉(zhuǎn)變?yōu)榻Y(jié)構(gòu)化信號)
  • 總體上,這個(gè)新范式有3個(gè)步驟:

  • reStructure
  • Pre-train
  • Fine-tune
  • reStructure:由于現(xiàn)有的信號格式多種多樣,有必要將它們重組為統(tǒng)一的格式用以模型的預(yù)訓(xùn)練。

    Pre-train:當(dāng)所有訓(xùn)練數(shù)據(jù)都被結(jié)構(gòu)化為統(tǒng)一的數(shù)據(jù)后,選擇預(yù)訓(xùn)練結(jié)構(gòu),并訓(xùn)練結(jié)構(gòu)化數(shù)據(jù)。

    Fine-tune:預(yù)訓(xùn)練完成后,模型可以用結(jié)構(gòu)化標(biāo)簽數(shù)據(jù)進(jìn)一步微調(diào);另一種常見情況是直接將它們用于下游任務(wù),通常通過zero-shot prompting。

    2.3 Evolutionary Process of Engineering Cycles

    [外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-TwKyC2cA-1657291156011)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_15.05.46.png)]

    機(jī)器學(xué)習(xí)技術(shù)的核心推動(dòng)力:

    the iteration of technology always moves along the direction that system developers can design a better and more general system by doing fewer things.

    技術(shù)的迭代總是朝著——系統(tǒng)開發(fā)者通過做更少的事情就可以設(shè)計(jì)一個(gè)更好的、更通用的系統(tǒng)——的方向發(fā)展。

    2.4 Design Considerations

  • Signal Definition
  • 作為restructured learning的第一步,我們需要知道哪些signals自然地存在于世界上,并且可收集、可獲取。

  • Data Mine Identification
  • Data Mine指一組包含多種類型信號的數(shù)據(jù)。一旦完成Siganal Defination就開始尋找合適的Data Mine。

  • Signal Extraction
  • 如何有效地從Data Mine中提取Signals也很重要。

  • Signal Restructuring
  • 這個(gè)過程關(guān)注如何使用統(tǒng)一的格式表示所有類型的signals,并且縮小數(shù)據(jù)存儲和檢索的差距。

  • Pre-training and Tuning
  • 這個(gè)過程關(guān)注使用什么預(yù)訓(xùn)練結(jié)構(gòu),使得所有結(jié)構(gòu)化數(shù)據(jù)都可以被有效地表示。

    3 reStructuring Engineering

    總結(jié)

    以上是生活随笔為你收集整理的《reStructured Pre-training》笔记的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。