“万物皆可Seq2Seq” | 忠于原文的T5手写论文翻译
《Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer》
摘要 /?Abstract
? ? ?Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP).?The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format.?Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.?By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.?To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.1 Keywords: transfer learning, natural language processing, multi-task learning, attentionbased models, deep learning
? ? ?遷移學(xué)習(xí),把一個(gè)模型先在數(shù)據(jù)豐富的任務(wù)上進(jìn)行預(yù)訓(xùn)練,然后再針對(duì)下游任務(wù)進(jìn)行微調(diào),這在自然語言處理中是一個(gè)強(qiáng)大的技術(shù)。遷移學(xué)習(xí)的有效性引起了方法、方式和實(shí)現(xiàn)的多樣性。在本文中,我們探索了NLP的遷移學(xué)習(xí)技術(shù)的前景,通過引入一個(gè)統(tǒng)一框架將所有基于文本的語言問題轉(zhuǎn)換為文本到文本格式。我們系統(tǒng)的比較了數(shù)十種語言理解任務(wù)的預(yù)訓(xùn)練目標(biāo),體系結(jié)構(gòu),未標(biāo)記的數(shù)據(jù)集,遷移方法和其他因素。通過結(jié)合對(duì)規(guī)模的探索和新的“巨型清潔爬蟲語料庫(C4)”,我們在許多基準(zhǔn)上獲得了最先進(jìn)的結(jié)果,包括文本摘要,問題解答,文本分類等。為了促進(jìn)NLP遷移學(xué)習(xí)的發(fā)展,我們發(fā)布了數(shù)據(jù)集,預(yù)訓(xùn)練的模型和代碼。
章節(jié)1 介紹 /?Introduction
? ? ?Training a machine learning model to perform natural language processing (NLP) tasks often requires that the model can process text in a way that is amenable to downstream learning.?This can be loosely viewed as developing general-purpose knowledge that allows the model to “understand” text.?This knowledge can range from low-level (e.g. the spelling?or meaning of words) to high-level (e.g. that a tuba is too large to fit in most backpacks).?In modern machine learning practice, providing this knowledge is rarely done explicitly; instead, it is often learned as part of an auxiliary task.?For example, a historically common approach is to use word vectors (Mikolov et al., 2013b,a; Pennington et al., 2014) to map word identities to a continuous representation where, ideally, similar words map to similar vectors.?These vectors are often learned through an objective that, for example, encourages co-occurring words to be positioned nearby in the continuous space?(Mikolov et al., 2013b).
? ? ?訓(xùn)練一個(gè)自然語言處理領(lǐng)域任務(wù)的機(jī)器學(xué)習(xí)模型經(jīng)常需要這個(gè)模型能夠處理文本數(shù)據(jù)以適應(yīng)下游學(xué)習(xí)。可以將其大致看做讓其學(xué)習(xí)通用的知識(shí),使模型可以“理解”文本。這些知識(shí)的范圍可能從低級(jí)(例如單詞的拼寫或含義)到高級(jí)(例如大號(hào)(低音銅管樂器)太大而無法容納大多數(shù)背包)。在現(xiàn)代機(jī)器學(xué)習(xí)實(shí)踐中,很少明確地提供這種知識(shí);相反的,通常將其作為輔助任務(wù)的一部分來學(xué)習(xí)。例如,一種歷史上常見的方法是使用詞向量(Mikolov et al., 2013b,a; Pennington et al., 2014)將單詞編碼映射為連續(xù)表示,理想情況下,相似的單詞映射到相似的向量。這些詞向量通常是通過一個(gè)目標(biāo)來學(xué)習(xí)的,例如,它鼓勵(lì)將同時(shí)出現(xiàn)的單詞放在連續(xù)空間的附近(對(duì)于word2vec來說在文本距離更近的單詞映射的詞向量擁有更近的空間距離)(Mikolov et al., 2013b).
? ? ?Recently, it has become increasingly common to pre-train the entire model on a data-rich task.?Ideally, this pre-training causes the model to develop general-purpose abilities and knowledge that can then be transferred to downstream tasks.?In applications of transfer learning to computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014), pre-training is typically done via supervised learning on a large labeled data set like ImageNet (Russakovsky et al., 2015; Deng et al., 2009).?In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data.?This approach has recently been used to obtain state-of-the-art results in many of the most common NLP benchmarks (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019). Beyond its empirical strength, unsupervised pre-training for NLP is particularly attractive because unlabeled text data is available en masse thanks to the Internet—for example, the Common Crawl project2 produces about 20TB of text data extracted from web pages each month.?This is a natural fit for neural networks, which have been shown to exhibit remarkable scalability, i.e. it is often possible to achieve better performance simply by training a larger model on a larger data set?(Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).
? ? ?最近,在數(shù)據(jù)豐富的任務(wù)上對(duì)整個(gè)模型進(jìn)行預(yù)訓(xùn)練變得越來越普遍。在理想情況下,這種預(yù)訓(xùn)練可使模型發(fā)展出通用的能力和知識(shí),然后將其遷移到下游任務(wù)中。在將遷移學(xué)習(xí)應(yīng)用于計(jì)算機(jī)視覺的過程中(Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014),預(yù)訓(xùn)練通常是在大型計(jì)算機(jī)上進(jìn)行有監(jiān)督學(xué)習(xí)來完成的。 比如已經(jīng)標(biāo)記的數(shù)據(jù)集ImageNet(Russakovsky et al., 2015; Deng et al., 2009)。相反,現(xiàn)在用于NLP中的遷移學(xué)習(xí)技術(shù)通常在未標(biāo)記的數(shù)據(jù)上使用無監(jiān)督學(xué)習(xí)進(jìn)行預(yù)訓(xùn)練。在許多最常見的NLP基準(zhǔn)測試中,近期用這種方法獲得了最頂?shù)慕Y(jié)果(Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019)。除了其經(jīng)驗(yàn)優(yōu)勢之外,對(duì)無監(jiān)督預(yù)訓(xùn)練的NLP尤其具有吸引力,因?yàn)榻柚ヂ?lián)網(wǎng),可以獲得無標(biāo)簽文本數(shù)據(jù),例如,Common Crawl project2每月會(huì)從網(wǎng)頁提取大約20TB的文本數(shù)據(jù)。這自然適用于神經(jīng)網(wǎng)絡(luò),神經(jīng)網(wǎng)絡(luò)已顯示出卓越的可擴(kuò)展性,即通常只需在較大的數(shù)據(jù)集上訓(xùn)練較大的模型,通常就有可能獲得更頂?shù)男阅?Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).
? ? ?This synergy has resulted in a great deal of recent work developing transfer learning methodology for NLP, which has produced a wide landscape of pre-training objectives (Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019), unlabeled data sets (Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019), benchmarks (Wang et al., 2019b, 2018; Conneau and Kiela, 2018), fine-tuning methods (Howard and Ruder, 2018; Houlsby et al., 2019; Peters et al., 2019), and more.?The rapid rate of progress and diversity of techniques in this burgeoning field can make it difficult to compare different algorithms, tease apart the effects of new contributions, and understand the space of existing methods for transfer learning.?Motivated by a need for more rigorous understanding, we leverage a unified approach to transfer learning that allows us to systematically study different approaches and push the current limits of the field.
? ? ?這種1+1>2的作用導(dǎo)致最近對(duì)NLP的遷移學(xué)習(xí)有了大量的工作進(jìn)展,這產(chǎn)生了廣泛的預(yù)訓(xùn)練目標(biāo)(Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019),未標(biāo)記的數(shù)據(jù)集(Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019),基準(zhǔn)(Wang et al., 2019b, 2018; Conneau and Kiela, 2018),微調(diào)方法(Howard and Ruder, 2018; Houlsby et al., 2019; Peters et al., 2019)等。在這個(gè)迅速發(fā)展的領(lǐng)域中,快速的進(jìn)步和技術(shù)的多樣性可能使得很難比較不同的算法,難以梳理出新研究的效果,并難以理解現(xiàn)有的遷移學(xué)習(xí)方法的情況。由于需要更嚴(yán)謹(jǐn)?shù)睦斫?#xff0c;我們利用統(tǒng)一的方法來遷移學(xué)習(xí),使我們能夠系統(tǒng)地研究不同的方法,并推動(dòng)該領(lǐng)域的當(dāng)前發(fā)展。
? ? ?The basic idea underlying our work is to treat every text processing problem as a “text-to-text” problem, i.e. taking text as input and producing new text as output.?This approach is inspired by previous unifying frameworks for NLP tasks, including casting all text problems as question answering (McCann et al., 2018), language modeling (Radford et al., 2019), or span extraction Keskar et al. (2019b) tasks.?Crucially, the text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider.?We leverage this flexibility by evaluating performance on a wide variety of English-based NLP problems, including question answering, document?summarization, and sentiment classification, to name a few.?With this unified approach, we can compare the effectiveness of different transfer learning objectives, unlabeled data sets, and other factors, while exploring the limits of transfer learning for NLP by scaling up models and data sets beyond what has previously been considered.
? ? ?我們工作的基本思想是將每個(gè)文本處理問題都視為“文本到文本”問題,即以文本作為輸入并產(chǎn)生一個(gè)新的文本作為輸出(萬物皆可Seq2Seq)。這種方法受到以前用于NLP任務(wù)的統(tǒng)一框架的啟發(fā),包括將所有文本問題都轉(zhuǎn)換為問答問題(McCann et al., 2018),語言建模(Radford et al., 2019)或跨度提取Keskar等任務(wù)。重要的是,文本到文本框架允許我們可以將相同的模型,目標(biāo),訓(xùn)練過程和解碼過程直接應(yīng)用于我們所考慮的每個(gè)任務(wù)。我們通過各種基于英語的NLP問題來評(píng)估這種性能,其中包括問答,文檔摘要和情感分類等。使用這種統(tǒng)一的方法,我們可以比較不同的遷移學(xué)習(xí)目標(biāo),未標(biāo)記的數(shù)據(jù)集和其他因素的有效性,同時(shí)通過擴(kuò)大模型和數(shù)據(jù)集的范圍以超越先前考慮的范圍,探索NLP遷移學(xué)習(xí)的局限性。
Figure 1: A diagram of our text-to-text framework.?Every task we consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text.?This allows us to use the same model, loss function, hyperparameters, etc. across our diverse set of tasks.?It also provides a standard testbed for the methods included in our empirical survey.?“T5” refers to our model, which we dub the “Text-to-Text Transfer Transformer”.
圖1:我們的文本到文本框架圖。我們考慮的每個(gè)任務(wù)(包括翻譯,問題解答和分類)都將文本作為輸入喂入我們的模型,并對(duì)其進(jìn)行訓(xùn)練來生成一些目標(biāo)文本。這使我們可以在各種任務(wù)中使用相同的模型,損失函數(shù),超參數(shù)等。它還為我們調(diào)研中的方法提供了標(biāo)準(zhǔn)的測試方法。“Text-to-Text Transfer Transformer”是指我們的模型,我們將其稱為“T5”。
? ? ?We emphasize that our goal is not to propose new methods but instead to provide a comprehensive perspective on where the field stands.?As such, our work primarily comprises a survey, exploration, and empirical comparison of existing techniques.?We also explore the limits of current approaches by scaling up the insights from our systematic study (training models up to 11 billion parameters) to obtain state-of-the-art results in many of the tasks we consider.?In order to perform experiments at this scale, we introduce the “Colossal Clean Crawled Corpus” (C4), a data set consisting of hundreds of gigabytes of clean English text scraped from the web.?Recognizing that the main utility of transfer learning is the possibility of leveraging pre-trained models in data-scarce settings, we release our code, data sets, and pre-trained models.
? ? ?我們強(qiáng)調(diào),我們的目標(biāo)不是提出新方法,而是提供有關(guān)這個(gè)領(lǐng)域現(xiàn)狀的全面觀點(diǎn)。因此,我們的工作主要包括對(duì)現(xiàn)有技術(shù)的研究,探索和經(jīng)驗(yàn)的比較。我們還將通過擴(kuò)大我們的系統(tǒng)研究(訓(xùn)練模型多達(dá)110億個(gè)參數(shù))的見解來探索當(dāng)前方法的局限性,從而在我們考慮的許多任務(wù)中獲得最頂?shù)慕Y(jié)果。為了進(jìn)行如此大規(guī)模的實(shí)驗(yàn),我們引入了“巨型清潔爬蟲語料庫”(C4),該數(shù)據(jù)集是從網(wǎng)絡(luò)上抓取的數(shù)百GB干凈的英語文本組成。我們認(rèn)識(shí)到遷移學(xué)習(xí)的主要作用是讓人們可以在數(shù)據(jù)稀缺的環(huán)境中利用預(yù)訓(xùn)練的模型,因此我們發(fā)布了代碼,數(shù)據(jù)集和預(yù)訓(xùn)練的模型。
? ? ?The remainder of the paper is structured as follows: In the following section, we discuss our base model and its implementation, our procedure for formulating every text processing problem as a text-to-text task, and the suite of tasks we consider.?In Section 3, we present a large set of experiments that explore the field of transfer learning for NLP.?At the end of the section (Section 3.7), we combine insights from our systematic study to obtain state-of-the-art results on a wide variety of benchmarks.?Finally, we provide a summary of our results and wrap up with a look towards the future in Section 4.
? ? ?在本文的其余結(jié)構(gòu)如下:在下面的部分中,我們討論基本模型及其實(shí)現(xiàn),將每個(gè)文本處理問題表達(dá)為文本到文本任務(wù)的過程以及我們考慮的一系列任務(wù)。在第3節(jié)中,我們提供了大量的實(shí)驗(yàn),探索NLP的遷移學(xué)習(xí)領(lǐng)域。在本節(jié)的最后(第3.7節(jié)),我們結(jié)合了系統(tǒng)研究的理解,從而獲得了各種基準(zhǔn)上的最頂結(jié)果。最后,我們對(duì)結(jié)果進(jìn)行了總結(jié),并在第4節(jié)中總結(jié)了對(duì)未來的展望。
?
總結(jié)
以上是生活随笔為你收集整理的“万物皆可Seq2Seq” | 忠于原文的T5手写论文翻译的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 开关磁阻电机滑模控制仿真,电流斩波控制,
- 下一篇: 树冠体积计算之AlphaShape算法