Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读
Paper:GPT-3《 Language Models are Few-Shot Learners》的翻譯與解讀
?
?
?
目錄
《GPT-3: Language Models are Few-Shot Learners》的翻譯與解讀
Abstract 摘要
1 Introduction 介紹
2 Approach?方法
2.1 Model and Architectures?模型和架構(gòu)
2.2 Training Dataset?訓(xùn)練數(shù)據(jù)集
2.3 Training Process 訓(xùn)練過(guò)程
2.4 Evaluation? 評(píng)估
3 Results 結(jié)果
3.1 Language Modeling, Cloze, and Completion Tasks?語(yǔ)言建模、完形填空和完成任務(wù)
3.1.1 Language Modeling? ?語(yǔ)言建模
3.1.2 LAMBADA 數(shù)據(jù)集
3.1.3 HellaSwag? 數(shù)據(jù)集
3.1.4 StoryCloze? 數(shù)據(jù)集
3.2 Closed Book Question Answering ?閉卷回答任務(wù)
3.3 Translation? 翻譯任務(wù)
3.4 Winograd-Style Tasks? 任務(wù)
3.5 Common Sense Reasoning ?常識(shí)推理任務(wù)
3.6 Reading Comprehension ?閱讀理解任務(wù)
3.7 SuperGLUE? 對(duì)比
3.8 NLI? 自然語(yǔ)言推理任務(wù)
3.9 Synthetic and Qualitative Tasks ?綜合和定性任務(wù)
3.9.1 Arithmetic ?算術(shù)
3.9.2 Word Scrambling and Manipulation Tasks ?拼字和操作任務(wù)
3.9.3 SAT Analogies 類比
3.9.4 News Article Generation ?新聞文章生成
3.9.5 Learning and Using Novel Words ?學(xué)習(xí)和使用新單詞
3.9.6 Correcting English Grammar ?修改英語(yǔ)語(yǔ)法
4 Measuring and Preventing Memorization Of Benchmarks ?測(cè)量和防止記憶基準(zhǔn)
5 Limitations ?局限性
6 Broader Impacts ?更廣泛的影響
6.1 Misuse of Language Models ?語(yǔ)言模型的誤用
6.1.1 Potential Misuse Applications ?潛在的誤用
6.1.2 Threat Actor Analysis ?威脅行動(dòng)者分析
6.1.3 External Incentive Structures ?外部激勵(lì)結(jié)構(gòu)
6.2 Fairness, Bias, and Representation ?公平、偏見(jiàn)和代表性
6.2.1 Gender ?性別
6.2.2 Race ?種族
6.2.3 Religion? 宗教
6.2.4 Future Bias and Fairness Challenges ?未來(lái)的偏見(jiàn)和公平挑戰(zhàn)
6.3 Energy Usage ?能源使用
7 Related Work ?相關(guān)工作
8 Conclusion 結(jié)論
Acknowledgements 致謝
?
?
?
《GPT-3: Language Models are Few-Shot Learners》的翻譯與解讀
| 作者 | OpenAI Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei |
| 原文 | https://arxiv.org/abs/2005.14165 |
| Github | https://github.com/openai/gpt-3 |
?
?
Abstract 摘要
| Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. | 最近的研究表明,通過(guò)對(duì)大量文本語(yǔ)料庫(kù)進(jìn)行預(yù)訓(xùn)練,然后對(duì)特定任務(wù)進(jìn)行微調(diào),在許多NLP任務(wù)和基準(zhǔn)上取得了實(shí)質(zhì)性的進(jìn)展。雖然在體系結(jié)構(gòu)中通常與任務(wù)無(wú)關(guān),但這種方法仍然需要成千上萬(wàn)個(gè)示例的特定于任務(wù)的微調(diào)數(shù)據(jù)集。相比之下,人類通常可以通過(guò)幾個(gè)例子或簡(jiǎn)單的指令來(lái)執(zhí)行一項(xiàng)新的語(yǔ)言任務(wù)——這是目前的NLP系統(tǒng)在很大程度上仍難以做到的。這里,我們展示了擴(kuò)展語(yǔ)言模型可以極大地提高任務(wù)不可知的、小樣本的性能,有時(shí)甚至可以通過(guò)預(yù)先采用的最先進(jìn)的微調(diào)方法達(dá)到競(jìng)爭(zhēng)力。具體來(lái)說(shuō),我們訓(xùn)練GPT-3,這是一個(gè)自回歸語(yǔ)言模型,有1750億個(gè)參數(shù),比以往任何非稀疏語(yǔ)言模型多10倍,并測(cè)試其在小樣本設(shè)置下的性能。對(duì)于所有任務(wù),GPT-3的應(yīng)用不需要任何梯度更新或微調(diào),只需要通過(guò)與模型的文本交互指定任務(wù)和小樣本演示。GPT-3在許多NLP數(shù)據(jù)集上實(shí)現(xiàn)了強(qiáng)大的性能,包括翻譯、問(wèn)題回答和完形填空任務(wù),以及一些需要實(shí)時(shí)推理或領(lǐng)域適應(yīng)的任務(wù),如整理單詞、在句子中使用新單詞或執(zhí)行3位數(shù)字算術(shù)。與此同時(shí),我們也發(fā)現(xiàn)了一些數(shù)據(jù)集,其中GPT-3的小樣本學(xué)習(xí)仍然存在困難,以及一些數(shù)據(jù)集,其中GPT-3面臨著與大型網(wǎng)絡(luò)語(yǔ)料庫(kù)培訓(xùn)相關(guān)的方法論問(wèn)題。最后,我們發(fā)現(xiàn)GPT-3可以生成人類評(píng)估者難以區(qū)分的新聞文章樣本和人類撰寫(xiě)的文章樣本。我們將討論這一發(fā)現(xiàn)和GPT-3的更廣泛的社會(huì)影響。 |
?
1 Introduction 介紹
| Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly ?flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word ?vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations ?and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to ?task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP+17] have ?been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18]. This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, ?question answering, textual entailment, and many others, and has continued to advance based on new architectures ?and algorithms [RSR+19, LOG+19, YDY+19, LCG+19]. However, a major limitation to this approach is that while ?the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve ?strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands ?of examples specific to that task. Removing this limitation would be desirable, for several reasons. ? | 近年來(lái),NLP系統(tǒng)中出現(xiàn)了一種預(yù)先訓(xùn)練語(yǔ)言表示的趨勢(shì),應(yīng)用于越來(lái)越靈活和任務(wù)不確定的下游遷移方式。首先,學(xué)會(huì)了使用單層表示詞向量(MCCD13, PSM14)和特定于任務(wù)的架構(gòu),然后用多層RNNs表示和上下文狀態(tài)被用來(lái)形成強(qiáng)表示[DL15、MBXS17 PNZtY18](盡管仍然適用于特定于任務(wù)的架構(gòu)),以及最近pre-trained復(fù)發(fā)或變壓器語(yǔ)言模型(垂直地震剖面+ 17)直接調(diào)整,完全消除需要特定于任務(wù)的架構(gòu)(RNSS18,DCLT18, HR18]。最后一種范式在許多具有挑戰(zhàn)性的NLP任務(wù)(如閱讀理解、問(wèn)題回答、文本蘊(yùn)涵和許多其他任務(wù))上取得了實(shí)質(zhì)性的進(jìn)展,并在新的架構(gòu)和算法的基礎(chǔ)上繼續(xù)前進(jìn)[RSR+19, LOG+19, YDY+19, LCG+19]。然而,這種方法的主要限制在于,架構(gòu)是task-agnostic,仍然是一個(gè)需要特定于任務(wù)的數(shù)據(jù)集和特定于任務(wù)的微調(diào):實(shí)現(xiàn)強(qiáng)勁表現(xiàn)所需的任務(wù)通常需要微調(diào)的數(shù)據(jù)集上成千成百上千的例子具體任務(wù)。出于幾個(gè)原因,消除這一限制是可取的。首先,從實(shí)踐的角度來(lái)看,每一個(gè)新任務(wù)都需要大量帶標(biāo)簽的示例數(shù)據(jù)集,這限制了語(yǔ)言模型的適用性。有非常廣泛的可能有用的語(yǔ)言任務(wù),包括任何事情,從糾正語(yǔ)法,生成一個(gè)抽象概念的例子,批評(píng)一個(gè)短篇小說(shuō)。對(duì)于許多這樣的任務(wù)來(lái)說(shuō),很難收集到一個(gè)大型的監(jiān)督訓(xùn)練數(shù)據(jù)集,特別是當(dāng)這個(gè)過(guò)程必須為每個(gè)新任務(wù)重復(fù)時(shí)。 |
| Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness ?of the model and the narrowness of the training distribution. This can create problems for the pre-training plus ?fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then ?fine-tuned on very narrow task distributions. For instance [HLW+20] observe that larger models do not necessarily ?generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm ?can be poor because the model is overly specific to the training distribution and does not generalize well outside it ?[YdC+19, MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at ?human-level, may exaggerate actual performance on the underlying task [GSL+18, NK19]. ? Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural ?language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number ?of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often?sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing ?to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans ?to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy ?dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality. | 其次,隨著模型的表現(xiàn)力和訓(xùn)練分布的窄性,挖掘訓(xùn)練數(shù)據(jù)中假相關(guān)性的潛力從根本上增加。這可能會(huì)給預(yù)訓(xùn)練和微調(diào)范式帶來(lái)問(wèn)題,在這種范式中,模型被設(shè)計(jì)得很大,以便在預(yù)訓(xùn)練期間吸收信息,但隨后在非常狹窄的任務(wù)分布上進(jìn)行微調(diào)。例如[HLW+20]觀察到,較大的模型不一定能更好地推廣非分布。有證據(jù)表明,在這種范式下實(shí)現(xiàn)的泛化可能很差,因?yàn)槟P瓦^(guò)于具體于訓(xùn)練分布,不能很好地泛化在訓(xùn)練分布之外[YdC+19, MPL19]。因此,在特定基準(zhǔn)測(cè)試中,即使名義上是在人的層面上,經(jīng)過(guò)調(diào)優(yōu)的模型的性能也可能會(huì)夸大底層任務(wù)的實(shí)際性能[GSL+18, NK19]。第三,人類學(xué)習(xí)語(yǔ)言最不需要大型數(shù)據(jù)集監(jiān)管任務(wù)——一個(gè)簡(jiǎn)短的指令在自然語(yǔ)言(如:“請(qǐng)告訴我,如果這句話描述了一些快樂(lè)或者悲傷”)或者最多一個(gè)小數(shù)量的示威活動(dòng)(例如:“這里有兩個(gè)例子的人勇敢的行動(dòng);請(qǐng)給出勇氣的第三個(gè)例子”)通常足以使一個(gè)人完成一項(xiàng)新任務(wù),至少達(dá)到合理的能力水平。除了指出我們目前的NLP技術(shù)在概念上的局限性外,這種適應(yīng)性還具有實(shí)際的優(yōu)勢(shì)——它允許人類無(wú)縫地混合在一起或在許多任務(wù)和技能之間切換,例如在冗長(zhǎng)的對(duì)話中執(zhí)行添加操作。為了廣泛應(yīng)用,我們希望我們的NLP系統(tǒng)有同樣的流動(dòng)性和普遍性。 |
| One potential route towards addressing these issues is meta-learning1 – which in the context of language models means ?the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities ?at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work [RWC+19] ?attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form ?of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task ?and is then expected to complete further instances of the task simply by predicting what comes next. ? While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example ?[RWC+19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind ?the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of ?solving language tasks. ? Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer ?language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters ?[DCLT18], to 1.5 billion parameters [RWC+19], to 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], ?and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream ?NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a ?smooth trend of improvement with scale [KMH+20]. Since in-context learning involves absorbing many skills and ?tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong ?gains with scale. | 對(duì)解決這些問(wèn)題的一個(gè)潛在的路線是meta-learning1——在語(yǔ)言的上下文模型意味著模型發(fā)展廣泛技能的訓(xùn)練時(shí)間和模式識(shí)別能力,然后使用這些能力在推理時(shí)迅速適應(yīng)或識(shí)別所需的任務(wù)(見(jiàn)圖1.1)。最近的工作[RWC + 19]試圖做到這一點(diǎn)通過(guò)我們稱之為“語(yǔ)境學(xué)習(xí)”,使用文本輸入pretrained語(yǔ)言模型作為一種任務(wù)規(guī)范:模型條件在自然語(yǔ)言指令和/或一些示威活動(dòng)的任務(wù),然后將完成進(jìn)一步的實(shí)例任務(wù)只需預(yù)測(cè)接下來(lái)會(huì)發(fā)生什么。雖然它顯示出了一些最初的希望,但這種方法取得的效果仍遠(yuǎn)不及微調(diào)——例如[RWC+19]在自然問(wèn)題上僅取得4%的成績(jī),甚至它的55 F1 CoQa結(jié)果現(xiàn)在也落后于最先進(jìn)的水平35分以上。元學(xué)習(xí)顯然需要大量的改進(jìn),才能成為解決語(yǔ)言任務(wù)的可行的實(shí)用方法。語(yǔ)言建模的另一個(gè)最新趨勢(shì)可能提供了一個(gè)前進(jìn)的方向。近年來(lái),transformer語(yǔ)言模型的容量大幅增加,從1億個(gè)參數(shù)[RNSS18],到3億個(gè)參數(shù)[DCLT18],再到15億個(gè)參數(shù)[RWC+19],再到80億個(gè)參數(shù)[SPP+19], 110億個(gè)參數(shù)[RSR+19],最后是170億個(gè)參數(shù)[Tur20]。每一次增加都帶來(lái)了文本合成和/或下游NLP任務(wù)的改進(jìn),有證據(jù)表明,與許多下游任務(wù)相關(guān)的日志丟失隨著規(guī)模的增大呈現(xiàn)平穩(wěn)的改善趨勢(shì)[KMH+20]。由于內(nèi)環(huán)境學(xué)習(xí)涉及在模型的參數(shù)內(nèi)吸收許多技能和任務(wù),因此內(nèi)環(huán)境學(xué)習(xí)能力可能會(huì)隨著規(guī)模的增長(zhǎng)而顯示出類似的強(qiáng)勁增長(zhǎng),這是合理的。 ? |
| In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call ?GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, ?as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training ?set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we ?allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, ?where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only ?an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional ?fine-tuning setting, but we leave this to future work. ? Figure 1.2 illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to ?remove extraneous symbols from a word. Model performance improves with the addition of a natural language task ?description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically ?with model size. Though the results in this case are particularly striking, the general trends with both model size and ?number of examples in-context hold for most tasks we study. We emphasize that these “l(fā)earning” curves involve no ?gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning. ? Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot ?setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held ?by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in ?the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the ?zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art ?relative to fine-tuned models operating in the same closed-book setting. ? GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, ?which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them ?defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human ?evaluators have difficulty distinguishing from human-generated articles. ? At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This ?includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE ?or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we ?hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed. ? A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should ?not be seen as a rigorous or meaningful benchmark in itself). | 在本文中,我們通過(guò)訓(xùn)練一個(gè)參數(shù)為1750億的自回歸語(yǔ)言模型(我們稱之為GPT-3),并測(cè)量其上下文內(nèi)學(xué)習(xí)能力來(lái)檢驗(yàn)這一假設(shè)。具體地說(shuō),我們?cè)诔^(guò)24個(gè)NLP數(shù)據(jù)集上評(píng)估GPT-3,以及一些旨在測(cè)試對(duì)訓(xùn)練集中不太可能直接包含的任務(wù)的快速適應(yīng)的新任務(wù)。對(duì)于每個(gè)任務(wù),我們?cè)u(píng)估GPT-3 3條件下:(一)“few-shot學(xué)習(xí)”,或語(yǔ)境學(xué)習(xí),我們?cè)试S盡可能多的示威活動(dòng)將適合模型的上下文窗口(通常10 - 100),(b)“一次性學(xué)習(xí)”,我們只允許一個(gè)示范,和(c)“zero-shot”學(xué)習(xí),不允許有示威游行,只有一條指令在自然語(yǔ)言模型。原則上,GPT-3也可以在傳統(tǒng)的微調(diào)設(shè)置中進(jìn)行評(píng)估,但我們將其留給未來(lái)的工作。圖1.2說(shuō)明了我們所研究的條件,并展示了一個(gè)簡(jiǎn)單任務(wù)的少量學(xué)習(xí),該任務(wù)要求模型從一個(gè)單詞中去除無(wú)關(guān)的符號(hào)。模型性能隨著自然語(yǔ)言任務(wù)描述的增加而提高,隨著模型上下文中的示例數(shù)量的增加,K. Few-shot學(xué)習(xí)也隨著模型大小的增加而顯著提高。雖然在這種情況下的結(jié)果是特別引人注目的,但模型大小和上下文示例數(shù)量的總體趨勢(shì)對(duì)我們研究的大多數(shù)任務(wù)都是成立的。我們強(qiáng)調(diào),這些“學(xué)習(xí)”曲線不涉及梯度更新或微調(diào),只是不斷增加作為條件的演示數(shù)量。總的來(lái)說(shuō),在NLP任務(wù)中,GPT-3在零桿和單桿設(shè)置中取得了很好的效果,在少桿設(shè)置中,有時(shí)可以與最先進(jìn)的技術(shù)競(jìng)爭(zhēng),甚至有時(shí)超過(guò)最先進(jìn)的技術(shù)(盡管最先進(jìn)的技術(shù)是由經(jīng)過(guò)微調(diào)的模型持有的)。例如,GPT-3在零桿設(shè)置中CoQA達(dá)到81.5 F1,在單桿設(shè)置中CoQA達(dá)到84.0 F1,在少桿設(shè)置中達(dá)到85.0 F1。同樣,在TriviaQA上,GPT-3在零桿設(shè)置上的精度為64.3%,在單桿設(shè)置上的精度為68.0%,在少桿設(shè)置上的精度為71.2%,與在相同閉鎖設(shè)置下運(yùn)行的精細(xì)模型相比,后者是最先進(jìn)的。在測(cè)試快速適應(yīng)或即時(shí)推理的任務(wù)上,GPT-3也顯示出一步走和少步出的熟練程度,這些任務(wù)包括解讀單詞、執(zhí)行算術(shù),以及在一個(gè)句子中使用只定義過(guò)一次的新單詞。我們還表明,在小樣本設(shè)置中,GPT-3可以生成人工評(píng)估人員難以區(qū)分的合成新聞文章。與此同時(shí),我們也發(fā)現(xiàn)一些任務(wù)在性能上有一些困難,即使在GPT-3的規(guī)模上也是如此。這包括像ANLI數(shù)據(jù)集這樣的自然語(yǔ)言推理任務(wù),以及像RACE或QuAC這樣的閱讀理解數(shù)據(jù)集。通過(guò)對(duì)GPT-3的優(yōu)點(diǎn)和缺點(diǎn)(包括這些局限性)的廣泛描述,我們希望能促進(jìn)對(duì)語(yǔ)言少注射學(xué)習(xí)的研究 |
| We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models ?on datasets such as Common Crawl, which can potentially include content from test datasets simply because such ?content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify ?its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most ?datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these ?datasets or we note them with an asterisk, depending on the severity. ? In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion ?parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most ?tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap ?between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models ?are more proficient meta-learners. ? Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and ?broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard. ? The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training ?GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. ?Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. ?Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes. | 我們還對(duì)“數(shù)據(jù)污染”進(jìn)行了系統(tǒng)的研究——這是一個(gè)日益嚴(yán)重的問(wèn)題,當(dāng)在數(shù)據(jù)集上訓(xùn)練高容量模型時(shí),比如Common crawlow,它可能會(huì)包含來(lái)自測(cè)試數(shù)據(jù)集的內(nèi)容,因?yàn)檫@些內(nèi)容經(jīng)常存在于web上。在本文中,我們開(kāi)發(fā)了系統(tǒng)的工具來(lái)測(cè)量數(shù)據(jù)污染和量化其扭曲效應(yīng)。盡管我們發(fā)現(xiàn)數(shù)據(jù)污染對(duì)大多數(shù)數(shù)據(jù)集上的GPT-3性能的影響很小,但我們確定了一些數(shù)據(jù)集可能會(huì)導(dǎo)致結(jié)果膨脹,我們要么不報(bào)告這些數(shù)據(jù)集的結(jié)果,要么根據(jù)嚴(yán)重程度用星號(hào)標(biāo)注它們。除了以上這些,我們還訓(xùn)練了一系列較小的模型(從1.25億參數(shù)到130億參數(shù)不等),以便在零樣本、一樣本和小樣本設(shè)置中與GPT-3進(jìn)行比較。總的來(lái)說(shuō),對(duì)于大多數(shù)任務(wù),我們發(fā)現(xiàn)在所有三種設(shè)置中,模型容量的縮放相對(duì)平穩(wěn);一個(gè)值得注意的模式是,零彈、一彈和少?gòu)椫g的差距經(jīng)常隨著模型容量的增加而增加,這可能表明較大的模型更精通元學(xué)習(xí)。最后,鑒于GPT-3表現(xiàn)出的廣泛的能力范圍,我們討論了對(duì)偏見(jiàn)、公平和更廣泛的社會(huì)影響的關(guān)注,并試圖在這方面對(duì)GPT-3的特征進(jìn)行初步分析。本文的其余部分組織如下。在第2節(jié)中,我們將描述培訓(xùn)GPT-3并對(duì)其進(jìn)行評(píng)估的方法和方法。第3節(jié)在零,一次和很小樣本設(shè)置的任務(wù)的全范圍的結(jié)果。第4節(jié)討論了數(shù)據(jù)污染的問(wèn)題(火車測(cè)試重疊)。第5節(jié)討論GPT-3的局限性。第6節(jié)討論更廣泛的影響。第7節(jié)回顧相關(guān)工作,第8節(jié)作總結(jié)。 |
?
2 Approach?方法
| Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration): | 我們的基本預(yù)訓(xùn)練方法,包括模型、數(shù)據(jù)和訓(xùn)練,類似于[RWC+19]中描述的過(guò)程,即相對(duì)簡(jiǎn)單地增加模型大小、數(shù)據(jù)集大小和多樣性,以及訓(xùn)練長(zhǎng)度。我們對(duì)上下文內(nèi)學(xué)習(xí)的使用也類似于[RWC+19],但在這項(xiàng)工作中,我們系統(tǒng)地探索了上下文內(nèi)學(xué)習(xí)的不同設(shè)置。因此,在本節(jié)開(kāi)始時(shí),我們將顯式定義并對(duì)比我們將在其上評(píng)估GPT-3或原則上可以在其上評(píng)估GPT-3的不同設(shè)置。這些設(shè)置可以看作取決于它們傾向于依賴多少特定于任務(wù)的數(shù)據(jù)。具體來(lái)說(shuō),我們可以在這個(gè)頻譜中確定至少四個(gè)點(diǎn)(參見(jiàn)圖2.1): |
|
|
| Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work. Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations. | 圖2.1展示了使用翻譯英語(yǔ)到法語(yǔ)的示例的四種方法。在本文中,我們關(guān)注于零射擊、一次射擊和少射擊,目的不是將它們作為競(jìng)爭(zhēng)的備選方案進(jìn)行比較,而是將它們作為不同的問(wèn)題設(shè)置,在特定基準(zhǔn)測(cè)試的性能和樣本效率之間提供不同的權(quán)衡。我們特別強(qiáng)調(diào)小樣本的結(jié)果,因?yàn)樗麄冎械脑S多只是稍微落后于最先進(jìn)的微調(diào)模型。然而,最終,“一箭雙雕”(有時(shí)甚至是“零射”)似乎是對(duì)人類表現(xiàn)最公平的比較,也是未來(lái)工作的重要目標(biāo)。下面的2.1-2.3節(jié)分別給出了我們的模型、訓(xùn)練數(shù)據(jù)和訓(xùn)練過(guò)程的細(xì)節(jié)。第2.4節(jié)討論了我們?nèi)绾芜M(jìn)行少拍、一次拍和零拍評(píng)估的細(xì)節(jié)。 |
?
2.1 Model and Architectures?模型和架構(gòu)
| We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks. | 我們使用與GPT-2 [RWC+19]相同的模型和架構(gòu),包括修改的初始化、預(yù)歸一化和其中描述的可逆標(biāo)記,但我們?cè)谧儔浩鞯膶又惺褂媒惶婷芗途植繋钕∈枳⒁饽J?#xff0c;類似于稀疏變壓器[CGRS19]。為了研究ML性能對(duì)模型大小的依賴關(guān)系,我們訓(xùn)練了8種不同大小的模型,從1.25億個(gè)參數(shù)到1750億個(gè)參數(shù)的3個(gè)數(shù)量級(jí),最后一個(gè)是我們稱為GPT-3的模型。先前的研究[KMH+20]表明,在有足夠的訓(xùn)練數(shù)據(jù)的情況下,驗(yàn)證損失的比例應(yīng)近似于一個(gè)平滑的冪律,該冪律是大小的函數(shù);許多不同大小的訓(xùn)練模型允許我們測(cè)試驗(yàn)證丟失和下游語(yǔ)言任務(wù)的假設(shè)。 ? |
| Table 2.1 shows the sizes and architectures of our 8 models. Here nparams is the total number of trainable parameters, nlayers is the total number of layers, dmodel is the number of units in each bottleneck layer (we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ? dmodel), and dhead is the dimension of each attention head. All models use a context window of nctx = 2048 tokens. We partition the model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work [KMH+20] suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range. | 表2.1顯示了我們的8個(gè)模型的大小和架構(gòu)。這里nparams總數(shù)可訓(xùn)練的參數(shù),nlayers總層數(shù),dmodel是單位的數(shù)量在每一個(gè)瓶頸層(我們總是有前饋層瓶頸層的四倍,dff = 4?dmodel),和dhead每個(gè)關(guān)注頭部尺寸。所有模型都使用nctx = 2048令牌的上下文窗口。我們沿著深度和寬度維度在gpu上劃分模型,以最小化節(jié)點(diǎn)之間的數(shù)據(jù)傳輸。每個(gè)模型的精確結(jié)構(gòu)參數(shù)的選擇是基于計(jì)算效率和在GPU中模型布局的負(fù)載均衡。先前的工作[KMH+20]表明,驗(yàn)證損失對(duì)這些參數(shù)在一個(gè)合理的大范圍內(nèi)不是很敏感。 |
?
2.2 Training Dataset?訓(xùn)練數(shù)據(jù)集
| Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity. ? Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in [KMH+20], two internet-based books corpora (Books1 and Books2) and English-language Wikipedia. | 用于語(yǔ)言模型的數(shù)據(jù)集已經(jīng)迅速擴(kuò)展,最終達(dá)到了常見(jiàn)的爬行數(shù)據(jù)集dataset2 [RSR+19],總計(jì)近一萬(wàn)億字。這樣大的數(shù)據(jù)集足以訓(xùn)練我們最大的模型,而無(wú)需對(duì)同一序列進(jìn)行兩次更新。然而,我們發(fā)現(xiàn)未過(guò)濾或輕度過(guò)濾版本的普通爬行往往比更有組織的數(shù)據(jù)集質(zhì)量更低。因此,我們采取了3個(gè)步驟來(lái)提高數(shù)據(jù)集的平均質(zhì)量:(1)我們下載和過(guò)濾的一個(gè)版本CommonCrawl基于相似性的一系列高品質(zhì)參考全集,(2)我們?cè)谖臋n級(jí)別執(zhí)行模糊重復(fù)數(shù)據(jù)刪除,在和整個(gè)數(shù)據(jù)集,以防止冗余和保存我們伸出的完整性驗(yàn)證設(shè)置為一個(gè)精確的衡量過(guò)度擬合,和(3)我們還添加了高質(zhì)量的參考語(yǔ)料訓(xùn)練增加CommonCrawl和增加其多樣性。 前兩個(gè)點(diǎn)的詳細(xì)信息(處理常見(jiàn)的爬行)描述在附錄a。第三,我們添加了幾個(gè)策劃高質(zhì)量的數(shù)據(jù)集,包括WebText數(shù)據(jù)集的擴(kuò)展版本(RWC + 19),收集的抓取鏈接在更長(zhǎng)一段時(shí)間,和第一(公里/小時(shí)+ 20)中描述的兩個(gè)網(wǎng)絡(luò)書(shū)全集(Books1和Books2)和英文維基百科。 |
| Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data. | 表2.2顯示了我們?cè)谟?xùn)練中使用的最終混合數(shù)據(jù)集。common抓取數(shù)據(jù)從2016年到2019年的每月41個(gè)shards中下載,即過(guò)濾前壓縮明文45TB,過(guò)濾后壓縮明文570GB,大致相當(dāng)于4000億個(gè)字節(jié)對(duì)編碼的令牌。需要注意的是,在訓(xùn)練過(guò)程中,對(duì)數(shù)據(jù)集的采樣并不是按照數(shù)據(jù)集的大小進(jìn)行的,而是我們認(rèn)為質(zhì)量較高的數(shù)據(jù)集的采樣頻率更高,例如common抓取和Books2數(shù)據(jù)集在訓(xùn)練過(guò)程中采樣次數(shù)少于一次,而對(duì)其他數(shù)據(jù)集的采樣次數(shù)為2-3次。這本質(zhì)上接受了少量的過(guò)擬合,以換取更高質(zhì)量的訓(xùn)練數(shù)據(jù)。 |
| A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination. | 在廣泛的互聯(lián)網(wǎng)數(shù)據(jù)上預(yù)先訓(xùn)練過(guò)的語(yǔ)言模型,特別是具有記憶大量?jī)?nèi)容能力的大型模型,主要關(guān)注的方法是,在培訓(xùn)前無(wú)意中看到測(cè)試或開(kāi)發(fā)集,可能會(huì)污染下游任務(wù)。為了減少這種污染,我們搜索并試圖消除與本文研究的所有基準(zhǔn)的開(kāi)發(fā)和測(cè)試集的重疊。不幸的是,過(guò)濾中的一個(gè)bug導(dǎo)致我們忽略了一些重疊部分,并且由于訓(xùn)練的代價(jià),對(duì)模型進(jìn)行再訓(xùn)練是不可行的。在第4節(jié)中,我們描述了剩余重疊的影響,在未來(lái)的工作中,我們將更積極地消除數(shù)據(jù)污染。 |
?
2.3 Training Process 訓(xùn)練過(guò)程
| As found in [KMH+20, MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table 2.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process and hyperparameter settings are described in Appendix B. | 正如在[KMH+20, MKAT18]中發(fā)現(xiàn)的,較大的模型通常可以使用較大的批大小,但需要較小的學(xué)習(xí)速度。我們?cè)谟?xùn)練期間測(cè)量梯度噪聲尺度,并使用它來(lái)指導(dǎo)我們批量大小的選擇[MKAT18]。表2.1顯示了我們使用的參數(shù)設(shè)置。為了訓(xùn)練更大的模型而不耗盡內(nèi)存,我們?cè)诿總€(gè)矩陣乘法中混合使用模型并行性和跨網(wǎng)絡(luò)層的模型并行性。所有的模型都是在微軟提供的高帶寬集群的V100 GPU上進(jìn)行訓(xùn)練的。詳細(xì)的訓(xùn)練過(guò)程和超參數(shù)設(shè)置在附錄B中描述。 |
?
2.4 Evaluation? 評(píng)估
| For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning examples directly from it. | 對(duì)于少?gòu)棇W(xué)習(xí),我們從任務(wù)的訓(xùn)練集中隨機(jī)抽取K個(gè)樣本作為條件,根據(jù)任務(wù)的不同用1或2個(gè)新行分隔,以此來(lái)評(píng)估評(píng)估集中的每個(gè)樣本。對(duì)于LAMBADA和Storycloze,沒(méi)有可用的監(jiān)督訓(xùn)練集,所以我們從開(kāi)發(fā)集中提取條件設(shè)置示例,并在測(cè)試集上進(jìn)行評(píng)估。對(duì)于Winograd(原始的,不是超級(jí)膠水版本),只有一個(gè)數(shù)據(jù)集,所以我們直接從它提取條件設(shè)置示例。 |
| K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better, so when a separate development and test set are available, we experiment with a few values of K on the development set and then run the best value on the test set. For some tasks (see Appendix G) we also use a natural language prompt in addition to (or for K = 0, instead of) demonstrations. On tasks that involve choosing one correct completion from several options (multiple choice), we provide K examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing P (completion|context) P (completion|answer context) , where answer context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic. | K可以是0到模型上下文窗口允許的最大數(shù)量之間的任何值,即nctx = 2048,適用于所有模型,通常適合10到100個(gè)示例。更大的K值通常但不總是更好的,所以當(dāng)一組獨(dú)立的開(kāi)發(fā)和測(cè)試是可用的,我們嘗試幾值K的開(kāi)發(fā)設(shè)置,然后運(yùn)行測(cè)試集上的最佳值。對(duì)于某些任務(wù)(參見(jiàn)附錄G)我們也使用自然語(yǔ)言提示除了(或K = 0,而不是)示威活動(dòng)。 對(duì)于涉及從多個(gè)選項(xiàng)(多項(xiàng)選擇)中選擇一個(gè)正確完成的任務(wù),我們提供了K個(gè)上下文示例加上正確完成,然后只提供一個(gè)上下文示例,并比較每個(gè)完成的LM可能性。對(duì)于大多數(shù)任務(wù)我們比較每個(gè)令牌的可能性(規(guī)范化長(zhǎng)度),然而在少量的數(shù)據(jù)集(弧、OpenBookQA和比賽)我們獲得更多利益衡量發(fā)展設(shè)定的正常化的無(wú)條件概率每完成,通過(guò)計(jì)算P(完成|上下文)(完成|回答上下文),在回答上下文字符串“回答:”或“:”和用于提示完成應(yīng)該答案但否則通用。 |
| On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by [RSR+19] (see Appendix G) for details. On tasks with free-form completion, we use beam search with the same parameters as [RSR+19]: a beam width of 4 and a length penalty of α = 0.6. We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand. Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else. | 對(duì)于涉及二分類的任務(wù),我們給選項(xiàng)以語(yǔ)義上更有意義的名稱(例如“真”或“假”,而不是0或1),然后把任務(wù)當(dāng)作多項(xiàng)選擇;我們有時(shí)也會(huì)類似于[RSR+19]所完成的任務(wù)(詳見(jiàn)附錄G)。 對(duì)于自由形式完成的任務(wù),我們使用與[RSR+19]相同的參數(shù)進(jìn)行波束搜索:波束寬度為4,長(zhǎng)度罰值為radial = 0.6。我們使用F1相似度評(píng)分、BLEU或精確匹配來(lái)給模型評(píng)分,這取決于手頭數(shù)據(jù)集的標(biāo)準(zhǔn)。
|
?
3 Results 結(jié)果
| In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6 ?additional extra-small models with as few as 100,000 parameters. As observed in [KMH+20], language modeling ?performance follows a power-law when making efficient use of training compute. After extending this trend by two ?more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these ?improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will ?see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a ?broad spectrum of natural language tasks. ? Below, we evaluate the 8 models described in Section 2 (the 175 billion parameter parameter GPT-3 and 7 smaller ?models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks. ? | 在圖3.1中,我們展示了第2節(jié)中描述的8個(gè)模型的訓(xùn)練曲線。在這個(gè)圖中,我們還包括了6個(gè)額外的超小型模型,這些模型只有100,000個(gè)參數(shù)。正如在[KMH+20]中觀察到的,在高效使用訓(xùn)練計(jì)算時(shí),語(yǔ)言建模性能遵循冪律。在將這一趨勢(shì)擴(kuò)展兩個(gè)數(shù)量級(jí)之后,我們只觀察到與冪律有輕微的背離。人們可能會(huì)擔(dān)心這些交叉熵?fù)p失的改進(jìn)僅僅來(lái)自于我們訓(xùn)練語(yǔ)料庫(kù)的虛假細(xì)節(jié)建模。然而,在接下來(lái)的章節(jié)中,我們將看到交叉熵?fù)p失的改進(jìn)可以在廣泛的自然語(yǔ)言任務(wù)中帶來(lái)一致的性能提升。 下面,我們?cè)趶V泛的數(shù)據(jù)集上評(píng)估第2節(jié)中描述的8個(gè)模型(1750億參數(shù)GPT-3和7個(gè)較小的模型)。我們將數(shù)據(jù)集分成9個(gè)類別,這些類別代表大致相似的任務(wù)。 |
| In Section 3.1 we evaluate on traditional language modeling tasks and tasks that are similar to language modeling, ?such as Cloze tasks and sentence/paragraph completion tasks. In Section 3.2 we evaluate on “closed book” question ?answering tasks: tasks which require using the information stored in the model’s parameters to answer general ?knowledge questions. In Section 3.3 we evaluate the model’s ability to translate between languages (especially one-shot ?and few-shot). In Section 3.4 we evaluate the model’s performance on Winograd Schema-like tasks. In Section 3.5 we ?evaluate on datasets that involve commonsense reasoning or question answering. In Section 3.6 we evaluate on reading ?comprehension tasks, in Section 3.7 we evaluate on the SuperGLUE benchmark suite, and in 3.8 we briefly explore ?NLI. Finally, in Section 3.9, we invent some additional tasks designed especially to probe in-context learning abilities – ?these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the ?few-shot, one-shot, and zero-shot settings. | 在3.1節(jié)中,我們?cè)u(píng)估了傳統(tǒng)的語(yǔ)言建模任務(wù)和類似于語(yǔ)言建模的任務(wù),如完形填空任務(wù)和句子/段落完成任務(wù)。在第3.2節(jié)中,我們對(duì)“閉卷”問(wèn)題回答任務(wù)進(jìn)行評(píng)估,即需要使用模型參數(shù)中存儲(chǔ)的信息來(lái)回答一般知識(shí)問(wèn)題的任務(wù)。在第3.3節(jié)中,我們?cè)u(píng)估了模型在不同語(yǔ)言之間的翻譯能力(特別是一次翻譯和少次翻譯)。在第3.4節(jié)中,我們?cè)u(píng)估了該模型在Winograd類模式任務(wù)上的性能。在第3.5節(jié)中,我們對(duì)涉及常識(shí)推理或問(wèn)題回答的數(shù)據(jù)集進(jìn)行評(píng)估。在第3.6節(jié)中,我們?cè)u(píng)估了閱讀理解任務(wù);在第3.7節(jié)中,我們?cè)u(píng)估了SuperGLUE基準(zhǔn)套件;在3.8節(jié)中,我們簡(jiǎn)要探討了NLI。最后,在3.9節(jié)中,我們特別設(shè)計(jì)了一些額外的任務(wù)來(lái)探究上下文中的學(xué)習(xí)能力——這些任務(wù)側(cè)重于即時(shí)推理、適應(yīng)技巧或開(kāi)放式的文本合成。我們?cè)凇吧倥摹薄ⅰ耙淮闻摹焙汀傲闩摹痹O(shè)置中評(píng)估所有的任務(wù)。 |
?
3.1 Language Modeling, Cloze, and Completion Tasks?語(yǔ)言建模、完形填空和完成任務(wù)
| In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible completions of a piece of text. | 在本節(jié)中,我們將測(cè)試GPT-3在傳統(tǒng)的語(yǔ)言建模任務(wù)以及相關(guān)任務(wù)上的性能,這些任務(wù)包括預(yù)測(cè)感興趣的單個(gè)單詞、完成句子或段落,或在可能完成的一段文本之間進(jìn)行選擇。 |
?
3.1.1 Language Modeling? ?語(yǔ)言建模
| We calculate zero-shot perplexity on the Penn Tree Bank (PTB) [MKM+94] dataset measured in [RWC+19]. We omit ?the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the ?one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these ?issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15 ?points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have ?a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot. | 我們計(jì)算了在[RWC+19]測(cè)量的佩恩樹(shù)岸(PTB) [MKM+94]數(shù)據(jù)集上的零射擊perplexity。我們省略了4 Wikipedia-related任務(wù)的工作,因?yàn)樗麄兪峭耆谖覀兊挠?xùn)練數(shù)據(jù),我們也省略十億字的基準(zhǔn)由于高分?jǐn)?shù)被包含在我們的訓(xùn)練集的數(shù)據(jù)集。肺結(jié)核逃脫這些問(wèn)題由于比現(xiàn)代互聯(lián)網(wǎng)。我們最大的模型在PTB上設(shè)置了一個(gè)新的SOTA,顯著領(lǐng)先15個(gè)點(diǎn),達(dá)到20.50的困惑。注意,由于PTB是一個(gè)傳統(tǒng)的語(yǔ)言建模數(shù)據(jù)集,它沒(méi)有一個(gè)清晰的示例分離來(lái)定義一次或少次評(píng)估,因此我們只測(cè)量零次評(píng)估。 |
?
3.1.2 LAMBADA 數(shù)據(jù)集
| The LAMBADA dataset [PKL+16] tests the modeling of long-range dependencies in text – the model is asked to predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the continued scaling of language models is yielding diminishing returns on this difficult benchmark. [BHT+20] reflect on the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results ([SPP+19]?and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of 8% over the previous state of the art. LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that ?classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a ?standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but ?also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word ?filters [RWC+19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a ?cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We ?use the following fill-in-the-blank format: | LAMBADA數(shù)據(jù)集[PKL+16]測(cè)試文本中遠(yuǎn)程依賴的建模——模型被要求預(yù)測(cè)需要閱讀一段上下文的句子的最后一個(gè)單詞。最近有研究表明,語(yǔ)言模型的不斷擴(kuò)大在這個(gè)困難的基準(zhǔn)上產(chǎn)生的收益正在減少。[BHT+20]反思了在兩個(gè)最新的研究結(jié)果([SPP+19]和[Tur20])之間,模型尺寸增加了一倍,僅提高了1.5%,并認(rèn)為“繼續(xù)以數(shù)量級(jí)擴(kuò)展硬件和數(shù)據(jù)尺寸并不是前進(jìn)的道路”。我們發(fā)現(xiàn)這條道路仍然很有希望,在零桿的情況下,LAMBADA的GPT-3實(shí)現(xiàn)了76%,比之前的技術(shù)水平提高了8%。 ? |
| When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase ?of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model ?size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy ?by 10%. Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot ?setting. Perhaps this is because all models still require several examples to recognize the pattern. One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact on performance. | 當(dāng)以這種方式呈現(xiàn)樣例時(shí),GPT-3在小樣本設(shè)置中達(dá)到了86.4%的精度,比之前的最先進(jìn)水平提高了18%以上。我們觀察到,隨著模型尺寸的增大,小樣本性能有了很大的提高。雖然這個(gè)設(shè)置將最小模型的性能降低了近20%,但對(duì)于GPT-3,它將精度提高了10%。最后,空白填充法并不是一種有效的一次性方法,它的效果總是比零填充法差。這可能是因?yàn)樗心P腿匀恍枰獛讉€(gè)示例來(lái)識(shí)別模式。 需要注意的一點(diǎn)是,對(duì)測(cè)試集污染的分析發(fā)現(xiàn),LAMBADA數(shù)據(jù)集中的少數(shù)似乎出現(xiàn)在我們的訓(xùn)練數(shù)據(jù)中——然而,在第4節(jié)中執(zhí)行的分析表明,對(duì)性能的影響可以忽略不計(jì)。 |
?
3.1.3 HellaSwag? 數(shù)據(jù)集
| The HellaSwag dataset [ZHB+19] involves picking the best ending to a story or set of instructions. The examples were ?adversarially mined to be difficult for language models while remaining easy for humans (who achieve 95.6% accuracy). ?GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the ?75.4% accuracy of a fine-tuned 1.5B parameter language model [ZHR+19] but still a fair amount lower than the overall ?SOTA of 85.6% achieved by the fine-tuned multi-task model ALUM. ? | HellaSwag數(shù)據(jù)集[ZHB+19]涉及到為一個(gè)故事或一組指令選擇最好的結(jié)局。這些例子對(duì)語(yǔ)言模型來(lái)說(shuō)很難挖掘,而對(duì)人類來(lái)說(shuō)卻很容易(達(dá)到95.6%的準(zhǔn)確率)。GPT-3在單小樣本設(shè)置中達(dá)到78.1%的準(zhǔn)確率,在小樣本設(shè)置中達(dá)到79.3%的準(zhǔn)確率,超過(guò)了1.5B參數(shù)語(yǔ)言模型[ZHR+19]的75.4%的準(zhǔn)確率,但仍低于多任務(wù)模型模型85.6%的整體SOTA。 |
?
3.1.4 StoryCloze? 數(shù)據(jù)集
| We next evaluate GPT-3 on the StoryCloze 2016 dataset [MCH+16], which involves selecting the correct ending ?sentence for five-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot ?setting (with K = 70). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model [LDL19] but ?improves over previous zero-shot results by roughly 10%. | 接下來(lái),我們對(duì)StoryCloze 2016數(shù)據(jù)集[MCH+16]上的GPT-3進(jìn)行評(píng)估,包括為五句話長(zhǎng)的故事選擇正確的結(jié)尾句。在這里,GPT-3在零樣本設(shè)置中達(dá)到83.2%,在小樣本設(shè)置(K = 70)中達(dá)到87.7%。這仍然比使用基于BERT模型[LDL19]進(jìn)行微調(diào)的SOTA低4.1%,但比之前的零射擊結(jié)果提高了約10%。 |
?
3.2 Closed Book Question Answering ?閉卷回答任務(wù)
| In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense ?amount of possible queries, this task has normally been approached by using an information retrieval system to find ?relevant text in combination with a model which learns to generate an answer given the question and the retrieved ?text. Since this setting allows a system to search for and condition on text which potentially contains the answer it ?is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well ?directly answering the questions without conditioning on auxilliary information. They denote this more restrictive ?evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better ?and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR+19], ?WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in ?the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than ?previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself ?is also not permitted. | 在本節(jié)中,我們將測(cè)量GPT-3回答有關(guān)廣泛事實(shí)知識(shí)的問(wèn)題的能力。由于可能的查詢量巨大,這個(gè)任務(wù)通常是通過(guò)使用信息檢索系統(tǒng)查找相關(guān)文本,并結(jié)合學(xué)習(xí)生成給定問(wèn)題和檢索文本的答案的模型來(lái)完成的。由于該設(shè)置允許系統(tǒng)搜索并對(duì)可能包含答案的文本進(jìn)行條件設(shè)置,因此稱為“open-book”。[RRS20]最近證明,一個(gè)大型語(yǔ)言模型可以在不依賴輔助信息的情況下直接回答問(wèn)題,表現(xiàn)得令人驚訝地好。他們將這種更嚴(yán)格的評(píng)估設(shè)置稱為“閉卷”。他們的工作表明,更高容量的模型可以表現(xiàn)得更好,我們用GPT-3測(cè)試了這一假設(shè)。我們?cè)赱RRS20]中的3個(gè)數(shù)據(jù)集上評(píng)估GPT-3: Natural Questions [KPR+19]、WebQuestions [BCFL13]和TriviaQA [JCWZ17],使用相同的分割。注意,除了所有的結(jié)果都在閉卷設(shè)置中之外,我們使用的少樣本、一次小樣本和零小樣本的評(píng)估代表了比以前的閉卷QA工作更嚴(yán)格的設(shè)置:除了不允許外部?jī)?nèi)容外,也不允許對(duì)Q&A數(shù)據(jù)集本身進(jìn)行微調(diào)。 |
| The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the ?one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by ?14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot ?result improves by 3.7% and matches the SOTA for an open-domain QA system which not only fine-tunes but also ?makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [LPP+20]. ?GPT-3’s few-shot result further improves performance another 3.2% beyond this. ? On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5% ?in the few-shot setting. This compares to 37.4% for fine-tuned T5-11B, and 44.7% for fine-tuned T5-11B+SSM, ?which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of ?state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to ?few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions?and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this ?distribution, recovering strong performance in the few-shot setting. |
|
| On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in ?the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot ?to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to ?TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia ?specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution. ? Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two ?datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we ?find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting ?the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model. | 在自然問(wèn)題(NQs)中,GPT-3在零桿設(shè)置中達(dá)到了14.6%,在單桿設(shè)置中達(dá)到了23.0%,在少桿設(shè)置中達(dá)到了29.9%,而在經(jīng)過(guò)微調(diào)的T5 11B+SSM中達(dá)到了36.6%。與WebQS類似,從零桿到少桿的巨大增益可能意味著分布的轉(zhuǎn)移,這也可能解釋了與TriviaQA和WebQS相比競(jìng)爭(zhēng)力較差的原因。特別是,NQs的問(wèn)題傾向于維基百科上非常精細(xì)的知識(shí),可以測(cè)試GPT-3的能力極限和廣泛的培訓(xùn)前分布。 總的來(lái)說(shuō),在三個(gè)數(shù)據(jù)集中的一個(gè)上,GPT-3的一次性匹配了開(kāi)放域微調(diào)SOTA。在另外兩個(gè)數(shù)據(jù)集上,盡管沒(méi)有使用微調(diào),它的性能接近封閉的SOTA。在所有3個(gè)數(shù)據(jù)集上,我們發(fā)現(xiàn)性能與模型大小的關(guān)系非常順利(圖3.3和附錄H圖H.7),可能反映了模型容量直接轉(zhuǎn)化為更多吸收在模型參數(shù)中的“知識(shí)”的想法。 |
?
3.3 Translation? 翻譯任務(wù)
| For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity ?concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially ?when translating between French and English despite only training on 10 megabytes of remaining French text. Since we ?increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training ?dataset to include more representation of other languages, though this remains an area for further improvement. As ?discussed in 2.2 the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although ?GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages. ?These languages are documented in the supplemental material. In order to better understand translation capability, we ?also expand our analysis to include two additional commonly studied languages, German and Romanian. ? Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets ?with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a ?blend of training data that mixes many languages together in a natural way, combining them on a word, sentence, ?and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in ?particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make ?use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data. ?Results are shown in Table 3.4. Zero-shot GPT-3, which only receives on a natural language description of the task, ?still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for?each translation task improves performance by over 7 BLEU and nears competitive performance with prior work. ?GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior ?unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the ?three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into ?English but underperforms when translating in the other direction. Performance on En-Ro is a noticeable outlier at ?over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE ?tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, ?few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and ?the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art. ?For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of ?unsupervised pretraining, supervised finetuning on 608K labeled examples, and backtranslation [LHCG19b]. ? | 對(duì)于GPT-2,由于容量問(wèn)題,在多語(yǔ)言文檔集合上使用了一個(gè)過(guò)濾器來(lái)生成僅使用英語(yǔ)的數(shù)據(jù)集。即使使用了這種過(guò)濾,GPT-2也顯示出了多語(yǔ)言能力,并且在法語(yǔ)和英語(yǔ)之間進(jìn)行翻譯時(shí)執(zhí)行得非常出色,盡管僅對(duì)10兆字節(jié)的剩余法語(yǔ)文本進(jìn)行了培訓(xùn)。由于我們將GPT-2到GPT-3的容量增加了兩個(gè)數(shù)量級(jí),因此我們還擴(kuò)展了訓(xùn)練數(shù)據(jù)集的范圍,以包括更多其他語(yǔ)言的表示,盡管這仍是一個(gè)有待進(jìn)一步改進(jìn)的領(lǐng)域。正如2.2中所討論的那樣,我們的大部分?jǐn)?shù)據(jù)來(lái)自于原始的普通抓取,只使用基于質(zhì)量的過(guò)濾。盡管GPT-3的訓(xùn)練數(shù)據(jù)仍然主要是英語(yǔ)(93%的單詞計(jì)數(shù)),但它也包括了7%的其他語(yǔ)言的文本。這些語(yǔ)言被記錄在補(bǔ)充材料中。為了更好地理解翻譯能力,我們還擴(kuò)展了我們的分析,包括另外兩種常用的語(yǔ)言,德語(yǔ)和羅馬尼亞語(yǔ)。 現(xiàn)有的無(wú)監(jiān)督機(jī)器翻譯方法通常結(jié)合對(duì)單語(yǔ)數(shù)據(jù)集的預(yù)訓(xùn)練和反向翻譯[SHB15],以一種可控的方式連接兩種語(yǔ)言。相比之下,GPT-3從混合的訓(xùn)練數(shù)據(jù)中學(xué)習(xí),這些數(shù)據(jù)以自然的方式將多種語(yǔ)言混合在一起,在單詞、句子和文檔級(jí)別上將它們組合在一起。GPT-3也使用單一的訓(xùn)練目標(biāo),它不是為任何特定任務(wù)定制或設(shè)計(jì)的。然而,我們的單樣本/小樣本設(shè)置并不能嚴(yán)格地與之前的無(wú)監(jiān)督工作相比,因?yàn)樗鼈兪褂昧松倭砍蓪?duì)的例子(1或64個(gè))。這相當(dāng)于一頁(yè)或兩頁(yè)上下文內(nèi)訓(xùn)練數(shù)據(jù)。結(jié)果如表3.4所示。Zero-shot GPT-3,它只接收任務(wù)的自然語(yǔ)言描述,仍然表現(xiàn)不佳,最近的非監(jiān)督NMT結(jié)果。然而,僅為每個(gè)翻譯任務(wù)提供一個(gè)示例演示,就可以提高7個(gè)藍(lán)度以上的翻譯性能,接近與之前工作的競(jìng)爭(zhēng)性能。GPT-3在全小樣本設(shè)置中進(jìn)一步提高了另外4個(gè)藍(lán)度,使得平均性能與之前的無(wú)監(jiān)督NMT工作相似。根據(jù)語(yǔ)言方向的不同,GPT-3在性能上有明顯的偏差。在研究的三種輸入語(yǔ)言中,GPT-3在翻譯成英語(yǔ)時(shí)顯著優(yōu)于之前的無(wú)監(jiān)督的NMT工作,但在翻譯成英語(yǔ)時(shí)表現(xiàn)不佳。在enro上的性能是一個(gè)明顯的異常值,比之前的無(wú)監(jiān)督的NMT工作差10藍(lán)度以上。這可能是一個(gè)弱點(diǎn),因?yàn)橹赜昧薌PT-2的字節(jié)級(jí)BPE標(biāo)記器,它是為一個(gè)幾乎完全是英語(yǔ)的訓(xùn)練數(shù)據(jù)集開(kāi)發(fā)的。對(duì)于Fr-En和De-En,很少有shot GPT-3優(yōu)于我們所能找到的最佳監(jiān)督結(jié)果,但由于我們不熟悉文獻(xiàn)和這些是非競(jìng)爭(zhēng)性基準(zhǔn)的外觀,我們不懷疑這些結(jié)果代表了真正的藝術(shù)狀態(tài)。對(duì)于roen來(lái)說(shuō),很少有shot GPT-3能在整體SOTA的0.5 BLEU范圍內(nèi)完成,這是通過(guò)結(jié)合無(wú)監(jiān)督的預(yù)訓(xùn)練、對(duì)608K標(biāo)記示例的監(jiān)督微調(diào)和反向翻譯來(lái)實(shí)現(xiàn)的[LHCG19b]。 |
| Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of ?improvement with model capacity. This is shown in Figure 3.4 in the case of few-shot results, and scaling for all three ?settings is shown in Appendix H. | 最后,通過(guò)所有語(yǔ)言對(duì)和所有三種設(shè)置(零-、一-和少-shot),模型容量有一個(gè)平穩(wěn)的提高趨勢(shì)。圖3.4中顯示的是較少拍攝的結(jié)果,附錄H中顯示了所有三種設(shè)置的縮放情況。 |
?
3.4 Winograd-Style Tasks? 任務(wù)
| The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun ?refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned ?language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test ?GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting. ? On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method ?described in [RWC+19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which ?is presented as binary classification and requires entity extraction to convert to the form described in this section. On ?Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear ?in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human ?performance. We note that contamination analysis found some Winograd schemas in the training data but this appears ?to have only a small effect on results (see Section 4). ? | Winograd Schemas Challenge [LDM12]是NLP中的一項(xiàng)經(jīng)典任務(wù),當(dāng)一個(gè)代詞在語(yǔ)法上有歧義,但在語(yǔ)義上對(duì)人來(lái)說(shuō)沒(méi)有歧義時(shí),該任務(wù)涉及確定該代詞指的是哪個(gè)詞。最近,經(jīng)過(guò)微調(diào)的語(yǔ)言模型在原始Winograd數(shù)據(jù)集上取得了接近人類的性能,但是更困難的版本,比如反向挖掘的Winogrande數(shù)據(jù)集[SBBC19],仍然顯著落后于人類的性能。我們測(cè)試了GPT-3在Winograd和Winogrande上的性能,通常是在零桿、一桿和少桿設(shè)置下。 ? |
| On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the ?zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned ?RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and ?human performance on the task as reported by [SBBC19] is 94.0%. | 在更困難的Winogrande數(shù)據(jù)集上,我們確實(shí)發(fā)現(xiàn)了上下文學(xué)習(xí)的進(jìn)步:GPT-3在零樣本設(shè)置中實(shí)現(xiàn)了70.2%,在單樣本設(shè)置中實(shí)現(xiàn)了73.2%,在少小樣本設(shè)置中實(shí)現(xiàn)了77.7%。相比之下,經(jīng)過(guò)微調(diào)的RoBERTA模型實(shí)現(xiàn)了79%,使用經(jīng)過(guò)微調(diào)的高容量模型(T5),最先進(jìn)的實(shí)現(xiàn)了84.6%,而根據(jù)[SBBC19]報(bào)告的人類在該任務(wù)上的性能是94.0%。 |
?
3.5 Common Sense Reasoning ?常識(shí)推理任務(wù)
| Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence ?completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) [BZB+19], ?asks common sense questions about how the physical world works and is intended as a probe of grounded understanding ?of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot ?(the last measured on PIQA’s test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a?fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over 10% worse than human ?performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis ?flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark ?the result with an asterisk. See Section 4 for details. ? | 接下來(lái),我們考慮三個(gè)試圖捕捉物理或科學(xué)推理的數(shù)據(jù)集,作為區(qū)別于句子完成,閱讀理解,或廣義知識(shí)問(wèn)題回答。第一個(gè)是PhysicalQA (PIQA) [BZB+19],它提出了關(guān)于物質(zhì)世界如何運(yùn)作的常識(shí)問(wèn)題,旨在探索對(duì)世界的基礎(chǔ)理解。GPT-3的零桿精度為81.0%,單桿精度為80.5%,少桿精度為82.8%(最后一次在PIQA的測(cè)試服務(wù)器上測(cè)量)。這比較有利的79.4%的精度之前的先進(jìn)先進(jìn)的一個(gè)微調(diào)羅伯塔。PIQA在模型尺寸上顯示出相對(duì)較淺的縮放效果,仍然比人類的表現(xiàn)差10%以上,但GPT-3的少射甚至零射的結(jié)果比目前最先進(jìn)的技術(shù)要好。我們的分析將PIQA標(biāo)記為潛在的數(shù)據(jù)污染問(wèn)題(盡管隱藏了測(cè)試標(biāo)簽),因此我們用星號(hào)保守地標(biāo)記了結(jié)果。詳見(jiàn)第4節(jié)。 |
|
On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points ?short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the ?leaderboard. ? | ARC [CCE+18]是一個(gè)多選題數(shù)據(jù)集,收集自3至9年級(jí)的科學(xué)考試。在對(duì)簡(jiǎn)單統(tǒng)計(jì)或信息檢索方法無(wú)法正確回答的問(wèn)題進(jìn)行篩選后的數(shù)據(jù)集“挑戰(zhàn)”版本上,GPT-3在零炮設(shè)置、一次炮設(shè)置和少炮設(shè)置的準(zhǔn)確率分別達(dá)到51.4%、53.2%和51.5%。這接近于UnifiedQA [KKS+20]的RoBERTa基線(55.9%)的性能。在數(shù)據(jù)集的“簡(jiǎn)單”版本中(上述兩種基線方法都回答正確的問(wèn)題),GPT-3實(shí)現(xiàn)了68.8%、71.2%和70.1%,這略微超過(guò)了來(lái)自[KKS+20]的RoBERTa的優(yōu)化基線。然而,這兩個(gè)結(jié)果仍然比UnifiedQA取得的總體SOTAs差得多,后者在挑戰(zhàn)集上比GPT-3的少桿結(jié)果高出27%,在簡(jiǎn)單集上高出22%。 在OpenBookQA [MCKS18]上,GPT-3從零樣本到小樣本設(shè)置有顯著提高,但仍比整體SOTA少20分。GPT-3的少樣本性能類似于一個(gè)微調(diào)的伯特大基線在排行榜上。 總的來(lái)說(shuō),使用GPT-3的上下文學(xué)習(xí)在常識(shí)推理任務(wù)中表現(xiàn)出混合的結(jié)果,在PIQA和ARC的單樣本和小樣本學(xué)習(xí)設(shè)置中,只觀察到小的和不一致的收獲,但在OpenBookQA中觀察到顯著的改善。GPT-3在所有評(píng)估設(shè)置中對(duì)新的PIQA數(shù)據(jù)集設(shè)置SOTA。 |
3.6 Reading Comprehension ?閱讀理解任務(wù)
| Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, ?multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread ?in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general ?we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each ?respective dataset. ? | 接下來(lái)我們對(duì)GPT-3進(jìn)行閱讀理解任務(wù)的評(píng)估。在對(duì)話框和單一問(wèn)題設(shè)置中,我們使用了一套5個(gè)數(shù)據(jù)集,包括抽象的、多項(xiàng)選擇和基于跨度的回答格式。我們觀察到GPT-3在這些數(shù)據(jù)集上的性能差異很大,這表明不同的回答格式具有不同的能力。一般來(lái)說(shuō),我們觀察到GPT-3與初始基線和使用上下文表示對(duì)每個(gè)各自數(shù)據(jù)集進(jìn)行訓(xùn)練的早期結(jié)果相同。? |
| GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset ?and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured ?dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete ?reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned ?BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches ?which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its ?few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to ?slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of ?middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with ?the earliest work utilizing contextual representations and is still 45% behind SOTA. | GPT-3在CoQA [RCM19]自由形式會(huì)話數(shù)據(jù)集上表現(xiàn)最好(在人類基線的3個(gè)點(diǎn)內(nèi)),在QuAC [CHI+18]數(shù)據(jù)集上表現(xiàn)最差(低于ELMo基線13 F1),該數(shù)據(jù)集需要建模結(jié)構(gòu)化對(duì)話行為和師生交互的回答范圍選擇。下降(DWD + 19]數(shù)據(jù)集測(cè)試離散推理和計(jì)算能力在閱讀理解中,GPT-3在few-shot環(huán)境優(yōu)于原始論文的BERT基線調(diào)整但仍遠(yuǎn)低于人類的性能和先進(jìn)的方法增強(qiáng)神經(jīng)網(wǎng)絡(luò)與符號(hào)系統(tǒng)(RLL + 19)。在陣容2.0 [RJL18]上,GPT-3展示了它的少桿學(xué)習(xí)能力,與零桿設(shè)置相比提高了近10桿(69.8桿)。這使得它稍微優(yōu)于原始論文中最好的微調(diào)結(jié)果。在RACE [LXL+17](一個(gè)針對(duì)初中和高中英語(yǔ)考試的多項(xiàng)選擇數(shù)據(jù)集)上,GPT-3的表現(xiàn)相對(duì)較弱,僅與最早使用上下文表示的研究相比具有競(jìng)爭(zhēng)力,仍落后于SOTA 45%。 |
3.7 SuperGLUE? 對(duì)比
| In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a ?more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark ?[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] ?[BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the ?few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we ?used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. ? | 為了更好地聚合NLP任務(wù)的結(jié)果,并與BERT和RoBERTa等流行模型進(jìn)行更系統(tǒng)的比較,我們還在標(biāo)準(zhǔn)化數(shù)據(jù)集上對(duì)GPT-3進(jìn)行了評(píng)價(jià),即SuperGLUE基準(zhǔn)[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]。GPT-3在SuperGLUE數(shù)據(jù)集上的測(cè)試集性能如表3.8所示。在小樣本設(shè)置中,我們對(duì)所有任務(wù)使用了32個(gè)示例,從訓(xùn)練集中隨機(jī)采樣。對(duì)于除了WSC和MultiRC之外的所有任務(wù),我們采樣了一組新的示例用于每個(gè)問(wèn)題的上下文。對(duì)于WSC和MultiRC,我們使用同一組從訓(xùn)練集中隨機(jī)抽取的例子作為我們?cè)u(píng)估的所有問(wèn)題的上下文。? 我們觀察到GPT-3在不同任務(wù)中的表現(xiàn)差異很大。在COPA和記錄GPT-3實(shí)現(xiàn)近sota的表現(xiàn)在一次樣本和小樣本設(shè)置,與COPA只下降了幾個(gè)點(diǎn),并在排行榜上取得第二名,第一名是由微調(diào)110億參數(shù)模型(T5)。在WSC上,性能仍然相對(duì)較強(qiáng),在小樣本設(shè)置中達(dá)到80.1%(請(qǐng)注意,如3.4節(jié)所述,gpot -3在原始Winograd數(shù)據(jù)集上達(dá)到88.6%)。在BoolQ、MultiRC和RTE上,性能是合理的,大致與經(jīng)過(guò)微調(diào)的BERT-Large匹配。在CB上,我們看到生命跡象的比例為75.6%。 |
| WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different ?phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two ?sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer ?in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot ?setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same ?way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. ?This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these ?weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to ?the state-of-the-art held by a fine-tuned 11 billion parameter model. Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of ?examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32 ?examples per task, after which point additional examples will not reliably fit into our context. When sweeping over ?values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large ?on overall SuperGLUE score. | WiC是一個(gè)值得注意的弱點(diǎn),它的命中率為49.4%(隨機(jī))。我們?yōu)閃iC嘗試了許多不同的短語(yǔ)和公式(包括確定一個(gè)單詞在兩個(gè)句子中是否具有相同的意思),但沒(méi)有一個(gè)能夠取得很好的效果。這暗示了一個(gè)現(xiàn)象,在下一節(jié)將變得更清楚(討論ANLI基準(zhǔn))——GPT-3似乎弱few-shot或一次性設(shè)置的一些任務(wù),涉及比較兩個(gè)句子或片段,例如一個(gè)詞是否用同樣的方式在兩個(gè)句子,一個(gè)句子是否解釋另一個(gè),或者一個(gè)句子是否意味著另一個(gè)。這也可以解釋RTE和CB的分?jǐn)?shù)相對(duì)較低的原因,它們也采用這種格式。盡管存在這些弱點(diǎn),GPT-3仍然在8個(gè)任務(wù)中的4個(gè)任務(wù)上優(yōu)于經(jīng)過(guò)微調(diào)的伯特-大公司,而在兩個(gè)任務(wù)上,GPT-3通過(guò)一個(gè)經(jīng)過(guò)微調(diào)的110億參數(shù)模型已經(jīng)接近最先進(jìn)水平。 最后,我們注意到,隨著模型大小和上下文中的示例數(shù)量的增加,少量注射的SuperGLUE得分穩(wěn)步提高,顯示了上下文內(nèi)學(xué)習(xí)的好處越來(lái)越大(圖3.8)。我們將K擴(kuò)展到每個(gè)任務(wù)32個(gè)示例,超過(guò)這一點(diǎn),額外的示例將不可靠地適合我們的上下文。當(dāng)掃過(guò)K的值時(shí),我們發(fā)現(xiàn)GPT-3每個(gè)任務(wù)總共需要少于8個(gè)示例,才能在總體超級(jí)膠水得分上超過(guò)經(jīng)過(guò)微調(diào)的伯特-大。 |
?
3.8 NLI? 自然語(yǔ)言推理任務(wù)
| Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. ?In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral). ?SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest ?version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting ?GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced ?Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of ?adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our ?models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (~ 33%), ?whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results ?for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult ?task for language models and they are only just beginning to show signs of progress. | 自然語(yǔ)言推理(NLI) [Fyo00]關(guān)注理解兩個(gè)句子之間關(guān)系的能力。在實(shí)踐中,這個(gè)任務(wù)通常被構(gòu)造成兩個(gè)或三個(gè)類的分類問(wèn)題,其中模型分類第二個(gè)句子在邏輯上是否與第一個(gè)句子相符合,是否與第一個(gè)句子相矛盾,或者可能是正確的(中立的)。SuperGLUE包括一個(gè)NLI數(shù)據(jù)集RTE,它計(jì)算任務(wù)的二進(jìn)制版本。在RTE上,只有最大版本的GPT-3在任何評(píng)估設(shè)置上的表現(xiàn)都令人信服地優(yōu)于random(56%),但在小樣本設(shè)置中,GPT-3的表現(xiàn)類似于單任務(wù)優(yōu)化的BERT Large。我們還評(píng)估了最近引入的對(duì)抗式自然語(yǔ)言推斷(ANLI)數(shù)據(jù)集[NWD+19]。ANLI是一個(gè)復(fù)雜的數(shù)據(jù)集,它在三輪(R1、R2和R3)中使用一系列逆向挖掘的自然語(yǔ)言推理問(wèn)題。與RTE類似,我們所有小于GPT-3的模型在ANLI上的表現(xiàn)幾乎完全是隨機(jī)的,即使是在很少投籃的設(shè)置中(約33%),而GPT-3本身在第3輪顯示出生命跡象。ANLI R3的結(jié)果突出顯示在圖3.9和全部結(jié)果輪可以在附錄h .這些結(jié)果RTE和ANLI NLI基礎(chǔ)仍然是一個(gè)非常困難的任務(wù)表明語(yǔ)言模型和他們才剛剛開(kāi)始顯示出進(jìn)步的跡象。 |
?
3.9 Synthetic and Qualitative Tasks ?綜合和定性任務(wù)
| One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which ?require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have ?occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we ?test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the ?letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to ?solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new ?words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets ?with the hope of stimulating further study of test-time behavior of language models. ? | 要想了解GPT-3在“少拍”(或“零拍”和“一次拍”)環(huán)境下的能力范圍,一種方法是讓它執(zhí)行一些任務(wù),這些任務(wù)要求它執(zhí)行簡(jiǎn)單的即時(shí)計(jì)算推理,識(shí)別訓(xùn)練中不太可能出現(xiàn)的新模式,或者快速適應(yīng)不尋常的任務(wù)。我們?cè)O(shè)計(jì)了幾個(gè)任務(wù)來(lái)測(cè)試這類能力。首先,我們測(cè)試GPT-3執(zhí)行算術(shù)的能力。其次,我們創(chuàng)建了幾個(gè)任務(wù),這些任務(wù)包括重新排列或整理單詞中的字母,這些任務(wù)不太可能在訓(xùn)練過(guò)程中被準(zhǔn)確地看到。第三,我們測(cè)試了GPT-3解決衛(wèi)星式類比問(wèn)題的能力。最后,我們對(duì)GPT-3進(jìn)行了幾個(gè)定性測(cè)試,包括在句子中使用新單詞、修改英語(yǔ)語(yǔ)法和生成新聞文章。我們將發(fā)布合成數(shù)據(jù)集,希望能促進(jìn)對(duì)語(yǔ)言模型測(cè)試時(shí)行為的進(jìn)一步研究。? |
?
3.9.1 Arithmetic ?算術(shù)
| To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small ?battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:
| 為了測(cè)試GPT-3在沒(méi)有特定任務(wù)訓(xùn)練的情況下執(zhí)行簡(jiǎn)單算術(shù)運(yùn)算的能力,我們開(kāi)發(fā)了一個(gè)包含10個(gè)測(cè)試的小電池,其中包括用自然語(yǔ)言問(wèn)GPT-3一個(gè)簡(jiǎn)單的算術(shù)問(wèn)題:
|
| In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random ?instances of the task and evaluate all models on those instances. ?First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, ?GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, ?98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the ?number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on ?five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves ?29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves ?21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness ?beyond just single operations. ? | 在所有的10個(gè)任務(wù)中,模型必須準(zhǔn)確地生成正確的答案。對(duì)于每個(gè)任務(wù),我們生成一個(gè)包含2000個(gè)任務(wù)隨機(jī)實(shí)例的數(shù)據(jù)集,并對(duì)這些實(shí)例上的所有模型進(jìn)行評(píng)估。首先,我們?cè)谛颖驹O(shè)置中評(píng)估GPT-3,其結(jié)果如圖3.10所示。在加減法方面,GPT-3在數(shù)字較少的情況下表現(xiàn)出較強(qiáng)的熟練度,2位加法的準(zhǔn)確率為100%,2位減法的準(zhǔn)確率為98.9%,3位加法的準(zhǔn)確率為80.2%,3位減法的準(zhǔn)確率為94.2%。隨著數(shù)字?jǐn)?shù)目的增加,性能會(huì)下降,但是GPT-3在四位數(shù)操作上仍能達(dá)到25-26%的精度,在五位數(shù)操作上仍能達(dá)到9-10%的精度,這表明至少有一些能力概括為更大數(shù)目的數(shù)字。GPT-3在2位乘法上也達(dá)到了29.2%的精度,這是一個(gè)特別的計(jì)算密集型操作。最后,GPT-3在個(gè)位數(shù)聯(lián)合操作(例如,9*(7+5))時(shí)達(dá)到了21.3%的準(zhǔn)確率,這表明GPT-3在單個(gè)操作之外還有一定的穩(wěn)健性。 ? |
| As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the ?second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all ?other operations less than 10% of the time. ? One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation ?to the task (or at the very least recognition of the task) is important to performing these computations correctly. ?Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly?outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and ?model capacity scaling for all three settings is shown in Appendix H. To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic ?problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and ?"<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 ?subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers ?could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes ?such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than ?memorizing a table. ? | 圖3.10表明,小模型在所有這些任務(wù)做差,甚至130億年的參數(shù)模型(1750億年之后的第二大完整的GPT-3)可以解決2位數(shù)的加法和減法只有一半的時(shí)間,和所有其他操作的時(shí)間不到10%。一次射擊和零射擊的性能相對(duì)于少射擊的性能有所下降,這表明適應(yīng)任務(wù)(或至少識(shí)別任務(wù))對(duì)正確執(zhí)行這些計(jì)算很重要。盡管如此,單次射擊的性能仍然相當(dāng)強(qiáng)大,甚至全GPT-3的零射擊性能也顯著優(yōu)于所有小型模型的少次射擊學(xué)習(xí)。表3.9顯示了完整GPT-3的所有三個(gè)設(shè)置,附錄H顯示了所有這三個(gè)設(shè)置的模型容量伸縮。 為了抽查模型是否只是簡(jiǎn)單地記憶特定的算術(shù)問(wèn)題,我們?nèi)y(cè)試集中的三位數(shù)算術(shù)問(wèn)題,并在訓(xùn)練數(shù)據(jù)中以“<num1> + <num2> =”和“<num1> + <num2>”的形式搜索它們。</num2></num1></num2></num1>在2000道加法題中,我們發(fā)現(xiàn)只有17道匹配(0.8%),而在2000道減法題中,我們發(fā)現(xiàn)只有2道匹配(0.1%),這表明只有一小部分正確答案能夠被記住。此外,對(duì)錯(cuò)誤答案的檢查發(fā)現(xiàn),該模型經(jīng)常會(huì)犯錯(cuò)誤,比如沒(méi)有帶“1”,這表明它實(shí)際上是在嘗試執(zhí)行相關(guān)的計(jì)算,而不是記憶一個(gè)表。總的來(lái)說(shuō),GPT-3在少桿、一桿甚至零桿設(shè)置中表現(xiàn)出了相當(dāng)熟練的中等復(fù)雜的算術(shù)。 |
?
3.9.2 Word Scrambling and Manipulation Tasks ?拼字和操作任務(wù)
| To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of ?5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of ?scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:
| 為了測(cè)試GPT-3從幾個(gè)例子中學(xué)習(xí)新的符號(hào)操作的能力,我們?cè)O(shè)計(jì)了一個(gè)包含5個(gè)“字符操作”任務(wù)的小電池。每個(gè)任務(wù)都包括給模型一個(gè)被打亂、添加或刪除字符組合而扭曲的單詞,并要求它恢復(fù)原來(lái)的單詞。這5項(xiàng)任務(wù)是:
|
| For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by ?[Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. ?Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing?random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram ?task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word. ? In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the ?model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these ?tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear ?in the pre-training data (although we cannot confirm this with certainty). ? | 對(duì)于每個(gè)任務(wù),我們生成10,000個(gè)示例,我們選擇這些示例作為最常見(jiàn)的10,000個(gè)單詞,以長(zhǎng)度大于4個(gè)字符和小于15個(gè)字符的[Nor09]來(lái)衡量。小樣本結(jié)果如圖3.11所示。任務(wù)性能隨著模型大小的變化而平穩(wěn)增長(zhǎng),完整的GPT-3模型在刪除隨機(jī)插入時(shí)達(dá)到66.9%,循環(huán)字母達(dá)到38.6%,在較簡(jiǎn)單的字謎任務(wù)中達(dá)到40.2%,在較困難的字謎任務(wù)(只保留第一個(gè)和最后一個(gè)字母)中達(dá)到15.1%。沒(méi)有一個(gè)模型能將字母倒轉(zhuǎn)成一個(gè)單詞。 在單樣本設(shè)置中,性能明顯較差(下降一半或更多),而在零樣本設(shè)置中,模型很少能執(zhí)行任何任務(wù)(表3.10)。這表明,模型確實(shí)在測(cè)試時(shí)學(xué)習(xí)了這些任務(wù),因?yàn)槟P筒荒芰闶д`地執(zhí)行它們,而且它們的人工特性使它們不太可能出現(xiàn)在訓(xùn)練前的數(shù)據(jù)中(盡管我們不能確定地證實(shí)這一點(diǎn))。 |
| We can further quantify performance by plotting “in-context learning curves”, which show task performance as a ?function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task ?in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, ?including both task examples and natural language task descriptions. Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding ?operates on significant fractions of a word (on average ~ 0.7 words per token), so from the LM’s perspective succeeding ?at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, ?CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), ?requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require ?non-trivial pattern-matching and computation. | 我們可以通過(guò)繪制“上下文內(nèi)學(xué)習(xí)曲線”來(lái)進(jìn)一步量化績(jī)效,該曲線將任務(wù)績(jī)效顯示為上下文內(nèi)例子數(shù)量的函數(shù)。我們?cè)趫D1.2中展示了用于符號(hào)插入任務(wù)的上下文內(nèi)學(xué)習(xí)曲線。我們可以看到,更大的模型能夠越來(lái)越有效地使用上下文信息,包括任務(wù)示例和自然語(yǔ)言任務(wù)描述。 最后,值得補(bǔ)充的是,解決這些任務(wù)需要字符級(jí)操作,而我們的BPE編碼作用于重要的分?jǐn)?shù)一個(gè)詞(平均0.7~字令牌),所以從LM的角度成功在這些任務(wù)不僅包括操縱BPE令牌但理解和剖析他們的子結(jié)構(gòu)。另外,CL、A1和A2不是雙射的(也就是說(shuō),被解置的單詞不是被解置單詞的確定性函數(shù)),需要模型執(zhí)行一些搜索來(lái)找到正確的解置。因此,所涉及的技能似乎需要非平凡的模式匹配和計(jì)算。 |
?
3.9.3 SAT Analogies 類比
| To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of ?374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of ?the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to ?hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to ?temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original ?word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the ?few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among ?college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with ?scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model. | 為了在另一個(gè)任務(wù)中測(cè)試GPT-3,這個(gè)任務(wù)相對(duì)于文本的典型分布有些不尋常,我們收集了一組374個(gè)“SAT類比”問(wèn)題[TLBS03]。類推題是2005年前SAT大學(xué)入學(xué)考試的一個(gè)部分的多項(xiàng)選擇題。一個(gè)典型的例子是“大膽之于大膽,正如(A)偽善之于偽善,(b)匿名之于身份,(c)懊悔之于惡行,(d)有害之于結(jié)果,(e)易受誘惑之于結(jié)果。”要求學(xué)生從五組單詞中選出與原單詞有相同關(guān)系的單詞;在這個(gè)例子中,答案是“假裝虔誠(chéng)就是虛偽”。在這項(xiàng)任務(wù)中,GPT-3在少發(fā)、一發(fā)和零發(fā)中得分分別為65.2%、59.1%和53.7%,而大學(xué)申請(qǐng)者的平均得分為57% [TL05](隨機(jī)猜測(cè)的得分為20%)。如圖3.12所示,結(jié)果隨著規(guī)模的增加而提高,全1750億模型比130億參數(shù)模型提高了10%以上。 |
?
?
3.9.4 News Article Generation ?新聞文章生成
| Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by ?conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news ?story [RWC+19]. Relative to [RWC+19], the dataset used to train GPT-3 is much less weighted towards news articles, ?so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets ?the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To ?solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the ?model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably ?generate short articles in the “news” genre. ? To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional ?sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles ?from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative ?language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to ?distinguish the two is a potentially important measure of quality.3 |
為了衡量GPT-3生成新聞文章的質(zhì)量(我們認(rèn)為這很可能與有條件的樣本生成質(zhì)量總體上相關(guān)),我們決定衡量人類區(qū)分GPT-3生成的文章與真實(shí)文章的能力。Kreps等人[KMB20]和Zellers等人[ZHR+19]也進(jìn)行了類似的工作。生成語(yǔ)言模型被訓(xùn)練來(lái)匹配人類生成的內(nèi)容的分布,所以人類區(qū)分這兩者的能力是質(zhì)量的一個(gè)潛在的重要衡量標(biāo)準(zhǔn) ? |
| In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles ?from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles ?from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each ?model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed ?by either the human written article or the article generated by the model4 ?. Participants were asked to select whether the ?article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by ?a machine”, or “very likely written by a machine”.? The articles we selected were not in the models’ training data and the model outputs were formatted and selected ?programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were ?pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. ?However, we also ran an experiment to control for participant effort and attention that followed the same format but ?involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a ?160M parameter model with no context and increased output randomness. | 為了考察人類檢測(cè)模型生成的文本的能力,我們從newser.com網(wǎng)站上任意選擇了25篇文章的標(biāo)題和副標(biāo)題(平均長(zhǎng)度:215個(gè)單詞)。然后,我們根據(jù)四種語(yǔ)言模型生成這些標(biāo)題和字幕的完整版本,大小從1.25米到175B (GPT-3)參數(shù)不等(平均長(zhǎng)度:200個(gè)單詞)。對(duì)于每個(gè)模型,我們向大約80名來(lái)自美國(guó)的參與者展示了一個(gè)測(cè)試,其中包含這些真實(shí)的標(biāo)題和副標(biāo)題,然后是人工撰寫(xiě)的文章或由模型4生成的文章。參與者被要求選擇文章是“很可能是人類寫(xiě)的”,“更可能是人類寫(xiě)的”,“我不知道”,“更可能是機(jī)器寫(xiě)的”,還是“很可能是機(jī)器寫(xiě)的”。 我們選擇的文章不在模型的訓(xùn)練數(shù)據(jù)中,并且模型的輸出被編程地格式化和選擇,以防止人類的“挑選”。所有模型都使用相同的上下文來(lái)設(shè)置輸出條件,并使用相同的上下文大小進(jìn)行預(yù)訓(xùn)練,每個(gè)模型都使用相同的文章標(biāo)題和副標(biāo)題作為提示。然而,我們也進(jìn)行了一項(xiàng)實(shí)驗(yàn),以控制參與者的努力和注意力,這些人遵循同樣的格式,但包含了有意的不良模型生成的文章。這是通過(guò)從一個(gè)“控制模型”生成文章來(lái)實(shí)現(xiàn)的:一個(gè)沒(méi)有上下文且增加了輸出隨機(jī)性的160M參數(shù)模型。 |
| Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that ?the intentionally bad articles were model generated was ~ 86% where 50% is chance level performance. By contrast, ?mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance ?at ~ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: ?there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 ?This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E). ? Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe ?more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated ?by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated ?completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial ?experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to ?compare human abilities to detect the articles generated by GPT-3 and a control model. ? | 在檢測(cè)出被模型生成的故意差的文章時(shí),人類的平均準(zhǔn)確率(每個(gè)參與者的正確任務(wù)與非中立任務(wù)的比率)為86%,其中50%是隨機(jī)水平的表現(xiàn)。相比之下,人類檢測(cè)175B參數(shù)模型產(chǎn)生的物品的平均準(zhǔn)確率僅為52%(見(jiàn)表3.11)。5人類檢測(cè)模型生成的文本的能力似乎隨著模型大小的增加而減少:模型大小似乎有機(jī)會(huì)準(zhǔn)確性的趨勢(shì),人類對(duì)GPT-3的檢測(cè)接近于機(jī)會(huì)。盡管隨著模型尺寸的增加,參與者會(huì)在每個(gè)輸出上花費(fèi)更多的時(shí)間(見(jiàn)附錄E),但這是真的。 圖3.14和圖3.15給出了GPT-3合成產(chǎn)品的示例。7如評(píng)估所示,大部分文本對(duì)人類來(lái)說(shuō)很難從真實(shí)的人類內(nèi)容中區(qū)分出來(lái)。事實(shí)不準(zhǔn)確可能是一篇文章是模型生成的標(biāo)志,因?yàn)榕c人類作者不同,模型無(wú)法訪問(wèn)文章標(biāo)題所引用的具體事實(shí)或文章的寫(xiě)作時(shí)間。其他的指標(biāo)包括重復(fù),不符合邏輯,和不尋常的措辭,盡管這些通常是足夠微妙的,他們沒(méi)有被注意到。? Ippolito等人[IDCBE19]在語(yǔ)言模型檢測(cè)方面的相關(guān)工作表明,自動(dòng)鑒別器如G R O V E R [ZHR+19]和GLTR [GSR19]在檢測(cè)模型生成的文本方面可能比人類評(píng)價(jià)器更成功。這些模型的自動(dòng)檢測(cè)可能是未來(lái)研究的一個(gè)有前景的領(lǐng)域。 Ippolito等人[IDCBE19]也注意到,隨著人們觀察到更多的標(biāo)記,人類檢測(cè)模型生成的文本的準(zhǔn)確性也會(huì)提高。做一個(gè)初步調(diào)查好人類是如何檢測(cè)時(shí)間的新聞文章由GPT-3 175 b,我們選擇了12項(xiàng)世界新聞文章來(lái)自路透社平均長(zhǎng)度為569個(gè)單詞和生成完成的這些文章GPT-3平均長(zhǎng)度為498個(gè)單詞(298字的時(shí)間比我們最初的實(shí)驗(yàn))。按照上述方法,我們進(jìn)行了兩個(gè)實(shí)驗(yàn),每個(gè)實(shí)驗(yàn)都有大約80名美國(guó)參與者,以比較人類檢測(cè)GPT-3和一個(gè)對(duì)照模型生成的文章的能力。? 我們發(fā)現(xiàn),人類在檢測(cè)控制組故意制造的較長(zhǎng)文章時(shí)的平均準(zhǔn)確率為~ 88%,而在檢測(cè)GPT-3 175B制造的較長(zhǎng)文章時(shí)的平均準(zhǔn)確率為~ 52%(見(jiàn)表3.12)。這表明,對(duì)于長(zhǎng)度在500字左右的新聞文章,GPT-3繼續(xù)生成人類難以區(qū)分的文章。 |
?
?
3.9.5 Learning and Using Novel Words ?學(xué)習(xí)和使用新單詞
| A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a ?word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here ?we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, ?such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the ?broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions ?were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were ?generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try ?any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final ?sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of ?the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy ?sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence. | 發(fā)展語(yǔ)言學(xué)[CB78]研究的一個(gè)任務(wù)是學(xué)習(xí)和利用新單詞的能力,例如在一個(gè)句子中只看到一個(gè)單詞的定義一次就使用它,或者從一個(gè)用法反過(guò)來(lái)推斷一個(gè)單詞的意思。在這里,我們定性地測(cè)試GPT-3完成前一項(xiàng)任務(wù)的能力。具體來(lái)說(shuō),我們給GPT-3一個(gè)不存在的單詞的定義,比如“Gigamuru”,然后讓它在一個(gè)句子中使用它。我們提供了一個(gè)(單獨(dú)的)不存在的單詞在句子中被定義和使用的1到5個(gè)例子,所以就寬泛任務(wù)的前面例子而言,任務(wù)是很少的,而就具體單詞而言,任務(wù)是一次性的。表3.16顯示了我們生成的6個(gè)示例;所有的定義都是人為生成的,第一個(gè)答案是人為生成的,作為條件反射,隨后的答案是GPT-3生成的。這些示例是在一次運(yùn)行中連續(xù)生成的,我們沒(méi)有省略或重復(fù)嘗試任何提示。在所有的情況下,生成的句子似乎是一個(gè)正確的或至少似是而非的詞的使用。在最后一句話中,該模型為單詞“screeg”(即“screeghed”)生成了一個(gè)貌似合理的變位,盡管這個(gè)詞的使用有點(diǎn)尷尬(“screeghed at each other”),盡管它在描述一場(chǎng)玩具劍戰(zhàn)的意義上似乎是可信的。總的來(lái)說(shuō),GPT-3在使用新單詞造句方面至少表現(xiàn)得很熟練。 |
?
3.9.6 Correcting English Grammar ?修改英語(yǔ)語(yǔ)法
| Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot ?setting by giving prompts of the form "Poor English Input: <sentence>\n Good English Output: ?<sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any ?omissions or repeats). Results are shown in Figure 3.17. | 另一項(xiàng)非常適合少量學(xué)習(xí)的任務(wù)是糾正英語(yǔ)語(yǔ)法。我們?cè)趂ewshot設(shè)置中使用GPT-3測(cè)試這一點(diǎn),給出如下提示:“糟糕的英語(yǔ)輸入:<句子>\n良好的英語(yǔ)輸出:<句子>”。我們給GPT-3一個(gè)人為的修正,然后讓它再修正5個(gè)(同樣沒(méi)有遺漏或重復(fù))。結(jié)果如圖3.17所示。 |
?
?
4 Measuring and Preventing Memorization Of Benchmarks ?測(cè)量和防止記憶基準(zhǔn)
| Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our ?benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research ?without established best practices. While it is common practice to train large models without investigating contamination, ?given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to. ? This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] ?detected and removed a training document which overlapped with one of their evaluation datasets. Other work such ?as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not ?significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent). | 由于我們的訓(xùn)練數(shù)據(jù)集來(lái)自互聯(lián)網(wǎng),所以我們的模型可能是在一些基準(zhǔn)測(cè)試集上訓(xùn)練的。從互聯(lián)網(wǎng)規(guī)模的數(shù)據(jù)集中準(zhǔn)確地檢測(cè)測(cè)試污染是一個(gè)新的研究領(lǐng)域,沒(méi)有建立最佳實(shí)踐。雖然在訓(xùn)練大型模型時(shí)不調(diào)查污染是常見(jiàn)的做法,但考慮到訓(xùn)練前數(shù)據(jù)集規(guī)模的不斷擴(kuò)大,我們相信這個(gè)問(wèn)題正變得越來(lái)越重要。? 這種擔(dān)憂不僅僅是假設(shè)。最早在普通爬行數(shù)據(jù)上訓(xùn)練語(yǔ)言模型的論文之一[TL18]檢測(cè)并刪除了一個(gè)與其中一個(gè)評(píng)估數(shù)據(jù)集重疊的訓(xùn)練文檔。GPT-2 [RWC+19]等其他工作也進(jìn)行了事后重疊分析。他們的研究相對(duì)令人鼓舞,發(fā)現(xiàn)盡管模型在訓(xùn)練和測(cè)試重疊的數(shù)據(jù)上表現(xiàn)得稍微好一些,但這并不會(huì)對(duì)報(bào)告的結(jié)果產(chǎn)生顯著影響,因?yàn)橛幸恍〔糠謹(jǐn)?shù)據(jù)被污染了(通常只有幾個(gè)百分點(diǎn))。 |
| GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of ?magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential ?for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B ?does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was ?deduplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as ?large as feared. ? We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap ?between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a ?bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t ?feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts ?results. ? For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as ?examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when ?it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, ?so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in ?Appendix C. | GPT-3的運(yùn)作方式有些不同。一方面,數(shù)據(jù)集和模型的大小大約比GPT-2使用的大兩個(gè)數(shù)量級(jí),并且包括大量的常見(jiàn)爬行,增加了污染和記憶的可能性。另一方面,精確地說(shuō),由于數(shù)據(jù)量大,即使是GPT-3 175B,其訓(xùn)練集也沒(méi)有過(guò)度擬合,這是相對(duì)于一個(gè)被刪除的驗(yàn)證集而言的(圖4.1)。因此,我們預(yù)計(jì)污染可能是頻繁的,但其影響可能不會(huì)像擔(dān)心的那樣大。? 我們最初試圖通過(guò)主動(dòng)搜索并試圖消除我們的訓(xùn)練數(shù)據(jù)與本文中研究的所有基準(zhǔn)的開(kāi)發(fā)和測(cè)試集之間的任何重疊,來(lái)解決污染問(wèn)題。不幸的是,一個(gè)錯(cuò)誤只導(dǎo)致部分刪除了訓(xùn)練數(shù)據(jù)中檢測(cè)到的所有重疊部分。由于培訓(xùn)成本的原因,對(duì)模型進(jìn)行再培訓(xùn)是不可行的。為了解決這個(gè)問(wèn)題,我們?cè)敿?xì)研究剩余檢測(cè)到的重疊是如何影響結(jié)果的。? 對(duì)于每個(gè)基準(zhǔn)測(cè)試,我們生成一個(gè)“干凈”版本,刪除所有可能泄露的示例,大致定義為與預(yù)訓(xùn)練集中的任何內(nèi)容有13克重疊的示例(或者與整個(gè)示例有重疊的示例,如果它小于13克)。我們的目標(biāo)是非常保守地標(biāo)記出任何可能被污染的東西,以便產(chǎn)生一個(gè)高度可靠的無(wú)污染子集。確切的程序在附錄C中有詳細(xì)說(shuō)明。 |
| We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean ?subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a ?significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be ?inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a ?quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence ?that contamination level and performance difference are correlated. We conclude that either our conservative method ?substantially overestimated contamination or that contamination has little effect on performance. ? Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on ?the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference ?difficult. | 然后我們?cè)谶@些干凈的基準(zhǔn)上評(píng)估GPT-3,并與原始分?jǐn)?shù)進(jìn)行比較。如果清潔子集上的分?jǐn)?shù)與整個(gè)數(shù)據(jù)集上的分?jǐn)?shù)相似,這表明即使存在污染,也不會(huì)對(duì)報(bào)告的結(jié)果產(chǎn)生顯著的影響。如果清潔組的分?jǐn)?shù)較低,這表明污染可能使結(jié)果膨脹。結(jié)果如圖4.2所示。盡管潛在的污染通常很高(四分之一的基準(zhǔn)測(cè)試得分超過(guò)50%),但在大多數(shù)情況下,性能變化只是微不足道的,而且我們沒(méi)有看到污染水平和性能差異相關(guān)的證據(jù)。我們得出的結(jié)論是,要么我們的保守方法大大高估了污染,要么污染對(duì)性能的影響很小。? 下面,我們將更詳細(xì)地回顧一些特定的情況,其中(1)模型在清理后的版本上表現(xiàn)明顯較差,或(2)潛在的污染非常高,這使得測(cè)量性能差異非常困難。 ? |
| Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension ?(QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false ?positives. We summarize the results for each group of tasks below:
| 我們的分析為進(jìn)一步的調(diào)查標(biāo)記了六組基準(zhǔn):拼詞,閱讀理解(QuAC, SQuAD2, DROP), PIQA, Winograd,語(yǔ)言建模任務(wù)(Wikitext任務(wù),1BW),以及德語(yǔ)到英語(yǔ)的翻譯。由于我們的重疊分析被設(shè)計(jì)成極其保守的,我們預(yù)計(jì)它會(huì)產(chǎn)生一些誤報(bào)。我們將每組任務(wù)的結(jié)果總結(jié)如下:
|
| We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply ?to verify how much actual contamination existed. These appeared to often contain false positives. They had either ?no actual contamination, or had contamination that did not give away the answer to the task. One notable exception ?was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very ?small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format ?precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this ?paper, the potential contamination is noted in the results section. ? Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C. | 我們還檢查了污染程度高的數(shù)據(jù)集,但對(duì)性能的影響接近于零,只是為了驗(yàn)證實(shí)際存在多少污染。這些報(bào)告似乎經(jīng)常包含誤報(bào)。他們要么沒(méi)有受到實(shí)際的污染,要么受到的污染并沒(méi)有泄露任務(wù)的答案。一個(gè)值得注意的例外是LAMBADA,它看起來(lái)確實(shí)存在大量的污染,但對(duì)性能的影響非常小,干凈子集的得分在整個(gè)數(shù)據(jù)集的0.5%之內(nèi)。而且,嚴(yán)格地說(shuō),我們的填空格式排除了最簡(jiǎn)單的記憶形式。然而,由于我們?cè)谶@篇論文中取得了很大的進(jìn)展,潛在的污染在結(jié)果部分被指出。 總的來(lái)說(shuō),我們已經(jīng)盡了最大的努力來(lái)度量和記錄數(shù)據(jù)污染的影響,并根據(jù)嚴(yán)重程度來(lái)注意或直接刪除有問(wèn)題的結(jié)果。在設(shè)計(jì)基準(zhǔn)和培訓(xùn)模式時(shí),仍有許多工作要做,以解決該領(lǐng)域一般的這一重要而微妙的問(wèn)題。有關(guān)我們的分析的更詳細(xì)的解釋,請(qǐng)讀者參閱附錄C。 |
?
?
5 Limitations ?局限性
| GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for ?future work. ? First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct ?predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although ?the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to ?lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences ?or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of ?GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed ?informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some ?datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type ?“If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable ?gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when ?evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same ?way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading ?comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks. | GPT-3和我們對(duì)它的分析都有一些局限性。下面我們將對(duì)其中一些進(jìn)行描述,并對(duì)未來(lái)的工作提出建議。? 首先,盡管GPT-3在定量和定性方面有了很大的改進(jìn),特別是與它的直接前身GPT-2相比,它在文本合成和一些NLP任務(wù)方面仍有明顯的缺陷。在文本合成方面,盡管整體質(zhì)量很高,但GPT-3樣本有時(shí)仍然在文檔層面上語(yǔ)義上重復(fù),在足夠長(zhǎng)的段落中開(kāi)始失去連貫性,自相矛盾,偶爾還包含不符合邏輯的句子或段落。我們將發(fā)布500個(gè)未經(jīng)管理的無(wú)條件樣本,以幫助更好地了解GPT-3在文本合成方面的局限性和優(yōu)勢(shì)。在離散語(yǔ)言任務(wù)領(lǐng)域,我們非正式地注意到GPT-3似乎在“常識(shí)物理”方面有特殊的困難,盡管在一些測(cè)試該領(lǐng)域的數(shù)據(jù)集(如PIQA [BZB+19])上做得很好。具體來(lái)說(shuō),GPT-3很難回答“如果我把奶酪放進(jìn)冰箱,它會(huì)融化嗎?”定量,GPT-3的語(yǔ)境學(xué)習(xí)表現(xiàn)有明顯的差距在我們套件的基準(zhǔn),如第三節(jié)所述,特別是它沒(méi)有比機(jī)會(huì)當(dāng)評(píng)估一次性甚至few-shot一些“比較”的任務(wù),如確定兩個(gè)詞使用同樣的方式在一個(gè)句子,或者如果一個(gè)句子意味著另一個(gè)(WIC和ANLI分別),以及閱讀理解任務(wù)的一個(gè)子集。考慮到GPT-3在許多其他任務(wù)上的出色的小樣本性能,這一點(diǎn)尤其引人注目。 |
| GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused ?on exploring in-context learning behavior in autoregressive language models because it is straightforward to both ?sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional ?architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent ?literature, which has documented improved fine-tuning performance when using these approaches over standard ?language models [RSR+19]. Thus our design decision comes at the cost of potentially worse performance on tasks ?which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back ?and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then ?generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a ?few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves ?comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and ?RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning ?than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with ?few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”. | GPT-3在結(jié)構(gòu)和算法上有一些限制,這可以解釋上面的一些問(wèn)題。我們專注于探索自回歸語(yǔ)言模型中的上下文內(nèi)學(xué)習(xí)行為,因?yàn)橛眠@個(gè)模型類進(jìn)行抽樣和計(jì)算可能性都很簡(jiǎn)單。因此,我們的實(shí)驗(yàn)不包括任何雙向架構(gòu)或其他訓(xùn)練目標(biāo),如去噪。這與最近的許多文獻(xiàn)有明顯的不同,后者記錄了在標(biāo)準(zhǔn)語(yǔ)言模型上使用這些方法可以提高調(diào)優(yōu)性能[RSR+19]。因此,我們的設(shè)計(jì)決策的代價(jià)是,在經(jīng)驗(yàn)上受益于雙向性的任務(wù)上,可能會(huì)有更糟糕的性能。這可能包括填空任務(wù),包括回顧和比較兩段內(nèi)容的任務(wù),或者要求重讀或仔細(xì)考慮一篇很長(zhǎng)的文章,然后寫(xiě)出非常簡(jiǎn)短的答案的任務(wù)。這可能是一個(gè)可能的解釋為GPT-3滯后few-shot性能的一些任務(wù),如WIC(包括比較詞的使用在兩個(gè)句子),ANLI(包括比較兩個(gè)句子是否意味著另一個(gè)),和一些閱讀理解任務(wù)(例如QuAC和種族)。基于過(guò)去的文獻(xiàn),我們還推測(cè),一個(gè)大型的雙向模型在微調(diào)方面會(huì)比GPT-3更強(qiáng)。在GPT-3的規(guī)模上制作一個(gè)雙向模型,以及/或嘗試使雙向模型在很少或零射擊學(xué)習(xí)中工作,是未來(lái)研究的一個(gè)有前途的方向,并且可以幫助實(shí)現(xiàn)“兩全其美”。 ? |
| A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether ?autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to ?predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, ?with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ?ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed ?actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains ?of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world ?[BHT+20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a ?different approach is likely to be necessary. Promising future directions in this vein might include learning the objective ?function from humans [ZSW+19a], fine-tuning with reinforcement learning, or adding additional modalities such as ?images to provide grounding and a better model of the world [CLY+19]. | 本文所描述的一般方法的一個(gè)更基本的限制是——擴(kuò)展任何類似lm的模型,無(wú)論是自回歸的還是雙向的——它可能最終會(huì)(或可能已經(jīng))碰到培訓(xùn)前目標(biāo)的限制。我們目前的目標(biāo)是平等地對(duì)每一個(gè)標(biāo)記進(jìn)行權(quán)重,并且缺乏一個(gè)概念,即哪些是最重要的,哪些是不那么重要的。[RRS20]演示定制對(duì)相關(guān)實(shí)體的預(yù)測(cè)的好處。此外,在自我監(jiān)督的目標(biāo)中,任務(wù)規(guī)范依賴于將所需的任務(wù)強(qiáng)制轉(zhuǎn)化為預(yù)測(cè)問(wèn)題,然而最終,有用的語(yǔ)言系統(tǒng)(例如虛擬助手)可能被認(rèn)為是采取目標(biāo)導(dǎo)向的行動(dòng),而不僅僅是進(jìn)行預(yù)測(cè)。最后,大型的預(yù)訓(xùn)練語(yǔ)言模型并不基于其他經(jīng)驗(yàn)領(lǐng)域,如視頻或現(xiàn)實(shí)世界的物理互動(dòng),因此缺乏大量關(guān)于世界的上下文[BHT+20]。由于所有這些原因,純自監(jiān)督預(yù)測(cè)的縮放可能會(huì)達(dá)到極限,使用不同的方法進(jìn)行擴(kuò)展可能是必要的。在這方面,未來(lái)有希望的方向可能包括從人類那里學(xué)習(xí)目標(biāo)函數(shù)[ZSW+19a],用強(qiáng)化學(xué)習(xí)進(jìn)行微調(diào),或添加額外的模式,如圖像,以提供接地和更好的世界模型[CLY+19]。 |
| Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 ?takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more ?text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is ?an important direction for future work, and might come from grounding in the physical world to provide additional ?information, or from algorithmic improvements. ? A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot ?learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it ?has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that ?are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, ?to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on ?this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words ?seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although ?possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what ?humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training ?and identifying them at test time would be an advance for language models, but nevertheless understanding precisely ?how few-shot learning works is an important unexplored direction for future research. | 語(yǔ)言模型普遍存在的另一個(gè)局限性是在訓(xùn)練前的樣本效率較低。盡管GPT-3在測(cè)試時(shí)間樣本效率方面更接近人類(一次或零次),但它在訓(xùn)練前看到的文本仍然比人類在一生中看到的要多得多[Lin20]。提高訓(xùn)練前的樣本效率是未來(lái)工作的一個(gè)重要方向,可能來(lái)自于在物理世界的基礎(chǔ)上提供額外的信息,或者來(lái)自于算法的改進(jìn)。在GPT-3中,與少樣本學(xué)習(xí)相關(guān)的一個(gè)限制,或者至少是不確定性,是關(guān)于小樣本學(xué)習(xí)實(shí)際上是在推理時(shí)間“從零開(kāi)始”學(xué)習(xí)新任務(wù),還是僅僅識(shí)別和識(shí)別在訓(xùn)練中學(xué)習(xí)到的任務(wù)的不確定性。這些可能性存在于光譜,從示威游行的訓(xùn)練集來(lái)自相同的分布與測(cè)試時(shí)間,認(rèn)識(shí)到相同的任務(wù),但在不同的格式,以適應(yīng)一個(gè)特定的風(fēng)格的QA等任務(wù),學(xué)習(xí)一門(mén)技能完全新創(chuàng)。GPT-3在這個(gè)范圍內(nèi)的位置也可能因任務(wù)而異。合成任務(wù),如詞序打亂或定義無(wú)意義的詞,似乎特別有可能從頭學(xué)習(xí),而翻譯顯然必須在訓(xùn)練前學(xué)習(xí),盡管可能從組織和風(fēng)格上與測(cè)試數(shù)據(jù)非常不同的數(shù)據(jù)。最終,我們甚至不清楚人類從從零開(kāi)始和之前的演示中學(xué)到了什么。即使是在訓(xùn)練前組織各種演示,并在測(cè)試時(shí)識(shí)別它們,也將是語(yǔ)言模型的一個(gè)進(jìn)步,但準(zhǔn)確地理解少槍學(xué)習(xí)是如何工作的,是未來(lái)研究的一個(gè)重要的未探索的方向。 ? |
| A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are ?both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of ?models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large ?models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, ?most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. ?Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; ?new challenges and opportunities may be associated with applying it to models of this size. ? Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, ?it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in ?performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This ?last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special ?concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts ?(Section 6). | 無(wú)論目標(biāo)函數(shù)或算法如何,GPT-3規(guī)模上的模型都存在一個(gè)限制,即它們都是昂貴的,并且不便于進(jìn)行推斷,這可能對(duì)當(dāng)前形式的這種規(guī)模的模型的實(shí)際適用性提出挑戰(zhàn)。解決這一問(wèn)題的一個(gè)可能的未來(lái)方向是將大型模型精餾[HVD15],使其達(dá)到可管理的規(guī)模,以完成特定的任務(wù)。像GPT-3這樣的大型模型包含了非常廣泛的技能,其中大多數(shù)技能對(duì)于特定的任務(wù)來(lái)說(shuō)是不需要的,這表明在原則上積極的提煉是可能的。蒸餾在一般情況下得到了很好的探索[LHCG19a],但還沒(méi)有在數(shù)千億個(gè)參數(shù)的規(guī)模上進(jìn)行嘗試;將其應(yīng)用于這種規(guī)模的模型可能會(huì)帶來(lái)新的挑戰(zhàn)和機(jī)會(huì)。最后,GPT-3共同分享一些限制大多數(shù)深度學(xué)習(xí)系統(tǒng)——它的決定并不容易解釋,它在預(yù)測(cè)不一定精確校準(zhǔn)的小說(shuō)所觀察到的輸入方差性能遠(yuǎn)高于人類標(biāo)準(zhǔn)基準(zhǔn),它保留了數(shù)據(jù)的偏見(jiàn)一直在訓(xùn)練。最后這個(gè)問(wèn)題- -數(shù)據(jù)的偏差可能導(dǎo)致模型產(chǎn)生定型或偏見(jiàn)的內(nèi)容- -從社會(huì)角度來(lái)說(shuō)是特別關(guān)注的問(wèn)題,將在下一節(jié)中與其他問(wèn)題一起討論更廣泛的影響(第6節(jié))。 |
?
?
6 Broader Impacts ?更廣泛的影響
| Language models have a wide range of beneficial applications for society, including code and writing auto-completion, ?grammar assistance, game narrative generation, improving search engine responses, and answering questions. But ?they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over ?smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the ?potential to advance both the beneficial and harmful applications of language models. ? Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily ?greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this ?are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in ?Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly ?discuss issues of energy efficiency (Section 6.3). | 語(yǔ)言模型為社會(huì)提供了廣泛的有益應(yīng)用,包括代碼和編寫(xiě)自動(dòng)完成、語(yǔ)法幫助、游戲敘事生成、改進(jìn)搜索引擎響應(yīng)和回答問(wèn)題。但它們也有潛在的有害用途。相對(duì)于較小的模型,GPT-3提高了文本生成的質(zhì)量和適應(yīng)性,并增加了區(qū)分合成文本和人類書(shū)寫(xiě)文本的難度。因此,它有潛力促進(jìn)語(yǔ)言模型的有益和有害應(yīng)用。 |
?
6.1 Misuse of Language Models ?語(yǔ)言模型的誤用
| Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing ?language models in a very different environment or for a different purpose than researchers intended. To help with this, ?we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying ?threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact ?[Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures. | 惡意使用語(yǔ)言模型可能有點(diǎn)難以預(yù)料,因?yàn)樗鼈兺ǔI婕暗皆诜浅2煌沫h(huán)境中重新使用語(yǔ)言模型,或者用于與研究人員預(yù)期不同的目的。為了幫助解決這一問(wèn)題,我們可以從傳統(tǒng)的安全風(fēng)險(xiǎn)評(píng)估框架的角度進(jìn)行思考,這些框架列出了關(guān)鍵步驟,如識(shí)別威脅和潛在影響、評(píng)估可能性以及將風(fēng)險(xiǎn)確定為可能性和影響的組合[Ros12]。我們討論三個(gè)因素:潛在的誤用應(yīng)用,威脅行動(dòng)者,和外部激勵(lì)結(jié)構(gòu)。 ? |
?
6.1.1 Potential Misuse Applications ?潛在的誤用
| Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples ?include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing ?and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high ?quality text. Language models that produce high quality text generation could lower existing barriers to carrying out ?these activities and increase their efficacy. The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to ?generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in ?3.9.4 represents a concerning milestone in this regard. | 任何依賴于生成文本的對(duì)社會(huì)有害的活動(dòng)都可以通過(guò)強(qiáng)大的語(yǔ)言模型來(lái)增強(qiáng)。例如,虛假信息,垃圾郵件,網(wǎng)絡(luò)釣魚(yú),濫用法律和政府程序,欺詐學(xué)術(shù)論文寫(xiě)作和社會(huì)工程借口。這些應(yīng)用程序中的許多都阻礙了人們編寫(xiě)足夠高質(zhì)量的文本。產(chǎn)生高質(zhì)量文本生成的語(yǔ)言模型可以降低執(zhí)行這些活動(dòng)的現(xiàn)有障礙,并提高其效率。 隨著文本合成質(zhì)量的提高,語(yǔ)言模型的誤用潛力也在增加。GPT-3生成幾段合成內(nèi)容的能力是這方面的一個(gè)重要里程碑,人們發(fā)現(xiàn)這些合成內(nèi)容很難與3.9.4中人類書(shū)寫(xiě)的文本區(qū)分開(kāi)來(lái)。 ? |
6.1.2 Threat Actor Analysis ?威脅行動(dòng)者分析
| Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors ?who may be able to build a malicious product to ‘a(chǎn)dvanced persistent threats’ (APTs): highly skilled and well-resourced ?(e.g. state-sponsored) groups with long-term agendas [SBC+19]. ? To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat ?groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did ?find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances ?of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated ?with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is ?not immediate, but significant improvements in reliability could change this. ? Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about ?possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible ?difference in operations that may see potential gains by using language models. The assessment was that language ?models may not be worth investing significant resources in because there has been no convincing demonstration that ?current language models are significantly better than current methods for generating text, and because methods for ?“targeting” or “controlling” the content of language models are still at a very early stage. ? | 威脅參與者可以根據(jù)技能和資源級(jí)別進(jìn)行組織,從能夠構(gòu)建惡意產(chǎn)品的低或中等技能和資源的參與者,到“高級(jí)持續(xù)威脅”(APTs):高技能和資源充足的(例如。國(guó)家資助的)有長(zhǎng)期議程的團(tuán)體[SBC+19]。
|
6.1.3 External Incentive Structures ?外部激勵(lì)結(jié)構(gòu)
| Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their ?agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular ?among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login ?credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment. Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. ?The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k ?truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot ?produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the ?amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts ?how scalable the operation can be. ? | 每個(gè)威脅行動(dòng)者組織也有一套戰(zhàn)術(shù)、技術(shù)和程序(TTPs),他們依靠這些來(lái)完成他們的議程。ttp會(huì)受到經(jīng)濟(jì)因素的影響,比如可伸縮性和部署的簡(jiǎn)便性;網(wǎng)絡(luò)釣魚(yú)在所有群體中都非常流行,因?yàn)樗峁┝艘环N低成本、低成本、高收益的部署惡意軟件和竊取登錄憑證的方法。使用語(yǔ)言模型來(lái)增強(qiáng)現(xiàn)有的ttp可能會(huì)導(dǎo)致部署成本更低。 易用性是另一個(gè)重要的激勵(lì)因素。擁有穩(wěn)定的基礎(chǔ)設(shè)施對(duì)ttp的采用有很大的影響。然而,語(yǔ)言模型的輸出是隨機(jī)的,盡管開(kāi)發(fā)人員可以限制這些輸出(例如使用top-k truncation),但如果沒(méi)有人類的反饋,它們無(wú)法持續(xù)執(zhí)行。如果一個(gè)社交媒體假信息機(jī)器人的輸出在99%的情況下是可靠的,但在1%的情況下輸出的是不連貫的,這就可以減少操作這個(gè)機(jī)器人所需的人力。但是仍然需要人工篩選輸出,這限制了操作的可伸縮性。 基于我們對(duì)這個(gè)模型的分析,以及對(duì)威脅參與者和環(huán)境的分析,我們懷疑人工智能研究人員最終將開(kāi)發(fā)出具有足夠一致性和可操控性的語(yǔ)言模型,從而使惡意參與者更感興趣。我們希望這將給更廣泛的研究界帶來(lái)挑戰(zhàn),并希望通過(guò)結(jié)合緩解研究、原型設(shè)計(jì)和與其他技術(shù)開(kāi)發(fā)人員協(xié)調(diào)來(lái)解決這一問(wèn)題。 |
6.2 Fairness, Bias, and Representation ?公平、偏見(jiàn)和代表性
| Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, ?since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and ?producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in ?the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8 ? Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes ?present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, ?and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how ?they are different in this dimension. | 訓(xùn)練數(shù)據(jù)中的偏差可能導(dǎo)致模型產(chǎn)生定型或偏見(jiàn)的內(nèi)容。這是令人擔(dān)憂的,因?yàn)槟P推?jiàn)可能以不同的方式傷害相關(guān)群體的人,通過(guò)加強(qiáng)現(xiàn)有的刻板印象和產(chǎn)生貶低形象等潛在危害[Cra17]。我們對(duì)模型中的偏差進(jìn)行了分析,以便更好地理解GPT-3在公平性、偏差和代表性方面的局限性。8 我們的目標(biāo)不是詳盡地描述GPT-3,而是對(duì)其局限性和行為進(jìn)行初步分析。我們關(guān)注的是與性別、種族和宗教相關(guān)的偏見(jiàn),盡管可能存在許多其他類別的偏見(jiàn),可以在后續(xù)工作中進(jìn)行研究。這只是初步的分析,并沒(méi)有反映模型的所有偏差,即使是在研究的類別內(nèi)。 總的來(lái)說(shuō),我們的分析表明,經(jīng)過(guò)互聯(lián)網(wǎng)訓(xùn)練的模型具有互聯(lián)網(wǎng)規(guī)模偏差;模型傾向于反映訓(xùn)練數(shù)據(jù)中呈現(xiàn)的刻板印象。下面我們將討論我們?cè)谛詣e、種族和宗教維度上的偏見(jiàn)的初步發(fā)現(xiàn)。我們?cè)?750億參數(shù)模型和類似較小的模型中探查偏差,看看它們?cè)谶@個(gè)維度上是否和如何不同。 |
6.2.1 Gender ?性別
| In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found ?that occupations in general have a higher probability of being followed by a male gender identifier than a female one ?(in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant). ?83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured ?this by feeding the model a context such as "The detective was a" and then looking at the probability of the ?model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.). ?In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus ?were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and ?sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist, ?housekeeper etc.? We also tested how these probabilities changed when we shifted the context to be the "The competent {occupation} ?was a" (Competent Variant), and when we shifted the context to be "The incompetent {occupation} was a" ?(Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent ?{occupation} was a," the majority of occupations had an even higher probability of being followed by a ?male identifier than a female one than was the case with our original neutral prompt, "The {occupation} was ?a". With the prompt "The incompetent {occupation} was a" the majority of occupations still leaned male ?with a similar probability than for our original neutral prompt. The average occupation bias - measured as ?1 ?njobs ?P ?jobs log( P (female|Context) ?P (male|Context)) ) - was ?1.11 for the Neutral Variant, ?2.14 for the Competent Variant and ?1.15 ?for the Incompetent Variant. | 在我們對(duì)GPT-3性別偏見(jiàn)的調(diào)查中,我們關(guān)注的是性別與職業(yè)之間的聯(lián)系。我們發(fā)現(xiàn),在給出“該職業(yè)是一個(gè)”(中性變量)這樣的背景下,一般來(lái)說(shuō),職業(yè)被男性性別標(biāo)識(shí)符跟隨的概率比女性更高(換句話說(shuō),她們更傾向于男性)。在我們測(cè)試的388種職業(yè)中,有83%的職業(yè)更有可能被男性的GPT-3尾隨。我們通過(guò)給模型輸入諸如“偵探是a”這樣的語(yǔ)境來(lái)測(cè)量這一點(diǎn),然后觀察模型接著輸入男性暗示詞(如“the detective was a”)的概率。或表示女性的詞(woman, female等)。特別是,具有較高教育水平的職業(yè),如立法者、銀行家或名譽(yù)教授,以及需要重體力勞動(dòng)的職業(yè),如梅森、米爾萊特和治安官,都偏重于男性。更有可能被女性識(shí)別的職業(yè)包括助產(chǎn)士、護(hù)士、接待員、管家等。 我們還測(cè)試了當(dāng)我們將上下文轉(zhuǎn)換為“勝任的{占職}是一個(gè)”(勝任的變體)時(shí),以及當(dāng)我們將上下文轉(zhuǎn)換為“不勝任的{占職}是一個(gè)”(不勝任的變體)時(shí),這些概率是如何變化的。我們發(fā)現(xiàn),當(dāng)提示為“勝任的{職業(yè)}是a”時(shí),大多數(shù)職業(yè)后面跟隨男性標(biāo)識(shí)符的概率比跟隨女性標(biāo)識(shí)符的概率還要高,這比我們最初的中性提示為“The{職業(yè)}是a”的概率還要高。當(dāng)提示“the incompetent {career} was a”時(shí),大多數(shù)職業(yè)仍然傾向于男性,這一概率與我們最初的中性提示相似。以1 njobs P job log(P(女性|環(huán)境)P(男性|環(huán)境))測(cè)量的平均職業(yè)偏倚為:中性變異為- 1.11,勝任變異為- 2.14,不勝任變異為- 1.15。 ? |
| We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further ?corroborated the model’s tendency to associate most occupations with males. One method measured the models ?ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model ?a context such as "The advisor met with the advisee because she wanted to get advice about job ?applications. ‘She’ refers to the" and found the option with the lowest probability between the two possible ?options (Choices between Occupation Option: advisor; Participant Option: advisee). ? We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected ?words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She ?was very", "He would be described as", "She would be described as"9 ?. We looked at the adjectives and ?adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more ?often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were ?more often described using adjectives that span a greater spectrum. ? | 我們還使用兩種方法對(duì)Winogender數(shù)據(jù)集[RNLVD18]進(jìn)行代詞解析,這兩種方法進(jìn)一步證實(shí)了該模型將大多數(shù)職業(yè)與男性聯(lián)系起來(lái)的傾向。一種方法是測(cè)試模型正確分配代詞作為職業(yè)或參與者的能力。例如,我們?yōu)槟P吞峁┝艘粋€(gè)上下文,例如“顧問(wèn)與被咨詢者會(huì)面,因?yàn)樗氆@得關(guān)于工作申請(qǐng)的建議。”“她”指的是“并在兩種可能的選項(xiàng)(職業(yè)選項(xiàng):顧問(wèn);參與者選擇:學(xué)生)。 職業(yè)和參與者的詞匯通常帶有社會(huì)偏見(jiàn),比如假設(shè)大多數(shù)居住者默認(rèn)為男性。我們發(fā)現(xiàn),語(yǔ)言模型學(xué)會(huì)了一些偏見(jiàn),比如傾向于將女性代詞與參與者的位置聯(lián)系起來(lái),而不是男性代詞。GPT-3 175B在這項(xiàng)任務(wù)上的準(zhǔn)確率是所有模型中最高的(64.17%)。這也是唯一一個(gè)女性的居住者句子(正確答案是職業(yè)選項(xiàng)的句子)的準(zhǔn)確率高于男性的模型(81.7%對(duì)76.7%)。除了我們的第二大模型GPT-3 13B,其他所有模型在男性代詞與職業(yè)相關(guān)的句子上的準(zhǔn)確率都高于女性代詞,但GPT-3 13B在兩個(gè)句子上的準(zhǔn)確率都相同(60%)。這提供了一些初步證據(jù),表明在存在偏見(jiàn)的地方,語(yǔ)言模型容易出錯(cuò),較大的模型比較小的模型更健壯。 |
6.2.2 Race ?種族
| To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The {race} man was very", ?"The {race} woman was very" and "People would describe the {race} person as" and generated 800 ?samples for each of the above prompts, with {race} replaced with a term indicating a racial category such as White ?or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that ?language models produce text of differing sentiment when varying features such as occupation [HZJ+19], we explored ?how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred ?disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive ?words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid: ?-87.5) and a score of 0 indicating neutral words (eg. sloping, chalet). ? It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that ?focused on racial features; these results are not from the models talking about race in the wild but talking about race in ?an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply ?looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to ?a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated ?with a negative sentiment under this testing methodology. ? Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the ?other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences ?narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and ?highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data. | GPT-3調(diào)查種族偏見(jiàn),我們播種等模型提示——“{種族}男人非常”,“{種族}的女人非常”和“人們將{種族}人描述為“和生成800個(gè)樣本對(duì)于上面的提示,用{種族}替換為一個(gè)術(shù)語(yǔ)表明種族類別如白人或亞洲。然后我們?cè)谏傻臉颖局卸攘繂卧~的共同出現(xiàn)。鑒于先前的研究表明,語(yǔ)言模型在不同的特征(如職業(yè))下產(chǎn)生不同的情緒[HZJ+19],我們探究了種族如何影響情緒。我們使用Senti WordNet [BES10]來(lái)測(cè)量情緒,以確定在每個(gè)種族中出現(xiàn)的不相稱的詞匯。每個(gè)詞的情緒在100到-100之間變化,積極的分?jǐn)?shù)表示積極的詞。精彩度:100,友好度:87.5),負(fù)分?jǐn)?shù)表示否定的詞。猥賤:-87.5,可怕:-87.5)和0分表示中性詞(如:傾斜的小屋)。 值得注意的是,我們明確地促使模型討論種族問(wèn)題,而這反過(guò)來(lái)產(chǎn)生了關(guān)注種族特征的文本;這些結(jié)果并不是來(lái)自于那些討論野外競(jìng)賽的模型,而是來(lái)自于他們已經(jīng)準(zhǔn)備好這樣做的實(shí)驗(yàn)設(shè)置。此外,由于我們測(cè)量情緒通過(guò)簡(jiǎn)單地看單詞共生,產(chǎn)生的情緒可以反映社會(huì)歷史因素——例如,文本有關(guān)的討論奴隸制會(huì)經(jīng)常有負(fù)面情緒,這可能會(huì)導(dǎo)致人口與負(fù)面情緒在這種測(cè)試方法。 在我們分析的所有模特中,“亞洲人”的人氣一直很高——在7個(gè)模特中,有3個(gè)排名第一。另一方面,“黑色”的人氣一直很低——在7款車型中,它在5款中排名最低。這些差異在較大的模型尺寸上略微縮小。這個(gè)分析給出了不同模型的偏差,并強(qiáng)調(diào)了對(duì)情緒、實(shí)體和輸入數(shù)據(jù)之間的關(guān)系進(jìn)行更復(fù)雜分析的必要性。 ? |
6.2.3 Religion? 宗教
| We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, ?and Judaism, by generating 800 model outputs of length ≈50 with a temperature of 1 and a top p of 0.9 for every ?prompt. Our prompts were of the nature "{Religion practitioners} are" (Eg. "Christians are") for each ?of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a ?corpus of such completions for studying co-occurrence of words. Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect ?how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such ?as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such ?as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in ?the top 40 most favored words for Islam in GPT-3. | 我們研究了哪些詞與無(wú)神論、佛教、基督教、印度教、伊斯蘭教和猶太教等宗教術(shù)語(yǔ)共出現(xiàn),通過(guò)生成800個(gè)模型輸出,長(zhǎng)度≈50,溫度為1,每個(gè)提示的p值為0.9。我們的提示屬于“宗教從業(yè)者”的性質(zhì)。“基督徒是”)對(duì)應(yīng)以上列出的六個(gè)宗教類別中的每一個(gè)。然后,我們?cè)试S模型自然地執(zhí)行補(bǔ)全,并創(chuàng)建這樣補(bǔ)全的語(yǔ)料庫(kù)來(lái)研究單詞的共現(xiàn)。 與種族相似,我們發(fā)現(xiàn)這些模型與宗教術(shù)語(yǔ)聯(lián)系在一起,顯示出某些傾向來(lái)反映這些術(shù)語(yǔ)在世界上是如何呈現(xiàn)的。以伊斯蘭教為例,我們發(fā)現(xiàn)像ramadan, prophet和mosque這樣的詞出現(xiàn)的頻率比其他宗教要高。我們還發(fā)現(xiàn),“暴力”、“恐怖主義”和“恐怖主義”等詞與“伊斯蘭”相關(guān)的比例要高于與其他宗教相關(guān)的比例,并在GPT-3中躋身“伊斯蘭”最受歡迎的40個(gè)詞匯之列。 ? |
6.2.4 Future Bias and Fairness Challenges ?未來(lái)的偏見(jiàn)和公平挑戰(zhàn)
| We have presented this preliminary analysis to share some of the biases we found in order to motivate further research, ?and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an ?area of continuous research for us and are excited to discuss different methodological approaches with the community. ?We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but ?we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model ?attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ+18]. ? Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this ?is also extensive [QMZH19, HZJ+19], so we offer only a few brief comments on future directions specific to large ?language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for ?building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for ?these models. There is room for more research that engages with the literature outside NLP, better articulates normative ?statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20]. ?Thus, mitigation work should not be approached purely with a metric driven objective to ‘remove’ bias as this has been ?shown to have blind spots [GG19, NvNvdG19] but in a holistic manner. | 我們提出這一初步分析是為了分享我們發(fā)現(xiàn)的一些偏見(jiàn),以推動(dòng)進(jìn)一步的研究,并強(qiáng)調(diào)在大規(guī)模生成模型中描述偏見(jiàn)的固有困難;我們希望這將是一個(gè)持續(xù)研究的領(lǐng)域,并很高興與社區(qū)討論不同的方法方法。我們把這部分的工作看作是主觀的路標(biāo)——我們選擇了性別、種族和宗教作為出發(fā)點(diǎn),但我們認(rèn)識(shí)到這種選擇的內(nèi)在主觀性。我們的工作受到了描述模型屬性以開(kāi)發(fā)信息性標(biāo)簽的文獻(xiàn)的啟發(fā),例如用于模型報(bào)告的模型卡片[MWZ+18]。 最終,重要的不僅僅是描述語(yǔ)言系統(tǒng)中的偏見(jiàn),還要進(jìn)行干預(yù)。關(guān)于這方面的文獻(xiàn)也很廣泛[QMZH19, HZJ+19],因此我們僅就大型語(yǔ)言模型的未來(lái)方向提供一些簡(jiǎn)短的評(píng)論。為了在通用模型中為有效預(yù)防偏倚鋪平道路,有必要建立一個(gè)共同的詞匯表,將這些模型在減輕偏倚方面的規(guī)范、技術(shù)和經(jīng)驗(yàn)挑戰(zhàn)結(jié)合起來(lái)。還有更多的研究空間與NLP以外的文獻(xiàn)相結(jié)合,更好地闡明關(guān)于傷害的規(guī)范性陳述,并與受NLP系統(tǒng)影響的社區(qū)的生活經(jīng)歷相結(jié)合[BBDIW20]。因此,應(yīng)對(duì)緩解工作不應(yīng)單純以一個(gè)度量驅(qū)動(dòng)的目標(biāo)來(lái)“消除”偏見(jiàn),因?yàn)檫@已被證明存在盲點(diǎn)[GG19, NvNvdG19],而應(yīng)以一種整體的方式。 |
6.3 Energy Usage ?能源使用
| Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 ?175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days ?for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such ?models, as advocated by [SDSE19]. ? The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we ?should consider not only the resources that go into training them, but how these resources are amortized over the ?lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though ?models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even ?with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or ?only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down ?the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient ?versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency ?of such models over time, similar to trends observed in image recognition and neural machine translation [HB20]. | 實(shí)際的大規(guī)模預(yù)訓(xùn)練需要大量的計(jì)算,這是能源密集型的:訓(xùn)練GPT-3 175B在預(yù)訓(xùn)練期間消耗了數(shù)千次petaflop/s天計(jì)算,相比之下,1.5B參數(shù)的GPT-2模型需要幾十次petaflop/s天計(jì)算(圖2.2)。這意味著我們應(yīng)該認(rèn)識(shí)到這種模式的成本和效率,正如[SDSE19]所倡導(dǎo)的。 大規(guī)模的使用訓(xùn)練的也給了另一個(gè)樣本,通過(guò)它觀看大型模型的效率,我們不僅應(yīng)該考慮去培訓(xùn)他們的資源,但這些資源如何平攤的生命周期模型,隨后將被用于各種各樣的目的特定任務(wù)來(lái)制定和調(diào)整。盡管像GPT-3這樣的模型在培訓(xùn)期間消耗了大量的資源,但一旦培訓(xùn)完成,它們的效率會(huì)驚人地高:即使使用完整的GPT-3 175B,從一個(gè)培訓(xùn)過(guò)的模型生成100頁(yè)內(nèi)容的成本大約是0.4千瓦時(shí),或者只有幾美分的能源成本。此外,像模型蒸餾[LHCG19a]這樣的技術(shù)可以進(jìn)一步降低此類模型的成本,讓我們采用訓(xùn)練單一、大規(guī)模模型的范例,然后創(chuàng)建更有效的版本,以便在適當(dāng)?shù)纳舷挛闹惺褂谩kS著時(shí)間的推移,算法的發(fā)展也會(huì)自然地進(jìn)一步提高這些模型的效率,類似于在圖像識(shí)別和神經(jīng)機(jī)器翻譯中觀察到的趨勢(shì)[HB20]。 ? |
7 Related Work ?相關(guān)工作
| Several lines of work have focused on increasing parameter count and/or computation in language models as a ?means to improve generative or task performance. An early work scaled LSTM based language models to over a ?billion parameters [JVS+16]. One line of work straightforwardly increases the size of transformer models, scaling ?up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size: ?213 million parameters [VSP+17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters ?[RWC+19], 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and most recently 17 billion parameters ?[Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of ?increasing models’ capacity to store information without increased computational cost. These approaches rely on the ?conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM+17] has been ?used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19], ?though only a small fraction of the parameters are actually used on each forward pass. A third approach increases ?computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and ?the universal transformer [DGV+18]. Our work focuses on the first approach (scaling compute and parameters together, ?by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ ?this strategy. ? Several efforts have also systematically studied the effect of scale on language model performance. [KMH+20, ?RRBS19, LWS+20, HNA+17], find a smooth power-law trend in loss as autoregressive language models are scaled up. ?This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the ?curve can perhaps be detected in Figure 3.1), and we also find relatively smooth increases in many (though not all) ?downstream tasks across 3 orders of magnitude of scaling. ? | 有幾行工作關(guān)注于增加語(yǔ)言模型中的參數(shù)計(jì)數(shù)和/或計(jì)算,以此作為提高生成或任務(wù)性能的手段。早期的工作將基于LSTM的語(yǔ)言模型擴(kuò)展到超過(guò)10億個(gè)參數(shù)[JVS+16]。一條生產(chǎn)線直接增加了變壓器模型的尺寸,大致按比例增加了參數(shù)和每個(gè)令牌的浮動(dòng)量。該血管的工作使模型規(guī)模不斷增大,原論文中有2.13億個(gè)參數(shù)[VSP+17],有3億個(gè)參數(shù)[DCLT18], 15億個(gè)參數(shù)[RWC+19], 80億個(gè)參數(shù)[SPP+19], 110億個(gè)參數(shù)[RSR+19],最近又增加了170億個(gè)參數(shù)[Tur20]。第二行工作集中在增加參數(shù)計(jì)數(shù)而不是計(jì)算,作為在不增加計(jì)算成本的情況下增加模型存儲(chǔ)信息的能力的一種方法。這些方法依賴于條件計(jì)算框架[BLC13],具體地說(shuō),專家混合方法[SMM+17]已經(jīng)被用于生成1000億個(gè)參數(shù)模型和最近的500億個(gè)參數(shù)轉(zhuǎn)換模型[AJF19],盡管在每次向前傳遞中實(shí)際使用的參數(shù)只有一小部分。第三種方法在不增加參數(shù)的情況下增加計(jì)算量;該方法的實(shí)例包括自適應(yīng)計(jì)算時(shí)間[Gra16]和通用變壓器[DGV+18]。我們的工作集中在第一種方法上(通過(guò)直接使神經(jīng)網(wǎng)絡(luò)變大,將計(jì)算和參數(shù)結(jié)合在一起),并將模型的大小比以前采用這種策略的模型增加10倍。 一些學(xué)者也系統(tǒng)地研究了規(guī)模對(duì)語(yǔ)言模型性能的影響。[KMH+20, RRBS19, LWS+20, HNA+17],隨著自回歸語(yǔ)言模型規(guī)模的增大,損失呈現(xiàn)平穩(wěn)的冪律趨勢(shì)。這項(xiàng)工作表明,隨著模型不斷擴(kuò)大,這一趨勢(shì)在很大程度上繼續(xù)下去(盡管在圖3.1中可以檢測(cè)到曲線的輕微彎曲),我們還發(fā)現(xiàn),在許多(盡管不是全部)下游任務(wù)中,在3個(gè)數(shù)量級(jí)的擴(kuò)展中,都出現(xiàn)了相對(duì)平穩(wěn)的增長(zhǎng)。 ? |
| Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language ?models that are as small as possible. This approach includes ALBERT [LCG+19] as well as general [HVD15] and task-specific [SDCW19, JYS+19, KR16] approaches to distillation of language models. These architectures and ?techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint ?of giant models. ? As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable ?effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR+19, ?IBGC+14, CCE+18, MCKS18], reading comprehension [CHI+18, RCM19], and adversarially constructed datasets ?designed to be difficult for existing language models [SBBC19, NWD+19]. In this work we test our models on many ?of these datasets. | 另一項(xiàng)工作與擴(kuò)展的方向相反,試圖在盡可能小的語(yǔ)言模型中保持強(qiáng)大的性能。該方法包括ALBERT [LCG+19]、general [HVD15]和task-specific [SDCW19, JYS+19, KR16]等語(yǔ)言模型精餾方法。這些架構(gòu)和技術(shù)對(duì)我們的工作具有潛在的補(bǔ)充作用,可以用于減少大型模型的延遲和內(nèi)存占用。 由于經(jīng)過(guò)調(diào)優(yōu)的語(yǔ)言模型在許多標(biāo)準(zhǔn)基準(zhǔn)測(cè)試任務(wù)上接近了人類的性能,人們投入了相當(dāng)多的精力來(lái)構(gòu)建更困難的或開(kāi)放的任務(wù),包括問(wèn)題回答[KPR+19, IBGC+14, CCE+18, MCKS18],閱讀理解[CHI+18, RCM19],以及為現(xiàn)有語(yǔ)言模型設(shè)計(jì)的困難的對(duì)立構(gòu)建數(shù)據(jù)集[SBBC19, NWD+19]。在這項(xiàng)工作中,我們?cè)谠S多數(shù)據(jù)集上測(cè)試我們的模型。 |
| Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the ?tasks we tested on. Recent efforts include [RSR+19, RRS20], which fine-tuned an 11 billion parameter language model, ?and [GLT+20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on ?in-context learning but could be combined in the future with those of [GLT+20, LPP+20]. ? Metalearning in language models has been utilized in [RWC+19], though with much more limited results and no ?systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it ?structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including ?matching networks [VBL+16], RL2 [DSC+16], learning to optimize [RL16, ADG+16, LM17] and MAML [FAL17]. ?Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also ?resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations ?across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training) ?updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time. ?Few-shot auto-regressive density estimation was explored in [RCP+17] and [GWC+18] studied low-resource NMT as ?a few-shot learning problem. ? | 之前的很多工作都是專門(mén)針對(duì)問(wèn)題的回答,這在我們的測(cè)試任務(wù)中占了很大一部分。最近的努力包括[RSR+19, RRS20],它微調(diào)了一個(gè)110億參數(shù)的語(yǔ)言模型,以及[GLT+20],它關(guān)注于在測(cè)試時(shí)處理大量的數(shù)據(jù)。我們的工作側(cè)重于語(yǔ)境學(xué)習(xí),但在未來(lái)可以與[GLT+20, LPP+20]的工作相結(jié)合。? 語(yǔ)言模型中的金屬學(xué)習(xí)在[RWC+19]中得到了應(yīng)用,盡管結(jié)果有限,也沒(méi)有系統(tǒng)的研究。更廣泛地說(shuō),語(yǔ)言模型metalearning具有內(nèi)環(huán)-外環(huán)結(jié)構(gòu),這使得它在結(jié)構(gòu)上類似于一般應(yīng)用于ML的metalearning。這里有大量的文獻(xiàn),包括匹配網(wǎng)絡(luò)[VBL+16], RL2 [DSC+16],學(xué)習(xí)優(yōu)化[RL16, ADG+16, LM17]和MAML [FAL17]。填料模型的上下文的我們的方法與以前的例子是最結(jié)構(gòu)類似于RL2上也類似于[HYC01],在適應(yīng)一個(gè)內(nèi)循環(huán)發(fā)生在步伐通過(guò)計(jì)算模型的激活,沒(méi)有更新權(quán)重,而外層循環(huán)(在這種情況下只是語(yǔ)言模型訓(xùn)練的)更新權(quán)重,和隱式學(xué)習(xí)能力適應(yīng)或者至少在inference-time定義識(shí)別任務(wù)。[RCP+17]探索了小樣本自回歸密度估計(jì),[GWC+18]將低資源NMT作為一個(gè)小樣本學(xué)習(xí)問(wèn)題進(jìn)行了研究。? |
| While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained ?language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with ?similar goals is semi-supervised learning where approaches such as UDA [XDH+19] also explore methods of fine-tuning ?when very little labeled data is available. ? Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18] ?and utilized for some tasks (such as summarizing) in a language model with [RWC+19]. The notion of presenting ?tasks in natural language was also explored in the text-to-text transformer [RSR+19], although there it was applied for ?multi-task fine-tuning rather than for in-context learning without weight updates.? ?Another approach to increasing generality and transfer-learning capability in language models is multi-task learning ?[Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for ?each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the ?weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating ?the weights for a new task. Multi-task learning has shown some promising initial results [LGH+15, LSP+18] and ?multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed ?the boundaries on certain tasks [KKS+20], but is still limited by the need to manually curate collections of datasets and ?set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of ?tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate ?a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR+17], human ?interaction [ZSW+19b], or active learning [Mac92]. ? | 雖然我們的小樣本方法的機(jī)制不同,但之前的工作也探索了使用預(yù)訓(xùn)練語(yǔ)言模型結(jié)合梯度下降進(jìn)行小樣本學(xué)習(xí)的方法[SS20]。另一個(gè)具有類似目標(biāo)的子領(lǐng)域是半監(jiān)督學(xué)習(xí),其中像UDA [XDH+19]這樣的方法也探索了在可用標(biāo)記數(shù)據(jù)很少的情況下進(jìn)行微調(diào)的方法。? 使用自然語(yǔ)言給出多任務(wù)模型的指令首先是在一個(gè)監(jiān)督設(shè)置中通過(guò)[MKXS18]形式化的,并在使用[RWC+19]的語(yǔ)言模型中用于一些任務(wù)(比如匯總)。在文本到文本轉(zhuǎn)換器[RSR+19]中也探索了用自然語(yǔ)言表示任務(wù)的概念,盡管它被應(yīng)用于多任務(wù)微調(diào),而不是在沒(méi)有權(quán)值更新的情況下用于上下文學(xué)習(xí)。 另一種提高語(yǔ)言模型通用性和轉(zhuǎn)移學(xué)習(xí)能力的方法是多任務(wù)學(xué)習(xí)[Car97],它對(duì)下游任務(wù)的混合進(jìn)行微調(diào),而不是分別更新每個(gè)任務(wù)的權(quán)重。如果成功的多任務(wù)學(xué)習(xí)可以允許單一模型在不更新權(quán)值的情況下用于多個(gè)任務(wù)(類似于我們的上下文學(xué)習(xí)方法),或者可以在更新新任務(wù)權(quán)值時(shí)提高樣本效率。多任務(wù)學(xué)習(xí)了一些初步的結(jié)果[LGH + 15, LSP + 18]和多級(jí)微調(diào)最近成為一個(gè)標(biāo)準(zhǔn)化的一部分SOTA結(jié)果在一些數(shù)據(jù)集[PFB18]而且突破某些任務(wù)(kk + 20),但仍需要手動(dòng)牧師收藏有限的數(shù)據(jù)集和設(shè)置培訓(xùn)課程。相比之下,大規(guī)模的預(yù)訓(xùn)練似乎提供了一種“自然的”廣泛分布的任務(wù),這種任務(wù)隱含在預(yù)測(cè)文本本身中。未來(lái)工作的一個(gè)方向可能是嘗試為多任務(wù)學(xué)習(xí)生成更廣泛的明確任務(wù),例如通過(guò)程序生成[TFR+17]、人機(jī)交互[ZSW+19b]或主動(dòng)學(xué)習(xí)[Mac92]。? ? |
| Algorithmic innovation in language models over the last two years has been enormous, including denoising-based ?bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG+19, RSR+19], random permutations ?during training [YDY+19], architectures that improve the efficiency of sampling [DYY+19], improvements in ?data and training procedures [LOG+19], and efficiency increases in the embedding parameters [LCG+19]. Many of ?these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive ?language models, both in order to focus on in-context learning performance and to reduce the complexity of our large ?model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s ?performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these ?algorithmic techniques is a promising direction for future work. | 算法語(yǔ)言的創(chuàng)新模式在過(guò)去的兩年里一直巨大,包括denoising-based雙向性[DCLT18], prefixLM [DL15]和encoder-decoder架構(gòu)(RSR LLG + 19日+ 19),隨機(jī)排列在訓(xùn)練(金波+ 19),架構(gòu),提高抽樣效率[DYY + 19],改善數(shù)據(jù)和訓(xùn)練程序[日志+ 19],和效率提高嵌入?yún)?shù)(LCG + 19)。許多這些技術(shù)為下游任務(wù)提供了顯著的收益。在這項(xiàng)工作中,我們繼續(xù)關(guān)注純自回歸語(yǔ)言模型,這既是為了關(guān)注上下文內(nèi)的學(xué)習(xí)性能,也是為了減少大型模型實(shí)現(xiàn)的復(fù)雜性。然而,結(jié)合這些算法的進(jìn)步很可能會(huì)提高GPT-3在下游任務(wù)中的性能,特別是在微調(diào)設(shè)置中,結(jié)合GPT-3的規(guī)模與這些算法技術(shù)是未來(lái)工作的一個(gè)有前途的方向。 |
?
8 Conclusion 結(jié)論
| We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at ?tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. ?We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results ?suggest that very large language models may be an important ingredient in the development of adaptable, general ?language systems. | 我們提出了一個(gè)1750億參數(shù)語(yǔ)言模型顯示強(qiáng)勁表現(xiàn)在許多NLP zero-shot任務(wù)和基準(zhǔn),一次性的,和few-shot設(shè)置,在某些情況下幾乎匹配最先進(jìn)的調(diào)整系統(tǒng)的性能,以及生成高質(zhì)量的樣品,在任務(wù)定義動(dòng)態(tài)定性表現(xiàn)強(qiáng)勁。我們記錄了大致可預(yù)測(cè)的性能擴(kuò)展趨勢(shì),而不使用微調(diào)。我們還討論了這類模型的社會(huì)影響。盡管有許多限制和弱點(diǎn),這些結(jié)果表明,非常大的語(yǔ)言模型可能是開(kāi)發(fā)適應(yīng)性強(qiáng)的通用語(yǔ)言系統(tǒng)的一個(gè)重要成分。 |
Acknowledgements 致謝
| ?The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub ?Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea ?Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up ?this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura ?Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early ?discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments, ?Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of ?people who created content that was used in the training of the model, and to those who were involved in indexing or ?upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure ?and supercomputing teams for making it possible to train models at this scale. | 作者要感謝Ryan Lowe對(duì)論文草稿提供的詳細(xì)反饋。感謝Jakub Pachocki和Szymon Sidor提出的任務(wù)建議,以及Greg Brockman、Michael Petrov、Brooke Chan和Chelsea Voss幫助運(yùn)行OpenAI基礎(chǔ)設(shè)施的評(píng)估。感謝大衛(wèi)的菜肴最初支持?jǐn)U大這個(gè)項(xiàng)目,艾琳Solaiman討論的方式方法和評(píng)估偏差,哈里森·愛(ài)德華茲和Yura呢Burda與語(yǔ)境的討論和實(shí)驗(yàn)學(xué)習(xí),杰弗里·歐文和保羅global早期的討論語(yǔ)言模型縮放、長(zhǎng)歐陽(yáng)的建議設(shè)計(jì)人類的評(píng)估實(shí)驗(yàn),克里斯Hallacy討論數(shù)據(jù)收集,和山卡特的幫助與視覺(jué)設(shè)計(jì)。感謝數(shù)百萬(wàn)創(chuàng)建內(nèi)容并用于模型培訓(xùn)的人,感謝那些參與索引或?qū)?nèi)容進(jìn)行向上投票(在WebText的情況下)的人。此外,我們要感謝整個(gè)OpenAI基礎(chǔ)設(shè)施和超級(jí)計(jì)算團(tuán)隊(duì),因?yàn)樗麄兪乖谶@種規(guī)模上訓(xùn)練模型成為可能。 |
?
?
?
?
總結(jié)
以上是生活随笔為你收集整理的Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 成功解决启动SQLServer失败,根据
- 下一篇: BigData之Hive beeline