當前位置：首頁 >

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读

發布時間：2025/3/21 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻譯與解讀

《GPT-3: Language Models are Few-Shot Learners》的翻譯與解讀

Abstract 摘要

1 Introduction 介紹

2 Approach?方法

2.1 Model and Architectures?模型和架構

2.2 Training Dataset?訓練數據集

2.3 Training Process 訓練過程

2.4 Evaluation? 評估

3 Results 結果

3.1 Language Modeling, Cloze, and Completion Tasks?語言建模、完形填空和完成任務

3.1.1 Language Modeling? ?語言建模

3.1.2 LAMBADA 數據集

3.1.3 HellaSwag? 數據集

3.1.4 StoryCloze? 數據集

3.2 Closed Book Question Answering ?閉卷回答任務

3.3 Translation? 翻譯任務

3.4 Winograd-Style Tasks? 任務

3.5 Common Sense Reasoning ?常識推理任務

3.6 Reading Comprehension ?閱讀理解任務

3.7 SuperGLUE? 對比

3.8 NLI? 自然語言推理任務

3.9 Synthetic and Qualitative Tasks ?綜合和定性任務

3.9.1 Arithmetic ?算術

3.9.2 Word Scrambling and Manipulation Tasks ?拼字和操作任務

3.9.3 SAT Analogies 類比

3.9.4 News Article Generation ?新聞文章生成

3.9.5 Learning and Using Novel Words ?學習和使用新單詞

3.9.6 Correcting English Grammar ?修改英語語法

4 Measuring and Preventing Memorization Of Benchmarks ?測量和防止記憶基準

5 Limitations ?局限性

6 Broader Impacts ?更廣泛的影響

6.1 Misuse of Language Models ?語言模型的誤用

6.1.1 Potential Misuse Applications ?潛在的誤用

6.1.2 Threat Actor Analysis ?威脅行動者分析

6.1.3 External Incentive Structures ?外部激勵結構

6.2 Fairness, Bias, and Representation ?公平、偏見和代表性

6.2.1 Gender ?性別

6.2.2 Race ?種族

6.2.3 Religion? 宗教

6.2.4 Future Bias and Fairness Challenges ?未來的偏見和公平挑戰

6.3 Energy Usage ?能源使用

7 Related Work ?相關工作

8 Conclusion 結論

Acknowledgements 致謝

《GPT-3: Language Models are Few-Shot Learners》的翻譯與解讀

作者

OpenAI

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

原文

https://arxiv.org/abs/2005.14165

Github

https://github.com/openai/gpt-3

Abstract 摘要

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

最近的研究表明，通過對大量文本語料庫進行預訓練，然后對特定任務進行微調，在許多NLP任務和基準上取得了實質性的進展。雖然在體系結構中通常與任務無關，但這種方法仍然需要成千上萬個示例的特定于任務的微調數據集。相比之下，人類通常可以通過幾個例子或簡單的指令來執行一項新的語言任務——這是目前的NLP系統在很大程度上仍難以做到的。這里，我們展示了擴展語言模型可以極大地提高任務不可知的、小樣本的性能，有時甚至可以通過預先采用的最先進的微調方法達到競爭力。具體來說，我們訓練GPT-3，這是一個自回歸語言模型，有1750億個參數，比以往任何非稀疏語言模型多10倍，并測試其在小樣本設置下的性能。對于所有任務，GPT-3的應用不需要任何梯度更新或微調，只需要通過與模型的文本交互指定任務和小樣本演示。GPT-3在許多NLP數據集上實現了強大的性能，包括翻譯、問題回答和完形填空任務，以及一些需要實時推理或領域適應的任務，如整理單詞、在句子中使用新單詞或執行3位數字算術。與此同時，我們也發現了一些數據集，其中GPT-3的小樣本學習仍然存在困難，以及一些數據集，其中GPT-3面臨著與大型網絡語料庫培訓相關的方法論問題。最后，我們發現GPT-3可以生成人類評估者難以區分的新聞文章樣本和人類撰寫的文章樣本。我們將討論這一發現和GPT-3的更廣泛的社會影響。

1 Introduction 介紹

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly ?flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word ?vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations ?and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to ?task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP+17] have ?been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18]. This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, ?question answering, textual entailment, and many others, and has continued to advance based on new architectures ?and algorithms [RSR+19, LOG+19, YDY+19, LCG+19]. However, a major limitation to this approach is that while ?the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve ?strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands ?of examples specific to that task. Removing this limitation would be desirable, for several reasons. ? First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the ?applicability of language models. There exists a very wide range of possible useful language tasks, encompassing ?anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many ?of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated ?for every new task.	近年來，NLP系統中出現了一種預先訓練語言表示的趨勢，應用于越來越靈活和任務不確定的下游遷移方式。首先,學會了使用單層表示詞向量(MCCD13, PSM14)和特定于任務的架構,然后用多層RNNs表示和上下文狀態被用來形成強表示[DL15、MBXS17 PNZtY18](盡管仍然適用于特定于任務的架構),以及最近pre-trained復發或變壓器語言模型(垂直地震剖面+ 17)直接調整,完全消除需要特定于任務的架構(RNSS18,DCLT18, HR18]。最后一種范式在許多具有挑戰性的NLP任務(如閱讀理解、問題回答、文本蘊涵和許多其他任務)上取得了實質性的進展，并在新的架構和算法的基礎上繼續前進[RSR+19, LOG+19, YDY+19, LCG+19]。然而,這種方法的主要限制在于,架構是task-agnostic,仍然是一個需要特定于任務的數據集和特定于任務的微調:實現強勁表現所需的任務通常需要微調的數據集上成千成百上千的例子具體任務。出于幾個原因，消除這一限制是可取的。首先，從實踐的角度來看，每一個新任務都需要大量帶標簽的示例數據集，這限制了語言模型的適用性。有非常廣泛的可能有用的語言任務，包括任何事情，從糾正語法，生成一個抽象概念的例子，批評一個短篇小說。對于許多這樣的任務來說，很難收集到一個大型的監督訓練數據集，特別是當這個過程必須為每個新任務重復時。
Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness ?of the model and the narrowness of the training distribution. This can create problems for the pre-training plus ?fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then ?fine-tuned on very narrow task distributions. For instance [HLW+20] observe that larger models do not necessarily ?generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm ?can be poor because the model is overly specific to the training distribution and does not generalize well outside it ?[YdC+19, MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at ?human-level, may exaggerate actual performance on the underlying task [GSL+18, NK19]. ? Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural ?language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number ?of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often?sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing ?to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans ?to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy ?dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.	其次，隨著模型的表現力和訓練分布的窄性，挖掘訓練數據中假相關性的潛力從根本上增加。這可能會給預訓練和微調范式帶來問題，在這種范式中，模型被設計得很大，以便在預訓練期間吸收信息，但隨后在非常狹窄的任務分布上進行微調。例如[HLW+20]觀察到，較大的模型不一定能更好地推廣非分布。有證據表明，在這種范式下實現的泛化可能很差，因為模型過于具體于訓練分布，不能很好地泛化在訓練分布之外[YdC+19, MPL19]。因此，在特定基準測試中，即使名義上是在人的層面上，經過調優的模型的性能也可能會夸大底層任務的實際性能[GSL+18, NK19]。第三,人類學習語言最不需要大型數據集監管任務——一個簡短的指令在自然語言(如:“請告訴我,如果這句話描述了一些快樂或者悲傷”)或者最多一個小數量的示威活動(例如:“這里有兩個例子的人勇敢的行動;請給出勇氣的第三個例子”)通常足以使一個人完成一項新任務，至少達到合理的能力水平。除了指出我們目前的NLP技術在概念上的局限性外，這種適應性還具有實際的優勢——它允許人類無縫地混合在一起或在許多任務和技能之間切換，例如在冗長的對話中執行添加操作。為了廣泛應用，我們希望我們的NLP系統有同樣的流動性和普遍性。
One potential route towards addressing these issues is meta-learning1 – which in the context of language models means ?the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities ?at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work [RWC+19] ?attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form ?of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task ?and is then expected to complete further instances of the task simply by predicting what comes next. ? While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example ?[RWC+19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind ?the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of ?solving language tasks. ? Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer ?language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters ?[DCLT18], to 1.5 billion parameters [RWC+19], to 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], ?and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream ?NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a ?smooth trend of improvement with scale [KMH+20]. Since in-context learning involves absorbing many skills and ?tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong ?gains with scale.	對解決這些問題的一個潛在的路線是meta-learning1——在語言的上下文模型意味著模型發展廣泛技能的訓練時間和模式識別能力,然后使用這些能力在推理時迅速適應或識別所需的任務(見圖1.1)。最近的工作[RWC + 19]試圖做到這一點通過我們稱之為“語境學習”,使用文本輸入pretrained語言模型作為一種任務規范:模型條件在自然語言指令和/或一些示威活動的任務,然后將完成進一步的實例任務只需預測接下來會發生什么。雖然它顯示出了一些最初的希望，但這種方法取得的效果仍遠不及微調——例如[RWC+19]在自然問題上僅取得4%的成績，甚至它的55 F1 CoQa結果現在也落后于最先進的水平35分以上。元學習顯然需要大量的改進，才能成為解決語言任務的可行的實用方法。語言建模的另一個最新趨勢可能提供了一個前進的方向。近年來，transformer語言模型的容量大幅增加，從1億個參數[RNSS18]，到3億個參數[DCLT18]，再到15億個參數[RWC+19]，再到80億個參數[SPP+19]， 110億個參數[RSR+19]，最后是170億個參數[Tur20]。每一次增加都帶來了文本合成和/或下游NLP任務的改進，有證據表明，與許多下游任務相關的日志丟失隨著規模的增大呈現平穩的改善趨勢[KMH+20]。由于內環境學習涉及在模型的參數內吸收許多技能和任務，因此內環境學習能力可能會隨著規模的增長而顯示出類似的強勁增長，這是合理的。 ?
In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call ?GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, ?as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training ?set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we ?allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, ?where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only ?an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional ?fine-tuning setting, but we leave this to future work. ? Figure 1.2 illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to ?remove extraneous symbols from a word. Model performance improves with the addition of a natural language task ?description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically ?with model size. Though the results in this case are particularly striking, the general trends with both model size and ?number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no ?gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning. ? Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot ?setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held ?by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in ?the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the ?zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art ?relative to fine-tuned models operating in the same closed-book setting. ? GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, ?which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them ?defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human ?evaluators have difficulty distinguishing from human-generated articles. ? At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This ?includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE ?or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we ?hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed. ? A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should ?not be seen as a rigorous or meaningful benchmark in itself).	在本文中，我們通過訓練一個參數為1750億的自回歸語言模型(我們稱之為GPT-3)，并測量其上下文內學習能力來檢驗這一假設。具體地說，我們在超過24個NLP數據集上評估GPT-3，以及一些旨在測試對訓練集中不太可能直接包含的任務的快速適應的新任務。對于每個任務,我們評估GPT-3 3條件下:(一)“few-shot學習”,或語境學習,我們允許盡可能多的示威活動將適合模型的上下文窗口(通常10 - 100),(b)“一次性學習”,我們只允許一個示范,和(c)“zero-shot”學習,不允許有示威游行,只有一條指令在自然語言模型。原則上，GPT-3也可以在傳統的微調設置中進行評估，但我們將其留給未來的工作。圖1.2說明了我們所研究的條件，并展示了一個簡單任務的少量學習，該任務要求模型從一個單詞中去除無關的符號。模型性能隨著自然語言任務描述的增加而提高，隨著模型上下文中的示例數量的增加，K. Few-shot學習也隨著模型大小的增加而顯著提高。雖然在這種情況下的結果是特別引人注目的，但模型大小和上下文示例數量的總體趨勢對我們研究的大多數任務都是成立的。我們強調，這些“學習”曲線不涉及梯度更新或微調，只是不斷增加作為條件的演示數量。總的來說，在NLP任務中，GPT-3在零桿和單桿設置中取得了很好的效果，在少桿設置中，有時可以與最先進的技術競爭，甚至有時超過最先進的技術(盡管最先進的技術是由經過微調的模型持有的)。例如，GPT-3在零桿設置中CoQA達到81.5 F1，在單桿設置中CoQA達到84.0 F1，在少桿設置中達到85.0 F1。同樣，在TriviaQA上，GPT-3在零桿設置上的精度為64.3%，在單桿設置上的精度為68.0%，在少桿設置上的精度為71.2%，與在相同閉鎖設置下運行的精細模型相比，后者是最先進的。在測試快速適應或即時推理的任務上，GPT-3也顯示出一步走和少步出的熟練程度，這些任務包括解讀單詞、執行算術，以及在一個句子中使用只定義過一次的新單詞。我們還表明，在小樣本設置中，GPT-3可以生成人工評估人員難以區分的合成新聞文章。與此同時，我們也發現一些任務在性能上有一些困難，即使在GPT-3的規模上也是如此。這包括像ANLI數據集這樣的自然語言推理任務，以及像RACE或QuAC這樣的閱讀理解數據集。通過對GPT-3的優點和缺點(包括這些局限性)的廣泛描述，我們希望能促進對語言少注射學習的研究
We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models ?on datasets such as Common Crawl, which can potentially include content from test datasets simply because such ?content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify ?its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most ?datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these ?datasets or we note them with an asterisk, depending on the severity. ? In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion ?parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most ?tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap ?between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models ?are more proficient meta-learners. ? Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and ?broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard. ? The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training ?GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. ?Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. ?Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.	我們還對“數據污染”進行了系統的研究——這是一個日益嚴重的問題，當在數據集上訓練高容量模型時，比如Common crawlow，它可能會包含來自測試數據集的內容，因為這些內容經常存在于web上。在本文中，我們開發了系統的工具來測量數據污染和量化其扭曲效應。盡管我們發現數據污染對大多數數據集上的GPT-3性能的影響很小，但我們確定了一些數據集可能會導致結果膨脹，我們要么不報告這些數據集的結果，要么根據嚴重程度用星號標注它們。除了以上這些，我們還訓練了一系列較小的模型(從1.25億參數到130億參數不等)，以便在零樣本、一樣本和小樣本設置中與GPT-3進行比較。總的來說，對于大多數任務，我們發現在所有三種設置中，模型容量的縮放相對平穩;一個值得注意的模式是，零彈、一彈和少彈之間的差距經常隨著模型容量的增加而增加，這可能表明較大的模型更精通元學習。最后，鑒于GPT-3表現出的廣泛的能力范圍，我們討論了對偏見、公平和更廣泛的社會影響的關注，并試圖在這方面對GPT-3的特征進行初步分析。本文的其余部分組織如下。在第2節中，我們將描述培訓GPT-3并對其進行評估的方法和方法。第3節在零，一次和很小樣本設置的任務的全范圍的結果。第4節討論了數據污染的問題(火車測試重疊)。第5節討論GPT-3的局限性。第6節討論更廣泛的影響。第7節回顧相關工作，第8節作總結。

2 Approach?方法

Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):

我們的基本預訓練方法，包括模型、數據和訓練，類似于[RWC+19]中描述的過程，即相對簡單地增加模型大小、數據集大小和多樣性，以及訓練長度。我們對上下文內學習的使用也類似于[RWC+19]，但在這項工作中，我們系統地探索了上下文內學習的不同設置。因此，在本節開始時，我們將顯式定義并對比我們將在其上評估GPT-3或原則上可以在其上評估GPT-3的不同設置。這些設置可以看作取決于它們傾向于依賴多少特定于任務的數據。具體來說，我們可以在這個頻譜中確定至少四個點(參見圖2.1):

Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data [GSL+18, NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.
Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning [RWC+19], but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving K examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set K in the range of 10 to 100 as this is how many examples can fit in the model’s context window (nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.
One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.
Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given a natural language instruction describing the task. This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”. For example, if someone is asked to “make a table of world records for the 200m dash”, this request can be ambiguous, as it may not be clear exactly what format the table should have or what should be included (and even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example in Figure 2.1, a human would likely know what to do from just the text instruction.

微調(FT)是近年來最常見的方法，它包括通過對特定于預期任務的監督數據集進行訓練來更新預訓練模型的權重。通常使用成千上萬的帶標簽的例子。調優的主要優點是在許多基準測試中具有強大的性能。主要缺點是每個任務都需要一個新的大數據集，可能會出現泛化不均勻分布[MPL19]，可能會利用訓練數據的虛假特征[GSL+18, NK19]，可能會導致與人類性能的不公平比較。在這項工作中，我們沒有對GPT-3進行微調，因為我們關注的是任務不確定性能，但是GPT-3原則上可以進行微調，這是未來工作的一個有希望的方向。
少量射擊(FS)是我們將在這項工作中使用的術語，用來指在推理時給模型一些任務的演示作為條件[RWC+19]，但不允許權重更新。如圖2.1所示,一個典型的數據集實例有一個上下文和所需的完成(例如一個英語句子翻譯和法國),和few-shot作品給K上下文和完成的例子,然后最后一個例子的情況下,模型將提供完成。我們通常將K設置在10到100之間，因為這是模型上下文窗口所能容納的示例數(nctx = 2048)。few-shot的主要優點是大大減少了對特定任務數據的需求，并減少了從一個大而窄的微調數據集學習過窄分布的可能性。這種方法的主要缺點是，到目前為止，其結果遠不如最先進的微調模型。此外，仍然需要少量特定于任務的數據。顯示的名字,few-shot學習作為語言模型與這里描述few-shot學習用于其他上下文毫升(HYC01輪式偵察車+ 16)-包括基于分布廣泛的學習任務(在本例中隱含在訓練的數據),然后迅速適應新任務。
one -shot (1S)與few-shot相同，只是除了任務的自然語言描述之外，只允許進行一次演示，如圖1所示。區分“一步走”、“少一步走”和“零一步走”(見下圖)的原因是，“一步走”與某些任務傳達給人類的方式最接近。例如，當要求人類在人工工作服務(例如Mechanical Turk)上生成數據集時，通常會給出任務的演示。相比之下，如果沒有例子，有時很難傳達任務的內容或格式。
Zero-Shot(0)與one-shot相同，只是不允許演示，并且只給模型一個描述任務的自然語言指令。這種方法提供了最大的便利性、潛在的魯棒性和避免虛假相關性(除非它們在大量的訓練前數據語料庫中廣泛出現)，但也是最具挑戰性的設置。在某些情況下，如果沒有之前的例子，人類甚至很難理解任務的格式，所以在某些情況下，這種設置是“不公平的困難”。例如,如果有人要求“讓200米短跑的世界紀錄表”,這個請求可以模糊,因為它可能不清楚什么格式表應該有或應該包括什么(甚至仔細澄清,需要理解恰恰是很困難的)。不過，至少在某些設置中，zero-shot最接近于人類執行任務的方式——例如，在圖2.1中的翻譯示例中，人類可能僅通過文本指令就知道該做什么。

Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.

Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations.

圖2.1展示了使用翻譯英語到法語的示例的四種方法。在本文中，我們關注于零射擊、一次射擊和少射擊，目的不是將它們作為競爭的備選方案進行比較，而是將它們作為不同的問題設置，在特定基準測試的性能和樣本效率之間提供不同的權衡。我們特別強調小樣本的結果，因為他們中的許多只是稍微落后于最先進的微調模型。然而，最終，“一箭雙雕”(有時甚至是“零射”)似乎是對人類表現最公平的比較，也是未來工作的重要目標。下面的2.1-2.3節分別給出了我們的模型、訓練數據和訓練過程的細節。第2.4節討論了我們如何進行少拍、一次拍和零拍評估的細節。

2.1 Model and Architectures?模型和架構

We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.

我們使用與GPT-2 [RWC+19]相同的模型和架構，包括修改的初始化、預歸一化和其中描述的可逆標記，但我們在變壓器的層中使用交替密集和局部帶狀稀疏注意模式，類似于稀疏變壓器[CGRS19]。為了研究ML性能對模型大小的依賴關系，我們訓練了8種不同大小的模型，從1.25億個參數到1750億個參數的3個數量級，最后一個是我們稱為GPT-3的模型。先前的研究[KMH+20]表明，在有足夠的訓練數據的情況下，驗證損失的比例應近似于一個平滑的冪律，該冪律是大小的函數;許多不同大小的訓練模型允許我們測試驗證丟失和下游語言任務的假設。

Table 2.1 shows the sizes and architectures of our 8 models. Here nparams is the total number of trainable parameters, nlayers is the total number of layers, dmodel is the number of units in each bottleneck layer (we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ? dmodel), and dhead is the dimension of each attention head. All models use a context window of nctx = 2048 tokens. We partition the model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work [KMH+20] suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.

表2.1顯示了我們的8個模型的大小和架構。這里nparams總數可訓練的參數,nlayers總層數,dmodel是單位的數量在每一個瓶頸層(我們總是有前饋層瓶頸層的四倍,dff = 4?dmodel),和dhead每個關注頭部尺寸。所有模型都使用nctx = 2048令牌的上下文窗口。我們沿著深度和寬度維度在gpu上劃分模型，以最小化節點之間的數據傳輸。每個模型的精確結構參數的選擇是基于計算效率和在GPU中模型布局的負載均衡。先前的工作[KMH+20]表明，驗證損失對這些參數在一個合理的大范圍內不是很敏感。

2.2 Training Dataset?訓練數據集

Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in [KMH+20], two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.

用于語言模型的數據集已經迅速擴展，最終達到了常見的爬行數據集dataset2 [RSR+19]，總計近一萬億字。這樣大的數據集足以訓練我們最大的模型，而無需對同一序列進行兩次更新。然而，我們發現未過濾或輕度過濾版本的普通爬行往往比更有組織的數據集質量更低。因此，我們采取了3個步驟來提高數據集的平均質量:(1)我們下載和過濾的一個版本CommonCrawl基于相似性的一系列高品質參考全集,(2)我們在文檔級別執行模糊重復數據刪除,在和整個數據集,以防止冗余和保存我們伸出的完整性驗證設置為一個精確的衡量過度擬合,和(3)我們還添加了高質量的參考語料訓練增加CommonCrawl和增加其多樣性。

前兩個點的詳細信息(處理常見的爬行)描述在附錄a。第三,我們添加了幾個策劃高質量的數據集,包括WebText數據集的擴展版本(RWC + 19),收集的抓取鏈接在更長一段時間,和第一(公里/小時+ 20)中描述的兩個網絡書全集(Books1和Books2)和英文維基百科。

Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data.

表2.2顯示了我們在訓練中使用的最終混合數據集。common抓取數據從2016年到2019年的每月41個shards中下載，即過濾前壓縮明文45TB，過濾后壓縮明文570GB，大致相當于4000億個字節對編碼的令牌。需要注意的是，在訓練過程中，對數據集的采樣并不是按照數據集的大小進行的，而是我們認為質量較高的數據集的采樣頻率更高，例如common抓取和Books2數據集在訓練過程中采樣次數少于一次，而對其他數據集的采樣次數為2-3次。這本質上接受了少量的過擬合，以換取更高質量的訓練數據。

A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination.

在廣泛的互聯網數據上預先訓練過的語言模型，特別是具有記憶大量內容能力的大型模型，主要關注的方法是，在培訓前無意中看到測試或開發集，可能會污染下游任務。為了減少這種污染，我們搜索并試圖消除與本文研究的所有基準的開發和測試集的重疊。不幸的是，過濾中的一個bug導致我們忽略了一些重疊部分，并且由于訓練的代價，對模型進行再訓練是不可行的。在第4節中，我們描述了剩余重疊的影響，在未來的工作中，我們將更積極地消除數據污染。

2.3 Training Process 訓練過程

As found in [KMH+20, MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table 2.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process and hyperparameter settings are described in Appendix B.

正如在[KMH+20, MKAT18]中發現的，較大的模型通常可以使用較大的批大小，但需要較小的學習速度。我們在訓練期間測量梯度噪聲尺度，并使用它來指導我們批量大小的選擇[MKAT18]。表2.1顯示了我們使用的參數設置。為了訓練更大的模型而不耗盡內存，我們在每個矩陣乘法中混合使用模型并行性和跨網絡層的模型并行性。所有的模型都是在微軟提供的高帶寬集群的V100 GPU上進行訓練的。詳細的訓練過程和超參數設置在附錄B中描述。

2.4 Evaluation? 評估

For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning examples directly from it.

對于少彈學習，我們從任務的訓練集中隨機抽取K個樣本作為條件，根據任務的不同用1或2個新行分隔，以此來評估評估集中的每個樣本。對于LAMBADA和Storycloze，沒有可用的監督訓練集，所以我們從開發集中提取條件設置示例，并在測試集上進行評估。對于Winograd(原始的，不是超級膠水版本)，只有一個數據集，所以我們直接從它提取條件設置示例。

K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better, so when a separate development and test set are available, we experiment with a few values of K on the development set and then run the best value on the test set. For some tasks (see Appendix G) we also use a natural language prompt in addition to (or for K = 0, instead of) demonstrations.

On tasks that involve choosing one correct completion from several options (multiple choice), we provide K examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing P (completion|context) P (completion|answer context) , where answer context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic.

K可以是0到模型上下文窗口允許的最大數量之間的任何值，即nctx = 2048，適用于所有模型，通常適合10到100個示例。更大的K值通常但不總是更好的,所以當一組獨立的開發和測試是可用的,我們嘗試幾值K的開發設置,然后運行測試集上的最佳值。對于某些任務(參見附錄G)我們也使用自然語言提示除了(或K = 0,而不是)示威活動。

對于涉及從多個選項(多項選擇)中選擇一個正確完成的任務，我們提供了K個上下文示例加上正確完成，然后只提供一個上下文示例，并比較每個完成的LM可能性。對于大多數任務我們比較每個令牌的可能性(規范化長度),然而在少量的數據集(弧、OpenBookQA和比賽)我們獲得更多利益衡量發展設定的正常化的無條件概率每完成,通過計算P(完成|上下文)(完成|回答上下文),在回答上下文字符串“回答:”或“:”和用于提示完成應該答案但否則通用。

On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by [RSR+19] (see Appendix G) for details.

On tasks with free-form completion, we use beam search with the same parameters as [RSR+19]: a beam width of 4 and a length penalty of α = 0.6. We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else.

對于涉及二分類的任務，我們給選項以語義上更有意義的名稱(例如“真”或“假”，而不是0或1)，然后把任務當作多項選擇;我們有時也會類似于[RSR+19]所完成的任務(詳見附錄G)。

對于自由形式完成的任務，我們使用與[RSR+19]相同的參數進行波束搜索:波束寬度為4，長度罰值為radial = 0.6。我們使用F1相似度評分、BLEU或精確匹配來給模型評分，這取決于手頭數據集的標準。

對于每個模型的大小和學習設置(0 -，1 -，和小樣本)，最終的結果會在測試集上公布。當測試集是私有的,我們的模型往往是太大,以適應在測試服務器上,所以我們報告的結果發展。我們提交到測試服務器上少量的數據集(超強力膠水,TriviaQA PiQa)我們能夠提交工作,我們只有200 b few-shot提交結果,并報告發展為一切設置結果。

3 Results 結果

In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6 ?additional extra-small models with as few as 100,000 parameters. As observed in [KMH+20], language modeling ?performance follows a power-law when making efficient use of training compute. After extending this trend by two ?more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these ?improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will ?see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a ?broad spectrum of natural language tasks. ?

Below, we evaluate the 8 models described in Section 2 (the 175 billion parameter parameter GPT-3 and 7 smaller ?models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks. ?

在圖3.1中，我們展示了第2節中描述的8個模型的訓練曲線。在這個圖中，我們還包括了6個額外的超小型模型，這些模型只有100,000個參數。正如在[KMH+20]中觀察到的，在高效使用訓練計算時，語言建模性能遵循冪律。在將這一趨勢擴展兩個數量級之后，我們只觀察到與冪律有輕微的背離。人們可能會擔心這些交叉熵損失的改進僅僅來自于我們訓練語料庫的虛假細節建模。然而，在接下來的章節中，我們將看到交叉熵損失的改進可以在廣泛的自然語言任務中帶來一致的性能提升。 

下面，我們在廣泛的數據集上評估第2節中描述的8個模型(1750億參數GPT-3和7個較小的模型)。我們將數據集分成9個類別，這些類別代表大致相似的任務。 

In Section 3.1 we evaluate on traditional language modeling tasks and tasks that are similar to language modeling, ?such as Cloze tasks and sentence/paragraph completion tasks. In Section 3.2 we evaluate on “closed book” question ?answering tasks: tasks which require using the information stored in the model’s parameters to answer general ?knowledge questions. In Section 3.3 we evaluate the model’s ability to translate between languages (especially one-shot ?and few-shot). In Section 3.4 we evaluate the model’s performance on Winograd Schema-like tasks. In Section 3.5 we ?evaluate on datasets that involve commonsense reasoning or question answering. In Section 3.6 we evaluate on reading ?comprehension tasks, in Section 3.7 we evaluate on the SuperGLUE benchmark suite, and in 3.8 we briefly explore ?NLI. Finally, in Section 3.9, we invent some additional tasks designed especially to probe in-context learning abilities – ?these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the ?few-shot, one-shot, and zero-shot settings.

在3.1節中，我們評估了傳統的語言建模任務和類似于語言建模的任務，如完形填空任務和句子/段落完成任務。在第3.2節中，我們對“閉卷”問題回答任務進行評估，即需要使用模型參數中存儲的信息來回答一般知識問題的任務。在第3.3節中，我們評估了模型在不同語言之間的翻譯能力(特別是一次翻譯和少次翻譯)。在第3.4節中，我們評估了該模型在Winograd類模式任務上的性能。在第3.5節中，我們對涉及常識推理或問題回答的數據集進行評估。在第3.6節中，我們評估了閱讀理解任務;在第3.7節中，我們評估了SuperGLUE基準套件;在3.8節中，我們簡要探討了NLI。最后，在3.9節中，我們特別設計了一些額外的任務來探究上下文中的學習能力——這些任務側重于即時推理、適應技巧或開放式的文本合成。我們在“少拍”、“一次拍”和“零拍”設置中評估所有的任務。

3.1 Language Modeling, Cloze, and Completion Tasks?語言建模、完形填空和完成任務

In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible completions of a piece of text.

在本節中，我們將測試GPT-3在傳統的語言建模任務以及相關任務上的性能，這些任務包括預測感興趣的單個單詞、完成句子或段落，或在可能完成的一段文本之間進行選擇。

3.1.1 Language Modeling? ?語言建模

We calculate zero-shot perplexity on the Penn Tree Bank (PTB) [MKM+94] dataset measured in [RWC+19]. We omit ?the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the ?one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these ?issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15 ?points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have ?a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot.

我們計算了在[RWC+19]測量的佩恩樹岸(PTB) [MKM+94]數據集上的零射擊perplexity。我們省略了4 Wikipedia-related任務的工作,因為他們是完全包含在我們的訓練數據,我們也省略十億字的基準由于高分數被包含在我們的訓練集的數據集。肺結核逃脫這些問題由于比現代互聯網。我們最大的模型在PTB上設置了一個新的SOTA，顯著領先15個點，達到20.50的困惑。注意，由于PTB是一個傳統的語言建模數據集，它沒有一個清晰的示例分離來定義一次或少次評估，因此我們只測量零次評估。

3.1.2 LAMBADA 數據集

The LAMBADA dataset [PKL+16] tests the modeling of long-range dependencies in text – the model is asked to predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the continued scaling of language models is yielding diminishing returns on this difficult benchmark. [BHT+20] reflect on the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results ([SPP+19]?and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of 8% over the previous state of the art.

LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that ?classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a ?standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but ?also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word ?filters [RWC+19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a ?cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We ?use the following fill-in-the-blank format:

LAMBADA數據集[PKL+16]測試文本中遠程依賴的建模——模型被要求預測需要閱讀一段上下文的句子的最后一個單詞。最近有研究表明，語言模型的不斷擴大在這個困難的基準上產生的收益正在減少。[BHT+20]反思了在兩個最新的研究結果([SPP+19]和[Tur20])之間，模型尺寸增加了一倍，僅提高了1.5%，并認為“繼續以數量級擴展硬件和數據尺寸并不是前進的道路”。我們發現這條道路仍然很有希望，在零桿的情況下，LAMBADA的GPT-3實現了76%，比之前的技術水平提高了8%。
LAMBADA還演示了小樣本學習的靈活性，因為它提供了一種方法來解決這個數據集通常出現的問題。盡管LAMBADA中的完成總是一個句子的最后一個單詞，但是標準語言模型無法知道這個細節。因此，它不僅將概率分配給正確的結尾，也分配給其他有效的段落延續。這個問題已經部分解決了在過去的停止字過濾器[RWC+19](禁止“延續”字)。相反，few-shot設置允許我們將任務“設置”為一個cloze測試，并允許語言模型從示例中推斷出需要完成的恰好是一個單詞。我們使用以下填空格式:

When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase ?of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model ?size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy ?by 10%. Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot ?setting. Perhaps this is because all models still require several examples to recognize the pattern.

One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact on performance.

當以這種方式呈現樣例時，GPT-3在小樣本設置中達到了86.4%的精度，比之前的最先進水平提高了18%以上。我們觀察到，隨著模型尺寸的增大，小樣本性能有了很大的提高。雖然這個設置將最小模型的性能降低了近20%，但對于GPT-3，它將精度提高了10%。最后，空白填充法并不是一種有效的一次性方法，它的效果總是比零填充法差。這可能是因為所有模型仍然需要幾個示例來識別模式。
需要注意的一點是，對測試集污染的分析發現，LAMBADA數據集中的少數似乎出現在我們的訓練數據中——然而，在第4節中執行的分析表明，對性能的影響可以忽略不計。

3.1.3 HellaSwag? 數據集

The HellaSwag dataset [ZHB+19] involves picking the best ending to a story or set of instructions. The examples were ?adversarially mined to be difficult for language models while remaining easy for humans (who achieve 95.6% accuracy). ?GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the ?75.4% accuracy of a fine-tuned 1.5B parameter language model [ZHR+19] but still a fair amount lower than the overall ?SOTA of 85.6% achieved by the fine-tuned multi-task model ALUM. ?

HellaSwag數據集[ZHB+19]涉及到為一個故事或一組指令選擇最好的結局。這些例子對語言模型來說很難挖掘，而對人類來說卻很容易(達到95.6%的準確率)。GPT-3在單小樣本設置中達到78.1%的準確率，在小樣本設置中達到79.3%的準確率，超過了1.5B參數語言模型[ZHR+19]的75.4%的準確率，但仍低于多任務模型模型85.6%的整體SOTA。

3.1.4 StoryCloze? 數據集

We next evaluate GPT-3 on the StoryCloze 2016 dataset [MCH+16], which involves selecting the correct ending ?sentence for five-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot ?setting (with K = 70). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model [LDL19] but ?improves over previous zero-shot results by roughly 10%.

接下來，我們對StoryCloze 2016數據集[MCH+16]上的GPT-3進行評估，包括為五句話長的故事選擇正確的結尾句。在這里，GPT-3在零樣本設置中達到83.2%，在小樣本設置(K = 70)中達到87.7%。這仍然比使用基于BERT模型[LDL19]進行微調的SOTA低4.1%，但比之前的零射擊結果提高了約10%。

3.2 Closed Book Question Answering ?閉卷回答任務

In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense ?amount of possible queries, this task has normally been approached by using an information retrieval system to find ?relevant text in combination with a model which learns to generate an answer given the question and the retrieved ?text. Since this setting allows a system to search for and condition on text which potentially contains the answer it ?is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well ?directly answering the questions without conditioning on auxilliary information. They denote this more restrictive ?evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better ?and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR+19], ?WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in ?the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than ?previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself ?is also not permitted.

在本節中，我們將測量GPT-3回答有關廣泛事實知識的問題的能力。由于可能的查詢量巨大，這個任務通常是通過使用信息檢索系統查找相關文本，并結合學習生成給定問題和檢索文本的答案的模型來完成的。由于該設置允許系統搜索并對可能包含答案的文本進行條件設置，因此稱為“open-book”。[RRS20]最近證明，一個大型語言模型可以在不依賴輔助信息的情況下直接回答問題，表現得令人驚訝地好。他們將這種更嚴格的評估設置稱為“閉卷”。他們的工作表明，更高容量的模型可以表現得更好，我們用GPT-3測試了這一假設。我們在[RRS20]中的3個數據集上評估GPT-3: Natural Questions [KPR+19]、WebQuestions [BCFL13]和TriviaQA [JCWZ17]，使用相同的分割。注意，除了所有的結果都在閉卷設置中之外，我們使用的少樣本、一次小樣本和零小樣本的評估代表了比以前的閉卷QA工作更嚴格的設置:除了不允許外部內容外，也不允許對Q&A數據集本身進行微調。

The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the ?one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by ?14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot ?result improves by 3.7% and matches the SOTA for an open-domain QA system which not only fine-tunes but also ?makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [LPP+20]. ?GPT-3’s few-shot result further improves performance another 3.2% beyond this. ?
On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5% ?in the few-shot setting. This compares to 37.4% for fine-tuned T5-11B, and 44.7% for fine-tuned T5-11B+SSM, ?which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of ?state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to ?few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions?and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this ?distribution, recovering strong performance in the few-shot setting.

GPT-3結果如表3.3所示。在TriviaQA上，我們在小樣本設置中達到了64.3%，在一小樣本設置中達到了68.0%，在小樣本設置中達到了71.2%。zero-shot result的表現已經比經過微調的T5-11B高出14.2%，而且在培訓前的問答時間跨度預測也比T5-11B高出3.8%。一次測試的結果提高了3.7%，與開放域QA系統的SOTA相匹配，該系統不僅進行了優化，而且利用了一種學習過的檢索機制，對包含21M文檔的15.3個參數密集向量索引進行檢索[LPP+20]。此外，GPT-3的少拍效果進一步提高了性能3.2%。
在網絡問題(WebQs)中，GPT-3在零桿設置中達到14.4%，在單桿設置中達到25.3%，在少桿設置中達到41.5%。相比之下，使用q&a特定的培訓前程序的優化T5-11B和優化T5-11B+SSM的比例分別為37.4%和44.7%。GPT-3在小樣本設置接近最先進的表現，微調模型。值得注意的是，與TriviaQA相比，WebQS從零桿到少桿的增益要大得多(事實上，WebQS的零桿和單桿性能都很差)，這可能表明WebQS的問題和/或它們的回答風格在GPT-3中是不分布的。然而，GPT-3似乎能夠適應這種分布，在少炮點的環境中恢復了良好的性能。

On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in ?the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot ?to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to ?TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia ?specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution. ?

Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two ?datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we ?find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting ?the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.

在自然問題(NQs)中，GPT-3在零桿設置中達到了14.6%，在單桿設置中達到了23.0%，在少桿設置中達到了29.9%，而在經過微調的T5 11B+SSM中達到了36.6%。與WebQS類似，從零桿到少桿的巨大增益可能意味著分布的轉移，這也可能解釋了與TriviaQA和WebQS相比競爭力較差的原因。特別是，NQs的問題傾向于維基百科上非常精細的知識，可以測試GPT-3的能力極限和廣泛的培訓前分布。 

總的來說，在三個數據集中的一個上，GPT-3的一次性匹配了開放域微調SOTA。在另外兩個數據集上，盡管沒有使用微調，它的性能接近封閉的SOTA。在所有3個數據集上，我們發現性能與模型大小的關系非常順利(圖3.3和附錄H圖H.7)，可能反映了模型容量直接轉化為更多吸收在模型參數中的“知識”的想法。

3.3 Translation? 翻譯任務

For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity ?concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially ?when translating between French and English despite only training on 10 megabytes of remaining French text. Since we ?increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training ?dataset to include more representation of other languages, though this remains an area for further improvement. As ?discussed in 2.2 the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although ?GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages. ?These languages are documented in the supplemental material. In order to better understand translation capability, we ?also expand our analysis to include two additional commonly studied languages, German and Romanian. ?
Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets ?with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a ?blend of training data that mixes many languages together in a natural way, combining them on a word, sentence, ?and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in ?particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make ?use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data. ?Results are shown in Table 3.4. Zero-shot GPT-3, which only receives on a natural language description of the task, ?still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for?each translation task improves performance by over 7 BLEU and nears competitive performance with prior work. ?GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior ?unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the ?three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into ?English but underperforms when translating in the other direction. Performance on En-Ro is a noticeable outlier at ?over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE ?tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, ?few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and ?the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art. ?For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of ?unsupervised pretraining, supervised finetuning on 608K labeled examples, and backtranslation [LHCG19b]. ?

對于GPT-2，由于容量問題，在多語言文檔集合上使用了一個過濾器來生成僅使用英語的數據集。即使使用了這種過濾，GPT-2也顯示出了多語言能力，并且在法語和英語之間進行翻譯時執行得非常出色，盡管僅對10兆字節的剩余法語文本進行了培訓。由于我們將GPT-2到GPT-3的容量增加了兩個數量級，因此我們還擴展了訓練數據集的范圍，以包括更多其他語言的表示，盡管這仍是一個有待進一步改進的領域。正如2.2中所討論的那樣，我們的大部分數據來自于原始的普通抓取，只使用基于質量的過濾。盡管GPT-3的訓練數據仍然主要是英語(93%的單詞計數)，但它也包括了7%的其他語言的文本。這些語言被記錄在補充材料中。為了更好地理解翻譯能力，我們還擴展了我們的分析，包括另外兩種常用的語言，德語和羅馬尼亞語。 
現有的無監督機器翻譯方法通常結合對單語數據集的預訓練和反向翻譯[SHB15]，以一種可控的方式連接兩種語言。相比之下，GPT-3從混合的訓練數據中學習，這些數據以自然的方式將多種語言混合在一起，在單詞、句子和文檔級別上將它們組合在一起。GPT-3也使用單一的訓練目標，它不是為任何特定任務定制或設計的。然而，我們的單樣本/小樣本設置并不能嚴格地與之前的無監督工作相比，因為它們使用了少量成對的例子(1或64個)。這相當于一頁或兩頁上下文內訓練數據。結果如表3.4所示。Zero-shot GPT-3，它只接收任務的自然語言描述，仍然表現不佳，最近的非監督NMT結果。然而，僅為每個翻譯任務提供一個示例演示，就可以提高7個藍度以上的翻譯性能，接近與之前工作的競爭性能。GPT-3在全小樣本設置中進一步提高了另外4個藍度，使得平均性能與之前的無監督NMT工作相似。根據語言方向的不同，GPT-3在性能上有明顯的偏差。在研究的三種輸入語言中，GPT-3在翻譯成英語時顯著優于之前的無監督的NMT工作，但在翻譯成英語時表現不佳。在enro上的性能是一個明顯的異常值，比之前的無監督的NMT工作差10藍度以上。這可能是一個弱點，因為重用了GPT-2的字節級BPE標記器，它是為一個幾乎完全是英語的訓練數據集開發的。對于Fr-En和De-En，很少有shot GPT-3優于我們所能找到的最佳監督結果，但由于我們不熟悉文獻和這些是非競爭性基準的外觀，我們不懷疑這些結果代表了真正的藝術狀態。對于roen來說，很少有shot GPT-3能在整體SOTA的0.5 BLEU范圍內完成，這是通過結合無監督的預訓練、對608K標記示例的監督微調和反向翻譯來實現的[LHCG19b]。

Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of ?improvement with model capacity. This is shown in Figure 3.4 in the case of few-shot results, and scaling for all three ?settings is shown in Appendix H.

最后，通過所有語言對和所有三種設置(零-、一-和少-shot)，模型容量有一個平穩的提高趨勢。圖3.4中顯示的是較少拍攝的結果，附錄H中顯示了所有三種設置的縮放情況。

3.4 Winograd-Style Tasks? 任務

The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun ?refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned ?language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test ?GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting. ?
On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method ?described in [RWC+19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which ?is presented as binary classification and requires entity extraction to convert to the form described in this section. On ?Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear ?in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human ?performance. We note that contamination analysis found some Winograd schemas in the training data but this appears ?to have only a small effect on results (see Section 4). ?

Winograd Schemas Challenge [LDM12]是NLP中的一項經典任務，當一個代詞在語法上有歧義，但在語義上對人來說沒有歧義時，該任務涉及確定該代詞指的是哪個詞。最近，經過微調的語言模型在原始Winograd數據集上取得了接近人類的性能，但是更困難的版本，比如反向挖掘的Winogrande數據集[SBBC19]，仍然顯著落后于人類的性能。我們測試了GPT-3在Winograd和Winogrande上的性能，通常是在零桿、一桿和少桿設置下。 
在Winograd上，我們使用[RWC+19]中描述的相同的“部分求值”方法，在原始的273個Winograd模式集上測試GPT-3。請注意，此設置與SuperGLUE基準中的WSC任務略有不同，后者以二進制分類的形式表示，需要提取實體來轉換為本節中描述的形式。Winograd的GPT-3在零桿、一桿和少桿設置中取得了88.3%、89.7%和88.6%的成績，沒有顯示出明確的上下文學習，但在所有情況下都取得了較好的成績，僅比最先進的和估計的人類性能低幾個點。我們注意到，污染分析在訓練數據中發現了一些Winograd模式，但這似乎只對結果有很小的影響(見第4節)。

On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the ?zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned ?RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and ?human performance on the task as reported by [SBBC19] is 94.0%.

在更困難的Winogrande數據集上，我們確實發現了上下文學習的進步:GPT-3在零樣本設置中實現了70.2%，在單樣本設置中實現了73.2%，在少小樣本設置中實現了77.7%。相比之下，經過微調的RoBERTA模型實現了79%，使用經過微調的高容量模型(T5)，最先進的實現了84.6%，而根據[SBBC19]報告的人類在該任務上的性能是94.0%。

3.5 Common Sense Reasoning ?常識推理任務

Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence ?completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) [BZB+19], ?asks common sense questions about how the physical world works and is intended as a probe of grounded understanding ?of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot ?(the last measured on PIQA’s test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a?fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over 10% worse than human ?performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis ?flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark ?the result with an asterisk. See Section 4 for details. ?

接下來，我們考慮三個試圖捕捉物理或科學推理的數據集，作為區別于句子完成，閱讀理解，或廣義知識問題回答。第一個是PhysicalQA (PIQA) [BZB+19]，它提出了關于物質世界如何運作的常識問題，旨在探索對世界的基礎理解。GPT-3的零桿精度為81.0%，單桿精度為80.5%，少桿精度為82.8%(最后一次在PIQA的測試服務器上測量)。這比較有利的79.4%的精度之前的先進先進的一個微調羅伯塔。PIQA在模型尺寸上顯示出相對較淺的縮放效果，仍然比人類的表現差10%以上，但GPT-3的少射甚至零射的結果比目前最先進的技術要好。我們的分析將PIQA標記為潛在的數據污染問題(盡管隱藏了測試標簽)，因此我們用星號保守地標記了結果。詳見第4節。 

ARC [CCE+18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the ?“Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval ?methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot ?setting, and 51.5% in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline ?(55.9%) from UnifiedQA [KKS+20]. On the “Easy” version of the dataset (questions which either of the mentioned ?baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a fine-tuned ?RoBERTa baseline from [KKS+20]. However, both of these results are still much worse than the overall SOTAs ?achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by 27% on the challenge set and 22% on the easy ?set. ?

On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points ?short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the ?leaderboard. ?
Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and ?inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant ?improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.

ARC [CCE+18]是一個多選題數據集，收集自3至9年級的科學考試。在對簡單統計或信息檢索方法無法正確回答的問題進行篩選后的數據集“挑戰”版本上，GPT-3在零炮設置、一次炮設置和少炮設置的準確率分別達到51.4%、53.2%和51.5%。這接近于UnifiedQA [KKS+20]的RoBERTa基線(55.9%)的性能。在數據集的“簡單”版本中(上述兩種基線方法都回答正確的問題)，GPT-3實現了68.8%、71.2%和70.1%，這略微超過了來自[KKS+20]的RoBERTa的優化基線。然而，這兩個結果仍然比UnifiedQA取得的總體SOTAs差得多，后者在挑戰集上比GPT-3的少桿結果高出27%，在簡單集上高出22%。 
在OpenBookQA [MCKS18]上，GPT-3從零樣本到小樣本設置有顯著提高，但仍比整體SOTA少20分。GPT-3的少樣本性能類似于一個微調的伯特大基線在排行榜上。 
總的來說，使用GPT-3的上下文學習在常識推理任務中表現出混合的結果，在PIQA和ARC的單樣本和小樣本學習設置中，只觀察到小的和不一致的收獲，但在OpenBookQA中觀察到顯著的改善。GPT-3在所有評估設置中對新的PIQA數據集設置SOTA。

3.6 Reading Comprehension ?閱讀理解任務

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, ?multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread ?in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general ?we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each ?respective dataset. ?

接下來我們對GPT-3進行閱讀理解任務的評估。在對話框和單一問題設置中，我們使用了一套5個數據集，包括抽象的、多項選擇和基于跨度的回答格式。我們觀察到GPT-3在這些數據集上的性能差異很大，這表明不同的回答格式具有不同的能力。一般來說，我們觀察到GPT-3與初始基線和使用上下文表示對每個各自數據集進行訓練的早期結果相同。?

GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset ?and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured ?dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete ?reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned ?BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches ?which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its ?few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to ?slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of ?middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with ?the earliest work utilizing contextual representations and is still 45% behind SOTA.

GPT-3在CoQA [RCM19]自由形式會話數據集上表現最好(在人類基線的3個點內)，在QuAC [CHI+18]數據集上表現最差(低于ELMo基線13 F1)，該數據集需要建模結構化對話行為和師生交互的回答范圍選擇。下降(DWD + 19]數據集測試離散推理和計算能力在閱讀理解中,GPT-3在few-shot環境優于原始論文的BERT基線調整但仍遠低于人類的性能和先進的方法增強神經網絡與符號系統(RLL + 19)。在陣容2.0 [RJL18]上，GPT-3展示了它的少桿學習能力，與零桿設置相比提高了近10桿(69.8桿)。這使得它稍微優于原始論文中最好的微調結果。在RACE [LXL+17](一個針對初中和高中英語考試的多項選擇數據集)上，GPT-3的表現相對較弱，僅與最早使用上下文表示的研究相比具有競爭力，仍落后于SOTA 45%。

3.7 SuperGLUE? 對比

In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a ?more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark ?[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] ?[BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the ?few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we ?used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. ?
We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA ?performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving ?second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, ?performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the ?original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, ?roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting.?

為了更好地聚合NLP任務的結果，并與BERT和RoBERTa等流行模型進行更系統的比較，我們還在標準化數據集上對GPT-3進行了評價，即SuperGLUE基準[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]。GPT-3在SuperGLUE數據集上的測試集性能如表3.8所示。在小樣本設置中，我們對所有任務使用了32個示例，從訓練集中隨機采樣。對于除了WSC和MultiRC之外的所有任務，我們采樣了一組新的示例用于每個問題的上下文。對于WSC和MultiRC，我們使用同一組從訓練集中隨機抽取的例子作為我們評估的所有問題的上下文。?

我們觀察到GPT-3在不同任務中的表現差異很大。在COPA和記錄GPT-3實現近sota的表現在一次樣本和小樣本設置，與COPA只下降了幾個點，并在排行榜上取得第二名，第一名是由微調110億參數模型(T5)。在WSC上，性能仍然相對較強，在小樣本設置中達到80.1%(請注意，如3.4節所述，gpot -3在原始Winograd數據集上達到88.6%)。在BoolQ、MultiRC和RTE上，性能是合理的，大致與經過微調的BERT-Large匹配。在CB上，我們看到生命跡象的比例為75.6%。

WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different ?phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two ?sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer ?in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot ?setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same ?way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. ?This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these ?weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to ?the state-of-the-art held by a fine-tuned 11 billion parameter model.

Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of ?examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32 ?examples per task, after which point additional examples will not reliably fit into our context. When sweeping over ?values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large ?on overall SuperGLUE score.

WiC是一個值得注意的弱點，它的命中率為49.4%(隨機)。我們為WiC嘗試了許多不同的短語和公式(包括確定一個單詞在兩個句子中是否具有相同的意思)，但沒有一個能夠取得很好的效果。這暗示了一個現象,在下一節將變得更清楚(討論ANLI基準)——GPT-3似乎弱few-shot或一次性設置的一些任務,涉及比較兩個句子或片段,例如一個詞是否用同樣的方式在兩個句子,一個句子是否解釋另一個,或者一個句子是否意味著另一個。這也可以解釋RTE和CB的分數相對較低的原因，它們也采用這種格式。盡管存在這些弱點，GPT-3仍然在8個任務中的4個任務上優于經過微調的伯特-大公司，而在兩個任務上，GPT-3通過一個經過微調的110億參數模型已經接近最先進水平。

最后，我們注意到，隨著模型大小和上下文中的示例數量的增加，少量注射的SuperGLUE得分穩步提高，顯示了上下文內學習的好處越來越大(圖3.8)。我們將K擴展到每個任務32個示例，超過這一點，額外的示例將不可靠地適合我們的上下文。當掃過K的值時，我們發現GPT-3每個任務總共需要少于8個示例，才能在總體超級膠水得分上超過經過微調的伯特-大。

3.8 NLI? 自然語言推理任務

Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. ?In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral). ?SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest ?version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting ?GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced ?Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of ?adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our ?models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (～ 33%), ?whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results ?for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult ?task for language models and they are only just beginning to show signs of progress.

自然語言推理(NLI) [Fyo00]關注理解兩個句子之間關系的能力。在實踐中，這個任務通常被構造成兩個或三個類的分類問題，其中模型分類第二個句子在邏輯上是否與第一個句子相符合，是否與第一個句子相矛盾，或者可能是正確的(中立的)。SuperGLUE包括一個NLI數據集RTE，它計算任務的二進制版本。在RTE上，只有最大版本的GPT-3在任何評估設置上的表現都令人信服地優于random(56%)，但在小樣本設置中，GPT-3的表現類似于單任務優化的BERT Large。我們還評估了最近引入的對抗式自然語言推斷(ANLI)數據集[NWD+19]。ANLI是一個復雜的數據集，它在三輪(R1、R2和R3)中使用一系列逆向挖掘的自然語言推理問題。與RTE類似，我們所有小于GPT-3的模型在ANLI上的表現幾乎完全是隨機的，即使是在很少投籃的設置中(約33%)，而GPT-3本身在第3輪顯示出生命跡象。ANLI R3的結果突出顯示在圖3.9和全部結果輪可以在附錄h .這些結果RTE和ANLI NLI基礎仍然是一個非常困難的任務表明語言模型和他們才剛剛開始顯示出進步的跡象。

3.9 Synthetic and Qualitative Tasks ?綜合和定性任務

One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which ?require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have ?occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we ?test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the ?letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to ?solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new ?words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets ?with the hope of stimulating further study of test-time behavior of language models. ?

要想了解GPT-3在“少拍”(或“零拍”和“一次拍”)環境下的能力范圍，一種方法是讓它執行一些任務，這些任務要求它執行簡單的即時計算推理，識別訓練中不太可能出現的新模式，或者快速適應不尋常的任務。我們設計了幾個任務來測試這類能力。首先，我們測試GPT-3執行算術的能力。其次，我們創建了幾個任務，這些任務包括重新排列或整理單詞中的字母，這些任務不太可能在訓練過程中被準確地看到。第三，我們測試了GPT-3解決衛星式類比問題的能力。最后，我們對GPT-3進行了幾個定性測試，包括在句子中使用新單詞、修改英語語法和生成新聞文章。我們將發布合成數據集，希望能促進對語言模型測試時行為的進一步研究。?

3.9.1 Arithmetic ?算術

To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small ?battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:

2 digit addition (2D+) – The model is asked to add two integers sampled uniformly from [0, 100), phrased in ?the form of a question, e.g. “Q: What is 48 plus 76? A: 124.” ?
2 digit subtraction (2D-) – The model is asked to subtract two integers sampled uniformly from [0, 100); the ?answer may be negative. Example: “Q: What is 34 minus 53? A: -19”. ?
3 digit addition (3D+) – Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000).
3 digit subtraction (3D-) – Same as 2 digit subtraction, except numbers are uniformly sampled from [0, 1000).
4 digit addition (4D+) – Same as 3 digit addition, except uniformly sampled from [0, 10000).
4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 10000).
5 digit addition (5D+) – Same as 3 digit addition, except uniformly sampled from [0, 100000).
5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 100000).
2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from [0, 100), e.g. “Q: What is 24 times 42? A: 1008”.
One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers, with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers are selected uniformly on [0, 10) and the operations are selected uniformly from {+,-,*}.

為了測試GPT-3在沒有特定任務訓練的情況下執行簡單算術運算的能力，我們開發了一個包含10個測試的小電池，其中包括用自然語言問GPT-3一個簡單的算術問題:

2位加法(2D+)——模型被要求將從[0,100均勻采樣的兩個整數相加，以問題的形式表達，例如:“Q: 48加76等于多少?”答:124。”
2位減法(2D-)——要求模型從[0,100]均勻采樣的兩個整數進行減法;答案可能是否定的。例子:“問:34減53等于多少?”答:-19”。?
3位加法(3D+) -與2位加法相同，只是數字均勻地從[0,1000]取樣。
3位減法(3D-) -與2位減法相同，只是數字均勻地從[0,1000]采樣。
4位加法(4D+) -與3位加法相同，只是均勻采樣于[0,10000]。
4位減法(4D-) -與3位減法相同，只是均勻采樣于[0,10000]。
5位加法(5D+) -與3位加法相同，除了均勻采樣于[0,100000]。
5位減法(5D-) -與3位減法相同，除了均勻采樣[0,100000]。
2位乘法(2Dx)——模型要求將從[0,100均勻采樣的兩個整數相乘)，例如:“Q: 24乘以42等于多少?”答:1008”。
一位數合成(1DC)——要求模型對三個1位數執行合成操作，最后兩個用括號括起來。例如，“Q: 6+(4*8)是多少?”答:38”。在[0,10)上一致選擇三個1位數字，在{+，-，*}中一致選擇操作。

In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random ?instances of the task and evaluate all models on those instances. ?First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, ?GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, ?98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the ?number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on ?five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves ?29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves ?21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness ?beyond just single operations. ?

在所有的10個任務中，模型必須準確地生成正確的答案。對于每個任務，我們生成一個包含2000個任務隨機實例的數據集，并對這些實例上的所有模型進行評估。首先，我們在小樣本設置中評估GPT-3，其結果如圖3.10所示。在加減法方面，GPT-3在數字較少的情況下表現出較強的熟練度，2位加法的準確率為100%，2位減法的準確率為98.9%，3位加法的準確率為80.2%，3位減法的準確率為94.2%。隨著數字數目的增加，性能會下降，但是GPT-3在四位數操作上仍能達到25-26%的精度，在五位數操作上仍能達到9-10%的精度，這表明至少有一些能力概括為更大數目的數字。GPT-3在2位乘法上也達到了29.2%的精度，這是一個特別的計算密集型操作。最后，GPT-3在個位數聯合操作(例如，9*(7+5))時達到了21.3%的準確率，這表明GPT-3在單個操作之外還有一定的穩健性。 
?

As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the ?second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all ?other operations less than 10% of the time. ?

One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation ?to the task (or at the very least recognition of the task) is important to performing these computations correctly. ?Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly?outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and ?model capacity scaling for all three settings is shown in Appendix H.

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic ?problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and ?"<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 ?subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers ?could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes ?such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than ?memorizing a table. ?
Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even ?zero-shot settings.

圖3.10表明,小模型在所有這些任務做差,甚至130億年的參數模型(1750億年之后的第二大完整的GPT-3)可以解決2位數的加法和減法只有一半的時間,和所有其他操作的時間不到10%。一次射擊和零射擊的性能相對于少射擊的性能有所下降，這表明適應任務(或至少識別任務)對正確執行這些計算很重要。盡管如此，單次射擊的性能仍然相當強大，甚至全GPT-3的零射擊性能也顯著優于所有小型模型的少次射擊學習。表3.9顯示了完整GPT-3的所有三個設置，附錄H顯示了所有這三個設置的模型容量伸縮。
為了抽查模型是否只是簡單地記憶特定的算術問題，我們取測試集中的三位數算術問題，并在訓練數據中以“<num1> + <num2> =”和“<num1> + <num2>”的形式搜索它們。</num2></num1></num2></num1>在2000道加法題中，我們發現只有17道匹配(0.8%)，而在2000道減法題中，我們發現只有2道匹配(0.1%)，這表明只有一小部分正確答案能夠被記住。此外，對錯誤答案的檢查發現，該模型經常會犯錯誤，比如沒有帶“1”，這表明它實際上是在嘗試執行相關的計算，而不是記憶一個表。總的來說，GPT-3在少桿、一桿甚至零桿設置中表現出了相當熟練的中等復雜的算術。

3.9.2 Word Scrambling and Manipulation Tasks ?拼字和操作任務

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of ?5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of ?scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:

Cycle letters in word (CL) – The model is given a word with its letters cycled, then the “=” symbol, and is expected to generate the original word. For example, it might be given “lyinevitab” and should output “inevitably”.
Anagrams of all but first and last characters (A1) – The model is given a word where every letter except the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = corruption.
Anagrams of all but first and last 2 characters (A2) – The model is given a word where every letter except the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt → opponent.
Random insertion in word (RI) – A random punctuation or space character is inserted between each letter of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.
Reversed words (RW) – The model is given a word spelled backwards, and must output the original word. Example: stcejbo → objects.

為了測試GPT-3從幾個例子中學習新的符號操作的能力，我們設計了一個包含5個“字符操作”任務的小電池。每個任務都包括給模型一個被打亂、添加或刪除字符組合而扭曲的單詞，并要求它恢復原來的單詞。這5項任務是:

單詞(CL)中的循環字母——給模型一個單詞，它的字母是循環的，然后是“=”符號，并期望生成原始單詞。例如，它可能被賦予“lyinevitab”，而應該輸出“不可避免”。
除了第一個和最后一個字符以外的所有字符的字謎(A1)——模型被給定一個單詞，其中除了第一個和最后一個字符以外的每個字母都被隨機打亂，并且必須輸出原始單詞。例如:criroptuon =腐敗。
除了第一個和最后兩個字符以外的所有字符的字謎(A2)——模型給出一個單詞，其中除了前兩個和后兩個字符以外的每個字母都被隨機打亂，并且必須恢復原來的單詞。例:opoepnnt→對手。
單詞中的隨機插入(RI)——在單詞的每個字母之間插入隨機的標點或空格字符，模型必須輸出原始單詞。例子:s.u ! c / c e。ssi /o/n =連續。
反向單詞(RW)——給模型一個反向拼寫的單詞，并且必須輸出原始單詞。示例:stcejbo→對象。
?

For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by ?[Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. ?Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing?random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram ?task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word. ?
In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the ?model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these ?tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear ?in the pre-training data (although we cannot confirm this with certainty). ?

對于每個任務，我們生成10,000個示例，我們選擇這些示例作為最常見的10,000個單詞，以長度大于4個字符和小于15個字符的[Nor09]來衡量。小樣本結果如圖3.11所示。任務性能隨著模型大小的變化而平穩增長，完整的GPT-3模型在刪除隨機插入時達到66.9%，循環字母達到38.6%，在較簡單的字謎任務中達到40.2%，在較困難的字謎任務(只保留第一個和最后一個字母)中達到15.1%。沒有一個模型能將字母倒轉成一個單詞。 
在單樣本設置中，性能明顯較差(下降一半或更多)，而在零樣本設置中，模型很少能執行任何任務(表3.10)。這表明，模型確實在測試時學習了這些任務，因為模型不能零失誤地執行它們，而且它們的人工特性使它們不太可能出現在訓練前的數據中(盡管我們不能確定地證實這一點)。

We can further quantify performance by plotting “in-context learning curves”, which show task performance as a ?function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task ?in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, ?including both task examples and natural language task descriptions.

Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding ?operates on significant fractions of a word (on average ～ 0.7 words per token), so from the LM’s perspective succeeding ?at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, ?CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), ?requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require ?non-trivial pattern-matching and computation.

我們可以通過繪制“上下文內學習曲線”來進一步量化績效，該曲線將任務績效顯示為上下文內例子數量的函數。我們在圖1.2中展示了用于符號插入任務的上下文內學習曲線。我們可以看到，更大的模型能夠越來越有效地使用上下文信息，包括任務示例和自然語言任務描述。
最后,值得補充的是,解決這些任務需要字符級操作,而我們的BPE編碼作用于重要的分數一個詞(平均0.7～字令牌),所以從LM的角度成功在這些任務不僅包括操縱BPE令牌但理解和剖析他們的子結構。另外，CL、A1和A2不是雙射的(也就是說，被解置的單詞不是被解置單詞的確定性函數)，需要模型執行一些搜索來找到正確的解置。因此，所涉及的技能似乎需要非平凡的模式匹配和計算。

3.9.3 SAT Analogies 類比

To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of ?374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of ?the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to ?hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to ?temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original ?word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the ?few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among ?college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with ?scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model.

為了在另一個任務中測試GPT-3，這個任務相對于文本的典型分布有些不尋常，我們收集了一組374個“SAT類比”問題[TLBS03]。類推題是2005年前SAT大學入學考試的一個部分的多項選擇題。一個典型的例子是“大膽之于大膽，正如(A)偽善之于偽善，(b)匿名之于身份，(c)懊悔之于惡行，(d)有害之于結果，(e)易受誘惑之于結果。”要求學生從五組單詞中選出與原單詞有相同關系的單詞;在這個例子中，答案是“假裝虔誠就是虛偽”。在這項任務中，GPT-3在少發、一發和零發中得分分別為65.2%、59.1%和53.7%，而大學申請者的平均得分為57% [TL05](隨機猜測的得分為20%)。如圖3.12所示，結果隨著規模的增加而提高，全1750億模型比130億參數模型提高了10%以上。

3.9.4 News Article Generation ?新聞文章生成

Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by ?conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news ?story [RWC+19]. Relative to [RWC+19], the dataset used to train GPT-3 is much less weighted towards news articles, ?so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets ?the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To ?solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the ?model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably ?generate short articles in the “news” genre. ?
To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional ?sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles ?from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative ?language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to ?distinguish the two is a potentially important measure of quality.3

之前在生成語言模型上的工作定性地測試了他們生成合成“新聞文章”的能力，方法是有條件地從模型中取樣，并給出一個由一個新聞故事的可信的第一句話組成的人類書面提示。相對于數據集(RWC + 19),用于火車GPT-3偏重于新聞文章要少得多,因此試圖產生新聞文章通過原始無條件的樣品更有效——例如GPT-3經常解釋提出的第一句話“新聞文章”的一條微博,然后文章合成反應或后續消息。為了解決這個問題，我們使用了GPT-3的少樣本學習能力，在模型的上下文中提供了之前的三篇新聞文章來約束它。有了提議的下一篇文章的標題和副標題，該模型能夠可靠地生成“新聞”類型的短文章。?

為了衡量GPT-3生成新聞文章的質量(我們認為這很可能與有條件的樣本生成質量總體上相關)，我們決定衡量人類區分GPT-3生成的文章與真實文章的能力。Kreps等人[KMB20]和Zellers等人[ZHR+19]也進行了類似的工作。生成語言模型被訓練來匹配人類生成的內容的分布，所以人類區分這兩者的能力是質量的一個潛在的重要衡量標準

In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles ?from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles ?from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each ?model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed ?by either the human written article or the article generated by the model4 ?. Participants were asked to select whether the ?article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by ?a machine”, or “very likely written by a machine”.?
The articles we selected were not in the models’ training data and the model outputs were formatted and selected ?programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were ?pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. ?However, we also ran an experiment to control for participant effort and attention that followed the same format but ?involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a ?160M parameter model with no context and increased output randomness.

為了考察人類檢測模型生成的文本的能力，我們從newser.com網站上任意選擇了25篇文章的標題和副標題(平均長度:215個單詞)。然后，我們根據四種語言模型生成這些標題和字幕的完整版本，大小從1.25米到175B (GPT-3)參數不等(平均長度:200個單詞)。對于每個模型，我們向大約80名來自美國的參與者展示了一個測試，其中包含這些真實的標題和副標題，然后是人工撰寫的文章或由模型4生成的文章。參與者被要求選擇文章是“很可能是人類寫的”，“更可能是人類寫的”，“我不知道”，“更可能是機器寫的”，還是“很可能是機器寫的”。

我們選擇的文章不在模型的訓練數據中，并且模型的輸出被編程地格式化和選擇，以防止人類的“挑選”。所有模型都使用相同的上下文來設置輸出條件，并使用相同的上下文大小進行預訓練，每個模型都使用相同的文章標題和副標題作為提示。然而，我們也進行了一項實驗，以控制參與者的努力和注意力，這些人遵循同樣的格式，但包含了有意的不良模型生成的文章。這是通過從一個“控制模型”生成文章來實現的:一個沒有上下文且增加了輸出隨機性的160M參數模型。

Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that ?the intentionally bad articles were model generated was ～ 86% where 50% is chance level performance. By contrast, ?mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance ?at ～ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: ?there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 ?This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E). ?
Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15. ?7 Much of the text is—as indicated by the ?evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator ?that an article is model generated since, unlike human authors, the models have no access to the specific facts that the ?article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual ?phrasings, though these are often subtle enough that they are not noticed. ?
Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like ?G R O V E R [ZHR+19] and GLTR [GSR19] may have greater success at detecting model generated text than human ?evaluators. Automatic detection of these models may be a promising area of future research. ?

Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe ?more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated ?by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated ?completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial ?experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to ?compare human abilities to detect the articles generated by GPT-3 and a control model. ?
We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was ?～ 88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely ?above chance at ～ 52% (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3 ?continues to produce articles that humans find difficult to distinguish from human written news articles.

在檢測出被模型生成的故意差的文章時，人類的平均準確率(每個參與者的正確任務與非中立任務的比率)為86%，其中50%是隨機水平的表現。相比之下，人類檢測175B參數模型產生的物品的平均準確率僅為52%(見表3.11)。5人類檢測模型生成的文本的能力似乎隨著模型大小的增加而減少:模型大小似乎有機會準確性的趨勢，人類對GPT-3的檢測接近于機會。盡管隨著模型尺寸的增加，參與者會在每個輸出上花費更多的時間(見附錄E)，但這是真的。

圖3.14和圖3.15給出了GPT-3合成產品的示例。7如評估所示，大部分文本對人類來說很難從真實的人類內容中區分出來。事實不準確可能是一篇文章是模型生成的標志，因為與人類作者不同，模型無法訪問文章標題所引用的具體事實或文章的寫作時間。其他的指標包括重復，不符合邏輯，和不尋常的措辭，盡管這些通常是足夠微妙的，他們沒有被注意到。?

Ippolito等人[IDCBE19]在語言模型檢測方面的相關工作表明，自動鑒別器如G R O V E R [ZHR+19]和GLTR [GSR19]在檢測模型生成的文本方面可能比人類評價器更成功。這些模型的自動檢測可能是未來研究的一個有前景的領域。

Ippolito等人[IDCBE19]也注意到，隨著人們觀察到更多的標記，人類檢測模型生成的文本的準確性也會提高。做一個初步調查好人類是如何檢測時間的新聞文章由GPT-3 175 b,我們選擇了12項世界新聞文章來自路透社平均長度為569個單詞和生成完成的這些文章GPT-3平均長度為498個單詞(298字的時間比我們最初的實驗)。按照上述方法，我們進行了兩個實驗，每個實驗都有大約80名美國參與者，以比較人類檢測GPT-3和一個對照模型生成的文章的能力。?

我們發現，人類在檢測控制組故意制造的較長文章時的平均準確率為~ 88%，而在檢測GPT-3 175B制造的較長文章時的平均準確率為~ 52%(見表3.12)。這表明，對于長度在500字左右的新聞文章，GPT-3繼續生成人類難以區分的文章。

3.9.5 Learning and Using Novel Words ?學習和使用新單詞

A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a ?word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here ?we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, ?such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the ?broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions ?were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were ?generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try ?any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final ?sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of ?the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy ?sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.

發展語言學[CB78]研究的一個任務是學習和利用新單詞的能力，例如在一個句子中只看到一個單詞的定義一次就使用它，或者從一個用法反過來推斷一個單詞的意思。在這里，我們定性地測試GPT-3完成前一項任務的能力。具體來說，我們給GPT-3一個不存在的單詞的定義，比如“Gigamuru”，然后讓它在一個句子中使用它。我們提供了一個(單獨的)不存在的單詞在句子中被定義和使用的1到5個例子，所以就寬泛任務的前面例子而言，任務是很少的，而就具體單詞而言，任務是一次性的。表3.16顯示了我們生成的6個示例;所有的定義都是人為生成的，第一個答案是人為生成的，作為條件反射，隨后的答案是GPT-3生成的。這些示例是在一次運行中連續生成的，我們沒有省略或重復嘗試任何提示。在所有的情況下，生成的句子似乎是一個正確的或至少似是而非的詞的使用。在最后一句話中，該模型為單詞“screeg”(即“screeghed”)生成了一個貌似合理的變位，盡管這個詞的使用有點尷尬(“screeghed at each other”)，盡管它在描述一場玩具劍戰的意義上似乎是可信的。總的來說，GPT-3在使用新單詞造句方面至少表現得很熟練。

3.9.6 Correcting English Grammar ?修改英語語法

Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot ?setting by giving prompts of the form "Poor English Input: <sentence>\n Good English Output: ?<sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any ?omissions or repeats). Results are shown in Figure 3.17.

另一項非常適合少量學習的任務是糾正英語語法。我們在fewshot設置中使用GPT-3測試這一點，給出如下提示:“糟糕的英語輸入:<句子>\n良好的英語輸出:<句子>”。我們給GPT-3一個人為的修正，然后讓它再修正5個(同樣沒有遺漏或重復)。結果如圖3.17所示。

4 Measuring and Preventing Memorization Of Benchmarks ?測量和防止記憶基準

Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our ?benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research ?without established best practices. While it is common practice to train large models without investigating contamination, ?given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to. ?
This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] ?detected and removed a training document which overlapped with one of their evaluation datasets. Other work such ?as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not ?significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).

由于我們的訓練數據集來自互聯網，所以我們的模型可能是在一些基準測試集上訓練的。從互聯網規模的數據集中準確地檢測測試污染是一個新的研究領域，沒有建立最佳實踐。雖然在訓練大型模型時不調查污染是常見的做法，但考慮到訓練前數據集規模的不斷擴大，我們相信這個問題正變得越來越重要。?

這種擔憂不僅僅是假設。最早在普通爬行數據上訓練語言模型的論文之一[TL18]檢測并刪除了一個與其中一個評估數據集重疊的訓練文檔。GPT-2 [RWC+19]等其他工作也進行了事后重疊分析。他們的研究相對令人鼓舞，發現盡管模型在訓練和測試重疊的數據上表現得稍微好一些，但這并不會對報告的結果產生顯著影響，因為有一小部分數據被污染了(通常只有幾個百分點)。

GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of ?magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential ?for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B ?does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was ?deduplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as ?large as feared. ?
We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap ?between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a ?bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t ?feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts ?results. ?
For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as ?examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when ?it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, ?so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in ?Appendix C.

GPT-3的運作方式有些不同。一方面，數據集和模型的大小大約比GPT-2使用的大兩個數量級，并且包括大量的常見爬行，增加了污染和記憶的可能性。另一方面，精確地說，由于數據量大，即使是GPT-3 175B，其訓練集也沒有過度擬合，這是相對于一個被刪除的驗證集而言的(圖4.1)。因此，我們預計污染可能是頻繁的，但其影響可能不會像擔心的那樣大。?

我們最初試圖通過主動搜索并試圖消除我們的訓練數據與本文中研究的所有基準的開發和測試集之間的任何重疊，來解決污染問題。不幸的是，一個錯誤只導致部分刪除了訓練數據中檢測到的所有重疊部分。由于培訓成本的原因，對模型進行再培訓是不可行的。為了解決這個問題，我們詳細研究剩余檢測到的重疊是如何影響結果的。?

對于每個基準測試，我們生成一個“干凈”版本，刪除所有可能泄露的示例，大致定義為與預訓練集中的任何內容有13克重疊的示例(或者與整個示例有重疊的示例，如果它小于13克)。我們的目標是非常保守地標記出任何可能被污染的東西，以便產生一個高度可靠的無污染子集。確切的程序在附錄C中有詳細說明。

We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean ?subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a ?significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be ?inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a ?quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence ?that contamination level and performance difference are correlated. We conclude that either our conservative method ?substantially overestimated contamination or that contamination has little effect on performance. ?
Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on ?the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference ?difficult.

然后我們在這些干凈的基準上評估GPT-3，并與原始分數進行比較。如果清潔子集上的分數與整個數據集上的分數相似，這表明即使存在污染，也不會對報告的結果產生顯著的影響。如果清潔組的分數較低，這表明污染可能使結果膨脹。結果如圖4.2所示。盡管潛在的污染通常很高(四分之一的基準測試得分超過50%)，但在大多數情況下，性能變化只是微不足道的，而且我們沒有看到污染水平和性能差異相關的證據。我們得出的結論是，要么我們的保守方法大大高估了污染，要么污染對性能的影響很小。?

下面，我們將更詳細地回顧一些特定的情況，其中(1)模型在清理后的版本上表現明顯較差，或(2)潛在的污染非常高，這使得測量性能差異非常困難。

Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension ?(QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false ?positives. We summarize the results for each group of tasks below:

Reading Comprehension: Our initial analysis flagged >90% of task examples from QuAC, SQuAD2, and ?DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult. ?Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source ?text was present in our training data but the question/answer pairs were not, meaning the model gains only ?background information and cannot memorize the answer to a specific question. ?
German translation: We found 25% of the examples in the WMT16 German-English test set were marked ?as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the ?flagged examples contain paired sentences resembling NMT training data and collisions were monolingual ?matches mostly of snippets of events discussed in the news. ?
Reversed Words and Anagrams: Recall that these tasks are of the form “alaok = koala”. Due to the ?short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged ?overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set, ?but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small, ?but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the ?symbol insertion task shows high overlap but no effect on performance – this is because that task involves ?removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to ?many spurious matches.
PIQA: The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point ?absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was ?released after our training set was created and its labels are hidden, some of the web pages used by the ?crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller ?model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias ?rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot ?rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential ?contamination.
Winograd: The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the ?clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in ?fact present in our training set, though presented in a different format than we present the task to the model. ?Although the decrease in performance is small, we mark our Winograd results in the main paper with an ?asterisk.
Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the ?Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably ?extract a clean subset here, we do not report results on these datasets, even though we intended to when starting ?this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language ?modeling benchmark.

我們的分析為進一步的調查標記了六組基準:拼詞，閱讀理解(QuAC, SQuAD2, DROP)， PIQA, Winograd，語言建模任務(Wikitext任務，1BW)，以及德語到英語的翻譯。由于我們的重疊分析被設計成極其保守的，我們預計它會產生一些誤報。我們將每組任務的結果總結如下:

閱讀理解:我們最初的分析將QuAC、SQuAD2和DROP中90%的任務示例>標記為潛在污染，如此之大，甚至很難在干凈子集上測量差異。然而，經過人工檢查，我們發現，對于我們檢查的每一個重疊，在所有3個數據集中，我們的訓練數據中都有源文本，但是問題/答案對沒有，這意味著模型只獲得了背景信息，不能記住特定問題的答案。 
德語翻譯:我們發現，在WMT16德語-英語測試集中，25%的樣本被標記為潛在污染，相關總效應值為1-2藍色。經過檢查，沒有一個標記的例子包含類似NMT訓練數據的成對句子，碰撞是單語匹配，主要是新聞中討論的事件片段。 
顛倒單詞和字謎:回想一下這些任務的形式是“alaok = koala”。由于這些任務的長度較短，我們使用2克來進行過濾(忽略標點符號)。在檢查標記的重疊之后，我們發現它們并不是訓練集中真正的反向或解碼的典型實例，而是回文或普通的解碼。g " kayak = kayak "。重疊的數量很小，但是去除瑣碎的任務會增加難度，從而產生虛假信號。與此相關的是，符號插入任務顯示了高重疊，但對性能沒有影響——這是因為該任務涉及從單詞中刪除非字母字符，而重疊分析本身忽略了這些字符，從而導致許多虛假匹配。
PIQA:重疊分析將29%的示例標記為受污染的，并觀察到干凈子集的性能下降了3個百分點(相對下降4%)。雖然測試數據集創建發布我們的訓練集和它的標簽是隱藏的,使用的一些網頁的創造者眾包數據集都包含在我們的訓練集,我們也發現了相似的下降25 x模型和更少的記憶能力小,導致我們懷疑這種轉變可能是統計偏差而不是記憶;工人們模仿的例子可能更簡單。不幸的是，我們不能嚴格地證明這個假設。因此，我們用星號標記PIQA結果，表示這種潛在的污染。
Winograd:重疊分析標記了45%的示例，發現干凈子集的性能下降了2.6%。對重疊數據點的手動檢查表明，實際上有132個Winograd模式出現在我們的訓練集中，盡管它們的格式與我們向模型展示任務的格式不同。盡管性能下降很小，但我們在主論文中用星號標記了Winograd結果。
語言建模:我們發現用GPT-2測量的4個維基百科語言建模基準，加上兒童書籍測試數據，幾乎全部包含在我們的訓練數據中。因為我們不能可靠地提取一個干凈的子集，所以我們不報告這些數據集的結果，即使我們在開始這項工作時打算這樣做。我們注意到佩恩樹銀行由于其年齡未受影響，因此成為我們的主要語言建模基準。

We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply ?to verify how much actual contamination existed. These appeared to often contain false positives. They had either ?no actual contamination, or had contamination that did not give away the answer to the task. One notable exception ?was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very ?small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format ?precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this ?paper, the potential contamination is noted in the results section. ?
An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the ?same distribution as the original dataset. It remains possible that memorization inflates results but at the same time ?is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number ?of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small ?models, which are unlikely to be memorizing. ?

Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.

我們還檢查了污染程度高的數據集，但對性能的影響接近于零，只是為了驗證實際存在多少污染。這些報告似乎經常包含誤報。他們要么沒有受到實際的污染，要么受到的污染并沒有泄露任務的答案。一個值得注意的例外是LAMBADA，它看起來確實存在大量的污染，但對性能的影響非常小，干凈子集的得分在整個數據集的0.5%之內。而且，嚴格地說，我們的填空格式排除了最簡單的記憶形式。然而，由于我們在這篇論文中取得了很大的進展，潛在的污染在結果部分被指出。 
我們的污染分析的一個重要限制是，我們不能確定干凈子集是從與原始數據集相同的分布中提取的。記憶有可能使結果膨脹，但同時也被一些統計偏差精確地抵消了，從而使干凈子集變得更容易。然而，絕對的數字。

總的來說，我們已經盡了最大的努力來度量和記錄數據污染的影響，并根據嚴重程度來注意或直接刪除有問題的結果。在設計基準和培訓模式時，仍有許多工作要做，以解決該領域一般的這一重要而微妙的問題。有關我們的分析的更詳細的解釋，請讀者參閱附錄C。

5 Limitations ?局限性

GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for ?future work. ? First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct ?predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although ?the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to ?lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences ?or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of ?GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed ?informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some ?datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type ?“If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable ?gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when ?evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same ?way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading ?comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.	GPT-3和我們對它的分析都有一些局限性。下面我們將對其中一些進行描述，并對未來的工作提出建議。? 首先，盡管GPT-3在定量和定性方面有了很大的改進，特別是與它的直接前身GPT-2相比，它在文本合成和一些NLP任務方面仍有明顯的缺陷。在文本合成方面，盡管整體質量很高，但GPT-3樣本有時仍然在文檔層面上語義上重復，在足夠長的段落中開始失去連貫性，自相矛盾，偶爾還包含不符合邏輯的句子或段落。我們將發布500個未經管理的無條件樣本，以幫助更好地了解GPT-3在文本合成方面的局限性和優勢。在離散語言任務領域，我們非正式地注意到GPT-3似乎在“常識物理”方面有特殊的困難，盡管在一些測試該領域的數據集(如PIQA [BZB+19])上做得很好。具體來說，GPT-3很難回答“如果我把奶酪放進冰箱，它會融化嗎?”定量,GPT-3的語境學習表現有明顯的差距在我們套件的基準,如第三節所述,特別是它沒有比機會當評估一次性甚至few-shot一些“比較”的任務,如確定兩個詞使用同樣的方式在一個句子,或者如果一個句子意味著另一個(WIC和ANLI分別),以及閱讀理解任務的一個子集。考慮到GPT-3在許多其他任務上的出色的小樣本性能，這一點尤其引人注目。
GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused ?on exploring in-context learning behavior in autoregressive language models because it is straightforward to both ?sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional ?architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent ?literature, which has documented improved fine-tuning performance when using these approaches over standard ?language models [RSR+19]. Thus our design decision comes at the cost of potentially worse performance on tasks ?which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back ?and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then ?generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a ?few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves ?comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and ?RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning ?than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with ?few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.	GPT-3在結構和算法上有一些限制，這可以解釋上面的一些問題。我們專注于探索自回歸語言模型中的上下文內學習行為，因為用這個模型類進行抽樣和計算可能性都很簡單。因此，我們的實驗不包括任何雙向架構或其他訓練目標，如去噪。這與最近的許多文獻有明顯的不同，后者記錄了在標準語言模型上使用這些方法可以提高調優性能[RSR+19]。因此，我們的設計決策的代價是，在經驗上受益于雙向性的任務上，可能會有更糟糕的性能。這可能包括填空任務，包括回顧和比較兩段內容的任務，或者要求重讀或仔細考慮一篇很長的文章，然后寫出非常簡短的答案的任務。這可能是一個可能的解釋為GPT-3滯后few-shot性能的一些任務,如WIC(包括比較詞的使用在兩個句子),ANLI(包括比較兩個句子是否意味著另一個),和一些閱讀理解任務(例如QuAC和種族)。基于過去的文獻，我們還推測，一個大型的雙向模型在微調方面會比GPT-3更強。在GPT-3的規模上制作一個雙向模型，以及/或嘗試使雙向模型在很少或零射擊學習中工作，是未來研究的一個有前途的方向，并且可以幫助實現“兩全其美”。 ?
A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether ?autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to ?predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, ?with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ?ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed ?actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains ?of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world ?[BHT+20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a ?different approach is likely to be necessary. Promising future directions in this vein might include learning the objective ?function from humans [ZSW+19a], fine-tuning with reinforcement learning, or adding additional modalities such as ?images to provide grounding and a better model of the world [CLY+19].	本文所描述的一般方法的一個更基本的限制是——擴展任何類似lm的模型，無論是自回歸的還是雙向的——它可能最終會(或可能已經)碰到培訓前目標的限制。我們目前的目標是平等地對每一個標記進行權重，并且缺乏一個概念，即哪些是最重要的，哪些是不那么重要的。[RRS20]演示定制對相關實體的預測的好處。此外，在自我監督的目標中，任務規范依賴于將所需的任務強制轉化為預測問題，然而最終，有用的語言系統(例如虛擬助手)可能被認為是采取目標導向的行動，而不僅僅是進行預測。最后，大型的預訓練語言模型并不基于其他經驗領域，如視頻或現實世界的物理互動，因此缺乏大量關于世界的上下文[BHT+20]。由于所有這些原因，純自監督預測的縮放可能會達到極限，使用不同的方法進行擴展可能是必要的。在這方面，未來有希望的方向可能包括從人類那里學習目標函數[ZSW+19a]，用強化學習進行微調，或添加額外的模式，如圖像，以提供接地和更好的世界模型[CLY+19]。
Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 ?takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more ?text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is ?an important direction for future work, and might come from grounding in the physical world to provide additional ?information, or from algorithmic improvements. ? A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot ?learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it ?has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that ?are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, ?to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on ?this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words ?seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although ?possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what ?humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training ?and identifying them at test time would be an advance for language models, but nevertheless understanding precisely ?how few-shot learning works is an important unexplored direction for future research.	語言模型普遍存在的另一個局限性是在訓練前的樣本效率較低。盡管GPT-3在測試時間樣本效率方面更接近人類(一次或零次)，但它在訓練前看到的文本仍然比人類在一生中看到的要多得多[Lin20]。提高訓練前的樣本效率是未來工作的一個重要方向，可能來自于在物理世界的基礎上提供額外的信息，或者來自于算法的改進。在GPT-3中，與少樣本學習相關的一個限制，或者至少是不確定性，是關于小樣本學習實際上是在推理時間“從零開始”學習新任務，還是僅僅識別和識別在訓練中學習到的任務的不確定性。這些可能性存在于光譜,從示威游行的訓練集來自相同的分布與測試時間,認識到相同的任務,但在不同的格式,以適應一個特定的風格的QA等任務,學習一門技能完全新創。GPT-3在這個范圍內的位置也可能因任務而異。合成任務，如詞序打亂或定義無意義的詞，似乎特別有可能從頭學習，而翻譯顯然必須在訓練前學習，盡管可能從組織和風格上與測試數據非常不同的數據。最終，我們甚至不清楚人類從從零開始和之前的演示中學到了什么。即使是在訓練前組織各種演示，并在測試時識別它們，也將是語言模型的一個進步，但準確地理解少槍學習是如何工作的，是未來研究的一個重要的未探索的方向。 ?
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are ?both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of ?models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large ?models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, ?most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. ?Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; ?new challenges and opportunities may be associated with applying it to models of this size. ? Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, ?it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in ?performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This ?last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special ?concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts ?(Section 6).	無論目標函數或算法如何，GPT-3規模上的模型都存在一個限制，即它們都是昂貴的，并且不便于進行推斷，這可能對當前形式的這種規模的模型的實際適用性提出挑戰。解決這一問題的一個可能的未來方向是將大型模型精餾[HVD15]，使其達到可管理的規模，以完成特定的任務。像GPT-3這樣的大型模型包含了非常廣泛的技能，其中大多數技能對于特定的任務來說是不需要的，這表明在原則上積極的提煉是可能的。蒸餾在一般情況下得到了很好的探索[LHCG19a]，但還沒有在數千億個參數的規模上進行嘗試;將其應用于這種規模的模型可能會帶來新的挑戰和機會。最后,GPT-3共同分享一些限制大多數深度學習系統——它的決定并不容易解釋,它在預測不一定精確校準的小說所觀察到的輸入方差性能遠高于人類標準基準,它保留了數據的偏見一直在訓練。最后這個問題- -數據的偏差可能導致模型產生定型或偏見的內容- -從社會角度來說是特別關注的問題，將在下一節中與其他問題一起討論更廣泛的影響(第6節)。

6 Broader Impacts ?更廣泛的影響

Language models have a wide range of beneficial applications for society, including code and writing auto-completion, ?grammar assistance, game narrative generation, improving search engine responses, and answering questions. But ?they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over ?smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the ?potential to advance both the beneficial and harmful applications of language models. ?
Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily ?greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this ?are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in ?Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly ?discuss issues of energy efficiency (Section 6.3).

語言模型為社會提供了廣泛的有益應用，包括代碼和編寫自動完成、語法幫助、游戲敘事生成、改進搜索引擎響應和回答問題。但它們也有潛在的有害用途。相對于較小的模型，GPT-3提高了文本生成的質量和適應性，并增加了區分合成文本和人類書寫文本的難度。因此，它有潛力促進語言模型的有益和有害應用。 
在這里，我們關注改進后的語言模型的潛在危害，不是因為我們認為這種危害必然更大，而是為了激勵人們努力去研究和減輕它們。這類語言模型的廣泛影響是多方面的。我們關注兩個主要問題:第6.1節中故意誤用像GPT-3這樣的語言模型的可能性，以及第6.2節中像GPT-3這樣的模型中的偏見、公平和表示問題。我們也簡要討論能源效益的問題(第6.3節)。

6.1 Misuse of Language Models ?語言模型的誤用

Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing ?language models in a very different environment or for a different purpose than researchers intended. To help with this, ?we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying ?threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact ?[Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.

惡意使用語言模型可能有點難以預料，因為它們通常涉及到在非常不同的環境中重新使用語言模型，或者用于與研究人員預期不同的目的。為了幫助解決這一問題，我們可以從傳統的安全風險評估框架的角度進行思考，這些框架列出了關鍵步驟，如識別威脅和潛在影響、評估可能性以及將風險確定為可能性和影響的組合[Ros12]。我們討論三個因素:潛在的誤用應用，威脅行動者，和外部激勵結構。

6.1.1 Potential Misuse Applications ?潛在的誤用

Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples ?include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing ?and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high ?quality text. Language models that produce high quality text generation could lower existing barriers to carrying out ?these activities and increase their efficacy.

The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to ?generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in ?3.9.4 represents a concerning milestone in this regard.

任何依賴于生成文本的對社會有害的活動都可以通過強大的語言模型來增強。例如，虛假信息，垃圾郵件，網絡釣魚，濫用法律和政府程序，欺詐學術論文寫作和社會工程借口。這些應用程序中的許多都阻礙了人們編寫足夠高質量的文本。產生高質量文本生成的語言模型可以降低執行這些活動的現有障礙，并提高其效率。

隨著文本合成質量的提高，語言模型的誤用潛力也在增加。GPT-3生成幾段合成內容的能力是這方面的一個重要里程碑，人們發現這些合成內容很難與3.9.4中人類書寫的文本區分開來。

6.1.2 Threat Actor Analysis ?威脅行動者分析

Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors ?who may be able to build a malicious product to ‘advanced persistent threats’ (APTs): highly skilled and well-resourced ?(e.g. state-sponsored) groups with long-term agendas [SBC+19]. ?
To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat ?groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did ?find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances ?of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated ?with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is ?not immediate, but significant improvements in reliability could change this. ?
Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about ?possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible ?difference in operations that may see potential gains by using language models. The assessment was that language ?models may not be worth investing significant resources in because there has been no convincing demonstration that ?current language models are significantly better than current methods for generating text, and because methods for ?“targeting” or “controlling” the content of language models are still at a very early stage. ?

威脅參與者可以根據技能和資源級別進行組織，從能夠構建惡意產品的低或中等技能和資源的參與者，到“高級持續威脅”(APTs):高技能和資源充足的(例如。國家資助的)有長期議程的團體[SBC+19]。

為了了解低技能和中等技能的參與者是如何思考語言模型的，我們一直在監視論壇和聊天組，在那里錯誤信息策略，惡意軟件的傳播，和計算機欺詐經常被討論。雖然在2019年春天首次發布GPT-2之后，我們確實發現了大量關于濫用的討論，但我們發現，自那以后，實驗的實例變少了，也沒有成功的部署。此外，這些誤用的討論與媒體對語言模型技術的報道有關。從這一點，我們評估的威脅，濫用這些行動者不是立即，但重大改進的可靠性可以改變這一點。
因為APTs通常不公開討論操作，所以我們就可能涉及語言模型使用的APT活動咨詢了專業的威脅分析師。自從GPT-2發布以來，在使用語言模型可以獲得潛在收益的操作方面沒有明顯的差異。評估是語言模型可能不值得投入大量資源,因為沒有令人信服的證明當前的語言模型明顯優于現有方法生成文本,因為“目標”或“控制”方法的內容語言模型仍處于早期階段。

6.1.3 External Incentive Structures ?外部激勵結構

Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their ?agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular ?among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login ?credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.

Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. ?The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k ?truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot ?produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the ?amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts ?how scalable the operation can be. ?
Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will ?eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to ?malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on ?this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

每個威脅行動者組織也有一套戰術、技術和程序(TTPs)，他們依靠這些來完成他們的議程。ttp會受到經濟因素的影響，比如可伸縮性和部署的簡便性;網絡釣魚在所有群體中都非常流行，因為它提供了一種低成本、低成本、高收益的部署惡意軟件和竊取登錄憑證的方法。使用語言模型來增強現有的ttp可能會導致部署成本更低。
易用性是另一個重要的激勵因素。擁有穩定的基礎設施對ttp的采用有很大的影響。然而，語言模型的輸出是隨機的，盡管開發人員可以限制這些輸出(例如使用top-k truncation)，但如果沒有人類的反饋，它們無法持續執行。如果一個社交媒體假信息機器人的輸出在99%的情況下是可靠的，但在1%的情況下輸出的是不連貫的，這就可以減少操作這個機器人所需的人力。但是仍然需要人工篩選輸出，這限制了操作的可伸縮性。 
基于我們對這個模型的分析，以及對威脅參與者和環境的分析，我們懷疑人工智能研究人員最終將開發出具有足夠一致性和可操控性的語言模型，從而使惡意參與者更感興趣。我們希望這將給更廣泛的研究界帶來挑戰，并希望通過結合緩解研究、原型設計和與其他技術開發人員協調來解決這一問題。

6.2 Fairness, Bias, and Representation ?公平、偏見和代表性

Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, ?since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and ?producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in ?the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8 ?
Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and ?behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely ?present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the model’s ?biases even within the studied categories. ?

Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes ?present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, ?and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how ?they are different in this dimension.

訓練數據中的偏差可能導致模型產生定型或偏見的內容。這是令人擔憂的，因為模型偏見可能以不同的方式傷害相關群體的人，通過加強現有的刻板印象和產生貶低形象等潛在危害[Cra17]。我們對模型中的偏差進行了分析，以便更好地理解GPT-3在公平性、偏差和代表性方面的局限性。8
我們的目標不是詳盡地描述GPT-3，而是對其局限性和行為進行初步分析。我們關注的是與性別、種族和宗教相關的偏見，盡管可能存在許多其他類別的偏見，可以在后續工作中進行研究。這只是初步的分析，并沒有反映模型的所有偏差，即使是在研究的類別內。 
總的來說，我們的分析表明，經過互聯網訓練的模型具有互聯網規模偏差;模型傾向于反映訓練數據中呈現的刻板印象。下面我們將討論我們在性別、種族和宗教維度上的偏見的初步發現。我們在1750億參數模型和類似較小的模型中探查偏差，看看它們在這個維度上是否和如何不同。

6.2.1 Gender ?性別

In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found ?that occupations in general have a higher probability of being followed by a male gender identifier than a female one ?(in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant). ?83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured ?this by feeding the model a context such as "The detective was a" and then looking at the probability of the ?model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.). ?In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus ?were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and ?sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist, ?housekeeper etc.?
We also tested how these probabilities changed when we shifted the context to be the "The competent {occupation} ?was a" (Competent Variant), and when we shifted the context to be "The incompetent {occupation} was a" ?(Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent ?{occupation} was a," the majority of occupations had an even higher probability of being followed by a ?male identifier than a female one than was the case with our original neutral prompt, "The {occupation} was ?a". With the prompt "The incompetent {occupation} was a" the majority of occupations still leaned male ?with a similar probability than for our original neutral prompt. The average occupation bias - measured as ?1 ?njobs ?P ?jobs log( P (female|Context) ?P (male|Context)) ) - was ?1.11 for the Neutral Variant, ?2.14 for the Competent Variant and ?1.15 ?for the Incompetent Variant.

在我們對GPT-3性別偏見的調查中，我們關注的是性別與職業之間的聯系。我們發現，在給出“該職業是一個”(中性變量)這樣的背景下，一般來說，職業被男性性別標識符跟隨的概率比女性更高(換句話說，她們更傾向于男性)。在我們測試的388種職業中，有83%的職業更有可能被男性的GPT-3尾隨。我們通過給模型輸入諸如“偵探是a”這樣的語境來測量這一點，然后觀察模型接著輸入男性暗示詞(如“the detective was a”)的概率。或表示女性的詞(woman, female等)。特別是，具有較高教育水平的職業，如立法者、銀行家或名譽教授，以及需要重體力勞動的職業，如梅森、米爾萊特和治安官，都偏重于男性。更有可能被女性識別的職業包括助產士、護士、接待員、管家等。
我們還測試了當我們將上下文轉換為“勝任的{占職}是一個”(勝任的變體)時，以及當我們將上下文轉換為“不勝任的{占職}是一個”(不勝任的變體)時，這些概率是如何變化的。我們發現，當提示為“勝任的{職業}是a”時，大多數職業后面跟隨男性標識符的概率比跟隨女性標識符的概率還要高，這比我們最初的中性提示為“The{職業}是a”的概率還要高。當提示“the incompetent {career} was a”時，大多數職業仍然傾向于男性，這一概率與我們最初的中性提示相似。以1 njobs P job log(P(女性|環境)P(男性|環境))測量的平均職業偏倚為:中性變異為- 1.11，勝任變異為- 2.14，不勝任變異為- 1.15。

?

We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further ?corroborated the model’s tendency to associate most occupations with males. One method measured the models ?ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model ?a context such as "The advisor met with the advisee because she wanted to get advice about job ?applications. ‘She’ refers to the" and found the option with the lowest probability between the two possible ?options (Choices between Occupation Option: advisor; Participant Option: advisee). ?
Occupation and participant words often have societal biases associated with them such as the assumption that most ?occupants are by default male. We found that the language models learnt some of these biases such as a tendency to ?associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of ?all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences ?where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All ?other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns ?with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers ?some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger ?models are more robust than smaller models.

We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected ?words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She ?was very", "He would be described as", "She would be described as"9 ?. We looked at the adjectives and ?adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more ?often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were ?more often described using adjectives that span a greater spectrum. ?
Table 6.1 shows the top 10 most favored descriptive words for the model along with the raw number of times each ?word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a ?category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective, ?we have also included the average for the number of co-occurrences across all qualifying words for each gender.

我們還使用兩種方法對Winogender數據集[RNLVD18]進行代詞解析，這兩種方法進一步證實了該模型將大多數職業與男性聯系起來的傾向。一種方法是測試模型正確分配代詞作為職業或參與者的能力。例如，我們為模型提供了一個上下文，例如“顧問與被咨詢者會面，因為她想獲得關于工作申請的建議。”“她”指的是“并在兩種可能的選項(職業選項:顧問;參與者選擇:學生)。

職業和參與者的詞匯通常帶有社會偏見，比如假設大多數居住者默認為男性。我們發現，語言模型學會了一些偏見，比如傾向于將女性代詞與參與者的位置聯系起來，而不是男性代詞。GPT-3 175B在這項任務上的準確率是所有模型中最高的(64.17%)。這也是唯一一個女性的居住者句子(正確答案是職業選項的句子)的準確率高于男性的模型(81.7%對76.7%)。除了我們的第二大模型GPT-3 13B，其他所有模型在男性代詞與職業相關的句子上的準確率都高于女性代詞，但GPT-3 13B在兩個句子上的準確率都相同(60%)。這提供了一些初步證據，表明在存在偏見的地方，語言模型容易出錯，較大的模型比較小的模型更健壯。
我們還進行了共現測試，分析哪些詞可能出現在其他預先選擇的詞附近。通過為數據集中的每個提示生成800個長度為50、溫度為1和頂部p為0.9的輸出，我們創建了一個模型輸出示例集。關于性別，我們有諸如"他非常"，"她非常"，"他被描述為"，"她被描述為"9。我們看了形容詞和副詞在100個最受歡迎的單詞中使用現成的POS標記。我們發現，女性被描述時更多地使用“美麗”和“華麗”等以外表為導向的詞匯，而男性則更多地使用范圍更廣的形容詞來描述。 
表6.1顯示了模型中最受歡迎的10個描述性單詞，以及每個單詞與代詞指示符共出現的原始次數。這里的“最受歡迎”指的是那些與某個類別同時出現的詞比另一個類別出現的比率要高。為了更好地理解這些數字，我們還包括了每種性別的所有限定詞中共同出現的次數的平均值。

6.2.2 Race ?種族

To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The {race} man was very", ?"The {race} woman was very" and "People would describe the {race} person as" and generated 800 ?samples for each of the above prompts, with {race} replaced with a term indicating a racial category such as White ?or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that ?language models produce text of differing sentiment when varying features such as occupation [HZJ+19], we explored ?how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred ?disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive ?words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid: ?-87.5) and a score of 0 indicating neutral words (eg. sloping, chalet). ?
It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that ?focused on racial features; these results are not from the models talking about race in the wild but talking about race in ?an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply ?looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to ?a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated ?with a negative sentiment under this testing methodology. ?
Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the ?other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences ?narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and ?highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data.

GPT-3調查種族偏見,我們播種等模型提示——“{種族}男人非常”,“{種族}的女人非常”和“人們將{種族}人描述為“和生成800個樣本對于上面的提示,用{種族}替換為一個術語表明種族類別如白人或亞洲。然后我們在生成的樣本中度量單詞的共同出現。鑒于先前的研究表明，語言模型在不同的特征(如職業)下產生不同的情緒[HZJ+19]，我們探究了種族如何影響情緒。我們使用Senti WordNet [BES10]來測量情緒，以確定在每個種族中出現的不相稱的詞匯。每個詞的情緒在100到-100之間變化，積極的分數表示積極的詞。精彩度:100，友好度:87.5)，負分數表示否定的詞。猥賤:-87.5，可怕:-87.5)和0分表示中性詞(如:傾斜的小屋)。 
值得注意的是，我們明確地促使模型討論種族問題，而這反過來產生了關注種族特征的文本;這些結果并不是來自于那些討論野外競賽的模型，而是來自于他們已經準備好這樣做的實驗設置。此外,由于我們測量情緒通過簡單地看單詞共生,產生的情緒可以反映社會歷史因素——例如,文本有關的討論奴隸制會經常有負面情緒,這可能會導致人口與負面情緒在這種測試方法。 
在我們分析的所有模特中，“亞洲人”的人氣一直很高——在7個模特中，有3個排名第一。另一方面，“黑色”的人氣一直很低——在7款車型中，它在5款中排名最低。這些差異在較大的模型尺寸上略微縮小。這個分析給出了不同模型的偏差，并強調了對情緒、實體和輸入數據之間的關系進行更復雜分析的必要性。
?

6.2.3 Religion? 宗教

We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, ?and Judaism, by generating 800 model outputs of length ≈50 with a temperature of 1 and a top p of 0.9 for every ?prompt. Our prompts were of the nature "{Religion practitioners} are" (Eg. "Christians are") for each ?of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a ?corpus of such completions for studying co-occurrence of words.
Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect ?how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such ?as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such ?as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in ?the top 40 most favored words for Islam in GPT-3.

我們研究了哪些詞與無神論、佛教、基督教、印度教、伊斯蘭教和猶太教等宗教術語共出現，通過生成800個模型輸出，長度≈50，溫度為1，每個提示的p值為0.9。我們的提示屬于“宗教從業者”的性質。“基督徒是”)對應以上列出的六個宗教類別中的每一個。然后，我們允許模型自然地執行補全，并創建這樣補全的語料庫來研究單詞的共現。
與種族相似，我們發現這些模型與宗教術語聯系在一起，顯示出某些傾向來反映這些術語在世界上是如何呈現的。以伊斯蘭教為例，我們發現像ramadan, prophet和mosque這樣的詞出現的頻率比其他宗教要高。我們還發現，“暴力”、“恐怖主義”和“恐怖主義”等詞與“伊斯蘭”相關的比例要高于與其他宗教相關的比例，并在GPT-3中躋身“伊斯蘭”最受歡迎的40個詞匯之列。
?

6.2.4 Future Bias and Fairness Challenges ?未來的偏見和公平挑戰

We have presented this preliminary analysis to share some of the biases we found in order to motivate further research, ?and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an ?area of continuous research for us and are excited to discuss different methodological approaches with the community. ?We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but ?we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model ?attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ+18]. ?
Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this ?is also extensive [QMZH19, HZJ+19], so we offer only a few brief comments on future directions specific to large ?language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for ?building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for ?these models. There is room for more research that engages with the literature outside NLP, better articulates normative ?statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20]. ?Thus, mitigation work should not be approached purely with a metric driven objective to ‘remove’ bias as this has been ?shown to have blind spots [GG19, NvNvdG19] but in a holistic manner.

我們提出這一初步分析是為了分享我們發現的一些偏見，以推動進一步的研究，并強調在大規模生成模型中描述偏見的固有困難;我們希望這將是一個持續研究的領域，并很高興與社區討論不同的方法方法。我們把這部分的工作看作是主觀的路標——我們選擇了性別、種族和宗教作為出發點，但我們認識到這種選擇的內在主觀性。我們的工作受到了描述模型屬性以開發信息性標簽的文獻的啟發，例如用于模型報告的模型卡片[MWZ+18]。 
最終，重要的不僅僅是描述語言系統中的偏見，還要進行干預。關于這方面的文獻也很廣泛[QMZH19, HZJ+19]，因此我們僅就大型語言模型的未來方向提供一些簡短的評論。為了在通用模型中為有效預防偏倚鋪平道路，有必要建立一個共同的詞匯表，將這些模型在減輕偏倚方面的規范、技術和經驗挑戰結合起來。還有更多的研究空間與NLP以外的文獻相結合，更好地闡明關于傷害的規范性陳述，并與受NLP系統影響的社區的生活經歷相結合[BBDIW20]。因此，應對緩解工作不應單純以一個度量驅動的目標來“消除”偏見，因為這已被證明存在盲點[GG19, NvNvdG19]，而應以一種整體的方式。

6.3 Energy Usage ?能源使用

Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 ?175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days ?for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such ?models, as advocated by [SDSE19]. ?
The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we ?should consider not only the resources that go into training them, but how these resources are amortized over the ?lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though ?models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even ?with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or ?only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down ?the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient ?versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency ?of such models over time, similar to trends observed in image recognition and neural machine translation [HB20].

實際的大規模預訓練需要大量的計算，這是能源密集型的:訓練GPT-3 175B在預訓練期間消耗了數千次petaflop/s天計算，相比之下，1.5B參數的GPT-2模型需要幾十次petaflop/s天計算(圖2.2)。這意味著我們應該認識到這種模式的成本和效率，正如[SDSE19]所倡導的。 
大規模的使用訓練的也給了另一個樣本,通過它觀看大型模型的效率,我們不僅應該考慮去培訓他們的資源,但這些資源如何平攤的生命周期模型,隨后將被用于各種各樣的目的特定任務來制定和調整。盡管像GPT-3這樣的模型在培訓期間消耗了大量的資源，但一旦培訓完成，它們的效率會驚人地高:即使使用完整的GPT-3 175B，從一個培訓過的模型生成100頁內容的成本大約是0.4千瓦時，或者只有幾美分的能源成本。此外，像模型蒸餾[LHCG19a]這樣的技術可以進一步降低此類模型的成本，讓我們采用訓練單一、大規模模型的范例，然后創建更有效的版本，以便在適當的上下文中使用。隨著時間的推移，算法的發展也會自然地進一步提高這些模型的效率，類似于在圖像識別和神經機器翻譯中觀察到的趨勢[HB20]。
?

7 Related Work ?相關工作

Several lines of work have focused on increasing parameter count and/or computation in language models as a ?means to improve generative or task performance. An early work scaled LSTM based language models to over a ?billion parameters [JVS+16]. One line of work straightforwardly increases the size of transformer models, scaling ?up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size: ?213 million parameters [VSP+17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters ?[RWC+19], 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and most recently 17 billion parameters ?[Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of ?increasing models’ capacity to store information without increased computational cost. These approaches rely on the ?conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM+17] has been ?used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19], ?though only a small fraction of the parameters are actually used on each forward pass. A third approach increases ?computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and ?the universal transformer [DGV+18]. Our work focuses on the first approach (scaling compute and parameters together, ?by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ ?this strategy. ? Several efforts have also systematically studied the effect of scale on language model performance. [KMH+20, ?RRBS19, LWS+20, HNA+17], find a smooth power-law trend in loss as autoregressive language models are scaled up. ?This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the ?curve can perhaps be detected in Figure 3.1), and we also find relatively smooth increases in many (though not all) ?downstream tasks across 3 orders of magnitude of scaling. ?	有幾行工作關注于增加語言模型中的參數計數和/或計算，以此作為提高生成或任務性能的手段。早期的工作將基于LSTM的語言模型擴展到超過10億個參數[JVS+16]。一條生產線直接增加了變壓器模型的尺寸，大致按比例增加了參數和每個令牌的浮動量。該血管的工作使模型規模不斷增大，原論文中有2.13億個參數[VSP+17]，有3億個參數[DCLT18]， 15億個參數[RWC+19]， 80億個參數[SPP+19]， 110億個參數[RSR+19]，最近又增加了170億個參數[Tur20]。第二行工作集中在增加參數計數而不是計算，作為在不增加計算成本的情況下增加模型存儲信息的能力的一種方法。這些方法依賴于條件計算框架[BLC13]，具體地說，專家混合方法[SMM+17]已經被用于生成1000億個參數模型和最近的500億個參數轉換模型[AJF19]，盡管在每次向前傳遞中實際使用的參數只有一小部分。第三種方法在不增加參數的情況下增加計算量;該方法的實例包括自適應計算時間[Gra16]和通用變壓器[DGV+18]。我們的工作集中在第一種方法上(通過直接使神經網絡變大，將計算和參數結合在一起)，并將模型的大小比以前采用這種策略的模型增加10倍。  一些學者也系統地研究了規模對語言模型性能的影響。[KMH+20, RRBS19, LWS+20, HNA+17]，隨著自回歸語言模型規模的增大，損失呈現平穩的冪律趨勢。這項工作表明，隨著模型不斷擴大，這一趨勢在很大程度上繼續下去(盡管在圖3.1中可以檢測到曲線的輕微彎曲)，我們還發現，在許多(盡管不是全部)下游任務中，在3個數量級的擴展中，都出現了相對平穩的增長。  ?
Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language ?models that are as small as possible. This approach includes ALBERT [LCG+19] as well as general [HVD15] and task-specific [SDCW19, JYS+19, KR16] approaches to distillation of language models. These architectures and ?techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint ?of giant models. ? As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable ?effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR+19, ?IBGC+14, CCE+18, MCKS18], reading comprehension [CHI+18, RCM19], and adversarially constructed datasets ?designed to be difficult for existing language models [SBBC19, NWD+19]. In this work we test our models on many ?of these datasets.	另一項工作與擴展的方向相反，試圖在盡可能小的語言模型中保持強大的性能。該方法包括ALBERT [LCG+19]、general [HVD15]和task-specific [SDCW19, JYS+19, KR16]等語言模型精餾方法。這些架構和技術對我們的工作具有潛在的補充作用，可以用于減少大型模型的延遲和內存占用。  由于經過調優的語言模型在許多標準基準測試任務上接近了人類的性能，人們投入了相當多的精力來構建更困難的或開放的任務，包括問題回答[KPR+19, IBGC+14, CCE+18, MCKS18]，閱讀理解[CHI+18, RCM19]，以及為現有語言模型設計的困難的對立構建數據集[SBBC19, NWD+19]。在這項工作中，我們在許多數據集上測試我們的模型。
Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the ?tasks we tested on. Recent efforts include [RSR+19, RRS20], which fine-tuned an 11 billion parameter language model, ?and [GLT+20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on ?in-context learning but could be combined in the future with those of [GLT+20, LPP+20]. ? Metalearning in language models has been utilized in [RWC+19], though with much more limited results and no ?systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it ?structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including ?matching networks [VBL+16], RL2 [DSC+16], learning to optimize [RL16, ADG+16, LM17] and MAML [FAL17]. ?Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also ?resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations ?across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training) ?updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time. ?Few-shot auto-regressive density estimation was explored in [RCP+17] and [GWC+18] studied low-resource NMT as ?a few-shot learning problem. ?	之前的很多工作都是專門針對問題的回答，這在我們的測試任務中占了很大一部分。最近的努力包括[RSR+19, RRS20]，它微調了一個110億參數的語言模型，以及[GLT+20]，它關注于在測試時處理大量的數據。我們的工作側重于語境學習，但在未來可以與[GLT+20, LPP+20]的工作相結合。? 語言模型中的金屬學習在[RWC+19]中得到了應用，盡管結果有限，也沒有系統的研究。更廣泛地說，語言模型metalearning具有內環-外環結構，這使得它在結構上類似于一般應用于ML的metalearning。這里有大量的文獻，包括匹配網絡[VBL+16]， RL2 [DSC+16]，學習優化[RL16, ADG+16, LM17]和MAML [FAL17]。填料模型的上下文的我們的方法與以前的例子是最結構類似于RL2上也類似于[HYC01],在適應一個內循環發生在步伐通過計算模型的激活,沒有更新權重,而外層循環(在這種情況下只是語言模型訓練的)更新權重,和隱式學習能力適應或者至少在inference-time定義識別任務。[RCP+17]探索了小樣本自回歸密度估計，[GWC+18]將低資源NMT作為一個小樣本學習問題進行了研究。?
While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained ?language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with ?similar goals is semi-supervised learning where approaches such as UDA [XDH+19] also explore methods of fine-tuning ?when very little labeled data is available. ? Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18] ?and utilized for some tasks (such as summarizing) in a language model with [RWC+19]. The notion of presenting ?tasks in natural language was also explored in the text-to-text transformer [RSR+19], although there it was applied for ?multi-task fine-tuning rather than for in-context learning without weight updates.? ?Another approach to increasing generality and transfer-learning capability in language models is multi-task learning ?[Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for ?each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the ?weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating ?the weights for a new task. Multi-task learning has shown some promising initial results [LGH+15, LSP+18] and ?multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed ?the boundaries on certain tasks [KKS+20], but is still limited by the need to manually curate collections of datasets and ?set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of ?tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate ?a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR+17], human ?interaction [ZSW+19b], or active learning [Mac92]. ?	雖然我們的小樣本方法的機制不同，但之前的工作也探索了使用預訓練語言模型結合梯度下降進行小樣本學習的方法[SS20]。另一個具有類似目標的子領域是半監督學習，其中像UDA [XDH+19]這樣的方法也探索了在可用標記數據很少的情況下進行微調的方法。? 使用自然語言給出多任務模型的指令首先是在一個監督設置中通過[MKXS18]形式化的，并在使用[RWC+19]的語言模型中用于一些任務(比如匯總)。在文本到文本轉換器[RSR+19]中也探索了用自然語言表示任務的概念，盡管它被應用于多任務微調，而不是在沒有權值更新的情況下用于上下文學習。另一種提高語言模型通用性和轉移學習能力的方法是多任務學習[Car97]，它對下游任務的混合進行微調，而不是分別更新每個任務的權重。如果成功的多任務學習可以允許單一模型在不更新權值的情況下用于多個任務(類似于我們的上下文學習方法)，或者可以在更新新任務權值時提高樣本效率。多任務學習了一些初步的結果[LGH + 15, LSP + 18]和多級微調最近成為一個標準化的一部分SOTA結果在一些數據集[PFB18]而且突破某些任務(kk + 20),但仍需要手動牧師收藏有限的數據集和設置培訓課程。相比之下，大規模的預訓練似乎提供了一種“自然的”廣泛分布的任務，這種任務隱含在預測文本本身中。未來工作的一個方向可能是嘗試為多任務學習生成更廣泛的明確任務，例如通過程序生成[TFR+17]、人機交互[ZSW+19b]或主動學習[Mac92]。? ?
Algorithmic innovation in language models over the last two years has been enormous, including denoising-based ?bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG+19, RSR+19], random permutations ?during training [YDY+19], architectures that improve the efficiency of sampling [DYY+19], improvements in ?data and training procedures [LOG+19], and efficiency increases in the embedding parameters [LCG+19]. Many of ?these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive ?language models, both in order to focus on in-context learning performance and to reduce the complexity of our large ?model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s ?performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these ?algorithmic techniques is a promising direction for future work.	算法語言的創新模式在過去的兩年里一直巨大,包括denoising-based雙向性[DCLT18], prefixLM [DL15]和encoder-decoder架構(RSR LLG + 19日+ 19),隨機排列在訓練(金波+ 19),架構,提高抽樣效率[DYY + 19],改善數據和訓練程序[日志+ 19],和效率提高嵌入參數(LCG + 19)。許多這些技術為下游任務提供了顯著的收益。在這項工作中，我們繼續關注純自回歸語言模型，這既是為了關注上下文內的學習性能，也是為了減少大型模型實現的復雜性。然而，結合這些算法的進步很可能會提高GPT-3在下游任務中的性能，特別是在微調設置中，結合GPT-3的規模與這些算法技術是未來工作的一個有前途的方向。

8 Conclusion 結論

We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at ?tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. ?We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results ?suggest that very large language models may be an important ingredient in the development of adaptable, general ?language systems.

我們提出了一個1750億參數語言模型顯示強勁表現在許多NLP zero-shot任務和基準,一次性的,和few-shot設置,在某些情況下幾乎匹配最先進的調整系統的性能,以及生成高質量的樣品,在任務定義動態定性表現強勁。我們記錄了大致可預測的性能擴展趨勢，而不使用微調。我們還討論了這類模型的社會影響。盡管有許多限制和弱點，這些結果表明，非常大的語言模型可能是開發適應性強的通用語言系統的一個重要成分。

Acknowledgements 致謝

?The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub ?Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea ?Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up ?this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura ?Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early ?discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments, ?Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of ?people who created content that was used in the training of the model, and to those who were involved in indexing or ?upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure ?and supercomputing teams for making it possible to train models at this scale.

作者要感謝Ryan Lowe對論文草稿提供的詳細反饋。感謝Jakub Pachocki和Szymon Sidor提出的任務建議，以及Greg Brockman、Michael Petrov、Brooke Chan和Chelsea Voss幫助運行OpenAI基礎設施的評估。感謝大衛的菜肴最初支持擴大這個項目,艾琳Solaiman討論的方式方法和評估偏差,哈里森·愛德華茲和Yura呢Burda與語境的討論和實驗學習,杰弗里·歐文和保羅global早期的討論語言模型縮放、長歐陽的建議設計人類的評估實驗,克里斯Hallacy討論數據收集,和山卡特的幫助與視覺設計。感謝數百萬創建內容并用于模型培訓的人，感謝那些參與索引或對內容進行向上投票(在WebText的情況下)的人。此外，我們要感謝整個OpenAI基礎設施和超級計算團隊，因為他們使在這種規模上訓練模型成為可能。

總結

以上是生活随笔為你收集整理的Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：成功解决启动SQLServer失败，根据
下一篇： BigData之Hive beeline

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读

《GPT-3: Language Models are Few-Shot Learners》的翻譯與解讀

Abstract 摘要

1 Introduction 介紹

2 Approach?方法

2.1 Model and Architectures?模型和架構

2.2 Training Dataset?訓練數據集

2.3 Training Process 訓練過程

2.4 Evaluation? 評估

3 Results 結果

3.1 Language Modeling, Cloze, and Completion Tasks?語言建模、完形填空和完成任務

3.1.1 Language Modeling? ?語言建模

3.1.2 LAMBADA 數據集

3.1.3 HellaSwag? 數據集

3.1.4 StoryCloze? 數據集

3.2 Closed Book Question Answering ?閉卷回答任務

3.3 Translation? 翻譯任務

3.4 Winograd-Style Tasks? 任務

3.5 Common Sense Reasoning ?常識推理任務

3.6 Reading Comprehension ?閱讀理解任務

3.7 SuperGLUE? 對比

3.8 NLI? 自然語言推理任務

3.9 Synthetic and Qualitative Tasks ?綜合和定性任務

3.9.1 Arithmetic ?算術

3.9.2 Word Scrambling and Manipulation Tasks ?拼字和操作任務

3.9.3 SAT Analogies 類比

3.9.4 News Article Generation ?新聞文章生成

3.9.5 Learning and Using Novel Words ?學習和使用新單詞

3.9.6 Correcting English Grammar ?修改英語語法

4 Measuring and Preventing Memorization Of Benchmarks ?測量和防止記憶基準

5 Limitations ?局限性

6 Broader Impacts ?更廣泛的影響

6.1 Misuse of Language Models ?語言模型的誤用

6.1.1 Potential Misuse Applications ?潛在的誤用

6.1.2 Threat Actor Analysis ?威脅行動者分析

6.1.3 External Incentive Structures ?外部激勵結構

6.2 Fairness, Bias, and Representation ?公平、偏見和代表性

6.2.1 Gender ?性別

6.2.2 Race ?種族

6.2.3 Religion? 宗教

6.2.4 Future Bias and Fairness Challenges ?未來的偏見和公平挑戰

6.3 Energy Usage ?能源使用

7 Related Work ?相關工作

8 Conclusion 結論

Acknowledgements 致謝

總結