當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Paper：《NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion，女娲:用于神经视觉世界创造的视觉》翻译与解读

發布時間：2025/3/21 编程问答 56 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：《NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion，女娲:用于神经视觉世界创造的视觉》翻译与解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper：《NüWA: Visual Synthesis Pre-training for Neural visUal World creAtion，女媧:用于神經視覺世界創造的視覺》翻譯與解讀

導讀：微軟亞洲研究院聯手北京大學，2021年11月，在?GitHub 開源了一個多模態預訓練模型：NüWA（女媧），可實現文本/草圖轉圖像、圖像補全、文字/草圖轉視頻等任務，功能異常強大。

《NüWA: Visual Synthesis Pre-training for Neural visUal World creAtion》翻譯與解讀

Abstract

1. Introduction

2. Related Works

2.1. Visual AutoRegressive Models

2.2. Visual Sparse SelfAttention

3. Method

3.1. 3D Data Representation

3.2. 3D Nearby SelfAttention

3.3. 3D EncoderDecoder

3.4. Training Objective

4. Experiments

4.1. Implementation Details

4.2. Comparison with stateoftheart

4.3. Ablation Study

5. Conclusion

《NüWA: Visual Synthesis Pre-training for Neural visUal World creAtion》翻譯與解讀

?鏈接	https://arxiv.org/abs/2111.12417
github	GitHub - microsoft/NUWA: A unified 3D Transformer Pipeline for visual synthesis
作者	Chenfei Wu,?Jian Liang,?Lei Ji,?Fan Yang,?Yuejian Fang,?Daxin Jiang,?Nan Duan
發布日期	2021年11月24日

Figure 1. Examples of 8 typical visual generation and manipulation tasks supported by the NUWA model.

圖1所示，由NUWA模型支持的8個典型的可視化生成和操作任務的例子。

Abstract

This paper presents a unified multimodal pre-trained model called N ¨UWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover lan-guage, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder frame-work is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate N ¨UWA on 8 downstream tasks. Compared to several strong baselines, N ¨UWA achieves state-of-the-art results on text-to-image gen-eration, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipula-tion tasks. Project repo is https://github.com/microsoft/NUWA.

本文提出了一個統一的多模態預訓練模型，稱為NüWA，它可以為各種視覺合成任務生成新的或操作現有的視覺數據(即圖像和視頻)。為了同時覆蓋不同場景的語言、圖像和視頻，設計了一種3D轉換器編碼器-解碼器框架，該框架既能將視頻處理為3D數據，又能將文本和圖像分別處理為1D和2D數據。此外，還提出了一種3D鄰近注意(3DNA)機制，以考慮視覺數據的性質，降低計算復雜度。我們在8個下游任務上評估NüWA。與幾個強基線相比，NüWA在文本到圖像生成、文本到視頻生成、視頻預測等方面取得了最先進的結果。此外，它還顯示出驚人的好零鏡頭文本引導圖像和視頻操作任務的能力。項目的倉庫是

https://github.com/microsoft/NUWA。

1. Introduction

Nowadays, the Web is becoming more visual than ever before, as images and videos have become the new informa-tion carriers and have been used in many practical applica-tions. With this background, visual synthesis is becoming a more and more popular research topic, which aims to build models that can generate new or manipulate existing visual data (i.e., images and videos) for various visual scenarios.

Auto-regressive models [33, 39, 41, 45] play an impor-tant role in visual synthesis tasks, due to their explicit den-sity modeling and stable training advantages compared with GANs [4, 30, 37, 47]. Earlier visual auto-regressive models, such as PixelCNN [39], PixelRNN [41], Image Transformer [28], iGPT [5], and Video Transformer [44], performed vi-sual synthesis in a “pixel-by-pixel” manner. However, due to their high computational cost on high-dimensional visual data, such methods can be applied to low-resolution images or videos only and are hard to scale up.

Recently, with the arise of VQ-VAE [40] as a discrete visual tokenization approach, efficient and large-scale pretraining can be applied to visual synthesis tasks for images (e.g., DALL-E [33] and CogView [9]) and videos (e.g., GO-DIVA [45]). Although achieving great success, such solu-tions still have limitations – they treat images and videos separately and focus on generating either of them. This lim-its the models to benefit from both image and video data.

如今，網絡正變得比以往任何時候都更加可視化，圖像和視頻已經成為新的信息載體，并在許多實際應用中得到了應用。在此背景下，視覺合成成為一個越來越受歡迎的研究課題，其目標是為各種視覺場景構建能夠生成新的或操作已有視覺數據(如圖像和視頻)的模型。

自回歸模型[33,39,41,45]在視覺合成任務中發揮著重要作用，這是因為自回歸模型具有明顯的密度建模和相對于GANs的穩定訓練優勢[4,30,37,47]。早期的視覺自回歸模型，如PixelCNN[39]、PixelRNN[41]、Image Transformer[28]、iGPT[5]和Video Transformer[44]，都以“逐像素”的方式進行視覺合成。但由于這種方法對高維視覺數據的計算成本較高，只能應用于低分辨率的圖像或視頻，且難以按比例放大。

近年來，隨著VQ-VAE[40]作為一種離散的視覺標記化方法的出現，高效、大規模的預訓練可以應用于圖像(如DALL-E[33]、CogView[9])和視頻(如GO-DIVA[45])的視覺合成任務。盡管取得了巨大的成功，但這樣的解決方案仍然有局限性——它們分別對待圖像和視頻，并專注于生成圖像和視頻。這就限制了從圖像和視頻數據中獲益的模型。

In this paper, we present N ¨UWA, a unified multimodal pre-trained model that aims to support visual synthesis tasks for both images and videos, and conduct experiments on 8 downstream visual synthesis, as shown in Fig. 1. The main contributions of this work are three-fold:

We propose N ¨UWA, a general 3D transformer encoder-decoder framework, which covers language, image, and video at the same time for different visual synthesis tasks. It consists of an adaptive encoder that takes either text or visual sketch as input, and a decoder shared by 8 visual synthesis tasks.

We propose a 3D Nearby Attention (3DNA) mecha-nism in the framework to consider the locality charac-teristic for both spatial and temporal axes. 3DNA not only reduces computational complexity but also im-proves the visual quality of the generated results.

Compared to several strong baselines, N ¨UWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Fur-thermore, N ¨UWA shows surprisingly good zero-shot capabilities not only on text-guided image manipula-tion, but also text-guided video manipulation.

本文提出了一種統一的多模態預訓練模型NüWA，旨在同時支持圖像和視頻的視覺合成任務，并對8個下游視覺合成進行了實驗，如圖1所示。本工作的主要貢獻有三方面:

我們提出了NüWA，一個通用的3D變壓器編碼器-解碼器框架，它涵蓋了語言、圖像和視頻，同時用于不同的視覺合成任務。它由1個以文本或視覺草圖為輸入的自適應編碼器和一個由8個視覺合成任務共享的解碼器組成。

我們在該框架中提出了一種三維鄰近注意(3DNA)機制，以考慮空間軸和時間軸的局地性特征。3DNA不僅降低了計算復雜度，而且提高了生成結果的視覺質量。

與幾個強基線相比，NüWA在文本到圖像生成、文本到視頻生成、視頻預測等方面取得了最先進的結果。此外，NüWA不僅在文本引導的圖像處理上，而且在文本引導的視頻處理上顯示了令人驚訝的良好的零鏡頭能力。

2. Related Works

2.1. Visual AutoRegressive Models

The method proposed in this paper follows the line of visual synthesis research based on auto-regressive models. Earlier visual auto-regressive models [5, 28, 39, 41, 44] per-formed visual synthesis in a “pixel-by-pixel” manner. How-ever, due to the high computational cost when modeling high-dimensional data, such methods can be applied to low-resolution images or videos only, and are hard to scale up.

Recently, VQ-VAE-based [40] visual auto-regressive models were proposed for visual synthesis tasks. By con-verting images into discrete visual tokens, such methods can conduct efficient and large-scale pre-training for text-to-image generation (e.g., DALL-E [33] and CogView [9]), text-to-video generation (e.g., GODIVA [45]), and video prediction (e.g., LVT [31] and VideoGPT [48]), with higher resolution of generated images or videos. However, none of these models was trained by images and videos together. But it is intuitive that these tasks can benefit from both types of visual data.

本文提出的方法遵循了基于自回歸模型的視覺綜合研究思路。早期的視覺自回歸模型[5,28,39,41,44]以“逐像素”的方式執行形成的視覺合成。然而，由于建模高維數據時計算成本較高，這種方法只能應用于低分辨率的圖像或視頻，且難以按比例放大。

最近，基于VQ-VAE-的[40]視覺自回歸模型被提出用于視覺合成任務。這些方法通過將圖像轉換成離散的視覺標記，可以對文本-圖像生成(如dale - e[33]和CogView[9])、文本-視頻生成(如GODIVA[45])和視頻預測(如LVT[31]和視頻- pt[48])進行高效、大規模的預訓練，生成的圖像或視頻分辨率更高。然而，這些模型都不是由圖像和視頻一起訓練的。但這些任務可以從兩種類型的可視數據中受益，這是很直觀的。

Compared to these works, N ¨UWA is a unified auto-regressive visual synthesis model that is pre-trained by the visual data covering both images and videos and can sup-port various downstream tasks. We also verify the effec-tiveness of different pretraining tasks in Sec. 4.3. Besides, VQ-GAN [11] instead of VQ-VAE is used in N ¨UWA for vi-sual tokenization, which, based on our experiment, can lead to better generation quality.

與這些作品相比，NüWA是一個統一的自回歸視覺合成模型，它是由包括圖像和視頻的視覺數據預先訓練的，可以支持各種下游任務。我們也在第4.3節中驗證了不同訓練前任務的有效性。此外，VQ-GAN[11]代替VQ-VAE在NüWA中用于視覺標記化，根據我們的實驗，可以導致更好的生成質量。

2.2. Visual Sparse SelfAttention

How to deal with the quadratic complexity issue brought by self-attention is another challenge, especially for tasks like high-resolution image synthesis or video synthesis.

Similar to NLP, sparse attention mechanisms have been explored to alleviate this issue for visual synthesis. [31, 44] split the visual data into different parts (or blocks) and then performed block-wise sparse attention for the synthesis tasks. However, such methods dealt with different blocks separately and did not model their relationships. [15,33,45] proposed to use axial-wise sparse attention in visual synthe-sis tasks, which conducts sparse attention along the axes of visual data representations. This mechanism makes training very efficient and is friendly to large-scale pre-trained mod-els like DALL-E [33], CogView [9], and GODIVA [45]. However, the quality of generated visual contents could be harmed due to the limited contexts used in self-attention.[6, 28, 32] proposed to use local-wise sparse attention in vi-sual synthesis tasks, which allows the models to see more contexts. But these works were for images only.

Compared to these works, N ¨UWA proposes a 3D nearby attention that extends the local-wise sparse attention to cover both images to videos. We also verify that local-wise sparse attention is superior to axial-wise sparse attention for visual generation in Sec. 4.3.

如何處理由自我注意帶來的二次復雜度問題是另一個挑戰，特別是對于高分辨率圖像合成或視頻合成等任務。

類似于NLP，稀疏注意機制已經被探索來緩解視覺合成的這個問題。[31,44]將視覺數據分割成不同的部分(或塊)，然后對合成任務進行逐塊稀疏注意。然而，這些方法分別處理不同的塊，并沒有對它們的關系建模。[15,33,45]提出了在視覺合成任務中使用軸向稀疏注意，即沿著視覺數據表示的軸方向進行稀疏注意。這種機制使得訓練非常高效，并且對大規模的預訓練模型如DALL-E[33]、CogView[9]、GODIVA[45]都很友好。然而，生成的視覺內容的質量可能會受到損害，因為有限的上下文用于自我注意。[6,28,32]提出在視覺合成任務中使用局部稀疏注意，這使得模型能夠看到更多的上下文。但這些作品只是圖像。

與這些作品相比，NüWA提出了一種3D附近注意，它將局部稀疏的注意擴展到覆蓋兩幅圖像到視頻。在第4.3節中，我們還驗證了局部稀疏注意優于軸向稀疏注意。

3. Method

3.1. 3D Data Representation

To cover all texts, images, and videos or their sketches, we view all of them as tokens and define a unified 3D no-tation X ∈ Rh×w×s×d, where h and w denote the number of tokens in the spatial axis (height and width respectively), s denotes the number of tokens in the temporal axis, and d is the dimension of each token. In the following, we in-troduce how we get this unified representation for different modalities.

Texts are naturally discrete, and following Transformer [42], we use a lower-cased byte pair encoding (BPE) to tok-enize and embed them into R1×1×s×d. We use placeholder 1 because the text has no spatial dimension.

Images are naturally continuous pixels. Input a raw im-age I ∈ RH×W×C with height H, width W and channel C, VQ-VAE [40] trains a learnable codebook to build a bridge between raw continuous pixels and discrete tokens, as denoted in Eq. (1)～(2):

涵蓋所有文字、圖片和視頻或草圖,我們認為所有這些標記和定義一個統一的3 d no-tation X∈Rh w×××s d h和w表示令牌的數量在空間軸(分別高度和寬度),s表示令牌的數量在時間軸,和d是每個令牌的維數。在下文中，我們將介紹如何得到不同模態的統一表示。

文本自然是離散的，在Transformer[42]之后，我們使用小寫的字節對編碼(BPE)來標記enize并將其嵌入到R1×1×s×d中。我們使用占位符1是因為文本沒有空間維度。

圖像自然是連續的像素。輸入原始圖像I∈RH×W×C，高度H，寬度W，通道C, VQ-VAE[40]訓練一個可學習碼本，在原始連續像素和離散令牌之間建立一座橋梁，如式(1)~(2)所示:

Videos can be viewed as a temporal extension of images, and recent works like VideoGPT [48] and VideoGen [51] extend convolutions in the VQ-VAE encoder from 2D to 3D and train a video-specific representation. However, this fails to share a common codebook for both images and videos. In this paper, we show that simply using 2D VQ-GAN to encode each frame of a video can also generate temporal consistency videos and at the same time benefit from both image and video data. The resulting representation is denoted as Rh×w×s×d, where s denotes the number of frames.

For image sketches, we consider them as images with special channels. An image segmentation matrix RH×W with each value representing the class of a pixel can be viewed in a one-hot manner RH×W×C where C is the num-ber of segmentation classes. By training an additional VQ-GAN for image sketch, we finally get the embedded image representation Rh×w×1×d. Similarly, for video sketches, the representation is Rh×w×s×d.

視頻可以被看作是圖像的時間擴展，最近的作品如VideoGPT[48]和VideoGen[51]將VQ-VAE編碼器中的卷積從2D擴展到3D，并訓練特定于視頻的表示。然而，這不能為圖像和視頻共享一個共同的代碼本。在本文中，我們證明了簡單地使用2D VQ-GAN編碼視頻的每一幀也可以生成時間一致性的視頻，同時受益于圖像和視頻數據。結果表示為Rh×w×s×d，其中s表示幀數。

對于圖像速寫，我們認為是具有特殊通道的圖像。圖像分割矩陣RH×W，每個值代表像素的類，可以用一熱方式RH×W×C查看，其中C為分割類的數量。通過訓練一個附加的VQ-GAN圖像草圖，我們最終得到了嵌入的圖像表示Rh×w×1×d。類似地，對于視頻草圖，表示為Rh×w×s×d。

3.2. 3D Nearby SelfAttention

In this section, we define a unified 3D Nearby Self-Attention (3DNA) module based on the previous 3D data representations, supporting both self-attention and cross-attention. We first give the definition of 3DNA in Eq. (6), and introduce detailed implementation in Eq. (7)～(11):	在本節中，我們基于之前的3D數據表示定義了一個統一的3D附近自我注意(3DNA)模塊，支持自我注意和交叉注意。我們首先在式(6)中給出3DNA的定義，并在式(7)~(11)中引入詳細的實現:

3.3. 3D EncoderDecoder

In this section, we introduce 3D encode-decoder built based on 3DNA. To generate a target Y ∈ Rh×w×s×dout under the condition of C ∈ Rh×w×s×din , the positional encoding for both Y and C are updated by three different learnable vocabularies considering height, width, and tem-poral axis, respectively in Eq. (12)～(13):

Then, the condition C is fed into an encoder with a stack of L 3DNA layers to model the self-attention interactions, with the lth layer denoted in Eq. (14):

Similarly, the decoder is also a stack of L 3DNA layers. The decoder calculates both self-attention of generated re-sults and cross-attention between generated results and con-ditions. The lth layer is denoted in Eq. (15).

在本節中，我們將介紹基于3DNA構建的三維碼譯碼器。為了在C∈Rh×w×s×din的條件下生成目標Y∈Rh×w×s×dout, Y和C的位置編碼分別由Eq.(12) ~(13)中考慮到高度、寬度和時間軸的三個不同可學詞匯來更新:

然后，將條件C輸入一個包含l3dna層堆棧的編碼器，對自我注意交互進行建模，第lth層如式(14)所示:

類似地，解碼器也是一個l3dna層的堆棧。解碼器計算生成結果的自注意和生成結果與條件之間的交叉注意。第l層如式(15)所示。

3.4. Training Objective

We train our model on three tasks, Text-to-Image (T2I), Video Prediction (V2V) and Text-to-Video (T2V). The training objective for the three tasks are cross-entropys de-noted as three parts in Eq. (16), respectively:

我們以文本到圖像(T2I)、視頻預測(V2V)和文本到視頻(T2V)三個任務來訓練我們的模型。這三個任務的訓練目標在式(16)中分別被標記為三個部分的交叉熵:

For T2I and T2V tasks, Ctext denotes text conditions. For the V2V task, since there is no text input, we instead get a constant 3D representation c of the special word “None”. θ denotes the model parameters.

對于T2I和T2V任務，Ctext表示文本條件。對于V2V任務，因為沒有文本輸入，所以我們得到一個特定單詞“None”的固定3D表示c。θ為模型參數。

4. Experiments

4.1. Implementation Details

Based on Sec. 3.4 we first pre-train N ¨UWA on three datasets: Conceptual Captions [22] for text-to-image (T2I) generation, which includes 2.9M text-image pairs, Mo-ments in Time [26] for video prediction (V2V), which in-cludes 727K videos, and VATEX dataset [43] for text-to-video (T2V) generation, which includes 241K text-video pairs. In the following, we first introduce implementation details in Sec. 4.1 and then compare N ¨UWA with state-of-the-art models in Sec. 4.2, and finally conduct ablation stud-ies in Sec. 4.3 to study the impacts of different parts.

基于第3.4節，我們首先在三個數據集上對NüWA進行預訓練:用于文本-圖像(T2I)生成的概念性字幕[22]，包括2.9M文本-圖像對;用于視頻預測的mots in Time[26]，包括727K個視頻;用于文本-視頻(T2V)生成的VATEX數據集[43]，包括241K個文本-視頻對。在接下來的章節4.1中，我們首先介紹了實施細節，然后將NüWA與章節4.2中最先進的模型進行比較，最后在章節4.3中進行消融研究，研究不同部位的影響。

In Sec. 3.1, we set the sizes of 3D representations for text, image, and video as follows. For text, the size of 3D representation is 1 × 1 × 77 × 1280. For image, the size of 3D representation is 21 × 21 × 1 × 1280. For video, the size of 3D representation is 21 × 21 × 10 × 1280, where we sample 10 frames from a video with 2.5 fps. Although the default visual resolution is 336 × 336, we pre-train different resolutions for a fair comparison with existing models. For the VQ-GAN model used for both images and videos, the size of grid feature E(I) in Eq. (1) is 441 × 256, and the size of the codebook B is 12, 288.

Different sparse extents are used for different modalities in Sec. 3.2. For text, we set (ew, eh, es) = (1, 1, ∞), where ∞ denotes that the full text is always used in attention. For image and image sketches, (ew, eh, es) = (3, 3, 1). For video and video sketches, (ew, eh, es) = (3, 3, 3).

We pre-train on 64 A100 GPUs for two weeks with the layer L in Eq. (14) set to 24, an Adam [17] optimizer with a learning rate of 1e-3, a batch size of 128, and warm-up 5%of a total of 50M steps. The final pre-trained model has a total number of 870M parameters.

在3.1節中，我們設置了文本、圖像和視頻的3D表示的大小，如下所示。對于文本，3D表示的大小為1 × 1 × 77 × 1280。圖像的三維表示尺寸為21 × 21 × 1 × 1280。對于視頻，3D表示的尺寸是21 × 21 × 10 × 1280，我們以2.5幀/秒的速度從視頻中選取10幀。雖然默認的視覺分辨率是336x336，但我們預先訓練不同的分辨率，以便與現有模型進行公平的比較。對于圖像和視頻使用的VQ-GAN模型，式(1)中網格特征E(I)的大小為441 × 256，碼本B的大小為12,288。

在第3.2節中，對于不同的模式使用了不同的稀疏范圍。對于文本，我們設置(ew, eh, es) =(1,1，∞)，∞表示全文總是用于注意。對于圖像和圖像草圖，(ew, eh, es) =(3,3,1)。對于視頻和視頻草圖，(ew, eh, es) =(3,3,3)。

我們在64個A100 gpu上進行了兩周的預訓練，將式(14)中的L層設置為24，學習速率為1e-3的Adam[17]優化器，批處理大小為128，熱身總數為50M步的5%。最終的預訓練模型共有870M個參數。

4.2. Comparison with stateoftheart

Text-to-Image (T2I) fine-tuning: We compare N ¨UWA on the MSCOCO [22] dataset quantitatively in Tab. 1 and qualitatively in Fig. 3. Following DALL-E [33], we use k blurred FID score (FID-k) and Inception Score (IS) [35] to evaluate the quality and variety respectively, and following GODIVA [45], we use CLIPSIM metric, which incor-porates a CLIP [29] model to calculate the semantic simi-larity between input text and the generated image. For a fair comparison, all the models use the resolution of 256 × 256. We generate 60 images for each text and select the best one by CLIP [29]. In Tab. 1, N ¨UWA significantly outperforms CogView [9] with FID-0 of 12.9 and CLIPSIM of 0.3429. Although XMC-GAN [50] reports a significant FID score of 9.3, we find N ¨UWA generates more realistic images com-pared with the exact same samples in XMC-GAN’s paper (see Fig. 3). Especially in the last example, the boy’s face is clear and the balloons are correctly generated.

文本-圖像(T2I)微調:我們對MSCOCO[22]數據集上的NüWA進行了定量比較(見表1)和定性比較(見圖3)。在dal - e[33]之后，我們分別使用k個模糊FID評分(FID-k)和Inception評分(IS)[35]來評估質量和多樣性，在GODIVA[45]之后，我們使用CLIPSIM度量，該度量包含一個CLIP[29]模型來計算輸入文本和生成圖像之間的語義相似度。為了便于比較，所有的模型都使用了256 × 256的分辨率。我們為每個文本生成60個圖像，并通過CLIP[29]選擇最佳的一個。在表1中，NüWA顯著優于CogView [9]， fidi -0為12.9,CLIPSIM為0.3429。雖然XMC-GAN[50]報告了顯著的FID評分9.3，但我們發現NüWA生成的圖像比XMC-GAN論文中完全相同的樣本(見圖3)更真實。特別是在最后一個例子中，男孩的臉清晰，氣球也正確生成。

Text-to-Video (T2V) fine-tuning: We compare N ¨UWA on the Kinetics [16] dataset quantitatively in Tab. 2 and qualitatively in Fig. 4. Following TFGAN [2], we evaluate the visual quality on FID-img and FID-vid metrics and se-mantic consistency on the accuracy of the label of generated video. As shown in Tab. 2, N ¨UWA achieves the best perfor-mance on all the above metrics. In Fig. 4, we also show the strong zero-shot ability for generating unseen text, such as “playing golf at swimming pool” or “running on the sea”.

Video Prediction (V2V) fine-tuning: We compare N ¨UWA on BAIR Robot Pushing [10] dataset quantitatively in Tab. 3. Cond. denotes the number of frames given to predict future frames. For a fair comparison, all the mod-els use 64×64 resolutions. Although given only one frame as condition (Cond.), N ¨UWA still significantly pushes the state-of-the-art FVD [38] score from 94±2 to 86.9.

文字-視頻(T2V)微調:我們在Fig. 2和Fig. 4中定量地比較了Kinetics[16]數據集上的NüWA。在TFGAN[2]之后，我們評估了fidi -img和fidi -vid度量的視覺質量，以及生成的視頻標簽準確性的語義一致性。如表2所示，NüWA在所有上述指標上都獲得了最佳性能。在圖4中，我們還展示了生成不可見文本的強大的零拍能力，例如“在游泳池打高爾夫球”或“在海上奔跑”。

視頻預測(V2V)微調:我們在表3中對BAIR機器人Pushing[10]數據集上的NüWA進行了定量比較。氣孔導度。表示用來預測未來幀的幀數。為了便于比較，所有型號都使用64×64的分辨率。雖然只給了一幀作為條件(Cond.)， NüWA仍然顯著地將最先進的FVD[38]分數從94±2提高到86.9。

Sketch-to-Image (S2I) fine-tuning: We compare N ¨UWA on MSCOCO stuff [22] qualitatively in Fig. 5. N ¨UWA generates realistic buses of great varieties compared with Taming-Transformers [11] and SPADE [27]. Even the reflection of the bus window is clearly visible.

Image Completion (I2I) zero-shot evaluation: We compare N ¨UWA in a zero-shot manner qualitatively in Fig. 6. Given the top half of the tower, compared with Taming Transformers [11], N ¨UWA shows richer imagina-tion of what could be for the lower half of the tower, includ-ing buildings, lakes, flowers, grass, trees, mountains, etc.

草圖到圖像(S2I)微調:我們在圖5中定性地比較了MSCOCO材料[22]上的NüWA。與馴服變形金剛[11]和鐵鍬[27]相比，UWA生產了各種各樣的現實公交車。甚至公交車車窗的反光都清晰可見。

圖像補全(I2I)零拍評價:我們在圖6中用零拍的方式定性地比較了NüWA。與《馴服的變形金剛》[11]相比，從塔頂的上半部分來看，東華大學對塔底的下半部分表現出了豐富的想象，包括建筑、湖泊、花草、樹木、山脈等。

Text-Guided Image Manipulation (TI2I) zero-shot evaluation: We compare N ¨UWA in a zero-shot manner qualitatively in Fig. 7. Compared with Paint By Word [3], N ¨UWA shows strong manipulation ability, generating high-quality text-consistent results while not changing other parts of the image. For example, in the third row, the blue firetruck generated by N ¨UWA is more realistic, while the behind buildings show no change. This is benefited from real-world visual patterns learned by multi-task pre-training on various visual tasks. Another advantage is the inference speed of N ¨UWA, practically 50 seconds to generate an im-age, while Paint By Words requires additional training dur-ing inference, and takes about 300 seconds to converge.

Sketch-to-Video (S2V) fine-tuning and Text-Guided Video Manipulation (TV2V) zero-shot evaluation: As far as we know, open-domain S2V and TV2V are tasks first proposed in this paper. Since there is no comparison, we instead arrange them in Ablation Study in Section 4.3.

文本引導圖像處理(TI2I)零拍評價:我們在圖7中以零拍的方式定性地比較了NüWA。與Paint By Word[3]相比，NüWA具有很強的操作能力，可以在不改變圖像其他部分的情況下，生成高質量的文本一致性結果。例如，在第三排，NüWA生成的藍色消防車更真實，而后面的建筑沒有變化。這得益于真實世界的視覺模式，這些模式是通過對各種視覺任務進行多任務預訓練而習得的。另一個優點是NüWA的推理速度，實際上生成圖像需要50秒，而Paint By Words在推理過程中需要額外的訓練，并且需要大約300秒的時間來收斂。

su -to-Video (S2V)微調和Text-Guided Video Manipulation (TV2V)零拍評價:據我們所知，開放域S2V和TV2V是本文首先提出的任務。由于沒有比較，我們將其安排在4.3節的消融研究中。

More detailed comparisons, samples, including human evaluations, are provided in the appendix.

更詳細的比較，樣本，包括人類評估，在附錄中提供。

Figure 5. Quantitative comparison with state-of-the-art models for Sketch-to-Image (S2I) task on MSCOCO stuff dataset.

Figure 6. Qualitative comparison with the state-of-the-art model for Image Completion (I2I) task in a zero-shot manner.

Figure 7. Quantitative comparison with state-of-the-art models for text-guided image manipulation (TI2I) in a zero-shot manner.

Figure 8. Reconstruction samples of VQ-GAN and VQ-GAN-Seg.

圖5。在MSCOCO材料數據集上進行S2I任務模型的定量比較。

圖6。與目前最先進的圖像補全(I2I)任務模型進行了定性比較。

圖7。與最先進的文本引導圖像處理(TI2I)模型在零鏡頭方式的定量比較。

圖8。VQ-GAN和VQ-GAN- seg的重建樣本。

Table 4. Effectiveness of different VQ-VAE (VQ-GAN) settings.

Table 5. Effectiveness of multi-task pre-training for Text-to-Video (T2V) generation task on MSRVTT dataset.

Table 6. Effectiveness of 3D nearby attention for Sketch-to-Video (S2V) task on VSPW dataset.

表4。VQ-VAE (VQ-GAN)設置的有效性。

表5所示。MSRVTT數據集上T2V生成任務的多任務預訓練效果

表6所示。VSPW數據集上S2V (Sketch-to-Video)任務三維鄰近注意的有效性

4.3. Ablation Study

The above part of Tab. 4 shows the effectiveness of dif-ferent VQ-VAE (VQ-GAN) settings. We experiment on Im-ageNet [34] and OpenImages [19]. R denotes raw resolu-tion, D denotes the number of discrete tokens. The com√pression rate is√denoted as F x, where x is the quotient of R divided by D. Comparing the first two rows in Tab. 4,VQ-GAN shows significantly better Fr′echet Inception Dis-tance (FID) [14] and Structural Similarity Matrix (SSIM) scores than VQ-VAE. Comparing Row 2-3, we find that the number of discrete tokens is the key factor leading to higher visual quality instead of compress rate. Although Row 2 and Row 4 have the same compression rate F16, they have different FID scores of 6.04 and 4.79. So what matters is not only how much we compress the original image, but also how many discrete tokens are used for representing an im-age. This is in line with cognitive logic, it’s too ambiguous to represent human faces with just one token. And practi-cally, we find that 162 discrete tokens usually lead to poor performance, especially for human faces, and 322 tokens show the best performance. However, more discrete tokens mean more computing, especially for videos. We finally use a trade-off version for our pre-training: 212 tokens. By training on the Open Images dataset, we further improve the FID score of the 212 version from 4.79 to 4.31.	4884/5000 ? 表4上半部分顯示了不同VQ-VAE (VQ-GAN)設置的效果。我們在Im-ageNet[34]和OpenImages[19]上進行了實驗。R表示原始分辨率，D表示離散令牌的數量。對比表4的前兩行，VQ-GAN的Fr′echet Inception distance (FID)[14]和Structural Similarity Matrix (SSIM)得分明顯好于VQ-VAE。比較2-3行，我們發現離散符號的數量是導致更高視覺質量的關鍵因素，而不是壓縮速率。雖然Row 2和Row 4有相同的壓縮率F16，但是它們的FID評分不同，分別為6.04和4.79。因此，重要的不僅是我們壓縮了原始圖像的多少，還包括有多少離散符號被用于表示一個圖像。這符合認知邏輯，用一個符號來表示人臉太模糊了。實際上，我們發現162個離散符號通常會導致較差的性能，尤其是對于人臉，而322個符號表現最好。然而，更多的離散標記意味著更多的計算，特別是對于視頻。我們最終為我們的預培訓使用了一個權衡版本:212個令牌。通過在Open Images數據集上進行訓練，我們進一步將212版本的FID分數從4.79提高到4.31。
The below part of Tab. 4 shows the performance of VQ-GAN for sketches. VQ-GAN-Seg on MSCOCO [22] is trained for Sketch-to-Image (S2I) task and VQ-GAN-Seg on VSPW [24] is trained for Sketch-to-Video (S2V) task. All the above backbone shows good performance in Pixel Accuracy (PA) and Frequency Weighted Intersection over Union (FWIoU), which shows a good quality of 3D sketch representation used in our model. Fig. 8 also shows some reconstructed samples of 336×336 images and sketches. Tab. 5 shows the effectiveness of multi-task pre-training for the Text-to-Video (T2V) generation task. We study on a challenging dataset, MSR-VTT [46], with natural descrip-tions and real-world videos. Compared with training only on a single T2V task (Row 1), training on both T2V and T2I (Row 2) improves the CLIPSIM from 0.2314 to 0.2379. This is because T2I helps to build a connection between text and image, and thus helpful for the semantic consis-tency of the T2V task. In contrast, training on both T2V and V2V (Row 3) improves the FVD score from 52.98 to 51.81. This is because V2V helps to learn a common un-conditional video pattern, and is thus helpful for the visual quality of the T2V task. As a default setting of N ¨UWA, training on all three tasks achieves the best performance.	下表4顯示了VQ-GAN對草圖的性能。MSCOCO[22]上的VQ-GAN-Seg訓練用于sketchto - image (S2I)任務，VSPW[24]上的VQ-GAN-Seg訓練用于sketchto - video (S2V)任務。該模型具有良好的像素精度(PA)和頻率加權交叉聯合(FWIoU)性能，表明該模型具有良好的三維草圖表示質量。圖8還展示了一些336×336圖像和草圖的重構樣本。表5顯示了多任務預訓練對于T2V (Text-to-Video)生成任務的有效性。我們研究一個具有挑戰性的數據集，MSR-VTT[46]，具有自然的描述和真實世界的視頻。與只進行單個T2V任務訓練(第1行)相比，同時進行T2V和T2I訓練(第2行)將CLIPSIM從0.2314提高到0.2379。這是因為T2I有助于在文本和圖像之間建立聯系，從而有助于T2V任務的語義一致性。相比之下，T2V和V2V(第三行)的訓練將FVD得分從52.98提高到51.81。這是因為V2V有助于學習一種常見的非條件視頻模式，因此有助于提高T2V任務的視覺質量。作為NüWA的默認設置，在所有三個任務上進行訓練可以獲得最佳性能。
Tab. 7 shows the effectiveness of 3D nearby attention for the Sketch-to-Video (S2V) task on the VSPW [24] dataset. We study on the S2V task because both the encoder and de-coder of this task are fed with 3D video data. To evaluate the semantic consistency for S2V, we propose a new met-ric called Detected PA, which uses a semantic segmentation model [49] to segment each frame of the generated video and then calculate the pixel accuracy between the generated segments and input video sketch. The default N ¨UWA set-ting in the last row, with both nearby encoder and nearby de-coder, achieves the best FID-vid and Detected PA. The per-formance drops if either encoder or decoder is replaced by full attention, showing that focusing on nearby conditions and nearby generated results is better than simply consid-ering all the information. We compare nearby-sparse and axial-sparse in two-folds. Firstly�� , the��computational com-plexity of nearby-sparse is O (hws) ehewes and axis-sparse attention is O ((hws) (h + w + s)). For generating long videos (larger s), nearby-sparse will be more compu-tational efficient. Secondly, nearby-sparse has better per-formance than axis-sparse in visual generation task, which is because nearby-sparse attends to “nearby” locations con-taining interactions between both spatial and temporal axes, while axis-sparse handles different axis separately and only consider interactions on the same axis.	表7顯示了在VSPW[24]數據集上sketchto - video (S2V)任務中三維附近注意的有效性。我們之所以研究S2V任務，是因為該任務的編碼器和解碼器都提供了三維視頻數據。為了評估S2V的語義一致性，我們提出了一種新的度量方法，稱為檢測PA，它使用一個語義分割模型[49]對生成的視頻進行每幀分割，然后計算生成的視頻片段與輸入視頻草圖之間的像素精度。默認的NüWA設置在最后一行，與附近的編碼器和譯碼器，實現最佳的FID-vid和檢測PA。如果編碼器或解碼器被完全關注所取代，性能會下降，這表明關注附近的條件和附近生成的結果比簡單地考慮所有的信息要好。我們在兩方面比較了近稀疏和軸稀疏。首先��，附近稀疏的��computational復雜度為O (hws) ehewes，軸稀疏的注意復雜度為O ((hws) (h + w + s))。對于生成長視頻(更大的視頻)，near -sparse將具有更高的計算效率。其次，在視覺生成任務中，鄰稀疏比軸稀疏有更好的表現，這是因為鄰稀疏處理的是包含時空軸交互的“鄰近”位置，而軸稀疏處理的是不同軸，只考慮同一軸上的交互。
Fig. 9 shows a new task proposed in this paper, which we call “Text-Guided Video Manipulation (TV2V)”. TV2V aims to change the future of a video starting from a selected frame guided by text. All samples start to change the future of the video from the second frame. The first row shows the original video frames, where a diver is swimming in the water. After feeding “The diver is swimming to the surface” into N ¨UWA’s encoder and providing the first video frame, N ¨UWA successfully generates a video with the diver swim-ming to the surface in the second row. The third row shows another successful sample that lets the diver swim to the bottom. What if we want the diver flying to the sky? The fourth row shows that N ¨UWA can make it as well, where the diver is flying upward, like a rocket.	圖9顯示了本文提出的一個新任務，我們稱之為“文本引導視頻操作(TV2V)”。TV2V的目標是改變未來的視頻，從文本引導的選定幀開始。所有的樣本從第二幀開始改變視頻的未來。第一行顯示的是原始的視頻畫面，一個潛水員正在水里游泳。在將“潛水員正在游向水面”輸入到NüWA的編碼器并提供第一個視頻幀后，NüWA成功地在第二排生成了一段潛水員游向水面的視頻。第三行是另一個成功的樣本，讓潛水員游到水底。如果我們想讓潛水員飛上天呢?第四行表示NüWA也能做到這一點，潛水員向上飛，像火箭一樣。

5. Conclusion

In this paper, we present N ¨UWA as a unified pre-trained model that can generate new or manipulate existing images and videos for 8 visual synthesis tasks. Several contribu-tions are made here, including (1) a general 3D encoder-decoder framework covering texts, images, and videos at the same time; (2) a nearby-sparse attention mechanism that considers the nearby characteristic of both spatial and tem-poral axes; (3) comprehensive experiments on 8 synthesis tasks. This is our first step towards building an AI platform to enable visual world creation and help content creators.

在本文中，我們提出NüWA作為一個統一的預訓練模型，可以為8個視覺合成任務生成新的或操作現有的圖像和視頻。本文給出了一些貢獻，包括

(1)一個通用的3D編碼器-解碼器框架，同時涵蓋文本、圖像和視頻;

(2)考慮空間軸和時間軸的鄰近性特征的鄰近-稀疏注意機制;

(3) 8個綜合任務的綜合實驗。

這是我們構建AI平臺的第一步，讓我們能夠創造視覺世界并幫助內容創造者。

總結

以上是生活随笔為你收集整理的Paper：《NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion，女娲:用于神经视觉世界创造的视觉》翻译与解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： AI：《DEEP LEARNING’S
下一篇： Py之Xlrd：Xlrd简介、安装、使用

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

Paper：《NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion，女娲:用于神经视觉世界创造的视觉》翻译与解读

《NüWA: Visual Synthesis Pre-training for Neural visUal World creAtion》翻譯與解讀

Abstract

1. Introduction

2. Related Works

2.1. Visual AutoRegressive Models

2.2. Visual Sparse SelfAttention

3. Method

3.1. 3D Data Representation

3.2. 3D Nearby SelfAttention

3.3. 3D EncoderDecoder

3.4. Training Objective

4. Experiments

4.1. Implementation Details

4.2. Comparison with stateoftheart

4.3. Ablation Study

5. Conclusion

總結