當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Paper：《First Order Motion Model for Image Animation》翻译与解读

發布時間：2025/3/21 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：《First Order Motion Model for Image Animation》翻译与解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper：《First Order Motion Model for Image Animation》翻譯與解讀
?

《First Order Motion Model for Image Animation》翻譯與解讀

Abstract

1 Introduction ?

2 Related work ?

3 Method

3.1 Local Affine Transformations for Approximate Motion Description ?

3.2 Occlusion-aware Image Generation?

3.3 Training Losses

3.4 Testing Stage: Relative Motion Transfer ?

4 Experiments

5 Conclusions ?

更新中……

《First Order Motion Model for Image Animation》翻譯與解讀

相關論文

《First Order Motion Model for Image Animation》

https://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation

Aliaksandr Siarohin DISI, University of Trento aliaksandr.siarohin@unitn.it Stéphane Lathuilière DISI, University of Trento LTCI, Télécom Paris, Institut polytechnique de Paris stephane.lathuilire@telecom-paris.fr Sergey Tulyakov Snap Inc. stulyakov@snap.com Elisa Ricci DISI, University of Trento Fondazione Bruno Kessler e.ricci@unitn.it Nicu Sebe DISI, University of Trento Huawei Technologies Ireland niculae.sebe@unitn.it

GitHub

https://github.com/AliaksandrSiarohin/first-order-model

Abstract

Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories.

圖像動畫包括生成視頻序列，以便根據驅動視頻的運動使源圖像中的對象動畫。我們的框架解決了這個問題，沒有使用任何注釋或關于動畫特定對象的先驗信息。一旦在一組描述同一類別對象(例如人臉、人體)的視頻上進行訓練，我們的方法就可以應用于該類中的任何對象。為了實現這一點，我們解耦外觀表面和運動信息使用一個自監督的公式。為了支持復雜的運動，我們使用一種由一組學習過的關鍵點及其局部仿射變換組成的表示法。生成器網絡對目標運動中產生的遮擋進行建模，并將從源圖像中提取的外觀與從駕駛視頻中提取的運動相結合。我們的框架在各種基準測試和各種對象類別上得分最高。

1 Introduction ?

Generating videos by animating objects in still images has countless applications across areas of ?interest including movie production, photography, and e-commerce. More precisely, image animation ?refers to the task of automatically synthesizing videos by combining the appearance extracted from ?a source image with motion patterns derived from a driving video. For instance, a face image of a ?certain person can be animated following the facial expressions of another individual (see Fig. 1). In ?the literature, most methods tackle this problem by assuming strong priors on the object representation ?(e.g. 3D model) [4] and resorting to computer graphics techniques [6, 33]. These approaches can ?be referred to as object-specific methods, as they assume knowledge about the model of the specific ?object to animate. ?

通過在靜態圖像中動畫對象來生成視頻有無數的應用程序，涉及的領域包括電影制作、攝影和電子商務。更準確地說，圖像動畫是指通過將從源圖像中提取的外觀與從駕駛視頻中提取的運動模式結合起來，自動合成視頻的任務。例如，一個人的面部圖像可以根據另一個人的面部表情進行動畫處理(見圖1),在文獻中，大多數方法通過對對象表示(如3D模型)[4]假設強先驗并借助于計算機圖形技術來解決這個問題[6,33]。這些方法可以被稱為特定對象的方法，因為它們假定了解要動畫的特定對象的模型。

Recently, deep generative models have emerged as effective techniques for image animation and ?video retargeting [2, 41, 3, 42, 27, 28, 37, 40, 31, 21]. In particular, Generative Adversarial Networks ?(GANs) [14] and Variational Auto-Encoders (VAEs) [20] have been used to transfer facial expressions ?[37] or motion patterns [3] between human subjects in videos. Nevertheless, these approaches ?usually rely on pre-trained models in order to extract object-specific representations such as keypoint ?locations. Unfortunately, these pre-trained models are built using costly ground-truth data annotations ?[2, 27, 31] and are not available in general for an arbitrary object category. To address this issues, ?recently Siarohin et al. [28] introduced Monkey-Net, the first object-agnostic deep model for image?animation. Monkey-Net encodes motion information via keypoints learned in a self-supervised fashion. ?At test time, the source image is animated according to the corresponding keypoint trajectories ?estimated in the driving video. The major weakness of Monkey-Net is that it poorly models object ?appearance transformations in the keypoint neighborhoods assuming a zeroth order model (as we ?show in Sec. 3.1). This leads to poor generation quality in the case of large object pose changes ?(see Fig. 4). To tackle this issue,

we propose to use a set of self-learned keypoints together with ?local affine transformations to model complex motions. We therefore call our method a first-order ?motion model.
Second, we introduce an occlusion-aware generator, which adopts an occlusion mask ?automatically estimated to indicate object parts that are not visible in the source image and that ?should be inferred from the context. This is especially needed when the driving video contains large ?motion patterns and occlusions are typical.
Third, we extend the equivariance loss commonly used ?for keypoints detector training [18, 44], to improve the estimation of local affine transformations. ?Fourth, we experimentally show that our method significantly outperforms state-of-the-art image ?animation methods and can handle high-resolution datasets where other approaches generally fail.
?Finally, we release a new high resolution dataset, Thai-Chi-HD, which we believe could become a ?reference benchmark for evaluating frameworks for image animation and video generation.

最近，深度生成模型已經成為圖像動畫和視頻重定向的有效技術[2,41,3,42,27,28,37,40,31,21]。特別是，生成對抗網絡(GANs)[14]和變分自動編碼器(VAEs)[20]已被用于在視頻中人類受試者之間轉移面部表情[37]或運動模式[3]。然而，這些方法通常依靠預先訓練好的模型來提取特定對象的表示，如關鍵點位置。不幸的是，這些預先訓練過的模型是使用昂貴的ground-truth數據注釋來構建的[2,27,31]，通常不能用于任意對象類別。為了解決這個問題，最近Siarohin等人[28]推出了Monkey-Net，這是第一個面向對象的圖像動畫深度模型。Monkey-Net編碼運動信息以一個自我監督的方式通過關鍵點學習。在測試時，根據在駕駛視頻中估計的相應關鍵點軌跡對源圖像進行動畫處理。Monkey-Net的主要弱點是，在假定為零階模型的情況下，它很難對關鍵點鄰域中的對象外觀變換進行建模(如3.1節所示)。這導致在大物體姿態變化的情況下生成質量較差(見圖4)。為了解決這個問題，

我們提出使用一組自學習的關鍵點和局部仿射變換來建模復雜的運動。因此我們稱我們的方法為一階運動模型 [first-order ?motion model.]。
其次，我們介紹了一個遮擋感知生成器，它采用一個自動估計的遮擋掩模來指示目標部分，在源圖像中不可見的，需要從上下文推斷。這是特別需要的時候，駕駛視頻包含大的運動模式和遮擋是典型的。
第三，我們擴展了關鍵點檢測器訓練中常用的等方差損失[18,44]，以改進局部仿射變換的估計。
第四，我們的實驗表明，我們的方法明顯優于最先進的圖像動畫方法，可以處理高分辨率數據集，其他方法通常失敗。
最后，我們發布了一個新的高分辨率數據集——Thai-Chi-HD，我們相信它可以成為評估圖像動畫和視頻生成框架的參考基準。

2 Related work ?

Video Generation. Earlier works on deep video generation discussed how spatio-temporal neural ?networks could render video frames from noise vectors [36, 26]. More recently, several approaches ?tackled the problem of conditional video generation. For instance, Wang et al. [38] combine a ?recurrent neural network with a VAE in order to generate face videos. Considering a wider range ?of applications, Tulyakov et al. [34] introduced MoCoGAN, a recurrent architecture adversarially ?trained in order to synthesize videos from noise, categorical labels or static images. Another typical ?case of conditional generation is the problem of future frame prediction, in which the generated video ?is conditioned on the initial frame [12, 23, 30, 35, 44]. Note that in this task, realistic predictions can ?be obtained by simply warping the initial video frame [1, 12, 35]. Our approach is closely related?to these previous works since we use a warping formulation to generate video sequences. However, ?in the case of image animation, the applied spatial deformations are not predicted but given by the ?driving video.	視頻生成。在深度視頻生成方面的早期工作討論了時空神經網絡如何從噪聲向量渲染視頻幀[36,26]。最近，一些方法解決了條件視頻生成的問題。例如，Wang et al.[38]結合遞歸神經網絡和VAE來生成人臉視頻?？紤]到更廣泛的應用，Tulyakov等人[34]引入了MoCoGAN，一種經過反訓練的周期性建筑，用于從噪聲、分類標簽或靜態圖像合成視頻。條件生成的另一個典型情況是未來幀預測問題，生成的視頻以初始幀為條件[12,23,30,35,44]。注意，在這個任務中，可以通過簡單地扭曲初始視頻幀來獲得現實的預測[1,12,35]。我們的方法與之前的工作密切相關，因為我們使用扭曲公式來生成視頻序列。然而，在圖像動畫的情況下，應用的空間變形不是預測，而是由駕駛視頻給出。
	?
Image Animation. Traditional approaches for image animation and video re-targeting [6, 33, ?13] were designed for specific domains such as faces [45, 42], human silhouettes [8, 37, 27] or ?gestures [31] and required a strong prior of the animated object. For example, in face animation, ?method of Zollhofer et al. [45] produced realistic results at expense of relying on a 3D morphable ?model of the face. In many applications, however, such models are not available. Image animation ?can also be treated as a translation problem from one visual domain to another. For instance, Wang ?et al. [37] transferred human motion using the image-to-image translation framework of Isola et ?al. [16]. Similarly, Bansal et al. [3] extended conditional GANs by incorporating spatio-temporal ?cues in order to improve video translation between two given domains. Such approaches in order to ?animate a single person require hours of videos of that person labelled with semantic information, ?and therefore have to be retrained for each individual. In contrast to these works, we neither rely on ?labels, prior information about the animated objects, nor on specific training procedures for each ?object instance. Furthermore, our approach can be applied to any object within the same category ?(e.g., faces, human bodies, robot arms etc). ?	圖像動畫。傳統的圖像動畫和視頻重定向方法[6,33,13]是為特定領域設計的，如人臉[45,42]，人體輪廓[8,37,27]或手勢[31]，并要求動畫對象的強大先驗。例如，在人臉動畫中，Zollhofer等人[45]的方法以依賴人臉的3D morphable模型為代價，產生了逼真的結果。然而，在許多應用中，這樣的模型是不可用的。圖像動畫也可以看作是一個從一個視覺領域到另一個視覺領域的轉換問題。例如，Wang等人[37]使用Isola等人的圖像到圖像的翻譯框架來傳輸人體運動。[16]。同樣，Bansal等人[3]通過合并時空線索擴展了條件GANs，以改善兩個給定域之間的視頻平移。為了使一個人動起來，這種方法需要數小時的帶有語義信息的視頻，因此必須為每個人重新訓練。與這些作品相比，我們既不依賴于標簽，也不依賴于動畫對象的先驗信息，也不依賴于每個對象實例的特定訓練程序。此外，我們的方法可以應用于同一類別中的任何對象。，人臉，人體，機器人手臂等)。
Several approaches were proposed that do not require priors about the object. X2Face [40] uses ?a dense motion field in order to generate the output video via image warping. Similarly to us ?they employ a reference pose that is used to obtain a canonical representation of the object. In our ?formulation, we do not require an explicit reference pose, leading to significantly simpler optimization ?and improved image quality. Siarohin et al. [28] introduced Monkey-Net, a self-supervised framework ?for animating arbitrary objects by using sparse keypoint trajectories. In this work, we also employ ?sparse trajectories induced by self-supervised keypoints. However, we model object motion in the ?neighbourhood of each predicted keypoint by a local affine transformation. Additionally, we explicitly ?model occlusions in order to indicate to the generator network the image regions that can be generated ?by warping the source image and the occluded areas that need to be inpainted.	提出了幾種不需要關于對象的先驗的方法。X2Face[40]使用密集運動場，通過圖像翹曲生成輸出視頻。與我們相似的是，它們使用一個參考姿態來獲得對象的規范表示。在我們的公式中，我們不需要一個明確的參考姿態，導致顯著簡化優化和改善圖像質量。Siarohin等人[28]介紹了Monkey-Net，這是一個自監督框架，通過使用稀疏的關鍵點軌跡來創建任意對象的動畫。在這項工作中，我們也使用稀疏軌跡由自監督關鍵點。然而，我們通過局部仿射變換在每個預測關鍵點的鄰域內建模物體的運動。此外，為了向生成網絡表明扭曲源圖像可以生成的圖像區域和需要繪制的遮擋區域，我們對遮擋進行了顯式建模。

3 Method

We are interested in animating an object depicted in a source image S based on the motion of a similar ?object in a driving video D. Since direct supervision is not available (pairs of videos in which objects ?move similarly), we follow a self-supervised strategy inspired from Monkey-Net [28]. For training, ?we employ a large collection of video sequences containing objects of the same object category. Our ?model is trained to reconstruct the training videos by combining a single frame and a learned latent ?representation of the motion in the video. Observing frame pairs, each extracted from the same video, ?it learns to encode motion as a combination of motion-specific keypoint displacements and local ?affine transformations. At test time we apply our model to pairs composed of the source image and of ?each frame of the driving video and perform image animation of the source object.	我們感興趣的動畫對象描述了源圖像的基于相似的對象的運動以來駕駛視頻d直接監督不可用(對視頻對象移動類似),我們遵循self-supervised策略啟發從Monkey-Net[28]。為了進行訓練，我們使用了大量的視頻序列集合，其中包含了同一對象類別的對象。我們的模型被訓練來重建訓練視頻結合一個單一的幀和一個學習的潛在的表示運動在視頻。通過觀察從同一視頻中提取的幀對，它學會了將運動編碼為特定運動關鍵點位移和局部仿射變換的組合。在測試時，我們將模型應用于由源圖像和驅動視頻的每一幀組成的對，并執行源對象的圖像動畫。 ?
An overview of our approach is presented in Fig. 2. Our framework is composed of two main ?modules: the motion estimation module and the image generation module. The purpose of the motion ?estimation module is to predict a dense motion field from a frame D ∈ R ?3×H×W of dimension ?H × W of the driving video D to the source frame S ∈ R ?3×H×W . The dense motion field is later ?used to align the feature maps computed from S with the object pose in D. The motion field is ?modeled by a function TS←D : R ?2 → R ?2 ?that maps each pixel location in D with its corresponding ?location in S. TS←D is often referred to as backward optical flow. We employ backward optical flow, ?rather than forward optical flow, since back-warping can be implemented efficiently in a differentiable ?manner using bilinear sampling [17]. We assume there exists an abstract reference frame R. We ?independently estimate two transformations: from R to S (TS←R) and from R to D (TD←R). Note ?that unlike X2Face [40] the reference frame is an abstract concept that cancels out in our derivations ?later. Therefore it is never explicitly computed and cannot be visualized. This choice allows us to ?independently process D and S. This is desired since, at test time the model receives pairs of the ?source image and driving frames sampled from a different video, which can be very different visually. ?Instead of directly predicting TD←R and TS←R, the motion estimator module proceeds in two steps.	我們的方法的概述如圖2所示。我們的框架由兩個主要模塊組成:運動估計模塊和圖像生成模塊。運動估計模塊的目的是預測從驅動視頻D的維數H×W的幀D∈R 3×H×W到源幀S∈R 3×H×W的密集運動場。密集的運動領域后用于對齊對象構成的特征圖譜計算從S D運動領域建模函數TS←D: R 2→R 2映射每個像素位置與相應的位置在美國TS←D D通常被稱為反向光流。由于使用雙線性采樣[17]可以以可微的方式有效地實現反向翹曲，因此我們采用了反向光流而不是前向光流。我們假設存在一個抽象參考系R，我們獨立估計兩個轉換:從R到S (TS←R)和從R到D (TD←R)。注意，與X2Face[40]不同的是，參考框架是一個抽象概念，在后面的派生中會被抵消。因此，它從不被顯式地計算，也不能被可視化。這種選擇允許我們獨立處理D和s，這是我們所希望的，因為在測試時，模型接收來自不同視頻的源圖像和驅動幀，它們在視覺上可能非常不同。動作估計器模塊不直接預測TD←R和TS←R，而是分兩步進行。 ?
In the first step, we approximate both transformations from sets of sparse trajectories, obtained by ?using keypoints learned in a self-supervised way. The locations of the keypoints in D and S are ?separately predicted by an encoder-decoder network. The keypoint representation acts as a bottleneck ?resulting in a compact motion representation. As shown by Siarohin et al. [28], such sparse motion ?representation is well-suited for animation as at test time, the keypoints of the source image can be ?moved using the keypoints trajectories in the driving video. We model motion in the neighbourhood ?of each keypoint using local affine transformations. Compared to using keypoint displacements only, ?the local affine transformations allow us to model a larger family of transformations. We use Taylor ?expansion to represent TD←R by a set of keypoint locations and affine transformations. To this end, ?the keypoint detector network outputs keypoint locations as well as the parameters of each affine ?transformation. ? During the second step, a dense motion network combines the local approximations to obtain the ?resulting dense motion field T? ?S←D. Furthermore, in addition to the dense motion field, this network ?outputs an occlusion mask O? ?S←D that indicates which image parts of D can be reconstructed by ?warping of the source image and which parts should be inpainted, i.e.inferred from the context. ? Finally, the generation module renders an image of the source object moving as provided in the ?driving video. Here, we use a generator network G that warps the source image according to T? ?S←D ?and inpaints the image parts that are occluded in the source image. In the following sections we detail ?each of these step and the training procedure.	在第一步中，我們從稀疏軌跡集近似兩個轉換，通過使用自監督方式學習的關鍵點獲得。通過編解碼器網絡分別預測D和S中關鍵點的位置。關鍵點表示是實現緊湊運動表示的瓶頸。如Siarohin等人[28]所示，這種稀疏運動表示非常適合于動畫，因為在測試時，可以使用駕駛視頻中的關鍵點軌跡移動源圖像的關鍵點。我們使用局部仿射變換在每個關鍵點的鄰域建模運動。與只使用關鍵點位移相比，局部仿射變換允許我們建模一個更大的變換家族。我們用泰勒展開通過一組關鍵點位置和仿射變換來表示TD←R。為此，關鍵點檢測器網絡輸出關鍵點位置以及每個仿射變換的參數。  在第二步中,密集的運動網絡結合了本地近似獲得由此產生的密集運動領域T?S←D。此外,除了茂密的運動領域,這個網絡輸出一個閉塞面具O?S←D D表明圖像部分可以重建源圖像的扭曲和哪些部分應該填補,i.e.inferred從上下文。最后，生成模塊呈現源對象移動的圖像，如驅動視頻中提供的那樣。在這里,我們使用一個發電機網絡G扭曲源圖像根據T?S←D和填補圖像部分被遮擋在源圖像。在下面的部分中，我們將詳細介紹這些步驟和培訓過程。

3.1 Local Affine Transformations for Approximate Motion Description ?局部仿射變換近似運動描述

The motion estimation module estimates the backward optical flow TS←D from a driving frame D to ?the source frame S. As discussed above, we propose to approximate TS←D by its first order Taylor ?expansion in a neighborhood of the keypoint locations. In the rest of this section, we describe the ?motivation behind this choice, and detail the proposed approximation of TS←D. ?
We assume there exist an abstract reference frame R. Therefore, estimating TS←D consists in ?estimating TS←R and TR←D. Furthermore, given a frame X, we estimate each transformation ?TX←R in the neighbourhood of the learned keypoints. Formally, given a transformation TX←R, we ?consider its first order Taylor expansions in K keypoints p1, . . . pK. Here, p1, . . . pK denote the ?coordinates of the keypoints in the reference frame R. Note that for the sake of simplicity in the ?following the point locations in the reference pose space are all denoted by p while the point locations ?in the X, S or D pose spaces are denoted by z. We obtain:

運動估計模塊估計從驅動幀D到源幀s的反向光流TS←D。如上所述，我們建議通過其一階泰勒展開在關鍵點位置的鄰域來近似TS←D。在本節的其余部分中，我們將描述此選擇背后的動機，并詳細介紹提出的TS←D近似。 
我們假設存在一個抽象的參考系R，因此，估算TS←D包含在估算TS←R和TR←D中。此外，給定一個坐標系X，我們估計每個變換TX←R在已學習關鍵點附近。正式地，給定一個變換TX←R，我們考慮它在K個關鍵點p1，…pK，這里是p1…pK表示的坐標參考系中的要點r .請注意,為了簡單起見在參考點位置后構成的空間都是用p點位置在X,年代或D構成空間是用z。我們得到:

Combining Local Motions. We employ a convolutional network P to estimate T? ?S←D from the set ?of Taylor approximations of TS←D(z) in the keypoints and the original source frame S. Importantly, ?since T? ?S←D maps each pixel location in D with its corresponding location in S, the local patterns in ?T? ?S←D, such as edges or texture, are pixel-to-pixel aligned with D but not with S. This misalignment ?issue makes the task harder for the network to predict T? ?S←D from S. In order to provide inputs ?already roughly aligned with T? ?S←D, we warp the source frame S according to local transformations ?estimated in Eq. (4). Thus, we obtain K transformed images S ?1 ?, . . . S ?K that are each aligned with ?T? ?S←D in the neighbourhood of a keypoint. Importantly, we also consider an additional image S ?0 = S ?for the background. ?
For each keypoint pk we additionally compute heatmaps Hk indicating to the dense motion network ?where each transformation happens. Each Hk(z) is implemented as the difference of two heatmaps ?centered in TD←R(pk) and TS←R(pk):

結合局部運動。我們采用卷積網絡P估計T?S←D組泰勒近似的TS←D (z)的重點和原始幀S .重要的是,由于T?S←D地圖每個像素位置在D相應位置的年代,當地T?S←D模式,如邊緣或紋理,pixel-to-pixel與D但不與美國這個偏差問題使得網絡任務更難預測T?S←D S為了提供輸入已經大致與T?S←D,我們經源幀S根據當地轉換在情商估計。(4)。因此,我們獲得K S轉換圖像1,。K,都與T?年代←D附近的一個關鍵點。重要的是，我們還考慮了一個額外的圖像S 0 = S作為背景。 
對于每一個關鍵點pk，我們額外計算熱圖Hk，表明在稠密的運動網絡，每一個變換發生。每個Hk(z)實現為以TD←R(pk)和TS←R(pk)為中心的兩個熱圖的差異:

3.2 Occlusion-aware Image Generation?遮擋感知圖像生成

As mentioned in Sec.3, the source image S is not pixel-to-pixel aligned with the image to be generated ?D? . In order to handle this misalignment, we use a feature warping strategy similar to [29, 28, 15]. ?More precisely, after two down-sampling convolutional blocks, we obtain a feature map ξ ∈ R ?H0×W0 ?of dimension H0 × W0 ?. We then warp ξ according to T? ?S←D. In the presence of occlusions in S, ?optical flow may not be sufficient to generate D? . Indeed, the occluded parts in S cannot be recovered ?by image-warping and thus should be inpainted. Consequently, we introduce an occlusion map ?O? ?S←D ∈ [0, 1]H0×W0 ?to mask out the feature map regions that should be inpainted. Thus, the ?occlusion mask diminishes the impact of the features corresponding to the occluded parts. The ?transformed feature map is written as:

Sec.3提到過,源圖像年代不是pixel-to-pixel與圖像生成D?。為了處理這種錯位，我們使用了類似于[29,28,15]的特征扭曲策略。更準確地說,經過兩個采樣下來卷積塊,我們獲得一個特性映射ξ∈R H0×W0 H0×W0的維度。然后經ξ根據T?S←D。存在遮擋的年代,光流可能不足以生成D?。實際上，S中被遮擋的部分是無法通過圖像扭曲恢復的，因此應該進行補繪。因此,我們引入一個閉塞地圖O?S←D∈[0,1]H0×W0面具出功能映射區域應該填補。因此，遮擋掩模減少了與遮擋部分相對應的特征的影響。轉換后的feature map為:

3.3 Training Losses

We train our system in an end-to-end fashion combining several losses. First, we use the reconstruction loss based on the perceptual loss of Johnson et al. [19] using the pre-trained VGG-19 network as our main driving loss. The loss is based on implementation of Wang et al. [37]. With the input driving frame D and the corresponding reconstructed frame D? , the reconstruction loss is written as:

我們以端到端的方式訓練我們的系統，結合了一些損失。首先，我們使用基于Johnson等人[19]的感知損失的重建損失，使用預訓練的vggg -19網絡作為我們的主要驅動損失。損失是基于Wang等人[37]的實施。與輸入驅動框架和相應的重構幀D?,重建損失是寫成:

Imposing Equivariance Constraint. Our keypoint predictor does not require any keypoint annotations ?during training. This may lead to unstable performance. Equivariance constraint is one of ?the most important factors driving the discovery of unsupervised keypoints [18, 43]. It forces the ?model to predict consistent keypoints with respect to known geometric transformations. We use thin ?plate splines deformations as they were previously used in unsupervised keypoint detection [18, 43] ?and are similar to natural image deformations. Since our motion estimator does not only predict the ?keypoints, but also the Jacobians, we extend the well-known equivariance loss to additionally include ?constraints on the Jacobians. ?
We assume that an image X undergoes a known spatial deformation TX←Y. In this case TX←Y can ?be an affine transformation or a thin plane spline deformation. After this deformation we obtain a ?new image Y. Now by applying our extended motion estimator to both images, we obtain a set of ?local approximations for TX←R and TY←R. The standard equivariance constraint writes as:

實施Equivariance約束。我們的關鍵點預測器在培訓期間不需要任何關鍵點注釋。這可能會導致不穩定的性能。等方差約束是驅動無監督關鍵點發現的最重要因素之一[18,43]。它迫使模型對已知的幾何變換預測一致的關鍵點。我們使用薄板樣條變形，因為它們以前在無監督關鍵點檢測中使用[18,43]，并且類似于自然圖像變形。由于我們的運動估計器不僅可以預測關鍵點，而且可以預測雅可比矩陣，因此我們擴展了眾所周知的等方差損失，增加了對雅可比矩陣的約束。
我們假設圖像X經歷了已知的空間變形TX←Y。在這種情況下，TX←Y可以是仿射變換或薄平面樣條變形。在此變形之后，我們得到了一個新的圖像y?，F在通過對這兩幅圖像應用我們的擴展運動估計器，我們得到了一組TX←R和TY←R的局部逼近。標準等方差約束為:

Note that the constraint Eq. (11) is strictly the same as the standard equivariance constraint for the ?keypoints [18, 43]. During training, we constrain every keypoint location using a simple L1 loss ?between the two sides of Eq. (11). However, implementing the second constraint from Eq. (12) with L1 would force the magnitude of the Jacobians to zero and would lead to numerical problems. To ?this end, we reformulate this constraint in the following way:

注意，約束Eq.(11)與關鍵點的標準等方差約束嚴格相同[18,43]。在訓練過程中，我們在Eq.(11)的兩邊使用一個簡單的L1損失來約束每個關鍵點的定位。但是，用L1來實現Eq.(12)中的第二個約束會使雅可比矩陣的大小為零，會導致數值問題。為此目的，我們以下列方式重新表述這一約束:

3.4 Testing Stage: Relative Motion Transfer? ??測試階段:相對運動轉移

At this stage our goal is to animate an object in a source frame S1 using the driving video D1, . . . DT . ?Each frame Dt is independently processed to obtain St. Rather than transferring the motion encoded ?in TS1←Dt ?(pk) to S, we transfer the relative motion between D1 and Dt to S1. In other words, we ?apply a transformation TDt←D1 ?(p) to the neighbourhood of each keypoint pk:

在這個階段，我們的目標是動畫的對象在源幀S1使用駕駛視頻D1，…DT。我們將D1和Dt之間的相對運動轉移到S1，而不是將TS1←Dt (pk)中編碼的運動轉移到S中。換句話說，我們對每個關鍵點pk的鄰域應用變換TDt←D1 (p):

Detailed mathematical derivations are provided in Sup. Mat.. Intuitively, we transform the neighbourhood ?of each keypoint pk in S1 according to its local deformation in the driving video. Indeed, ?transferring relative motion over absolute coordinates allows to transfer only relevant motion patterns, ?while preserving global object geometry. Conversely, when transferring absolute coordinates, as in ?X2Face [40], the generated frame inherits the object proportions of the driving video. It’s important ?to note that one limitation of transferring relative motion is that we need to assume that the objects ?in S1 and D1 have similar poses (see [28]). Without initial rough alignment, Eq. (14) may lead to ?absolute keypoint locations physically impossible for the object of interest.

在Sup. Mat中提供了詳細的數學推導。直觀上，我們根據driving video中每個關鍵點pk的局部變形，對S1中每個關鍵點pk的鄰域進行變換。實際上，在絕對坐標上傳輸相對運動只允許傳輸相關的運動模式，同時保留全局物體的幾何形狀。相反，在傳輸絕對坐標時，如在X2Face[40]中，生成的幀繼承驅動視頻的對象比例。需要注意的是，傳遞相對運動的一個限制是，我們需要假設S1和D1中的物體具有相似的姿態(見[28])。在沒有初始粗對準的情況下，Eq.(14)可能導致感興趣對象在物理上無法得到絕對的關鍵點位置。

4 Experiments

Datasets. We train and test our method on four different datasets containing various objects. Our model is capable of rendering videos of much higher resolution compared to [28] in all our experiments. The VoxCeleb dataset [22] is a face dataset of 22496 videos, extracted from YouTube videos. For ?pre-processing, we extract an initial bounding box in the first video frame. We track this face until ?it is too far away from the initial position. Then, we crop the video frames using the smallest crop ?containing all the bounding boxes. The process is repeated until the end of the sequence. We filter ?out sequences that have resolution lower than 256 × 256 and the remaining videos are resized to ?256 × 256 preserving the aspect ratio. It’s important to note that compared to X2Face [40], we obtain ?more natural videos where faces move freely within the bounding box. Overall, we obtain 12331 ?training videos and 444 test videos, with lengths varying from 64 to 1024 frames. ? The UvA-Nemo dataset [9] is a facial analysis dataset that consists of 1240 videos. We apply the ?exact same pre-processing as for VoxCeleb. Each video starts with a neutral expression. Similar to ?Wang et al. [38], we use 1116 videos for training and 124 for evaluation. ? The BAIR robot pushing dataset [10] contains videos collected by a Sawyer robotic arm pushing ?diverse objects over a table. It consists of 42880 training and 128 test videos. Each video is 30 frame ?long and has a 256 × 256 resolution. ? Following Tulyakov et al. [34], we collected 280 tai-chi videos from YouTube. We use 252 videos ?for training and 28 for testing. Each video is split in short clips as described in pre-processing of ?VoxCeleb dataset. We retain only high quality videos and resized all the clips to 256 × 256 pixels ?(instead of 64 × 64 pixels in [34]). Finally, we obtain 3049 and 285 video chunks for training and ?testing respectively with video length varying from 128 to 1024 frames. This dataset is referred to as ?the Tai-Chi-HD dataset. The dataset will be made publicly available.	數據集。我們在包含不同對象的四個不同數據集上訓練和測試我們的方法。在我們所有的實驗中，我們的模型能夠呈現比[28]分辨率高得多的視頻。 VoxCeleb數據集[22]是從YouTube視頻中提取的22496個視頻的人臉數據集。為了進行預處理，我們在第一幀視頻中提取一個初始邊界框。我們跟蹤這個面，直到它離初始位置太遠。然后，我們使用包含所有邊框的最小剪裁來裁剪視頻幀。這個過程一直重復，直到序列結束。我們過濾掉分辨率低于256×256的序列，其余的視頻調整為256×256，保持高寬比不變。值得注意的是，與X2Face[40]相比，我們獲得了更自然的視頻，其中面在邊框內自由移動?？偟膩碚f，我們獲得了12331個訓練視頻和444個測試視頻，長度從64幀到1024幀不等。? UvA-Nemo數據集[9]是一個面部分析數據集，包含1240個視頻。我們使用與VoxCeleb完全相同的預處理。每個視頻都以一個中性的表情開始。與Wang et al.[38]類似，我們使用1116個視頻進行培訓，124個視頻進行評估。? BAIR機器人推送數據集[10]包含了由Sawyer機器人手臂在桌子上推送不同對象所收集的視頻。它由42880個訓練視頻和128個測試視頻組成。每個視頻為30幀長，分辨率為256×256。? 在Tulyakov等人[34]之后，我們從YouTube上收集了280個太極視頻。我們使用252個視頻進行培訓，28個視頻進行測試。每個視頻被分割成簡短的片段，正如在VoxCeleb數據集預處理中描述的那樣。我們只保留高質量的視頻，并將所有剪輯調整為256×256像素(而不是[34]中的64×64像素)。最后，我們分別得到3049和285個視頻塊進行訓練和測試，視頻長度在128到1024幀之間。這個數據集被稱為taichi - hd數據集。數據集將向公眾開放。
Evaluation Protocol. Evaluating the quality of image animation is not obvious, since ground truth ?animations are not available. We follow the evaluation protocol of Monkey-Net [28]. First, we quantitatively evaluate each method on the "proxy" task of video reconstruction. This task consists of ?reconstructing the input video from a representation in which appearance and motion are decoupled. ?In our case, we reconstruct the input video by combining the sparse motion representation in (2) of ?each frame and the first video frame. Second, we evaluate our model on image animation according ?to a user-study. In all experiments we use K=10 as in [28]. Other implementation details are given in ?Sup. Mat.	評估方案。評價圖像動畫的質量并不明顯，因為地面真實動畫是不可用的。我們遵循猴網[28]的評估協議。首先，我們對視頻重建的“代理”任務進行了定量評估。這個任務包括從外觀和運動解耦的再現中重構輸入視頻。在我們的例子中，我們結合每一幀的稀疏運動表示和第一幀視頻來重建輸入視頻。其次，我們根據用戶研究評估我們的圖像動畫模型。在所有的實驗中，我們使用K=10作為[28]。其他實現細節見?Sup. Mat.
	?
Metrics. To evaluate video reconstruction, we adopt the metrics proposed in Monkey-Net [28]: L1. We report the average L1 distance between the generated and the ground-truth videos. Average Keypoint Distance (AKD). For the Tai-Chi-HD, VoxCeleb and Nemo datasets, we use ?3rd-party pre-trained keypoint detectors in order to evaluate whether the motion of the input video ?is preserved. For the VoxCeleb and Nemo datasets we use the facial landmark detector of Bulat et ?al. [5]. For the Tai-Chi-HD dataset, we employ the human-pose estimator of Cao et al. [7]. These ?keypoints are independently computed for each frame. AKD is obtained by computing the average ?distance between the detected keypoints of the ground truth and of the generated video. ? Missing Keypoint Rate (MKR). In the case of Tai-Chi-HD, the human-pose estimator returns an ?additional binary label for each keypoint indicating whether or not the keypoints were successfully ?detected. Therefore, we also report the MKR defined as the percentage of keypoints that are detected ?in the ground truth frame but not in the generated one. This metric assesses the appearance quality of ?each generated frame. ? Average Euclidean Distance (AED). Considering an externally trained image representation, we ?report the average euclidean distance between the ground truth and generated frame representation, ?similarly to Esser et al. [11]. We employ the feature embedding used in Monkey-Net [28].	指標。為了評估視頻重構，我們采用Monkey-Net[28]中提出的度量: L1。我們報告了生成的視頻和地面真實視頻之間的平均L1距離。平均關鍵點距離(AKD)。對于Tai-Chi-HD、VoxCeleb和Nemo數據集，我們使用第三方預訓練的關鍵點檢測器來評估輸入視頻的運動是否被保留。對于VoxCeleb和Nemo數據集，我們使用Bulat等人的面部地標檢測器。[5]。對于taichi - hd數據集，我們采用了Cao等人[7]的人體姿態估計器。對于每一幀，這些關鍵點都是獨立計算的。AKD是通過計算ground truth檢測關鍵點與生成視頻之間的平均距離得到的。? 缺少關鍵點率(MKR)。在Tai-Chi-HD的情況下，人體姿態估計器為每個關鍵點返回一個額外的二進制標簽，以指示是否成功地檢測到關鍵點。因此，我們還報告MKR定義為在ground truth框架中檢測到但在生成的框架中未檢測到的關鍵點的百分比。這個度量評估每個生成幀的外觀質量。? 平均歐氏距離(AED)?？紤]到外部訓練的圖像表示，我們報告了ground truth和生成的幀表示之間的平均歐氏距離，類似于Esser等人[11]。我們使用了在猴網[28]中使用的特征嵌入。燒蝕研究。我們比較模型的以下變體?；€:不使用遮擋模板訓練的最簡單模型(Eq.(8)中OS←D=1)， Eq.(4)中雅可比矩陣(Jk =1)，并且僅在最高分辨率下使用Lrec進行監督;吡定:金字塔損失添加到基線;吡定+OS←D:關于Pyr。，我們將產生網絡替換為封閉感知網絡;江淮。Eq.(12)我們的局部仿射變換模型，但對雅可比矩陣沒有等方差約束完整:包括3.1節中描述的局部仿射變換的完整模型。?
Ablation Study. We compare the following variants of our model. Baseline: the simplest model ?trained without using the occlusion mask (OS←D=1 in Eq. (8)), jacobians (Jk = 1 in Eq. (4)) and ?is supervised with Lrec at the highest resolution only; Pyr.: the pyramid loss is added to Baseline; ?Pyr.+OS←D: with respect to Pyr., we replace the generator network with the occlusion-aware network; ?Jac. w/o Eq. (12) our model with local affine transformations but without equivariance constraints on ?jacobians Eq. (12); Full: the full model including local affine transformations described in Sec. 3.1. ? In Fig. 3, we report the qualitative ablation. First, the pyramid loss leads to better results according ?to all the metrics except AKD. Second, adding OS←D to the model consistently improves all the ?metrics with respect to Pyr.. This illustrates the benefit of explicitly modeling occlusions. We found ?that without equivariance constraint over the jacobians, Jk becomes unstable which leads to poor ?motion estimations. Finally, our Full model further improves all the metrics. In particular, we note ?that, with respect to the Baseline model, the MKR of the full model is smaller by the factor of 2.75. ?It shows that our rich motion representation helps generate more realistic images. These results are ?confirmed by our qualitative evaluation in Tab. 1 where we compare the Baseline and the Full models. ?In these experiments, each frame D of the input video is reconstructed from its first frame (first ?column) and the estimated keypoint trajectories. We note that the Baseline model does not locate any keypoints in the arms area. Consequently, when the pose difference with the initial pose increases, ?the model cannot reconstruct the video (columns 3,4 and 5). In contrast, the Full model learns to ?detect a keypoint on each arm, and therefore, to more accurately reconstruct the input video even in ?the case of complex motion.	燒蝕研究。我們比較模型的以下變體?；€:不使用遮擋模板訓練的最簡單模型(Eq.(8)中OS←D=1)， Eq.(4)中雅可比矩陣(Jk =1)，并且僅在最高分辨率下使用Lrec進行監督;吡定:金字塔損失添加到基線;吡定+OS←D:關于Pyr。，我們將產生網絡替換為封閉感知網絡;江淮。Eq.(12)我們的局部仿射變換模型，但對雅可比矩陣沒有等方差約束完整:包括3.1節中描述的局部仿射變換的完整模型。在圖3中，我們報告了定性消融。首先，根據除AKD之外的所有指標，金字塔損失導致更好的結果。其次，在模型中添加OS←D可以持續地改進關于Pyr的所有度量。這說明了明確建模遮擋的好處。我們發現，如果雅可比矩陣上沒有等方差約束，Jk將變得不穩定，從而導致較差的運動估計。最后，我們的完整模型進一步改進了所有的度量。特別地，我們注意到，相對于基線模型，完整模型的MKR要小2.75倍。這表明，我們豐富的運動表示有助于生成更真實的圖像。我們在表1中對基線和完整模型進行了定性評價，驗證了這些結果。在這些實驗中，輸入視頻的每一幀D都從第一幀(第一列)和估計的關鍵點軌跡重建。我們注意到基線模型沒有在武器區域定位任何關鍵點。因此，當與初始位姿的位姿差增大時，模型無法重構視頻(第3、4、5列)，而全模型學習檢測每只手臂上的一個關鍵點，從而在復雜運動的情況下更準確地重構輸入視頻。 ?
	?
Comparison with State of the Art. We now compare our method with state of the art for the video ?reconstruction task as in [28]. To the best of our knowledge, X2Face [40] and Monkey-Net [28] are ?the only previous approaches for model-free image animation. Quantitative results are reported in ?Tab. 3. We observe that our approach consistently improves every single metric for each of the four ?different datasets. Even on the two face datasets, VoxCeleb and Nemo datasets, our approach clearly ?outperforms X2Face that was originally proposed for face generation. The better performance of our ?approach compared to X2Face is especially impressive X2Face exploits a larger motion embedding ?(128 floats) than our approach (60=K(2+4) floats). Compared to Monkey-Net that uses a motion ?representation with a similar dimension (50=K(2+3)), the advantages of our approach are clearly ?visible on the Tai-Chi-HD dataset that contains highly non-rigid objects (i.e.human body). ? We now report a qualitative comparison for image animation. Generated sequences are reported in ?Fig. 4. The results are well in line with the quantitative evaluation in Tab. 3. Indeed, in both examples, ?X2Face and Monkey-Net are not able to correctly transfer the body notion in the driving video, ?instead warping the human body in the source image as a blob. Conversely, our approach is able ?to generate significantly better looking videos in which each body part is independently animated. ?This qualitative evaluation illustrates the potential of our rich motion description. We complete our ?evaluation with a user study. We ask users to select the most realistic image animation. Each question ?consists of the source image, the driving video, and the corresponding results of our method and a ?competitive method. We require each question to be answered by 10 AMT worker. This evaluation ?is repeated on 50 different input pairs. Results are reported in Tab. 2. We observe that our method ?is clearly preferred over the competitor methods. Interestingly, the largest difference with the state ?of the art is obtained on Tai-Chi-HD: the most challenging dataset in our evaluation due to its rich ?motions.	與先進水平的比較。我們現在比較我們的方法與先進的視頻重建任務在[28]。就我們所知，X2Face[40]和Monkey-Net[28]是之前唯一的無模型圖像動畫方法。定量結果在Tab中報告。3.我們觀察到，我們的方法始終如一地改善了四個不同數據集的每一個指標。即使在VoxCeleb和Nemo這兩個人臉數據集上，我們的方法也明顯優于最初為人臉生成而提出的X2Face。與X2Face相比，我們的方法的更好性能尤其令人印象深刻。X2Face利用了更大的運動嵌入(128個浮點數)，而我們的方法(60=K(2+4)浮點數)。與使用類似維數(50=K(2+3))的運動表示的猴網相比，我們的方法在包含高度非剛性對象(如人體)的Tai-Chi-HD數據集上的優勢是顯而易見的。  我們現在報告一個圖像動畫的定性比較。生成的序列如圖所示。4. 結果與表3的定量評價很一致。實際上，在這兩個例子中，X2Face和Monkey-Net都無法在驅動視頻中正確傳輸身體概念，而是將源圖像中的人體扭曲成一個blob。相反，我們的方法能夠產生明顯更好的視頻，其中身體的每個部分都是獨立的動畫。這種定性評價說明了我們豐富的運動描述的潛力。我們通過用戶研究來完成我們的評估。我們要求用戶選擇最真實的圖像動畫。每個問題由源圖像，駕駛視頻，以及相應的結果，我們的方法和競爭方法。我們要求每個問題由10個AMT工人回答。這個評估在50個不同的輸入對上重復。結果如表2所示。我們觀察到我們的方法明顯優于競爭對手的方法。有趣的是，與目前最先進的最大差異是在Tai-Chi-HD上獲得的:由于其豐富的運動，在我們的評估中最具挑戰性的數據集。

5 Conclusions ?

We presented a novel approach for image animation based on keypoints and local affine transformations. ?Our novel mathematical formulation describes the motion field between two frames and is ?efficiently computed by deriving a first order Taylor expansion approximation. In this way, motion is ?described as a set of keypoints displacements and local affine transformations. A generator network ?combines the appearance of the source image and the motion representation of the driving video. In ?addition, we proposed to explicitly model occlusions in order to indicate to the generator network ?which image parts should be inpainted. We evaluated the proposed method both quantitatively and ?qualitatively and showed that our approach clearly outperforms state of the art on all the benchmarks.

本文提出了一種基于關鍵點和局部仿射變換的圖像動畫方法。我們的新的數學公式描述了兩個幀之間的運動場，并通過推導一階泰勒展開近似來有效地計算。這樣，運動被描述為一組關鍵點位移和局部仿射變換。生成網絡將源圖像的外觀和驅動視頻的運動表示結合起來。此外，我們建議顯式地建立遮擋模型，以便向生成器網絡指示哪些圖像部分需要修復。我們對所提出的方法進行了定量和定性的評估，并表明我們的方法在所有基準測試中都明顯優于現有的技術水平。

總結

以上是生活随笔為你收集整理的Paper：《First Order Motion Model for Image Animation》翻译与解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：成功解决NVIDIA安装程序无法继续
下一篇： Python编程语言学习：for循环中常