Paper:《First Order Motion Model for Image Animation》翻译与解读
Paper:《First Order Motion Model for Image Animation》翻譯與解讀
?
?
?
?
目錄
《First Order Motion Model for Image Animation》翻譯與解讀
Abstract
1 Introduction ?
2 Related work ?
3 Method
3.1 Local Affine Transformations for Approximate Motion Description ?
3.2 Occlusion-aware Image Generation?
3.3 Training Losses
3.4 Testing Stage: Relative Motion Transfer ?
4 Experiments
5 Conclusions ?
?
?
更新中……
《First Order Motion Model for Image Animation》翻譯與解讀
| 相關論文 | 《First Order Motion Model for Image Animation》 https://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation Aliaksandr Siarohin DISI, University of Trento aliaksandr.siarohin@unitn.it Stéphane Lathuilière DISI, University of Trento LTCI, Télécom Paris, Institut polytechnique de Paris stephane.lathuilire@telecom-paris.fr Sergey Tulyakov Snap Inc. stulyakov@snap.com Elisa Ricci DISI, University of Trento Fondazione Bruno Kessler e.ricci@unitn.it Nicu Sebe DISI, University of Trento Huawei Technologies Ireland niculae.sebe@unitn.it |
| GitHub | https://github.com/AliaksandrSiarohin/first-order-model |
?
Abstract
| Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories. | 圖像動畫包括生成視頻序列,以便根據驅動視頻的運動使源圖像中的對象動畫。我們的框架解決了這個問題,沒有使用任何注釋或關于動畫特定對象的先驗信息。一旦在一組描述同一類別對象(例如人臉、人體)的視頻上進行訓練,我們的方法就可以應用于該類中的任何對象。為了實現這一點,我們解耦外觀表面和運動信息使用一個自監督的公式。為了支持復雜的運動,我們使用一種由一組學習過的關鍵點及其局部仿射變換組成的表示法。生成器網絡對目標運動中產生的遮擋進行建模,并將從源圖像中提取的外觀與從駕駛視頻中提取的運動相結合。我們的框架在各種基準測試和各種對象類別上得分最高。 |
?
1 Introduction ?
| Generating videos by animating objects in still images has countless applications across areas of ?interest including movie production, photography, and e-commerce. More precisely, image animation ?refers to the task of automatically synthesizing videos by combining the appearance extracted from ?a source image with motion patterns derived from a driving video. For instance, a face image of a ?certain person can be animated following the facial expressions of another individual (see Fig. 1). In ?the literature, most methods tackle this problem by assuming strong priors on the object representation ?(e.g. 3D model) [4] and resorting to computer graphics techniques [6, 33]. These approaches can ?be referred to as object-specific methods, as they assume knowledge about the model of the specific ?object to animate. ? | 通過在靜態圖像中動畫對象來生成視頻有無數的應用程序,涉及的領域包括電影制作、攝影和電子商務。更準確地說,圖像動畫是指通過將從源圖像中提取的外觀與從駕駛視頻中提取的運動模式結合起來,自動合成視頻的任務。例如,一個人的面部圖像可以根據另一個人的面部表情進行動畫處理(見圖1),在文獻中,大多數方法通過對對象表示(如3D模型)[4]假設強先驗并借助于計算機圖形技術來解決這個問題[6,33]。這些方法可以被稱為特定對象的方法,因為它們假定了解要動畫的特定對象的模型。 ? |
| ? | |
| Recently, deep generative models have emerged as effective techniques for image animation and ?video retargeting [2, 41, 3, 42, 27, 28, 37, 40, 31, 21]. In particular, Generative Adversarial Networks ?(GANs) [14] and Variational Auto-Encoders (VAEs) [20] have been used to transfer facial expressions ?[37] or motion patterns [3] between human subjects in videos. Nevertheless, these approaches ?usually rely on pre-trained models in order to extract object-specific representations such as keypoint ?locations. Unfortunately, these pre-trained models are built using costly ground-truth data annotations ?[2, 27, 31] and are not available in general for an arbitrary object category. To address this issues, ?recently Siarohin et al. [28] introduced Monkey-Net, the first object-agnostic deep model for image?animation. Monkey-Net encodes motion information via keypoints learned in a self-supervised fashion. ?At test time, the source image is animated according to the corresponding keypoint trajectories ?estimated in the driving video. The major weakness of Monkey-Net is that it poorly models object ?appearance transformations in the keypoint neighborhoods assuming a zeroth order model (as we ?show in Sec. 3.1). This leads to poor generation quality in the case of large object pose changes ?(see Fig. 4). To tackle this issue,
| 最近,深度生成模型已經成為圖像動畫和視頻重定向的有效技術[2,41,3,42,27,28,37,40,31,21]。特別是,生成對抗網絡(GANs)[14]和變分自動編碼器(VAEs)[20]已被用于在視頻中人類受試者之間轉移面部表情[37]或運動模式[3]。然而,這些方法通常依靠預先訓練好的模型來提取特定對象的表示,如關鍵點位置。不幸的是,這些預先訓練過的模型是使用昂貴的ground-truth數據注釋來構建的[2,27,31],通常不能用于任意對象類別。為了解決這個問題,最近Siarohin等人[28]推出了Monkey-Net,這是第一個面向對象的圖像動畫深度模型。Monkey-Net編碼運動信息以一個自我監督的方式通過關鍵點學習。在測試時,根據在駕駛視頻中估計的相應關鍵點軌跡對源圖像進行動畫處理。Monkey-Net的主要弱點是,在假定為零階模型的情況下,它很難對關鍵點鄰域中的對象外觀變換進行建模(如3.1節所示)。這導致在大物體姿態變化的情況下生成質量較差(見圖4)。為了解決這個問題,
|
?
?
2 Related work ?
| Video Generation. Earlier works on deep video generation discussed how spatio-temporal neural ?networks could render video frames from noise vectors [36, 26]. More recently, several approaches ?tackled the problem of conditional video generation. For instance, Wang et al. [38] combine a ?recurrent neural network with a VAE in order to generate face videos. Considering a wider range ?of applications, Tulyakov et al. [34] introduced MoCoGAN, a recurrent architecture adversarially ?trained in order to synthesize videos from noise, categorical labels or static images. Another typical ?case of conditional generation is the problem of future frame prediction, in which the generated video ?is conditioned on the initial frame [12, 23, 30, 35, 44]. Note that in this task, realistic predictions can ?be obtained by simply warping the initial video frame [1, 12, 35]. Our approach is closely related?to these previous works since we use a warping formulation to generate video sequences. However, ?in the case of image animation, the applied spatial deformations are not predicted but given by the ?driving video. | 視頻生成。在深度視頻生成方面的早期工作討論了時空神經網絡如何從噪聲向量渲染視頻幀[36,26]。最近,一些方法解決了條件視頻生成的問題。例如,Wang et al.[38]結合遞歸神經網絡和VAE來生成人臉視頻??紤]到更廣泛的應用,Tulyakov等人[34]引入了MoCoGAN,一種經過反訓練的周期性建筑,用于從噪聲、分類標簽或靜態圖像合成視頻。條件生成的另一個典型情況是未來幀預測問題,生成的視頻以初始幀為條件[12,23,30,35,44]。注意,在這個任務中,可以通過簡單地扭曲初始視頻幀來獲得現實的預測[1,12,35]。我們的方法與之前的工作密切相關,因為我們使用扭曲公式來生成視頻序列。然而,在圖像動畫的情況下,應用的空間變形不是預測,而是由駕駛視頻給出。 |
| ? | |
| Image Animation. Traditional approaches for image animation and video re-targeting [6, 33, ?13] were designed for specific domains such as faces [45, 42], human silhouettes [8, 37, 27] or ?gestures [31] and required a strong prior of the animated object. For example, in face animation, ?method of Zollhofer et al. [45] produced realistic results at expense of relying on a 3D morphable ?model of the face. In many applications, however, such models are not available. Image animation ?can also be treated as a translation problem from one visual domain to another. For instance, Wang ?et al. [37] transferred human motion using the image-to-image translation framework of Isola et ?al. [16]. Similarly, Bansal et al. [3] extended conditional GANs by incorporating spatio-temporal ?cues in order to improve video translation between two given domains. Such approaches in order to ?animate a single person require hours of videos of that person labelled with semantic information, ?and therefore have to be retrained for each individual. In contrast to these works, we neither rely on ?labels, prior information about the animated objects, nor on specific training procedures for each ?object instance. Furthermore, our approach can be applied to any object within the same category ?(e.g., faces, human bodies, robot arms etc). ? | 圖像動畫。傳統的圖像動畫和視頻重定向方法[6,33,13]是為特定領域設計的,如人臉[45,42],人體輪廓[8,37,27]或手勢[31],并要求動畫對象的強大先驗。例如,在人臉動畫中,Zollhofer等人[45]的方法以依賴人臉的3D morphable模型為代價,產生了逼真的結果。然而,在許多應用中,這樣的模型是不可用的。圖像動畫也可以看作是一個從一個視覺領域到另一個視覺領域的轉換問題。例如,Wang等人[37]使用Isola等人的圖像到圖像的翻譯框架來傳輸人體運動。[16]。同樣,Bansal等人[3]通過合并時空線索擴展了條件GANs,以改善兩個給定域之間的視頻平移。為了使一個人動起來,這種方法需要數小時的帶有語義信息的視頻,因此必須為每個人重新訓練。與這些作品相比,我們既不依賴于標簽,也不依賴于動畫對象的先驗信息,也不依賴于每個對象實例的特定訓練程序。此外,我們的方法可以應用于同一類別中的任何對象。,人臉,人體,機器人手臂等)。 |
| Several approaches were proposed that do not require priors about the object. X2Face [40] uses ?a dense motion field in order to generate the output video via image warping. Similarly to us ?they employ a reference pose that is used to obtain a canonical representation of the object. In our ?formulation, we do not require an explicit reference pose, leading to significantly simpler optimization ?and improved image quality. Siarohin et al. [28] introduced Monkey-Net, a self-supervised framework ?for animating arbitrary objects by using sparse keypoint trajectories. In this work, we also employ ?sparse trajectories induced by self-supervised keypoints. However, we model object motion in the ?neighbourhood of each predicted keypoint by a local affine transformation. Additionally, we explicitly ?model occlusions in order to indicate to the generator network the image regions that can be generated ?by warping the source image and the occluded areas that need to be inpainted. | 提出了幾種不需要關于對象的先驗的方法。X2Face[40]使用密集運動場,通過圖像翹曲生成輸出視頻。與我們相似的是,它們使用一個參考姿態來獲得對象的規范表示。在我們的公式中,我們不需要一個明確的參考姿態,導致顯著簡化優化和改善圖像質量。Siarohin等人[28]介紹了Monkey-Net,這是一個自監督框架,通過使用稀疏的關鍵點軌跡來創建任意對象的動畫。在這項工作中,我們也使用稀疏軌跡由自監督關鍵點。然而,我們通過局部仿射變換在每個預測關鍵點的鄰域內建模物體的運動。此外,為了向生成網絡表明扭曲源圖像可以生成的圖像區域和需要繪制的遮擋區域,我們對遮擋進行了顯式建模。 |
?
?
3 Method
| We are interested in animating an object depicted in a source image S based on the motion of a similar ?object in a driving video D. Since direct supervision is not available (pairs of videos in which objects ?move similarly), we follow a self-supervised strategy inspired from Monkey-Net [28]. For training, ?we employ a large collection of video sequences containing objects of the same object category. Our ?model is trained to reconstruct the training videos by combining a single frame and a learned latent ?representation of the motion in the video. Observing frame pairs, each extracted from the same video, ?it learns to encode motion as a combination of motion-specific keypoint displacements and local ?affine transformations. At test time we apply our model to pairs composed of the source image and of ?each frame of the driving video and perform image animation of the source object. | 我們感興趣的動畫對象描述了源圖像的基于相似的對象的運動以來駕駛視頻d直接監督不可用(對視頻對象移動類似),我們遵循self-supervised策略啟發從Monkey-Net[28]。為了進行訓練,我們使用了大量的視頻序列集合,其中包含了同一對象類別的對象。我們的模型被訓練來重建訓練視頻結合一個單一的幀和一個學習的潛在的表示運動在視頻。通過觀察從同一視頻中提取的幀對,它學會了將運動編碼為特定運動關鍵點位移和局部仿射變換的組合。在測試時,我們將模型應用于由源圖像和驅動視頻的每一幀組成的對,并執行源對象的圖像動畫。 ? |
| An overview of our approach is presented in Fig. 2. Our framework is composed of two main ?modules: the motion estimation module and the image generation module. The purpose of the motion ?estimation module is to predict a dense motion field from a frame D ∈ R ?3×H×W of dimension ?H × W of the driving video D to the source frame S ∈ R ?3×H×W . The dense motion field is later ?used to align the feature maps computed from S with the object pose in D. The motion field is ?modeled by a function TS←D : R ?2 → R ?2 ?that maps each pixel location in D with its corresponding ?location in S. TS←D is often referred to as backward optical flow. We employ backward optical flow, ?rather than forward optical flow, since back-warping can be implemented efficiently in a differentiable ?manner using bilinear sampling [17]. We assume there exists an abstract reference frame R. We ?independently estimate two transformations: from R to S (TS←R) and from R to D (TD←R). Note ?that unlike X2Face [40] the reference frame is an abstract concept that cancels out in our derivations ?later. Therefore it is never explicitly computed and cannot be visualized. This choice allows us to ?independently process D and S. This is desired since, at test time the model receives pairs of the ?source image and driving frames sampled from a different video, which can be very different visually. ?Instead of directly predicting TD←R and TS←R, the motion estimator module proceeds in two steps. | 我們的方法的概述如圖2所示。我們的框架由兩個主要模塊組成:運動估計模塊和圖像生成模塊。運動估計模塊的目的是預測從驅動視頻D的維數H×W的幀D∈R 3×H×W到源幀S∈R 3×H×W的密集運動場。密集的運動領域后用于對齊對象構成的特征圖譜計算從S D運動領域建模函數TS←D: R 2→R 2映射每個像素位置與相應的位置在美國TS←D D通常被稱為反向光流。由于使用雙線性采樣[17]可以以可微的方式有效地實現反向翹曲,因此我們采用了反向光流而不是前向光流。我們假設存在一個抽象參考系R,我們獨立估計兩個轉換:從R到S (TS←R)和從R到D (TD←R)。注意,與X2Face[40]不同的是,參考框架是一個抽象概念,在后面的派生中會被抵消。因此,它從不被顯式地計算,也不能被可視化。這種選擇允許我們獨立處理D和s,這是我們所希望的,因為在測試時,模型接收來自不同視頻的源圖像和驅動幀,它們在視覺上可能非常不同。動作估計器模塊不直接預測TD←R和TS←R,而是分兩步進行。 ? |
| In the first step, we approximate both transformations from sets of sparse trajectories, obtained by ?using keypoints learned in a self-supervised way. The locations of the keypoints in D and S are ?separately predicted by an encoder-decoder network. The keypoint representation acts as a bottleneck ?resulting in a compact motion representation. As shown by Siarohin et al. [28], such sparse motion ?representation is well-suited for animation as at test time, the keypoints of the source image can be ?moved using the keypoints trajectories in the driving video. We model motion in the neighbourhood ?of each keypoint using local affine transformations. Compared to using keypoint displacements only, ?the local affine transformations allow us to model a larger family of transformations. We use Taylor ?expansion to represent TD←R by a set of keypoint locations and affine transformations. To this end, ?the keypoint detector network outputs keypoint locations as well as the parameters of each affine ?transformation. ? During the second step, a dense motion network combines the local approximations to obtain the ?resulting dense motion field T? ?S←D. Furthermore, in addition to the dense motion field, this network ?outputs an occlusion mask O? ?S←D that indicates which image parts of D can be reconstructed by ?warping of the source image and which parts should be inpainted, i.e.inferred from the context. ? Finally, the generation module renders an image of the source object moving as provided in the ?driving video. Here, we use a generator network G that warps the source image according to T? ?S←D ?and inpaints the image parts that are occluded in the source image. In the following sections we detail ?each of these step and the training procedure. | 在第一步中,我們從稀疏軌跡集近似兩個轉換,通過使用自監督方式學習的關鍵點獲得。通過編解碼器網絡分別預測D和S中關鍵點的位置。關鍵點表示是實現緊湊運動表示的瓶頸。如Siarohin等人[28]所示,這種稀疏運動表示非常適合于動畫,因為在測試時,可以使用駕駛視頻中的關鍵點軌跡移動源圖像的關鍵點。我們使用局部仿射變換在每個關鍵點的鄰域建模運動。與只使用關鍵點位移相比,局部仿射變換允許我們建模一個更大的變換家族。我們用泰勒展開通過一組關鍵點位置和仿射變換來表示TD←R。為此,關鍵點檢測器網絡輸出關鍵點位置以及每個仿射變換的參數。 在第二步中,密集的運動網絡結合了本地近似獲得由此產生的密集運動領域T?S←D。此外,除了茂密的運動領域,這個網絡輸出一個閉塞面具O?S←D D表明圖像部分可以重建源圖像的扭曲和哪些部分應該填補,i.e.inferred從上下文。 最后,生成模塊呈現源對象移動的圖像,如驅動視頻中提供的那樣。在這里,我們使用一個發電機網絡G扭曲源圖像根據T?S←D和填補圖像部分被遮擋在源圖像。在下面的部分中,我們將詳細介紹這些步驟和培訓過程。 |
?
3.1 Local Affine Transformations for Approximate Motion Description ?局部仿射變換近似運動描述
| The motion estimation module estimates the backward optical flow TS←D from a driving frame D to ?the source frame S. As discussed above, we propose to approximate TS←D by its first order Taylor ?expansion in a neighborhood of the keypoint locations. In the rest of this section, we describe the ?motivation behind this choice, and detail the proposed approximation of TS←D. ? | 運動估計模塊估計從驅動幀D到源幀s的反向光流TS←D。如上所述,我們建議通過其一階泰勒展開在關鍵點位置的鄰域來近似TS←D。在本節的其余部分中,我們將描述此選擇背后的動機,并詳細介紹提出的TS←D近似。 ? |
| ? | |
| Combining Local Motions. We employ a convolutional network P to estimate T? ?S←D from the set ?of Taylor approximations of TS←D(z) in the keypoints and the original source frame S. Importantly, ?since T? ?S←D maps each pixel location in D with its corresponding location in S, the local patterns in ?T? ?S←D, such as edges or texture, are pixel-to-pixel aligned with D but not with S. This misalignment ?issue makes the task harder for the network to predict T? ?S←D from S. In order to provide inputs ?already roughly aligned with T? ?S←D, we warp the source frame S according to local transformations ?estimated in Eq. (4). Thus, we obtain K transformed images S ?1 ?, . . . S ?K that are each aligned with ?T? ?S←D in the neighbourhood of a keypoint. Importantly, we also consider an additional image S ?0 = S ?for the background. ? | 結合局部運動。我們采用卷積網絡P估計T?S←D組泰勒近似的TS←D (z)的重點和原始幀S .重要的是,由于T?S←D地圖每個像素位置在D相應位置的年代,當地T?S←D模式,如邊緣或紋理,pixel-to-pixel與D但不與美國這個偏差問題使得網絡任務更難預測T?S←D S為了提供輸入已經大致與T?S←D,我們經源幀S根據當地轉換在情商估計。(4)。因此,我們獲得K S轉換圖像1,。K,都與T?年代←D附近的一個關鍵點。重要的是,我們還考慮了一個額外的圖像S 0 = S作為背景。 對于每一個關鍵點pk,我們額外計算熱圖Hk,表明在稠密的運動網絡,每一個變換發生。每個Hk(z)實現為以TD←R(pk)和TS←R(pk)為中心的兩個熱圖的差異: |
?
?
3.2 Occlusion-aware Image Generation?遮擋感知圖像生成
| As mentioned in Sec.3, the source image S is not pixel-to-pixel aligned with the image to be generated ?D? . In order to handle this misalignment, we use a feature warping strategy similar to [29, 28, 15]. ?More precisely, after two down-sampling convolutional blocks, we obtain a feature map ξ ∈ R ?H0×W0 ?of dimension H0 × W0 ?. We then warp ξ according to T? ?S←D. In the presence of occlusions in S, ?optical flow may not be sufficient to generate D? . Indeed, the occluded parts in S cannot be recovered ?by image-warping and thus should be inpainted. Consequently, we introduce an occlusion map ?O? ?S←D ∈ [0, 1]H0×W0 ?to mask out the feature map regions that should be inpainted. Thus, the ?occlusion mask diminishes the impact of the features corresponding to the occluded parts. The ?transformed feature map is written as: | Sec.3提到過,源圖像年代不是pixel-to-pixel與圖像生成D?。為了處理這種錯位,我們使用了類似于[29,28,15]的特征扭曲策略。更準確地說,經過兩個采樣下來卷積塊,我們獲得一個特性映射ξ∈R H0×W0 H0×W0的維度。然后經ξ根據T?S←D。存在遮擋的年代,光流可能不足以生成D?。實際上,S中被遮擋的部分是無法通過圖像扭曲恢復的,因此應該進行補繪。因此,我們引入一個閉塞地圖O?S←D∈[0,1]H0×W0面具出功能映射區域應該填補。因此,遮擋掩模減少了與遮擋部分相對應的特征的影響。轉換后的feature map為: |
| ? |
?
?
3.3 Training Losses
| We train our system in an end-to-end fashion combining several losses. First, we use the reconstruction loss based on the perceptual loss of Johnson et al. [19] using the pre-trained VGG-19 network as our main driving loss. The loss is based on implementation of Wang et al. [37]. With the input driving frame D and the corresponding reconstructed frame D? , the reconstruction loss is written as: | 我們以端到端的方式訓練我們的系統,結合了一些損失。首先,我們使用基于Johnson等人[19]的感知損失的重建損失,使用預訓練的vggg -19網絡作為我們的主要驅動損失。損失是基于Wang等人[37]的實施。與輸入驅動框架和相應的重構幀D?,重建損失是寫成: |
| Imposing Equivariance Constraint. Our keypoint predictor does not require any keypoint annotations ?during training. This may lead to unstable performance. Equivariance constraint is one of ?the most important factors driving the discovery of unsupervised keypoints [18, 43]. It forces the ?model to predict consistent keypoints with respect to known geometric transformations. We use thin ?plate splines deformations as they were previously used in unsupervised keypoint detection [18, 43] ?and are similar to natural image deformations. Since our motion estimator does not only predict the ?keypoints, but also the Jacobians, we extend the well-known equivariance loss to additionally include ?constraints on the Jacobians. ? | 實施Equivariance約束。我們的關鍵點預測器在培訓期間不需要任何關鍵點注釋。這可能會導致不穩定的性能。等方差約束是驅動無監督關鍵點發現的最重要因素之一[18,43]。它迫使模型對已知的幾何變換預測一致的關鍵點。我們使用薄板樣條變形,因為它們以前在無監督關鍵點檢測中使用[18,43],并且類似于自然圖像變形。由于我們的運動估計器不僅可以預測關鍵點,而且可以預測雅可比矩陣,因此我們擴展了眾所周知的等方差損失,增加了對雅可比矩陣的約束。 |
| Note that the constraint Eq. (11) is strictly the same as the standard equivariance constraint for the ?keypoints [18, 43]. During training, we constrain every keypoint location using a simple L1 loss ?between the two sides of Eq. (11). However, implementing the second constraint from Eq. (12) with L1 would force the magnitude of the Jacobians to zero and would lead to numerical problems. To ?this end, we reformulate this constraint in the following way: |
|
?
?
3.4 Testing Stage: Relative Motion Transfer? ??測試階段:相對運動轉移
| At this stage our goal is to animate an object in a source frame S1 using the driving video D1, . . . DT . ?Each frame Dt is independently processed to obtain St. Rather than transferring the motion encoded ?in TS1←Dt ?(pk) to S, we transfer the relative motion between D1 and Dt to S1. In other words, we ?apply a transformation TDt←D1 ?(p) to the neighbourhood of each keypoint pk: | 在這個階段,我們的目標是動畫的對象在源幀S1使用駕駛視頻D1,…DT。我們將D1和Dt之間的相對運動轉移到S1,而不是將TS1←Dt (pk)中編碼的運動轉移到S中。換句話說,我們對每個關鍵點pk的鄰域應用變換TDt←D1 (p): ? |
| Detailed mathematical derivations are provided in Sup. Mat.. Intuitively, we transform the neighbourhood ?of each keypoint pk in S1 according to its local deformation in the driving video. Indeed, ?transferring relative motion over absolute coordinates allows to transfer only relevant motion patterns, ?while preserving global object geometry. Conversely, when transferring absolute coordinates, as in ?X2Face [40], the generated frame inherits the object proportions of the driving video. It’s important ?to note that one limitation of transferring relative motion is that we need to assume that the objects ?in S1 and D1 have similar poses (see [28]). Without initial rough alignment, Eq. (14) may lead to ?absolute keypoint locations physically impossible for the object of interest. | 在Sup. Mat中提供了詳細的數學推導。直觀上,我們根據driving video中每個關鍵點pk的局部變形,對S1中每個關鍵點pk的鄰域進行變換。實際上,在絕對坐標上傳輸相對運動只允許傳輸相關的運動模式,同時保留全局物體的幾何形狀。相反,在傳輸絕對坐標時,如在X2Face[40]中,生成的幀繼承驅動視頻的對象比例。需要注意的是,傳遞相對運動的一個限制是,我們需要假設S1和D1中的物體具有相似的姿態(見[28])。在沒有初始粗對準的情況下,Eq.(14)可能導致感興趣對象在物理上無法得到絕對的關鍵點位置。 |
?
4 Experiments
| Datasets. We train and test our method on four different datasets containing various objects. Our model is capable of rendering videos of much higher resolution compared to [28] in all our experiments.
| 數據集。我們在包含不同對象的四個不同數據集上訓練和測試我們的方法。在我們所有的實驗中,我們的模型能夠呈現比[28]分辨率高得多的視頻。
|
| Evaluation Protocol. Evaluating the quality of image animation is not obvious, since ground truth ?animations are not available. We follow the evaluation protocol of Monkey-Net [28]. First, we quantitatively evaluate each method on the "proxy" task of video reconstruction. This task consists of ?reconstructing the input video from a representation in which appearance and motion are decoupled. ?In our case, we reconstruct the input video by combining the sparse motion representation in (2) of ?each frame and the first video frame. Second, we evaluate our model on image animation according ?to a user-study. In all experiments we use K=10 as in [28]. Other implementation details are given in ?Sup. Mat. | 評估方案。評價圖像動畫的質量并不明顯,因為地面真實動畫是不可用的。我們遵循猴網[28]的評估協議。首先,我們對視頻重建的“代理”任務進行了定量評估。這個任務包括從外觀和運動解耦的再現中重構輸入視頻。在我們的例子中,我們結合每一幀的稀疏運動表示和第一幀視頻來重建輸入視頻。其次,我們根據用戶研究評估我們的圖像動畫模型。在所有的實驗中,我們使用K=10作為[28]。其他實現細節見?Sup. Mat. |
| ? | |
Metrics. To evaluate video reconstruction, we adopt the metrics proposed in Monkey-Net [28]:
| 指標。為了評估視頻重構,我們采用Monkey-Net[28]中提出的度量:
|
| Ablation Study. We compare the following variants of our model. Baseline: the simplest model ?trained without using the occlusion mask (OS←D=1 in Eq. (8)), jacobians (Jk = 1 in Eq. (4)) and ?is supervised with Lrec at the highest resolution only; Pyr.: the pyramid loss is added to Baseline; ?Pyr.+OS←D: with respect to Pyr., we replace the generator network with the occlusion-aware network; ?Jac. w/o Eq. (12) our model with local affine transformations but without equivariance constraints on ?jacobians Eq. (12); Full: the full model including local affine transformations described in Sec. 3.1. ? In Fig. 3, we report the qualitative ablation. First, the pyramid loss leads to better results according ?to all the metrics except AKD. Second, adding OS←D to the model consistently improves all the ?metrics with respect to Pyr.. This illustrates the benefit of explicitly modeling occlusions. We found ?that without equivariance constraint over the jacobians, Jk becomes unstable which leads to poor ?motion estimations. Finally, our Full model further improves all the metrics. In particular, we note ?that, with respect to the Baseline model, the MKR of the full model is smaller by the factor of 2.75. ?It shows that our rich motion representation helps generate more realistic images. These results are ?confirmed by our qualitative evaluation in Tab. 1 where we compare the Baseline and the Full models. ?In these experiments, each frame D of the input video is reconstructed from its first frame (first ?column) and the estimated keypoint trajectories. We note that the Baseline model does not locate any keypoints in the arms area. Consequently, when the pose difference with the initial pose increases, ?the model cannot reconstruct the video (columns 3,4 and 5). In contrast, the Full model learns to ?detect a keypoint on each arm, and therefore, to more accurately reconstruct the input video even in ?the case of complex motion. | 燒蝕研究。我們比較模型的以下變體?;€:不使用遮擋模板訓練的最簡單模型(Eq.(8)中OS←D=1), Eq.(4)中雅可比矩陣(Jk =1),并且僅在最高分辨率下使用Lrec進行監督;吡定:金字塔損失添加到基線;吡定+OS←D:關于Pyr。,我們將產生網絡替換為封閉感知網絡;江淮。Eq.(12)我們的局部仿射變換模型,但對雅可比矩陣沒有等方差約束完整:包括3.1節中描述的局部仿射變換的完整模型。 ? |
| ? | |
| Comparison with State of the Art. We now compare our method with state of the art for the video ?reconstruction task as in [28]. To the best of our knowledge, X2Face [40] and Monkey-Net [28] are ?the only previous approaches for model-free image animation. Quantitative results are reported in ?Tab. 3. We observe that our approach consistently improves every single metric for each of the four ?different datasets. Even on the two face datasets, VoxCeleb and Nemo datasets, our approach clearly ?outperforms X2Face that was originally proposed for face generation. The better performance of our ?approach compared to X2Face is especially impressive X2Face exploits a larger motion embedding ?(128 floats) than our approach (60=K*(2+4) floats). Compared to Monkey-Net that uses a motion ?representation with a similar dimension (50=K*(2+3)), the advantages of our approach are clearly ?visible on the Tai-Chi-HD dataset that contains highly non-rigid objects (i.e.human body). ? We now report a qualitative comparison for image animation. Generated sequences are reported in ?Fig. 4. The results are well in line with the quantitative evaluation in Tab. 3. Indeed, in both examples, ?X2Face and Monkey-Net are not able to correctly transfer the body notion in the driving video, ?instead warping the human body in the source image as a blob. Conversely, our approach is able ?to generate significantly better looking videos in which each body part is independently animated. ?This qualitative evaluation illustrates the potential of our rich motion description. We complete our ?evaluation with a user study. We ask users to select the most realistic image animation. Each question ?consists of the source image, the driving video, and the corresponding results of our method and a ?competitive method. We require each question to be answered by 10 AMT worker. This evaluation ?is repeated on 50 different input pairs. Results are reported in Tab. 2. We observe that our method ?is clearly preferred over the competitor methods. Interestingly, the largest difference with the state ?of the art is obtained on Tai-Chi-HD: the most challenging dataset in our evaluation due to its rich ?motions. | 與先進水平的比較。我們現在比較我們的方法與先進的視頻重建任務在[28]。就我們所知,X2Face[40]和Monkey-Net[28]是之前唯一的無模型圖像動畫方法。定量結果在Tab中報告。3.我們觀察到,我們的方法始終如一地改善了四個不同數據集的每一個指標。即使在VoxCeleb和Nemo這兩個人臉數據集上,我們的方法也明顯優于最初為人臉生成而提出的X2Face。與X2Face相比,我們的方法的更好性能尤其令人印象深刻。X2Face利用了更大的運動嵌入(128個浮點數),而我們的方法(60=K*(2+4)浮點數)。與使用類似維數(50=K*(2+3))的運動表示的猴網相比,我們的方法在包含高度非剛性對象(如人體)的Tai-Chi-HD數據集上的優勢是顯而易見的。 我們現在報告一個圖像動畫的定性比較。生成的序列如圖所示。4. 結果與表3的定量評價很一致。實際上,在這兩個例子中,X2Face和Monkey-Net都無法在驅動視頻中正確傳輸身體概念,而是將源圖像中的人體扭曲成一個blob。相反,我們的方法能夠產生明顯更好的視頻,其中身體的每個部分都是獨立的動畫。這種定性評價說明了我們豐富的運動描述的潛力。我們通過用戶研究來完成我們的評估。我們要求用戶選擇最真實的圖像動畫。每個問題由源圖像,駕駛視頻,以及相應的結果,我們的方法和競爭方法。我們要求每個問題由10個AMT工人回答。這個評估在50個不同的輸入對上重復。結果如表2所示。我們觀察到我們的方法明顯優于競爭對手的方法。有趣的是,與目前最先進的最大差異是在Tai-Chi-HD上獲得的:由于其豐富的運動,在我們的評估中最具挑戰性的數據集。 |
5 Conclusions ?
| We presented a novel approach for image animation based on keypoints and local affine transformations. ?Our novel mathematical formulation describes the motion field between two frames and is ?efficiently computed by deriving a first order Taylor expansion approximation. In this way, motion is ?described as a set of keypoints displacements and local affine transformations. A generator network ?combines the appearance of the source image and the motion representation of the driving video. In ?addition, we proposed to explicitly model occlusions in order to indicate to the generator network ?which image parts should be inpainted. We evaluated the proposed method both quantitatively and ?qualitatively and showed that our approach clearly outperforms state of the art on all the benchmarks. | 本文提出了一種基于關鍵點和局部仿射變換的圖像動畫方法。我們的新的數學公式描述了兩個幀之間的運動場,并通過推導一階泰勒展開近似來有效地計算。這樣,運動被描述為一組關鍵點位移和局部仿射變換。生成網絡將源圖像的外觀和驅動視頻的運動表示結合起來。此外,我們建議顯式地建立遮擋模型,以便向生成器網絡指示哪些圖像部分需要修復。我們對所提出的方法進行了定量和定性的評估,并表明我們的方法在所有基準測試中都明顯優于現有的技術水平。 |
?
?
?
?
?
?
?
?
?
?
總結
以上是生活随笔為你收集整理的Paper:《First Order Motion Model for Image Animation》翻译与解读的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 成功解决NVIDIA安装程序无法继续
- 下一篇: Python编程语言学习:for循环中常