日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Paper:《How far are we from solving the 2D 3D Face Alignment problem? 》解读与翻译

發布時間:2025/3/21 编程问答 36 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Paper:《How far are we from solving the 2D 3D Face Alignment problem? 》解读与翻译 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper:《How far are we from solving the 2D & 3D Face Alignment problem? 》解讀與翻譯

?

?

?

目錄

How far are we from solving the 2D & 3D Face Alignment problem?

Abstract

1. Introduction

2. Closely related work

3. Datasets

3.1. Training datasets

3.2. Test datasets

3.3. Metrics

4. Method

4.3. Training

5. 2D face alignment

6. Large Scale 3D Faces in-the-Wild dataset

7. 3D face alignment

8. Ablation studies

9. Conclusions

10. Acknowledgments?


?

?

?

How far are we from solving the 2D & 3D Face Alignment problem?

Adrian Bulat and Georgios Tzimiropoulos Computer Vision Laboratory, The University of Nottingham Nottingham, United Kingdom
官網:https://www.adrianbulat.com/face-alignment
原文地址:https://arxiv.org/pdf/1703.07332.pdf

Abstract

This paper investigates how far a very deep neural network ?is from attaining close to saturating performance on ?existing 2D and 3D face alignment datasets. To this end, ?we make the following 5 contributions:

  • (a) we construct, ?for the first time, a very strong baseline by combining a ?state-of-the-art architecture for landmark localization with ?a state-of-the-art residual block, train it on a very large yet ?synthetically expanded 2D facial landmark dataset and finally ?evaluate it on all other 2D facial landmark datasets.
  • ?(b) We create a guided by 2D landmarks network which converts ?2D landmark annotations to 3D and unifies all existing ?datasets, leading to the creation of LS3D-W, the largest ?and most challenging 3D facial landmark dataset to date ?(~230,000 images).
  • (c) Following that, we train a neural ?network for 3D face alignment and evaluate it on the newly ?introduced LS3D-W.
  • (d) We further look into the effect of all ?“traditional” factors affecting face alignment performance ?like large pose, initialization and resolution, and introduce ?a “new” one, namely the size of the network.
  • (e) We show ?that both 2D and 3D face alignment networks achieve performance ?of remarkable accuracy which is probably close ?to saturating the datasets used.

Training and testing code ?as well as the dataset can be downloaded from https: ?//www.adrianbulat.com/face-alignment/

本文研究了在現有的二維和三維人臉定位數據集上,一個非常深入的神經網絡離達到接近飽和的性能還有多遠。為此,我們做出了以下5項貢獻:

  • (a)我們首次通過將最先進的landmark 定位架構與最先進的殘差結合,構建了一個非常強大的基線,在一個非常大但經過綜合擴展的2D面部landmark 數據集上訓練,最后在所有其他2D面部landmark 數據集上對其進行評估。
  • (b)我們創建了一個由二維landmark 網絡引導的網絡,該網絡將二維landmark 注釋轉換為三維,并統一了所有現有數據集,從而創建了迄今為止最大、最具挑戰性的三維面部landmark 數據集LS3D-W(約23萬張圖像)。
  • (c) 然后,我們訓練了一個用于三維人臉對齊的神經網絡,并在新引入的LS3D-W。
  • (d)我們進一步研究了影響人臉對齊性能的所有“傳統”因素的影響,如大姿態、初始化和分辨率,并引入了一個“新的”因素,即網絡的大小。
  • (e) 結果表明,二維和三維人臉對齊網絡均達到了很高的精度,可能接近于飽和所使用的數據集。

訓練和測試代碼以及數據集可從/https: ?//www.adrianbulat.com/face-alignment/? 下載

?

1. Introduction

With the advent of Deep Learning and the development ?of large annotated datasets, recent work has shown results ?of unprecedented accuracy even on the most challenging ?computer vision tasks. In this work, we focus on landmark ?localization, in particular, on facial landmark localization, ?also known as face alignment, arguably one of the ?most heavily researched topics in computer vision over the ?last decades. Very recent work on landmark localization ?using Convolutional Neural Networks (CNNs) has pushed the boundaries in other domains like human pose estimation ?[39, 38, 24, 17, 27, 42, 23, 5], yet it remains unclear what ?has been achieved so far for the case of face alignment. The ?aim of this work is to address this gap in literature. ?Historically, different techniques have been used for ?landmark localization depending on the task in hand. For ?example, work in human pose estimation, prior to the advent ?of neural networks, was primarily based on pictorial ?structures [12] and sophisticated extensions [44, 25, ?36, 32, 26] due to their ability to model large appearance ?changes and accommodate a wide spectrum of human ?poses. Such methods though have not been shown capable ?of achieving the high degree of accuracy exhibited by ?cascaded regression methods for the task of face alignment ?[11, 8, 43, 50, 41]. On the other hand, the performance ?of cascaded regression methods is known to deteriorate for ?cases of inaccurate initialisation, and large (and unfamiliar) ?facial poses when there is a significant number of selfoccluded ?landmarks or large in-plane rotations.隨著深度學習的到來和大型注釋數據集的發展,最近的工作已經顯示出前所未有的準確性,甚至在最具挑戰性的計算機視覺任務的結果。在這項工作中,我們重點關注landmark 定位,特別是面部landmark 定位,也被稱為面部對齊,可以說是過去幾十年計算機視覺中研究最多的主題之一。最近使用卷積神經網絡(CNNs)進行地標定位的工作已經在其他領域如人體姿態估計[39,38,24,17,27,42,23,5]中突破了界限,但目前還不清楚在人臉對齊方面取得了什么進展。這項工作的目的是解決這個差距在文學。歷史上,根據手頭的任務不同,使用了不同的技術來進行landmark 定位。例如,在神經網絡出現之前,人類姿態估計的工作主要基于圖像結構[12]和復雜的擴展[44、25、36、32、26],因為它們能夠模擬大的外觀變化并適應廣泛的人類姿態。然而,這種方法還沒有被證明能夠達到用于面部對準任務的級聯回歸方法所顯示的高度準確性[11,8,43,50,41]。另一方面,在初始化不準確的情況下,級聯回歸方法的性能會下降,當有大量自聚焦landmark 或大的面內旋轉時,會出現較大的(和不熟悉的)面部姿勢。

圖1:由4個HGs疊加而成的人臉比對網絡(FAN),其中所有的瓶頸塊(以矩形表示)都被替換為[7]的分層、平行和多尺度塊。

More recently, fully Convolutional Neural Network architectures ?based on heatmap regression have revolutionized ?human pose estimation [39, 38, 24, 17, 27, 42, 23, 5] ?producing results of remarkable accuracy even for the most ?challenging datasets [1]. Thanks to their end-to-end training ?and little need for hand engineering, such methods can be ?readily applied to the problem of face alignment. Following ?this path, our main contribution is to construct and train ?such a powerful network for face alignment and investigate ?for the first time how far it is from attaining close to saturating ?performance on all existing 2D face alignment datasets ?and a newly introduced large scale 3D dataset. More specifically, ?our contributions are:

  • 1. We construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block and train it on a very large yet synthetically expanded 2D facial landmark dataset. Then, we evaluate it on all other 2D datasets (~230,000 images), investigating how?far are we from solving 2D face alignment
  • 2. In order to overcome the scarcity of 3D face alignment datasets, we further propose a guided-by-2D landmarks CNN which converts 2D annotations to 3D 1 and use it to create LS3D-W, the largest and most challenging 3D facial landmark dataset to date (~230,000 images), obtained from unifying almost all existing datasets to date.
  • 3. Following that, we train a 3D face alignment network and then evaluate it on the newly introduced large scale 3D facial landmark dataset, investigating how far are we from solving 3D face alignment.
  • 4. We further look into the effect of all “traditional” factors affecting face alignment performance like large pose, initialization and resolution, and introduce a “new” one, namely the size of the network.
  • 5. We show that both 2D and 3D face alignment networks achieve performance of remarkable accuracy which is probably close to saturating the datasets used.

最近,基于熱圖回歸的全卷積神經網絡架構已經徹底改變了人體姿態估計[39,38,24,17,27,42,23,5],即使對于最具挑戰性的數據集[1]也能產生非常精確的結果。由于他們的端到端訓練和很少需要手工程,這種方法可以很容易地應用于面部對準的問題。沿著這條道路,我們的主要貢獻是構建和訓練這樣一個強大的人臉比對網絡,并首次研究如何在現有的所有2D人臉比對數據集和新引入的大規模3D數據集上實現接近飽和的性能。更具體地說,我們的貢獻是:

  • 1. 我們首次構建了一個非常強大的基線,將最先進的landmark 定位體系結構與最先進的殘塊相結合,并將其訓練在一個非常大但綜合擴展的2D面部地標數據集上。然后,我們在所有其他2D數據集(~230,000張圖像)上對其進行評估,研究我們離解決2D人臉對齊問題還有多遠
  • 2. 為了克服缺乏真實感三維人臉對齊的數據集,我們進一步提出一個guided-by-2D地標CNN把1到3 d和2 d注釋使用它來創建LS3D-W,最大的和最具挑戰性的3 d面部(~ 230000圖像)迄今具有里程碑意義的數據集,從統一獲得幾乎所有現有的數據集。
  • 3.然后,我們訓練一個三維人臉比對網絡,然后在新引入的大規模三維面部地標數據集上對其進行評估,研究我們離解決三維人臉比對問題還有多遠。
  • 4. 我們進一步研究了影響人臉對齊性能的所有“傳統”因素的影響,如大姿態、初始化和分辨率,并引入了一個“新的”因素,即網絡的大小。
  • 5. 結果表明,無論是二維還是三維人臉對準網絡,其精度都達到了很高的水平,接近所使用數據集的飽和狀態。

?

?

2. Closely related work

This Section reviews related work on face alignment and landmark localization. Datasets are described in detail in the next Section. 2D face alignment. Prior to the advent of Deep Learning, methods based on cascaded regression had emerged as the state-of-the-art in 2D face alignment, see for example [8, 43, 50, 41]. Such methods are now considered to have largely “solved” the 2D face alignment problem for faces with controlled pose variation like the ones of LFPW [2], Helen [22] and 300-W [30].

We will keep the main result from these works, namely their performance on the frontal dataset of LFPW [2]. This performance will be used as a measure of comparison of how well the methods described in this paper perform assuming that a method achieving a similar error curve on a different dataset is close to saturating that dataset.

本節回顧了人臉定位和地標定位的相關工作。數據集將在下一節中詳細描述。在深度學習出現之前,基于級聯回歸的方法已經成為二維人臉對準的最新技術,參見[8,43,50,41]。這種方法現在被認為在很大程度上“解決”了具有受控位姿變化的人臉的2D人臉對準問題,比如LFPW[2]、Helen[22]和300-W[30]。

我們將保留這些工作的主要結果,即它們在LFPW[2]的正面數據集上的性能。如果在不同的數據集上實現類似的錯誤曲線的方法接近于該數據集的飽和狀態,那么本文中描述的方法的性能將被用來比較它們的執行情況。

CNNs for face alignment. By no means we are the first?to use CNNs for face alignment. The method of [35] uses a CNN cascade to regress the facial landmark locations. The work in [47] proposes multi-task learning for joint facial landmark localization and attribute classification. More recently, the method of [40] extends [43] within recurrent neural networks. All these methods have been mainly shown effective for the near-frontal faces of 300-W [30].

Recent works on large pose and 3D face alignment includes [20, 50] which perform face alignment by fitting a 3D Morphable Model (3DMM) to a 2D facial image. The work in [20] proposes to fit a dense 3DMM using a cascade of CNNs. The approach of [50] fits a 3DMM in an iterative manner through a single CNN which is augmented by additional input channels (besides RGB) representing shape features at each iteration. More recent works that are closer to the methods presented in this paper are [4] and [6]. Nevertheless, [4] is evaluated on [20] which is a relatively small dataset (3900 images for training and 1200 for testing) and [6] on [19] which is of moderate size (16,2000 images for training and 4,900 for testing), includes mainly images collected in the lab and does not cover the full spectrum of facial poses. Hence, the results of [4] and [6] are not conclusive in regards to the main questions posed in our paper.

Landmark localization. A detailed review of state-ofthe-art methods on landmark localization for human pose estimation is beyond the scope of this work, please see [39, 38, 24, 17, 27, 42, 23, 5]. For the needs of this work, we built a powerful CNN for 2D and 3D face alignment based on two components: (a) the state-of-the-art HourGlass (HG) network of [23], and (b) the hierarchical, parallel & multi-scale block recently proposed in [7]. In particular, we replaced the bottleneck block [15] used in [23] with the block proposed in [7].

用于面對齊的CNNs。我們絕不是第一個使用CNNs進行人臉校準的公司。[35]方法利用CNN級聯反演面部地標位置。在[47]的工作中,提出了多任務學習聯合面部landmark 定位和屬性分類。最近,[40]方法在循環神經網絡中擴展了[43]。這些方法主要用于300-W[30]的近正面。

最近在大姿態和3D人臉對齊方面的工作包括[20,50],這些工作通過將3D Morphable Model (3DMM)擬合到2D面部圖像來進行人臉對齊。在[20]的工作建議適應一個稠密的3DMM使用級聯的CNNs。[50]的方法通過一個單獨的CNN以迭代的方式對3DMM進行了擬合,在每次迭代中,CNN被表示形狀特征的額外輸入通道(除了RGB)所增強。更接近本文方法的是[4]和[6]。然而,評估[4]在[20]這是一個相對較小的數據集(培訓3900張圖片和1200年測試)和[6][19]這是溫和的大小(2000圖片4900培訓和測試),包括在實驗室主要是圖像采集和不包括面部造成的全譜。因此,對于本文提出的主要問題,[4]和[6]的結果并不是決定性的。

具有里程碑意義的本地化。對用于人體姿態估計的地標定位方法的詳細回顧超出了本工作的范圍,請參見[39,38,24,17,27,42,23,5]。為了滿足這項工作的需要,我們構建了一個強大的基于兩個組件的二維和三維人臉校準CNN: (a)[23]的最先進的沙漏(HG)網絡,和(b)最近在[7]中提出的分層、并行和多尺度塊。特別是,我們用[7]中提出的塊替換了[23]中使用的瓶頸塊[15]。

Transferring landmark annotations. There are a few works that have attempted to unify facial alignment datasets by transferring landmark annotations, typically through exploiting common landmarks across datasets [49, 34, 46]. Such methods have been primarily shown to be successful when landmarks are transferred from more challenging to less challenging images, for example in [49] the target?dataset is LFW [16] or [34] provides annotations only for the relatively easy images of AFLW [21]. Hence, the community primarily relies on the unification performed manually by the 300-W challenge [29] which contains less than 5,000 near frontal images annotated from a 2D perspective.

Using 300-W-LP [50] as a basis, this paper presents the first attempt to provide 3D annotations for all other datasets, namely AFLW-2000 [50] (2,000 images), 300-W test set [28] (600 images), 300-VW [33] (218,595 frames), and Menpo training set (9,000 images). To this end, we propose a guided-by-2D landmarks CNN which converts 2D annotations to 3D and unifies all aforementioned datasets.

傳輸具有里程碑意義的注釋。有一些研究試圖通過傳輸地標注釋來統一面部定位數據集,通常是通過跨數據集利用共同的地標[49,34,46]。這些方法已經被初步證明是成功的,當地標從更具挑戰性的圖像轉移到不太具有挑戰性的圖像時,例如在[49]中,目標數據集是LFW[16]或[34]僅為AFLW[21]的相對簡單的圖像提供注釋。因此,社區主要依賴于由300-W挑戰[29]手動執行的統一,該挑戰包含少于5000張從2D角度注釋的近正面圖像。

本文以300-W- lp[50]為基礎,首次嘗試為所有其他數據集提供3D標注,即AFLW-2000[50](2000張圖像)、300-W測試集[28](600張圖像)、300-VW[33](218595幀)和Menpo訓練集(9000張圖像)。為此,我們提出了一個二維路標CNN,它可以將二維注解轉換為三維,并統一所有上述數據集。

?

3. Datasets

In this Section, we provide a description of how existing 2D and 3D datasets were used for training and testing for the purposes of our experiments. We note that the 3D annotations preserve correspondence across pose as opposed to the 2D ones and, in general, they should be preferred. We emphasize that the 3D annotations are actually the 2D projections of the 3D facial landmark coordinates but for simplicity we will just call them 3D. In the supplementary material, we present a method for extending these annotations to full 3D. Finally, we emphasize that we performed cross-database experiments only.

在本節中,我們將描述如何使用現有的2D和3D數據集進行實驗目的的培訓和測試。我們注意到,與2D注釋相比,3D注釋保留了各個姿勢之間的對應關系,一般來說,它們應該是首選。我們強調,3D標注實際上是3D面部地標坐標的2D投影,但為了簡單起見,我們將其稱為3D。在補充材料中,我們提出了一種將這些注釋擴展到完整3D的方法。最后,我們強調我們只執行跨數據庫實驗。

?

Table 1: Summary of the most popular face alignment datasets and their main characteristics.

表1:最流行的人臉比對數據集及其主要特征的摘要。

3.1. Training datasets

For training and validation, we used 300-W-LP [50], a synthetically expanded version of 300-W [29]. 300-W-LP provides both 2D and 3D landmarks allowing for training models and conducting experiments using both types of annotations. For some 2D experiments, we also used the original 300-W dataset [29] for fine tuning, only. This is because the 2D landmarks of 300-W-LP are not entirely compatible with the 2D landmarks of the test sets used in our experiments (i.e. 300-W test set, [28], 300-VW [33] and Menpo [45]), but the original annotations from 300-W are.?

300-W. 300-W [29] is currently the most widely-used inthe-wild dataset for 2D face alignment. The dataset itself is a concatenation of a series of smaller datasets: LFPW [3], HELEN [22], AFW [51] and iBUG [30], where each image was re-annotated in a consistent manner using the 68 2D landmark configuration of Multi-PIE [13]. The dataset contains in total ~4,000 near frontal facial images. 300W-LP-2D and 300W-LP-3D. 300-W-LP is a synthetically generated dataset obtained by rendering the faces of 300-W into larger poses, ranging from ?900 to 900 , using the profiling method of [50]. The dataset contains 61,225 images providing both 2D (300W-LP-2D) and 3D landmark annotations (300W-LP-3D).

為了進行培訓和驗證,我們使用了300-W- lp[50],這是300-W[29]的綜合擴展版本。300-W-LP提供了2D和3D地標,允許使用這兩種類型的注釋來訓練模型和進行實驗。對于一些2D實驗,我們也僅使用原始的300-W數據集[29]進行微調。這是因為300w - lp的2D landmark與我們實驗中使用的測試集(即300w test set, [28], 300vw [33], Menpo[45])的2D landmark并不完全兼容,而300w原始的annotation卻兼容。

300 - w。300w[29]是目前野外應用最廣泛的二維人臉比對數據集。數據集本身是一系列較小數據集的串聯:LFPW[3]、HELEN[22]、AFW[51]和iBUG[30],其中每個圖像都使用68個2D地標式多餅[13]重新進行了一致的注釋。該數據庫共包含近4000張正面人臉圖像。300年w-lp-2d和300 w-lp-3d。300w - lp是一個綜合生成的數據集,使用[50]的profiling方法,將300w的面渲染成更大的位姿,范圍從?900到900。數據集包含61,225張圖像,提供2D (300W-LP-2D)和3D地標注釋(300W-LP-3D)。

?

3.2. Test datasets

This Section describes the test sets used for our 2D and 3D experiments. Observe that there is a large number of 2D datasets/annotations which are however problematic for moderately large poses (2D landmarks lose correspondence) and that the only in-the-wild 3D test set is AFLW2000-3D [50] 2 . We address this significant gap in 3D face alignment datasets in Section 6.

本節描述用于我們的2D和3D實驗的測試集。注意,有大量的2D數據集/注釋,但是這些數據集/注釋對于中等大小的位姿是有問題的(2D地標丟失對應關系),并且唯一的野外3D測試集是AFLW2000-3D[50] 2。我們在第6節中解決了3D人臉對準數據集中的這個重要差距。

3.2.1 2D datasets

300-W test set. The 300-W test set consists of the 600 images used for the evaluation purposes of the 300-W Challenge [28]. The images are split in two categories: Indoor and Outdoor. All images were annotated with the same 68 2D landmarks as the ones used in the 300-W data set. 300-VW. 300-VW[33] is a large-scale face tracking dataset, containing 114 videos and in total 218,595 frames. From the total of 114 videos, 64 are used for testing and 50 for training. The test videos are further separated into three categories (A, B, and C) with the last one being the most challenging. It is worth noting that some videos (especially from category C) contain very low resolution/poor quality faces. Due to the semi-automatic annotation approach (see [33] for more details), in some cases, the annotations for these videos are not so accurate (see Fig. 3). Another source of annotation error is caused by facial pose, i.e. large poses are also not accurately annotated (see Fig. 3). Menpo. Menpo is a recently introduced dataset [45] containing landmark annotations for about 9,000 faces from FDDB [18] and ALFW. Frontal faces were annotated in terms of 68 landmarks using the same annotation policy as the one of 300-W but profile faces in terms of 39 different landmarks which are not in correspondence with the landmarks from the 68-point mark-up.

3.2.1 2 d數據集

300-W測試集。300-W測試集包含用于300-W挑戰[28]的評估目的的600張圖像。這些圖片分為兩類:室內和室外。所有的圖像都被標注上了與300-W數據集中使用的相同的68個2D地標。300-VW[33]是一個大型的人臉跟蹤數據集,包含114個視頻,總共218,595幀。在114個視頻中,有64個用于測試,50個用于培訓。測試視頻進一步分為三個類別(A、B和C),最后一個是最具挑戰性的。值得注意的是,有些視頻(尤其是C類視頻)的分辨率很低,質量很差。由于半自動標注方法(詳見[33]),在某些情況下,這些視頻的標注并不十分準確(見圖3)。另一個標注錯誤的來源是由面部姿態造成的,即大的pose也沒有準確的標注(見圖3)。Menpo是最近推出的一個數據集[45],其中包含來自FDDB[18]和ALFW的大約9,000個面孔的地標注釋。正面臉被標注了68個地標,使用與300-W相同的標注策略,但側面臉被標注了39個不同的地標,這些地標與68點標記的地標不一致。

3.2.2 3D datasets

AFLW2000-3D. AFLW2000-3D [50] is a dataset constructed by re-annotating the first 2000 images from AFLW [21] using 68 3D landmarks in a consistent manner with the?ones from 300W-LP-3D. The faces of this dataset contain large-pose variations (yaw from ?90o to 90o ), with various expressions and illumination conditions. However, some annotations, especially for larger poses or occluded faces are not so accurate (see Fig. 6).

3.2.2 3 d數據集

AFLW2000-3D。AFLW2000-3D[50]是一個數據集,它使用68個3D地標,以與300W-LP-3D一致的方式重新注釋來自AFLW[21]的前2000個圖像。該數據集的面包含了較大的姿態變化(從?90o到90o的偏航),具有不同的表達式和光照條件。然而,一些注釋,特別是對于較大的姿勢或遮擋的面部,并不是很準確(見圖6)。

3.3. Metrics

Traditionally, the metric used for face alignment is the point-to-point Euclidean distance normalized by the interocular distance [10, 29, 33]. However, as noted in [51], this error metric is biased for profile faces for which the interocular distance can be very small. Hence, we normalize by the bounding box size. In particular, we used the Normalized Mean Error defined as:

where x denotes the ground truth landmarks for a given face, y the corresponding prediction and d is the squareroot of the ground truth bounding box, computed as d = √ wbbox ? hbbox. Although we conducted both 2D and 3D experiments, we opted to use the same bounding box definition for both experiments; in particular we used the bounding box calculated from the 2D landmarks. This way, we can readily compare the accuracy achieved in 2D and 3D.

傳統上,用于人臉對齊的度量是點對點歐幾里德距離,由眼間距歸一化[10,29,33]。然而,正如在[51]中所指出的,這種誤差度量對于眼間距可能非常小的側面是有偏差的。因此,我們通過邊界框大小進行規范化。特別地,我們使用歸一化平均誤差定義為:


其中x為給定人臉的ground truth landmarks, y為相應的預測,d為ground truth綁定框的squareroot,計算為d =√wbbox hbbox。雖然我們同時進行了2D和3D實驗,但我們選擇對兩個實驗使用相同的邊界框定義;我們特別使用了從2D地標計算出的邊界框。這樣,我們就可以很容易地比較2D和3D的精度。

?

4. Method

This Section describes FAN, the network used for 2D and 3D face alignment. It also describes 2D-to-3D FAN, the network used for constructing the very large scale 3D face alignment dataset (LS3D-W) containing more than 230,000 3D landmark annotations.

本節描述風扇,用于二維和三維人臉對準的網絡。它還描述了2D-to-3D FAN,用于構建包含超過230,000個3D地標注釋的超大規模3D人臉比對數據集(LS3D-W)的網絡。

4.1. 2D and 3D Face Alignment Networks

We coin the network used for our experiments simply Face Alignment Network (FAN). To our knowledge, it is the first time that such a powerful network is trained and evaluated for large scale 2D/3D face alignment experiments.

We construct FAN based on one of the state-of-the-art architectures for human pose estimation, namely the HourGlass (HG) network of [23]. In particularly, we used a stack of four HG networks (see Fig. 1). While [23] uses the bottleneck block of [14] as the main building block for the HG, we go one step further and replace the bottleneck block with the recently introduced hierarchical, parallel and multi-scale block of [7]. As it was shown in [7], this block outperforms the original bottleneck of [14] when the same number of network parameter were used. Finally, we used 300W-LP-2D and 300W-LP-3D to train 2D-FAN and 3DFAN, respectively.

Figure 2: The 2D-to-3D-FAN network used for the creation of the LS3D-W dataset. The network takes as input the RGB image and the 2D landmarks and outputs the corresponding 2D projections of the 3D landmarks.

4.1。二維和三維人臉對準網絡

我們創造了用于我們的實驗的網絡簡單地面對對準網絡(風扇)。據我們所知,這是第一次為大規模的二維/三維人臉對準實驗訓練和評估這樣一個強大的網絡。

我們基于最先進的人體姿態估計架構之一,即[23]的沙漏(HG)網絡來構建風扇。特別地,我們使用了4個HG網絡的堆棧(見圖1)。當[23]使用[14]的瓶頸塊作為HG的主要構建塊時,我們更進一步,用最近引入的[7]的分級、并行和多尺度塊替換瓶頸塊。如[7]所示,當使用相同數量的網絡參數時,此塊的性能優于[14]的原始瓶頸。最后,我們使用300W-LP-2D和300W-LP-3D分別訓練2D-FAN和3DFAN。

圖2:用于創建LS3D-W數據集的2D-to-3D-FAN網絡。網絡以RGB圖像和二維地標為輸入,輸出相應的三維地標的二維投影。

4.2. 2D-to-3D Face Alignment Network

Our aim is to create the very first very large scale dataset of 3D facial landmarks for which annotations are scarce. To this end, we followed a guided-based approach in which a FAN for predicting 3D landmarks is guided by 2D landmarks. In particular, we created a 3D-FAN in which the input RGB channels have been augmented with 68 additional channels, one for each 2D landmark, containing a 2D Gaussian with std = 1px centered at each landmark’s location. We call this network 2D-to-3D FAN. Given the 2D facial landmarks for an image, 2D-to-3D FAN converts them to 3D. To train 2D-to-3D FAN, we used 300-W-LP which provides both 2D and 3D annotations for the same image. We emphasize again that the 3D annotations are actually the 2D projections of the 3D coordinates but for simplicity we call them 3D. Please see supplementary material for extending these annotations to full 3D.

4.2。二維到三維人臉對準網絡

我們的目標是創建第一個非常大的三維面部地標數據集,其中注釋是稀缺的。為此,我們采用了基于指南的方法,其中用于預測3D地標的風扇由2D地標引導。特別地,我們創建了一個3D-FAN,其中輸入RGB通道被增加了68個額外通道,每個2D地標一個通道,包含一個2D高斯函數,std = 1px以每個地標的位置為中心。我們稱這個網絡為2D-to-3D風扇。給定圖像的2D面部地標,2D-to-3D FAN將其轉換為3D。為了訓練2D- To -3D風扇,我們使用了300-W-LP,它為相同的圖像提供了2D和3D注釋。我們再次強調,3D標注實際上是3D坐標的2D投影,但為了簡單起見,我們稱它們為3D。請參閱補充材料擴展這些注釋到完整的3D。

4.3. Training

For all of our experiments, we independently trained three distinct networks: 2D-FAN, 3D-FAN, and 2D-to-3DFAN. For the first two networks, we set the initial learning rate to 10?4 and used a minibatch of 10. During the process, we dropped the learning rate to 10?5 after 15 epochs and to 10?6 after another 15, training for a total of 40 epochs. We also applied random augmentation: flipping, rotation (from ?50o to 50o ), color jittering, scale noise (from 0.8 to 1.2) and random occlusion. The 2D-to-3D-FAN model was trained by following a similar procedure increasing the amount of augmentation even further: rotation (from ?70o to 70o ) and scale (from 0.7 to 1.3). Additionally, the learning rate initially was set to 10?3 . All networks were implemented in Torch7 [9] and trained using rmsprop [37].

在我們所有的實驗中,我們獨立地訓練了三個不同的網絡:2D-FAN、3D-FAN和2D-to-3DFAN。對于前兩個網絡,我們將初始學習率設置為10?4,并使用10個小批。在這個過程中,我們在15個時點之后將學習率降低到10 - 5,在15個時點之后將學習率降低到10 - 6,總共訓練了40個時點。我們還應用了隨機增強:翻轉、旋轉(從?50o到50o)、顏色抖動、尺度噪聲(從0.8到1.2)和隨機遮擋。2D-to-3D-FAN模型按照類似的程序進行訓練,進一步增加增加量:旋轉(從?70o到70o)和縮放(從0.7到1.3)。此外,學習速率最初設置為10?3。所有網絡在Torch7[9]中實現,使用rmsprop[37]進行訓練。

5. 2D face alignment

This Section evaluates 2D-FAN (trained on 300-W-LP2D), on 300-W test set, 300-VW (both training and test sets), and Menpo (frontal subset). Overall, 2D-FAN is evaluated on more than 220,000 images. Prior to reporting our results, the following points need to be emphasized:

  • 1. 300-W-LP-2D contains a wide range of poses (yaw angles in [?90? , 90? ]), yet it is still a synthetically generated dataset as this wide spectrum of poses were produced by warping the nearly frontal images of the 300- W dataset. It is evident that this lack of real data largely increases the difficulty of the experiment.
  • 2. The 2D landmarks of 300-W-LP-2D that 2D-FAN was trained on are slightly different from the 2D landmarks of the 300-W test set, 300-VW and Menpo. To alleviate this, the 2D-FAN was further fine-tuned on the original 300-W training set for a few epochs. Although this seems to resolve the issue, this discrepancy obviously increases the difficulty of the experiment.
  • 3. We compare the performance of 2D-FAN on all the aforementioned datasets with that of an unconventional baseline: the performance of a recent state-of-the-art method, namely MDM [40] on LFPW test set, initialized with the ground truth bounding boxes. We call this result MDM-on-LFPW. As there is very little performance progress made on the frontal dataset of LFPW over the past years, we assume that a state-of-the-art method like MDM (nearly) saturates it. Hence, we use the produced error curve to compare how well our method does on the much more challenging aforementioned test sets.

Figure 3: Fittings with the highest error from 300-VW (NME 6.8-7%). Red: ground truth. White: our predictions. In most cases, our predictions are more accurate than the ground truth.

本節評估2D-FAN(在300-W- lp2d上進行訓練)、300-W測試集、300-VW(包括訓練和測試集)和Menpo(正面子集)。總的來說,2D-FAN在超過22萬張圖片上進行了評估。在報告我們的結果之前,需要強調以下幾點:

  • 1. 300-W- lv - 2d包含了廣泛的姿態(偏航角度在[?90?,90?]),但它仍然是一個綜合生成的數據集,因為這一廣泛的姿態是通過扭曲近正面圖像的300-W數據集產生的。很明顯,缺乏真實數據在很大程度上增加了實驗的難度。
  • 2. 2D- fan所訓練的300w - lp -2D的2D地標與300w測試集、300vw和Menpo的2D地標略有不同。為了緩解這個問題,2D-FAN在最初的300-W訓練集上進行了一些改進。雖然這似乎解決了問題,但這種差異明顯增加了實驗的難度。
  • 3.我們將2D-FAN在上述所有數據集上的性能與非傳統基線的性能進行了比較:使用ground truth邊界框初始化的最新技術方法,即LFPW測試集上的MDM[40]的性能。我們稱這個結果為MDM-on-LFPW。由于LFPW的正面數據集在過去幾年中幾乎沒有取得什么性能進步,所以我們假設MDM(幾乎)等最先進的方法使其達到飽和。因此,我們使用產生的誤差曲線來比較我們的方法在前面提到的更具挑戰性的測試集上的表現。


圖3:300-VW的配件誤差最大(NME 6.8-7%)。紅色:地面真理。白:我們的預測。在大多數情況下,我們的預測比事實更準確。

The cumulative error curves for our 2D experiments on 300-VW, 300-W test set and Menpo are shown in Fig. 8. We additionally report the performance of MDM on all datasets initialized by ground truth bounding boxes, ICCR, the stateof-the-art face tracker of [31], on 300-VW (the only tracking dataset), and our unconventional baseline (called MDMon-LFPW). Comparison with a number of methods in terms of AUC are also provided in Table 2.

With the exception of Category C of 300-VW, it is evident that 2D-FAN achieves literally the same performance on all datasets, outperforming MDM and ICCR, and, notably, matching the performance of MDM-on-LFPW. Out?of 7,200 images (from Menpo and 300-W test set), there are in total only 18 failure cases, which represent 0.25% of the images (we consider a failure a fitting with NME > 7%). After removing these cases, the 8 fittings with the highest error for each dataset are shown in Fig. 4.

Figure 4: Fittings with the highest error from 300-W test set (first row) and Menpo (second row) (NME 6.5-7%). Red: ground truth. White: our predictions. In most cases, our predictions are more accurate than the ground truth.

我們在300vw、300w和Menpo上進行的二維實驗累積誤差曲線如圖8所示。此外,我們還報告了MDM在所有數據集上的性能,這些數據集由ground truth包圍盒(ICCR,[31]的最先進的面跟蹤器)在300-VW(唯一的跟蹤數據集)上初始化,以及我們的非常規基線(稱為MDMon-LFPW)。表2還提供了在AUC方面與一些方法的比較。

除了300-VW的C類,2D-FAN在所有數據集上的性能都是一樣的,優于MDM和ICCR,特別是在性能上與MDM-on- lfpw相當。在7200張圖像(來自Menpo和300-W測試集)中,總共只有18張失敗案例,占圖像的0.25%(我們認為失敗是與NME > 7%的擬合)。去除這些情況后,每個數據集誤差最大的8個配件如圖4所示。

圖4:在300-W的測試裝置(第一行)和門珀(第二行)(NME 6.5-7%)中,誤差最大的配件。紅色:地面真理。白:我們的預測。在大多數情況下,我們的預測比事實更準確。

?

Regarding the Category C of 300-VW, we found that the main reason for this performance drop is the quality of the annotations which were obtained in a semi-automatic manner. After removing all failure cases (101 frames representing 0.38% of the total number of frames), Fig. 3 shows the quality of our predictions vs the ground truth landmarks for the 8 fittings with the highest error for this dataset. It is evident that in most cases our predictions are more accurate. Conclusion: Given that 2D-FAN matches the performance of MDM-on-LFPW, we conclude that 2D-FAN achieves near saturating performance on the above 2D datasets. Notably, this result was obtained by training 2D-FAN primarily on synthetic data, and there was a mismatch between training and testing landmark annotations.對于300-VW的C類,我們發現性能下降的主要原因是半自動獲得的標注質量問題。在刪除了所有的失敗案例(101幀代表所有幀數的0.38%)之后,圖3顯示了我們的預測的質量與該數據集的8個誤差最大的裝置的地面真實地標。很明顯,在大多數情況下,我們的預測更準確。結論:在2D- fan性能與MDM-on-LFPW匹配的情況下,我們認為2D- fan在上述2D數據集上的性能接近飽和。值得注意的是,這一結果主要是基于合成數據對2D-FAN進行訓練得到的,訓練與測試地標標注之間存在不匹配。

6. Large Scale 3D Faces in-the-Wild dataset

Motivated by the scarcity of 3D face alignment annotations and the remarkable performance of 2D-FAN, we opted to create a large scale 3D face alignment dataset by converting all existing 2D face alignment annotations to 3D. To this end, we trained a 2D-to-3D FAN as described in Subsection 4.2 and guided it using the predictions of 2D-FAN, creating 3D landmarks for: 300-W test set, 300-VW (both training and all 3 testing datasets), Menpo (the whole dataset).

Evaluating 2D-to-3D is difficult: the only available 3D face alignment in-the-wild dataset (not used for training) is AFLW2000-3D [50]. Hence, we applied our pipeline (consisting of applying 2D-FAN for producing the 2D landmarks and then 2D-to-3D FAN for converting them to 3D) on AFLW2000-3D and then calculated the error, shown in Fig. 5 (note that for normalization purposes, 2D bounding?box annotations are still used). The results show that there is discrepancy between our 3D landmarks and the ones provided by [50]. After removing a few failure cases (19 in total, which represent 0.9% of the data), Fig. 6 shows 8 images with the highest error between our 3D landmarks and the ones of [50]. It is evident, that this discrepancy is mainly caused from the semi-automatic annotation pipeline of [50] which does not produce accurate landmarks especially for images with difficult poses.

由于3D人臉對齊注釋的稀缺性和2D- fan的卓越性能,我們選擇通過將所有現有的2D人臉對齊注釋轉換為3D來創建一個大規模的3D人臉對齊數據集。為此,我們按照4.2小節的描述培訓了一個2D-to-3D風扇,并使用2D-FAN的預測指導它,創建了3D地標:300-W測試集、300-VW(包括培訓和所有3個測試數據集)、Menpo(整個數據集)。

評估2D-to-3D是困難的:野外唯一可用的3D人臉對準數據集(不用于培訓)是AFLW2000-3D[50]。因此,我們在AFLW2000-3D上應用了我們的管道(包括應用2D-FAN生成2D地標,然后應用2D-to-3D FAN將它們轉換成3D),然后計算了誤差,如圖5所示(注意,出于規范化目的,仍然使用2D包圍框注釋)。結果表明,我們的三維地標與[50]提供的有差異。在排除了一些失敗案例(總共19個,占數據的0.9%)之后,圖6顯示了8張我們的3D地標和[50]的地標之間的誤差最大的圖像。很明顯,這一差異主要是由于[50]的半自動標注流水線無法生成準確的地標,特別是對于姿態困難的圖像。

Table 2: AUC (calculated for a threshold of 7%) on all major 2D face alignment datasets. MDM, CFSS and TCDCN were evaluated using ground truth bounding boxes and the openly available code.

表2:所有主要的2D人臉比對數據集的AUC(閾值為7%)。MDM、CFSS和TCDCN使用ground truth包圍盒和公開可用的代碼進行評估。

Figure 5: NME on AFLW2000-3D, between the original annotations of [50] and the ones generated by 2D-to-3DFAN. The error is mainly introduced by the automatic annotation process of [50]. See Fig. 6 for visual examples.

Figure 6: Fittings with the highest error from AFLW2000- 3D (NME 7-8%). Red: ground truth from [50]. White: predictions of 2D-to-3D-FAN. In most cases, our predictions are more accurate than the ground truth.
圖5:AFLW2000-3D上的NME,在[50]的原始注釋和2D-to-3DFAN生成的注釋之間。誤差主要是由[50]的自動標注過程引起的。圖6為可視化示例。

圖6:裝具誤差最高的AFLW2000- 3D (NME 7-8%)。紅色:地面真相來自[50]。白色:2d - 3d風扇的預測。在大多數情況下,我們的預測比事實更準確。

By additionally including AFLW2000-3D into the aforementioned datasets, overall, ~230,000 images were annotated in terms of 3D landmarks leading to the creation of the Large Scale 3D Faces in-the-Wild dataset (LS3D-W), the largest 3D face alignment dataset to date.通過將AFLW2000-3D添加到上述數據集中,總的來說,約23萬張圖像被標注了3D地標,從而創建了迄今為止最大的3D人臉比對數據集——野外大型3D人臉數據集(LS3D-W)。

7. 3D face alignment

This Section evaluates 3D-FAN trained on 300-W-LP3D, on LS3D-W (described in the previous Section) i.e. on the 3D landmarks of the 300-W test set, 300-VW (both training and test sets), and Menpo (the whole dataset) and AFLW2000-3D (re-annotated). Overall, 3D-FAN is evaluated on ~230,000 images. Note that compared to the 2D experiments reported in Section 5, more images in large poses have been used as our 3D experiments also include AFLW2000-3D and the profile images of Menpo (~2000 more images in total). The results of our 3D face alignment experiments on 300-W test set, 300-VW, Menpo and AFLW2000-3D are shown in Fig. 9. We additionally report the performance of the state-of-the-art method of 3DDFA (trained on the same dataset as 3D-FAN) on all datasets. Conclusion: 3D-FAN essentially produces the same accuracy on all datasets largely outperforming 3DDFA. This accuracy is slightly increased compared to the one achieved by 2D-FAN, especially for the part of the error curve for which the error is less than 2% something which is not surprising as now the training and testing datasets are annotated using the same mark-up.

真實感三維人臉對齊

本節將評估3D- fan在300-W- lp3d、LS3D-W(在前一節中描述)上的訓練,即在300-W測試集、300-VW(包括訓練和測試集)、Menpo(整個數據集)和AFLW2000-3D(重新注釋)上的3D地標上的訓練。總的來說,3D-FAN在大約23萬張圖像上進行了評估。注意,與第5節報道的2D實驗相比,我們的3D實驗使用了更多的大位姿圖像,包括AFLW2000-3D和Menpo的profile images(總共多了約2000張圖像)。我們在300w試驗臺上、300vw、Menpo和AFLW2000-3D上進行的三維人臉對準實驗結果如圖9所示。此外,我們還報告了3DDFA(在與3D-FAN相同的數據集上訓練)的最新方法在所有數據集上的性能。結論:3D-FAN在所有數據集上產生的準確性基本上都優于3DDFA。與2D-FAN相比,這種精度略有提高,特別是對于誤差小于2%的誤差曲線部分,這并不奇怪,因為現在訓練和測試數據集都使用相同的標記進行注釋。

8. Ablation studies

To further investigate the performance of 3D-FAN under challenging conditions, we firstly created a dataset of 7,200 images from LS3D-W so that there is an equal number of images in yaw angles [0o ? 30o ], [30o ? 60o ] and [60o ? 90o ]. We call this dataset LS3D-W Balanced. Then, we conducted the following experiments:

為了進一步研究3D-FAN在復雜條件下的性能,我們首先創建了一個包含來自LS3D-W的7200張圖像的數據集,以便在偏航角[0o?30o]、[30o?60o]和[60o?90o]有相同數量的圖像。我們稱這個數據集為LS3D-W平衡。然后,我們進行了以下實驗:

Table 3: AUC (calculated for a threshold of 7%) on the LS3D-W Balanced for different yaw angles.

Table 4: AUC on the LS3D-W Balanced for different levels of initialization noise. The network was trained with a noise level of up to 20%.

Table 5: AUC on the LS3D-W Balanced for various network sizes. Between 12-24M parameters, performance remains almost the same.

Figure 7: AUC on the LS3D-W Balanced for different face resolutions. Up to 30px, performance remains high.

表3:不同偏航角度下LS3D-W平衡時的AUC(閾值為7%)。

表4:在不同初始化噪聲水平下,LS3D-W上的AUC平衡。該網絡在20%的噪音水平下進行訓練。

表5:LS3D-W上的AUC為各種網絡大小而平衡。在12-24M參數之間,性能幾乎保持不變。

圖7:在LS3D-W上的AUC為不同的面分辨率而平衡。高達30px,性能仍然很高。

Performance across pose. We report the performance of 3D-FAN on LS3D-W Balanced for each pose separately in terms of the Area Under the Curve (AUC) (calculated for a threshold of 7%) in Table 3. We observe only a slight degradation of performance for very large poses ([60o ? 90o ]). We believe that this is to some extent to be expected as 3D-FAN was largely trained with synthetic data?for these poses (300-W-LP-3D). This data was produced by warping frontal images (i.e. the ones of 300-W) to very large poses which causes face distortion especially for the face region close to the ears.

Conclusion: Facial pose is not a major issue for 3D-FAN.

Performance across resolution. We repeated the previous?experiment but for different face resolutions (resolution is reduced relative to the face size defined by the tight bounding box) and report the performance of 3D-FAN in terms of AUC in Fig. 7. Note that we did not retrain 3D-FAN to particularly work for such low resolutions. We observe significant performance drop for all poses only when the face size is as low as 30 pixels.

Conclusion: Resolution is not a major issue for 3D-FAN.

Performance across noisy initializations. For all reported results so far, we used 10% of noise added to the ground truth bounding boxes. Note that 3D-FAN was trained with noise level of 20%. Herein, we repeated the previous experiment but for different noise levels and report the performance of 3D-FAN in terms of AUC in Table 4. We observe only small performance decrease for noise level?equal to 30% which is greater than the level of noise that the network was trained with.

Conclusion: Initialization is not a major issue for 3D-FAN.

在構成性能。根據曲線下面積(AUC)(閾值為7%),我們報告了3D-FAN在LS3D-W上的表現。我們只觀察到在非常大的位姿([60o - 90o])下性能有輕微的下降。我們認為這在某種程度上是可以預期的,因為3D-FAN主要是使用這些姿勢的合成數據進行訓練的(300-W-LP-3D)。這些數據是通過將正面圖像(即300-W的圖像)扭曲成非常大的姿態產生的,這會導致面部變形,尤其是靠近耳朵的面部區域。

結論:面部姿勢不是3d風扇的主要問題。

在分辨率性能。我們重復了之前的實驗,但是對于不同的人臉分辨率(分辨率相對于緊邊界框定義的人臉尺寸減小),并在圖7中以AUC的形式報告了3D-FAN的性能。請注意,我們并沒有對3D-FAN進行再培訓,使其特別適合如此低分辨率的工作。我們觀察到,只有當面部尺寸低至30像素時,所有姿勢的表現才會顯著下降。

結論:解決問題并不是3D-FAN的主要問題。

跨有噪聲的初始化的性能。對于到目前為止所有報告的結果,我們使用10%的噪音添加到地面真實邊界框。注意3D-FAN的訓練噪音等級為20%。在此,我們重復了之前的實驗,但是在不同的噪聲水平下,我們用AUC來報告3D-FAN的性能如表4所示。我們觀察到,當噪聲水平大于網絡訓練時的噪聲水平時,當噪聲水平等于30%時,系統的性能只有很小的下降。

結論:初始化不是3D-FAN的主要問題。

?

Performance across different network sizes. For all reported results so far, we used a very powerful 3D-FAN with 24M parameters. Herein, we repeated the previous experiment varying the number of network parameters and report the performance of 3D-FAN in terms of AUC in Table 5. The number of parameters is varied by firstly reducing the number of HG networks used from 4 to 1. Then, the number of parameters was dropped further by reducing the number of channels inside the building block. It is important to note that even then biggest network is able to run on 28-30 fps on a TitanX GPU while the smallest one can reach 150 fps. We observe that up to 12M, there is only a small performance drop and that the network’s performance starts to drop significantly only when the number of parameters becomes as low as 6M. Conclusion: There is a moderate performance drop vs the number of parameters of 3D-FAN. We believe that this is an interesting direction for future work.跨不同網絡大小的性能。到目前為止,所有報告的結果,我們使用了一個非常強大的24米參數的3d風扇。在此,我們重復了之前的實驗,改變網絡參數的個數,并以AUC的形式報告了3D-FAN的性能如表5所示。參數的數量是變化的,首先減少使用的4到1的HG網絡的數量。然后,通過減少構建塊內的通道數量進一步減少參數的數量。值得注意的是,即使是最大的網絡也可以在TitanX GPU上運行28-30幀每秒,而最小的可以達到150幀每秒。我們觀察到,在高達12M的情況下,只有很小的性能下降,并且只有當參數的數量低至6M時,網絡的性能才開始顯著下降。結論:隨著3D-FAN參數的增加,性能有一定程度的下降。我們認為這是未來工作的一個有趣的方向。

9. Conclusions

We constructed a state-of-the-art neural network for landmark localization, trained it for 2D and 3D face alignment, and evaluate it on hundreds of thousands of images. Our results show that our network nearly saturates these datasets, showing also remarkable resilience to pose, resolution, initialization, and even to the number of the network parameters used. Although some very unfamiliar poses were not explored in these datasets, there is no reason to believe, that given sufficient data, the network does not have the learning capacity to accommodate them, too.我們構建了一個最先進的用于地標定位的神經網絡,對其進行了二維和三維人臉對準訓練,并對數十萬張圖像進行了評估。我們的結果表明,我們的網絡幾乎飽和了這些數據集,在姿態、分辨率、初始化,甚至使用的網絡參數的數量方面也表現出了顯著的彈性。雖然在這些數據集中沒有探索一些非常不熟悉的姿勢,但沒有理由相信,給定足夠的數據,網絡也不具備容納它們的學習能力。

?

10. Acknowledgments?

Adrian Bulat was funded by a PhD scholarship from the University of Nottingham. This work was supported in part by the EPSRC project EP/M02153X/1 Facial Deformable Models of Animals.Adrian Bulat獲得了諾丁漢大學的博士獎學金。該工作部分得到了EPSRC項目EP/M02153X/1面部可變形動物模型的支持。

?

?

?

?

?

?

?

《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀

總結

以上是生活随笔為你收集整理的Paper:《How far are we from solving the 2D 3D Face Alignment problem? 》解读与翻译的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。