《PlaneNet-单幅RGB图像的分段平面重建》论文中英文对照解读
論文地址:https://arxiv.org/pdf/1804.06278.pdf
代碼地址:https://github.com/art-programmer/PlaneNet
PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image
PlaneNet:單幅RGB圖像的分割平面重建
Abstract
論文摘要
EN: This paper proposes a deep neural network (DNN) for piece-wise planar depthmap reconstruction from a single RGB image. While DNNs have brought remarkable progress to single-image depth prediction, piece-wise planar depthmap reconstruction requires a structured geometry representation, and has been a difficult task to master even for DNNs. The proposed end-to-end DNN learns to directly infer a set of plane parameters and corresponding plane segmentation masks from a single RGB image. We have generated more than 50,000 piece-wise planar depthmaps for training and testing from ScanNet, a largescale RGBD video database. Our qualitative and quantitative evaluations demonstrate that the proposed approach outperforms baseline methods in terms of both plane segmentation and depth estimation accuracy. To the best of our knowledge, this paper presents the first end-to-end neural architecture for piece-wise planar reconstruction from a single RGB image. Code and data are available at https://github.com/art-programmer/PlaneNet.
CH: 本篇論文提出了一種深度神經(jīng)網(wǎng)絡(luò)(DNN)去完成單幅圖像的分割平面深度圖重建任務(wù)。雖然DNN在單幅圖像上的深度預(yù)測取得了顯著的進(jìn)步,但是分割平面深度圖重建需要一個(gè)結(jié)構(gòu)化的幾何表示,即使對于DNN也是很難解決的一個(gè)任務(wù)。提出的這個(gè)端到端的DNN直接從單幅RGB圖像中推算出一套平面參數(shù)和對應(yīng)的平面分割掩膜。我們從 ScanNet 生成了超過50000張的分割平面深度圖用于訓(xùn)練和測試,ScanNet 是一個(gè)大型的 RGBD 視頻數(shù)據(jù)集。我們的定性和定量評估表明我們提出的這個(gè)方法在平面分割和深度估計(jì)的精度方面都比基礎(chǔ)的方法效果要好。據(jù)我們所知,這篇論文提出的端到端神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)是第一個(gè)用來解決單幅RGB圖像的分割平面重建問題的神經(jīng)網(wǎng)絡(luò)。代碼和數(shù)據(jù)均在GitHub:https://github.com/art-programmer/PlaneNet
1. Introduction
1. 前言
EN: Human vision has extraordinary perceptual power in understanding advanced scene structures. Looking at a typical indoor scene (for example, Figure 1), we can immediately parse the room into a few major planes (for example, floors, walls, and ceilings), sense the main surface of the furniture, or identify the surface of a horizontal tabletop. Segmental planar geometry understanding will be key to many applications in emerging areas such as robotics or augmented reality (AR). For example, the robot needs to identify the extent of the floor used to plan the move, or the desktop split for placing the object. In AR applications, planar surface inspection is becoming the basic building block for placing virtual objects on the desktop, replacing floor textures or hanging artwork on walls for internal remodeling. A fundamental problem in computer vision is the development of a computational algorithm that masters similar perceptions to implement such an application.
CH: 人類視覺在理解高級別場景結(jié)構(gòu)方面有著非凡的感知能力。看一個(gè)典型的室內(nèi)場景(比如圖一),我們能立即將這個(gè)房間分析成一些主要的平面(比如墻,地板,天花板),感知家具的主要表面和水平桌面的表面。分割平面的幾何理解對一些新興領(lǐng)域的許多應(yīng)用起到了很關(guān)鍵的作用,比如機(jī)器人或虛擬現(xiàn)實(shí)(AR)。例如,機(jī)器人需要檢測用于移動(dòng)的地板的范圍,或者在放置物體時(shí)需要分割桌面。在AR應(yīng)用中,需要往桌面上放置虛擬的物體,更換地板的樣式或?qū)ι系乃囆g(shù)品進(jìn)行內(nèi)部改建,這時(shí)的檢測平面的表面就是一個(gè)基礎(chǔ)的模塊。計(jì)算機(jī)視覺中一個(gè)基礎(chǔ)的問題是一個(gè)能解決相似感知問題的幾何算法來實(shí)現(xiàn)這樣的應(yīng)用。
EN: With the proliferation of deep neural networks, single image depth map inference and room layout estimation have been active areas of research. However, to our surprise, little attention has been paid to the study of segmental planar depth map reconstruction, which mimics this remarkable human perception in a general form. The main challenge is that segmented planar depth maps require a structured geometric representation (ie, a set of planar parameters and their segmentation masks). In particular, we do not know the number of planes to infer, and the order of the planes that are returned in the output feature vector, making the task even challenging for deep neural networks.
CH: 隨著深度神經(jīng)網(wǎng)絡(luò)的興起,單幅圖像的深度圖和房間布局的推斷一直是搞研究的活躍領(lǐng)域。然而,我們感到比較驚訝的是,分割平面深度圖重建這一方面很少有人關(guān)注,這一方面一般來說是模仿了人類的這種非凡的感知能力。其中比較主要的挑戰(zhàn)是分割平面深度圖需要一個(gè)結(jié)構(gòu)化的幾何表示(i.e.平面參數(shù)的集合和它們的分割掩膜)。尤其是,我們不知道需要分割的平面數(shù)量,以及平面在輸出特征向量中的順序,完成這些任務(wù)對深度神經(jīng)網(wǎng)絡(luò)來說也很有挑戰(zhàn)。
EN: This paper proposes a novel deep neural architecture “PlaneNet” that learns to directly produce a set of plane parameters and probabilistic plane segmentation masks from a single RGB image. Following a recent work on point-setgeneration, we define a loss function that is agnostic to the order of planes. We further control the number of planes by allowing probabilistic plane segmentation masks to be all 0. The network also predicts a depthmap at non-planar surfaces, whose loss is defined through the probabilistic segmentation masks to allow back-propagation. We have generated more than 50,000 piece-wise planar depthmaps from ScanNet as ground-truth by fitting planes to 3D points and projecting them to images. Qualitative and quantitative evaluations show that our algorithm produces significantly better plane segmentation results than the current state-ofthe-art. Furthermore, our depth prediction accuracy is on-par or even superior to the existing single image depth inference techniques that are specifically trained for this task.
CH: 本篇論文提出了一個(gè)新的深度神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)“PlaneNet”,它通過學(xué)習(xí)訓(xùn)練直接從單幅RGB圖像中得到一組平面參數(shù)和對應(yīng)的平面分割掩膜。在最近的一項(xiàng)點(diǎn)集分割工作中,我們定義了一個(gè)跟平面順序無關(guān)的損失函數(shù)。我們通過允許概率性的平面分割掩膜為0來進(jìn)一步的控制平面的數(shù)量。這個(gè)網(wǎng)絡(luò)結(jié)構(gòu)還預(yù)測非平面處的深度圖,這個(gè)損失是通過概率分割掩膜定義的,可以進(jìn)行反向傳播。我們通過擬合平面到3D點(diǎn)上,并且將它們投射到圖像中,從 ScanNet 數(shù)據(jù)集中生成了超過50000張分段平面深度圖作為真實(shí)樣本。定性和定量的評估標(biāo)準(zhǔn)表明:我們的算法的平面分割結(jié)果相比當(dāng)下流行的技術(shù),有顯著的提升。此外,我們的深度預(yù)測精度甚至要比當(dāng)下專門針對此任務(wù)的算法更優(yōu)秀。
2. Related work
2. 相關(guān)工作
EN: Multi-view piece-wise planar reconstruction. Piece-wise planar depthmap reconstruction was once an active research topic in multi-view 3D reconstruction. The task is to infer a set of plane parameters and assign a plane ID to each pixel. Most existing methods first reconstruct precise 3D points, perform plane-fitting to generate plane hypotheses, then solve a global inference problem to reconstruct a piece-wise planar depthmap. Our approach learns to directly infer plane parameters and plane segmentations from a single RGB image.
CH: 多視圖分段平面重建。分段平面深度圖重建曾經(jīng)在多視圖3D重建中的活躍研究領(lǐng)域。這個(gè)任務(wù)是推斷一組平面參數(shù)并且給每個(gè)像素分配一個(gè)平面ID。目前大部分的算法都是首先重建精確的3D點(diǎn)集,擬合平面去生成假設(shè)平面,然后求解一個(gè)全局的推理問題去重建一個(gè)分段平面深度圖。我們的方法通過學(xué)習(xí)訓(xùn)練直接從單幅RGB圖像中得到一組平面參數(shù)和對應(yīng)的平面分割掩膜。
EN: Learning based depth reconstruction. Saxena et al. pioneered a learning based approach for depthmap inference from a single image. With the surge of deep neural networks, numerous CNN based approaches have been proposed. However, most techniques simply produce an array of depth values (i.e., depthmap) without plane detection or segmentation. More recently, Wang et al. enforce planarity in depth (and surface normal) predictions by inferring pixels on planar surfaces. This is the closest work to ours. However, they only produce a binary segmentation mask (i.e., if a pixel is on a planar surface or not) without plane parameters or instance-level plane segmentation.
CH: 基于自學(xué)習(xí)的深度重建。Saxena 等人針對單幅圖像的深度圖推斷提出了一個(gè)基于自學(xué)習(xí)的方法。隨著深度神經(jīng)網(wǎng)絡(luò)的興起,出現(xiàn)了許多基于CNN的方法。但是,大部分的方法只是簡單生成一組深度數(shù)值(i.e.深度圖)而沒有平面的檢測與分割。最近,Wang等人通過計(jì)算平面上的像素信息,在深度信息(以及表面法線)預(yù)測中執(zhí)行平面化操作。這是跟我們最接近的方法。然而,他們僅僅生成一個(gè)二進(jìn)制的分割掩膜(i.e.一個(gè)像素是否在平面上),而沒有平面參數(shù)或?qū)嵗墑e的平面分割。
EN: Layout estimation. Room layout estimation also aims at predicting dominant planes in a scene (e.g., walls, floor, and ceiling). Most traditional approaches rely on image processing heuristics to estimate vanishing points of a scene, and aggregate low-level features by a global optimization procedure. Besides low-level features, high-level information has been utilized, such as human poses or semantics. Attempts have been made to go beyond room structure, and predict object geometry. However, the reliance on hand-crafted features makes those methods less robust, and the Manhattan World assumption limits their operating ranges. Recently, Lee et al. proposed an end-to-end deep neural network, RoomNet, which simultaneously classifies a room layout type and predicts corner locations. However, their framework is not applicable to general piece-wise planar scenes.
CH: 房間布局估計(jì)。房間布局的估計(jì)也是針對一個(gè)場景中的主要平面進(jìn)行預(yù)測的。(e.g.墻,地板和天花板)大部分傳統(tǒng)的算法依靠圖像的啟發(fā)式處理去估算場景中的消隱點(diǎn),并通過一個(gè)全局的優(yōu)化程序聚合底層特征。除了底層特征,還使用到了一些高級信息,比如:人類的姿態(tài)和語義。嘗試越過房間的結(jié)構(gòu)來預(yù)測目標(biāo)的幾何結(jié)構(gòu)。但是,人工選擇的特征使得這些方法的穩(wěn)健性比較低,曼哈頓世界的假設(shè)也限制了它們的操作范圍。最近,Lee等人,提出了一個(gè)端到端的深度神經(jīng)網(wǎng)絡(luò) RoomNet,它能同時(shí)分類房間的布局類型和預(yù)測角落的位置。但是,他們的框架不適用與一般情況下的分段平面場景。
EN: Line analysis. Single image 3D reconstruction of line drawings date back to the 60s. The earliest attempt is probably the Robert’s system, which inspired many follow-up works. In real images, extraction of line drawings is challenging. Statistical analysis of line directions, junctions, or image segments have been used to enable 3D reconstruction for architectural scenes or indoor panoramas. Attributed grammar was used to parse an image into a hierarchical graph for 3D reconstruction. However, these approaches require hand-crafted features, grammar specification, or algorithmic rules. Our approach is purely data-driven harnessing the power of deep neural networks.
CH: 線分析。單幅線條圖像的3D重建可以追溯到60年代。最早的嘗試大概是 Robert 的系統(tǒng),它啟發(fā)了許多后面的工作。在實(shí)際的圖像中,線條圖的提取有不小的挑戰(zhàn)性。線向統(tǒng)計(jì)分析,交叉點(diǎn)和圖像分割已經(jīng)被用于建筑場景和室內(nèi)全景圖的3D重建。Attributed grammar 將圖像解析成分層圖用于3D重建。但是,這些傳統(tǒng)的算法需要人工選取的特征,grammar specification, 或算法規(guī)則。我們的方法純粹靠數(shù)據(jù)驅(qū)動(dòng)的深度神經(jīng)網(wǎng)絡(luò)的力量。
3. PlaneNet
3. PlaneNet
EN: We build our network on the Extended Residual Network (DRN) (see Figure 2), which is a flexible framework for global tasks (eg image classification) and pixel prediction tasks (eg semantic segmentation). Given the high-resolution final feature map from the DRN, we make three output branches for the three prediction tasks.
CH: 我們基于 Extended Residual Network (DRN) 來構(gòu)建我們的網(wǎng)絡(luò),(圖二所示)DRN是針對全局性任務(wù)(e.g.圖片分類)和像素預(yù)測任務(wù)(e.g.語義分割)的一個(gè)靈活框架。針對DRN最終輸出的高分辨率的特征圖,我們對于三個(gè)不同的預(yù)測任務(wù)提供了三個(gè)分支。
EN: Plane parameters: For each scene, we predict a fixed number (KKK) of planar surfaces S=S1,???SKS = {S_1, · · · S_K}S=S1?,???SK?. Each surface SiS_iSi? is specified by the three plane parameters PiP_iPi? (i.e., encoding a normal and an offset). We use DiD_iDi? to denote a depth image, which can be inferred from the parameters PiP_iPi? .The depth value calculation requires camera intrinsic parameters, which can be estimated via vanishing point analysis, for example. In our experiments, intrinsics are given for each image through the database information.
CH: 平面參數(shù)。對于每個(gè)場景,我們預(yù)測的平面 S=S1,???SKS = {S_1, · · · S_K}S=S1?,???SK? 數(shù)量是固定的 KKK。每個(gè)平面 SiS_iSi? 都通過三個(gè)平面參數(shù) PiP_iPi? 指定。(i.e.編碼法線和偏移量)我們用 DiD_iDi? 來表示深度圖像,它能從參數(shù) PiP_iPi? 中推算出來。深度值的推算需要相機(jī)內(nèi)置參數(shù),而相機(jī)內(nèi)置參數(shù)可以通過消隱點(diǎn)分析來估算。但在我們的實(shí)驗(yàn)中相機(jī)內(nèi)置參數(shù)是通過數(shù)據(jù)集每張圖像的信息提供的。
EN: Non-planar depthmap: We model non-planar structures and infer its geometry as a standard depthmap. With abuse of notation, we treat it as the (K+1)th(K+1)^{th}(K+1)th surface and denote the depthmap as DK+1D_{K+1}DK+1?. This does not explain planar surfaces.
CH: 非平面深度圖:我們對非平面結(jié)構(gòu)進(jìn)行建模處理,并將它的幾何結(jié)構(gòu)推斷為標(biāo)準(zhǔn)的深度圖。用符號表示的話,我們把平面表示為 (K+1)th(K+1)^{th}(K+1)th ,把對應(yīng)的深度圖表示為 DK+1D_{K+1}DK+1?。但是這個(gè)不能用來解釋平面信息。
EN: Segmentation masks: The last output is the probabilistic segmentation masks for the KKK planes (M1,???MK)(M_1, · · · M_K)(M1?,???MK?) and the non planar depthmap (MK+1)(M_{K+1})(MK+1?).
CH: 分割掩膜:最后的輸出是第 KKK 個(gè)平面 (M1,???MK)(M_1, · · · M_K)(M1?,???MK?) 的分割掩膜和對應(yīng)的非平面深度圖 (MK+1)(M_{K+1})(MK+1?) 。
EN: In summary, the network predicts 1) plane parameters (P1,???,PK)(P1, ···, PK)(P1,???,PK), 2) non-planar depth maps (DK+1)(D_{K + 1})(DK+1?), and 3) probability split masks (M1,???,MK+1)(M_1, ···, M_{K + 1})(M1?,???,MK+1?). We now explain more details and loss functions for each task.
CH: 概括起來,這個(gè)網(wǎng)絡(luò)解決了三個(gè)任務(wù):1)平面參數(shù) (P1,???,PK)(P1, ···, PK)(P1,???,PK),2)非平面深度圖 (DK+1)(D_{K + 1})(DK+1?),3)概率分割掩膜 (M1,???,MK+1)(M_1, ···, M_{K + 1})(M1?,???,MK+1?)。下面詳細(xì)說明每個(gè)任務(wù)的更多細(xì)節(jié)和損失函數(shù)。
3.1. Plane parameter branch
3.1. 平面參數(shù)分支
EN: The plane parameter branch starts with a global average pooling to reduce the feature map size to 1x1, followed by a fully connected layer to produce K×3K×3K×3 plane parameters. We do not know the number of planes as well as their order in this prediction task. By following prior works, we predict a constant number (K)(K)(K) of planes, then allow some predictions to be invalid by letting the corresponding probabilistic segmentation masks to be 0. Our ground-truth generation process (See Sect. 4) produces at most 10 planes for most examples, thus we set K=10K = 10K=10 in our experiments. We define an order-agnostic loss function based on the Chamfer distance metric for the regressed plane parameters:
LP=∑i=1K?minj∈[1,K]∥Pi??Pj∥22L^P=\sum_{i=1}^{K^*}min_{j\in[1,K]}\Vert P_i^*-P_j \Vert_2^2LP=i=1∑K??minj∈[1,K]?∥Pi???Pj?∥22?
The parameterization PiP_iPi? is given by the 3D coordinate of the point that is closest to the camera center on the plane. Pi?P^?_iPi?? is the ground truth. K?K^?K? is the number of ground-truth planes.
CH: 平面參數(shù)分支從一個(gè)全局平均 pooling 開始,將特征圖的尺寸變成 1x1,緊接著,通過一個(gè)全連接層生成 K×3K×3K×3 的平面參數(shù)。我們不知道平面的數(shù)量也不知道在這個(gè)預(yù)測任務(wù)中的順序。通過遵循之前的工作,我們預(yù)測的平面數(shù)量為 KKK,然后通過使對應(yīng)的概率分割掩膜為 0,讓一些預(yù)測的平面無效。我們的大部分真實(shí)實(shí)例都可以生成十個(gè)左右的平面,(見第四節(jié))因此在我們的實(shí)驗(yàn)中設(shè)置 K=10K=10K=10。我們基于倒角距離度量針對平面參數(shù)的回歸定義了一個(gè)與順序無關(guān)的損失函數(shù):LP=∑i=1K?minj∈[1,K]∥Pi??Pj∥22L^P=\sum_{i=1}^{K^*}min_{j\in[1,K]}\Vert P_i^*-P_j \Vert_2^2LP=i=1∑K??minj∈[1,K]?∥Pi???Pj?∥22?參數(shù) PiP_iPi? 是根據(jù)平面上最靠近相機(jī)中心的點(diǎn)的3D坐標(biāo)得到的。Pi?P^?_iPi?? 是真實(shí)實(shí)例。K?K^?K? 是真實(shí)實(shí)例中平面的數(shù)量。
3.2. Plane segmentation branch
3.2. 平面分割分支
EN: The branch begins with a pyramid pool module followed by a convolutional layer to produce a K+1K + 1K+1 channel likelihood map for planar and non-planar surfaces. We added a dense conditional random field (DCRF) module based on the fast inference algorithm proposed by Krahenbuhl and Koltun, and jointly trained the DCRF module and the previous layer with Zheng et al. We set the average number of field iterations to 5 during training and set it to 10 during the test. For simplicity, the bandwidth of the bilateral filter is fixed. We use standard softmax cross entropy loss to supervise segmentation training: LM=∑i=1K+1∑p∈I(1(M?(p)=i)log(1?Mi(p)))L^M=\sum_{i=1}^{K+1}\sum_{p \in I}(1(M^{*(p)}=i)log(1-M_i^{(p)}))LM=i=1∑K+1?p∈I∑?(1(M?(p)=i)log(1?Mi(p)?))The internal summation is over the image pixels (I)(I)(I), where Mi(p)M^{(p)}_iMi(p)? denotes the probability of pixel ppp belonging to the ithi^{th}ith plane. M?(p)M^{?(p)}M?(p) is the ground-truth plane-id for the pixel.
CH: 這個(gè)分支以一個(gè)金字塔池化模塊開始,緊接著通過一個(gè)卷積層生成平面和非平面表面 K+1K+1K+1 通道的極大似然圖。我們在 Krahenbuhl 和 Koltun 提出的快速推理算法的基礎(chǔ)上添加了一個(gè)密集條件隨機(jī)場(DCRF)模塊,并且和 Zheng 等人共同訓(xùn)練這個(gè) DCRF 模塊和先前的層。我們在訓(xùn)練期間設(shè)置平均場迭代為5,在測試期間設(shè)置為10.為簡單起見,雙邊濾波器的帶寬是固定的。我們用標(biāo)準(zhǔn)的 softmax 交叉熵?fù)p失函數(shù)來監(jiān)督分割訓(xùn)練:LM=∑i=1K+1∑p∈I(1(M?(p)=i)log(1?Mi(p)))L^M=\sum_{i=1}^{K+1}\sum_{p \in I}(1(M^{*(p)}=i)log(1-M_i^{(p)}))LM=i=1∑K+1?p∈I∑?(1(M?(p)=i)log(1?Mi(p)?))當(dāng) Mi(p)M^{(p)}_iMi(p)? 表示像素 ppp 屬于平面 ithi^{th}ith 的概率時(shí),里面的求和是對圖像像素 (I)(I)(I) 的求和。M?(p)M^{?(p)}M?(p) 是像素在真實(shí)實(shí)例上所屬的平面 id。
3.3. Non-planar depth branch
3.3. 非平面深度分支
EN: The branch shares the same pyramid pooling module, followed by a convolution layer to produce a 1-channel depthmap. Instead of defining a loss specifically for non-planar regions, we found that exploiting the entire ground-truth depthmap makes the overall training more effective. Specifically, we define the loss as the sum of squared depth differences between the ground-truth and either a predicted plane or a non-planar depthmap, weighted by probabilities:LD=∑i=1K+1∑p∈I(Mi(p)(Di(p)?D?(p))2)L^D=\sum_{i=1}^{K+1}\sum_{p\in I}(M_i^{(p)}(D_i^{(p)}-D^{*(p)})^2)LD=i=1∑K+1?p∈I∑?(Mi(p)?(Di(p)??D?(p))2)Di(p)D_i^{(p)}Di(p)? denotes the depth value at pixel p, while D?(p)D^{*(p)}D?(p) is the ground truth depth value.
CH: 這個(gè)分支與平面分割分支分享使用同一個(gè)金字塔池化模塊,然后通過一個(gè)卷積層生成1通道的深度圖。我們發(fā)現(xiàn)使用全部的真實(shí)實(shí)例深度圖來進(jìn)行訓(xùn)練比單獨(dú)定義一個(gè)非平面區(qū)域的損失要更有效。因此,我們將損失定義為真實(shí)實(shí)例與預(yù)測平面或非平面深度圖之間深度差的平方和,并由概率進(jìn)行加權(quán):LD=∑i=1K+1∑p∈I(Mi(p)(Di(p)?D?(p))2)L^D=\sum_{i=1}^{K+1}\sum_{p\in I}(M_i^{(p)}(D_i^{(p)}-D^{*(p)})^2)LD=i=1∑K+1?p∈I∑?(Mi(p)?(Di(p)??D?(p))2)當(dāng)D?(p)D^{*(p)}D?(p) 表示真實(shí)實(shí)例的深度值時(shí),Di(p)D_i^{(p)}Di(p)? 表示在像素 ppp 的深度值。
4. Datasets and implemenation details
4. 數(shù)據(jù)集和網(wǎng)絡(luò)實(shí)現(xiàn)細(xì)節(jié)
EN: We have generated 51,000 ground-truth piece-wise planar depthmaps (50,000 training and 1,000 testing) from ScanNet, a large-scale indoor RGB-D video database. A depthmap in a single RGB-D frame contains holes and the quality deteriorates at far distances. Our approach for ground-truth generation is to directly fit planes to a consolidated mesh and project them back to individual frames, while also exploiting the associated semantic annotations.
CH: 我們從 ScanNet(一個(gè)大型室內(nèi)的 RGB-D 視頻數(shù)據(jù)庫)中生成了 51,000 張分段平面深度圖作為真實(shí)樣本(50,000張訓(xùn)練,1,000張預(yù)測)。單幅RGB-D圖像的深度圖包含 holes,而且圖像內(nèi)容距離比較遠(yuǎn)的效果也會(huì)變壞。我們生成真實(shí)實(shí)例的方法是將平面擬合到統(tǒng)一的網(wǎng)格中,并將他們投射回單個(gè)圖像幀,同時(shí)還利用了相關(guān)的語義注釋。
EN: Specifically, for each sub mesh-models of the same semantic label, we treat mesh-vertices as points and repeat extracting planes by RANSAC with replacement. The inlier distance threshold is 5cm5cm5cm, and the process continues until 90% of the points are covered. We merge two (not necessarily adjacent) planes that span different semantic labels if the plane normal difference is below 20?20^?20? , and if the larger plane fits the smaller one with the mean distance error below 5cm5cm5cm. We project each triangle to individual frames if the three vertices are fitted by the same plane. After projecting all the triangles, we keep only the planes whose projected area is larger than 1% of an image. We discard entire frames if the ratio of pixels covered by the planes is below 50%. For training samples, we randomly choose 90% of the scenes from ScanNet, subsample every 10 frames, compute piecewise planar depthmaps with the above procedure, then use the final random sampling to produce 50,000 examples. The same procedure generates 1,000 testing examples from the remaining 10% of the scenes.
CH: 明確來說,對于相同語義標(biāo)簽的每個(gè)子網(wǎng)格模型,我們將網(wǎng)格頂點(diǎn)視為 points,并通過 RANSAC 算法重復(fù)提取平面。這個(gè)內(nèi)部距離的閾值為 5cm5cm5cm,并且這個(gè)過程會(huì)持續(xù)到 points 的百分之九十被覆蓋。如果兩個(gè)跨越不同語義標(biāo)簽的平面的平面法線差異小于20?20^?20? 并且大平面擬合小平面時(shí)平均距離誤差小于 5cm5cm5cm,就合并這兩個(gè)平面。(不一定相鄰)如果三個(gè)網(wǎng)格頂點(diǎn)擬合同一個(gè)平面,就把三個(gè)頂點(diǎn)投射到單獨(dú)的坐標(biāo)系中。投射完所有的頂點(diǎn),只保留投射區(qū)域大于原圖面積百分之一的平面。如果所有的平面像素覆蓋比小于百分之五十,就丟棄所有的平面。我們從 ScanNet 中隨機(jī)選取百分之九十的場景,每十幀采樣一次,使用上述流程生成分段平面深度圖,然后用隨機(jī)采樣選出50,000個(gè)樣本作為訓(xùn)練集。相同的流程從 ScanNet 剩余的百分之十場景中選出1,000個(gè)樣本作為測試集。
EN: We have implemented PlaneNet using TensorFlow based on DeepLab . Our system is a 101-layer ResNet with Dilated Convolution, while we have followed a prior work and modified the first few layers to deal with the degridding issue. The final feature map of the DRN contains 2096 channels. We use the Adam optimizer with the initial learning rate set to 0.0003. The input image, the output plane segmentation masks, and the non-planar depthmap have a resolution of 256x192. We train our network for 50 epochs on the 50,000 training samples.
CH: 我們基于 DeepLab 的 TensorFlow 實(shí)現(xiàn)了 PlaneNet 。我們的方法是有著 Dilated Convolution(擴(kuò)張卷積)的 101層 Resnet,我們復(fù)現(xiàn)了先前的工作,并修改了前幾層,為了處理 degridding 的問題。最后輸出的 DRN 特征圖包括 2096 個(gè)通道。我們使用 Adam 優(yōu)化器設(shè)置學(xué)習(xí)率為 0.0003 來進(jìn)行網(wǎng)絡(luò)訓(xùn)練優(yōu)化。輸入圖像,輸出的平面分割掩膜和非平面深度圖的尺寸都為 256x192。我們在 50,000 個(gè)訓(xùn)練樣本上訓(xùn)練了 50 輪我們的網(wǎng)絡(luò)。
5. Experimental results
5. 實(shí)驗(yàn)結(jié)果
EN: Figure 3 shows the reconstruction results for various scenarios. Our end-to-end learning framework has successfully restored segmented planar and semantically meaningful structures from a single RGB image, such as a floor, wall, desktop or computer screen. We have included more examples in the supplements. We now provide a quantitative assessment of the accuracy of planar segmentation and depth reconstruction for competitive baselines, and then analyze our results more.
CH: 圖三展示了各種場景的平面重建結(jié)果。我們的端到端的學(xué)習(xí)架構(gòu)成功的從單幅RGB圖像中重建了分段的平面結(jié)構(gòu)和有意義的語義結(jié)構(gòu),比如:地板,墻面,桌面或電腦屏幕。在補(bǔ)充材料里有更多的實(shí)例結(jié)果。我們提出了一個(gè)針對平面分割和深度圖重建的定量評估標(biāo)準(zhǔn),然后對我們的結(jié)果進(jìn)行更多的分析。
EN: Figure 3: Piece-wise planar depthmap reconstruction results by PlaneNet. From left to right: input image, plane segmentation, depthmap reconstruction, and 3D rendering of our depthmap. In the plane segmentation results, the black color shows non-planar surface regions.
CH: 圖三:PlaneNet 的分段平面深度圖重建結(jié)果。從左到右:輸入圖像,平面分割結(jié)果,深度圖重建結(jié)果和深度圖的3D渲染結(jié)果。在平面分割結(jié)果中,黑色顯示非平面表面區(qū)域。
5.1. Plane segmentation accuracy
5.1. 平面分割準(zhǔn)確率
EN: Piece-wise planar reconstruction from a single RGB image is a challenging problem. While existing approaches have produced encouraging results, they are based on hand-crafted features and algorithmic designs, and may not match against big-data and deep neural network (DNN) based systems. Much better baselines would then be piece-wise planar depthmap reconstruction techniques from 3D points, where input 3D points are either given by the ground truth depthmaps or inferred by a state-of-the-art DNN-based system.
CH: 單幅RGB圖像的分段平面重建是一個(gè)有挑戰(zhàn)的問題。雖然現(xiàn)有的方法在這方面已經(jīng)有了不錯(cuò)的結(jié)果,但它們都是基于手動(dòng)設(shè)計(jì)的特征的算法,并且可能和基于大數(shù)據(jù)和深度神經(jīng)網(wǎng)絡(luò)的系統(tǒng)不匹配。更好的基準(zhǔn)線將來自于3D點(diǎn)的分段平面深度圖重建技術(shù),輸入3D點(diǎn),然后將由真實(shí)實(shí)例的深度圖或最先進(jìn)的DNN系統(tǒng)推測輸出。
EN: In particular, to infer depthmaps, we have used a variant of PlaneNet which only has the pixel-wise depthmap branch, while following Eigen et al. to change the loss. Table 1 shows that this network, PlaneNet (Depth rep.), outperforms the current top-performers on the NYU benchmark.
CH: 特別是,為了推算深度圖,我們使用了 PlaneNet 的變種網(wǎng)絡(luò),只保留像素級的深度圖分支,然后參考 Eigen 等人的思想去改變損失函數(shù)。圖一顯示 PlaneNet 在 NYU 的基準(zhǔn)上是目前最佳的網(wǎng)絡(luò)。
EN: For piece-wise planar depthmap reconstruction, we have used the following three baselines from the literature.
“NYU-Toolbox” is a plane extraction algorithm from the official NYU toolbox that extracts plane hypotheses using RANSAC, and optimizes the plane segmentation via a Markov Random Field (MRF) optimization.
Manhattan World Stereo (MWS) is very similar to NYU-Toolbox except that MWS employs the Manhattan World assumption in extracting planes and exploits vanishing lines in the pairwise terms to improve results.
Piecewise Planar Stereo (PPS) relaxes the Manhattan World assumption of MWS, and uses vanishing lines to generate better plane proposals. Please see the supplementary document for more algorithmic details on the baselines.
CH: 為了對比分段平面深度圖重建,我們使用了文獻(xiàn)中的三個(gè)方法作為比較基準(zhǔn)。
NYU-Toolbox 是 NYU 官方工具箱中的平面提取算法,使用了 RANSAC 算法提取平面候選區(qū)域,然后通過馬爾可夫隨機(jī)場(MRF)來優(yōu)化平面分割。
Manhattan World Stereo (MWS) 與 NYU-Toolbox 很相似,不同之處在于 MWS 在提取平面時(shí)用了曼哈頓世界的假設(shè)(Manhattan World assumption),并且用成對項(xiàng)中的消失線來改善結(jié)果。
Piecewise Planar Stereo (PPS) 放寬了曼哈頓世界假設(shè)(Manhattan World assumption)對 MWS 的影響,并使用消失線來生成更好的平面候選區(qū)域。
EN: Figure 4 shows the evaluation results on two recall metrics. The first metric is the percentage of correctly predicted ground-truth planes. We consider a ground-truth plane being correctly predicted, if one of the inferred planes has 1) more than 0.5 Intersection over Union (IOU) score and 2) the mean depth difference over the overlapping region is less than a threshold. We vary this threshold from 0 to 0.6m with an increment of 0.05m to plot graphs. The second recall metric is simply the percentage of pixels that are in such overlapping regions where planes are correctly predicted. The figure shows that PlaneNet is significantly better than all the competing methods when inferred depthmaps are used. PlaneNet is even better than some competing methods that use ground-truth depthmaps. This demonstrates the effectiveness of our approach, learning to infer piece-wise planar structures from many examples.
CH: 圖四顯示了兩個(gè)召回指標(biāo)的評估結(jié)果。第一個(gè)指標(biāo)是正確預(yù)測的真實(shí)實(shí)例平面的百分比。我們判斷一個(gè)真實(shí)實(shí)例平面預(yù)測是否正確的標(biāo)準(zhǔn)是:1)是否有IOU分?jǐn)?shù)大于0.5的平面,2)重疊區(qū)域的平均深度差是否小于閾值。我們讓這個(gè)閾值從0 - 0.6m以0.05m的速度遞增來畫圖。第二個(gè)指標(biāo)是正確預(yù)測平面中重疊區(qū)域所占的像素百分比。該圖顯示,在推算深度圖指標(biāo)中 PlaneNet 要優(yōu)于其他的方法。證明了我們的方法的有效性,從許多實(shí)例中學(xué)習(xí)推算分段平面結(jié)構(gòu)。
EN: Figure 4: Plane segmentation accuracy against competing baselines that use 3D points as input. Either ground-truth depthmaps or inferred depthmaps (by a DNN-based system) are used as their inputs. PlaneNet outperforms all the other methods that use inferred depthmaps. Surprisingly, PlaneNet is even better than many other methods that use ground-truth depthmaps.
CH: 圖四:使用3D點(diǎn)作為輸入,平面分割準(zhǔn)確率的對比。或者使用真實(shí)實(shí)例深度圖和基于DNN系統(tǒng)推算的深度圖作為輸入。PlaneNet 要優(yōu)于其他的方法。出人意料的是,PlaneNet 比一些使用真實(shí)實(shí)例深度圖的方法還要好。
EN: Figure 5 shows qualitative comparisons against existing methods with inferred depthmaps. PlaneNet produces significantly better plane segmentation results, while existing methods often generate many redundant planes where depthmaps are noisy, and fail to capture precise boundaries where the intensity edges are weak.
CH: 圖五顯示了與現(xiàn)有的方法推算出的深度圖的定性比較。PlaneNet 生成了更好的平面分割結(jié)果,現(xiàn)有的方法會(huì)有一些冗余的平面而且深度圖會(huì)有很多噪音,不能精確的捕捉到平面的邊界。
EN: Figure 5: Qualitative comparisons between PlaneNet and existing methods that use inferred depthmaps as the inputs. From left to right: an input image, plane segmentation results for existing methods, and PlaneNet, respectively, and the ground-truth.
CH: 圖五:使用推算的深度圖作為輸入,PlaneNet 與現(xiàn)有的其他方法的定性比較。從左往右:第一列為輸入圖像,第二三四列為現(xiàn)有其他方法的平面分割結(jié)果,第五列為PlaneNet 的平面分割結(jié)果,第六列為真實(shí)實(shí)例。
5.2. Depth reconstruction accuracy
5.2. 深度重建的準(zhǔn)確率
EN: While the capability to infer a plane segmentation mask and precise plane parameters is the key contribution of the work, it is also interesting to compare against depth prediction methods. This is to ensure that our structured depth prediction does not compromise per-pixel depth prediction accuracy. PlaneNet makes (K+1)(K+1)(K+1) depth value predictions at each pixel. We pick the depth value with the maximum probability in the segmentation mask to define our depthmap.
CH: 雖然這個(gè)工作的關(guān)鍵是預(yù)測平面分割掩膜和精確的平面參數(shù),但也能與深度預(yù)測方法進(jìn)行比較。可以確保我們的深度結(jié)構(gòu)化預(yù)測不會(huì)對每個(gè)像素的深度預(yù)測精度造成影響。PlaneNet 對每個(gè)像素進(jìn)行了 (K+1)(K+1)(K+1) 深度值預(yù)測。我們選擇分割掩膜中最大概率的深度值來定義深度圖。
EN: Depth accuracies are evaluated on the NYUv2 dataset at 1) planar regions, 2) boundary regions, and 3) the entire image, against three competing baselines. Eigen-VGG is a convolutional architecture to predict both depths and surface normals. SURGE is a more recent depth inference network that optimizes planarity. FCRN is the current state-of-the-art single-image depth inference network .
CH: 深度精度評估基于 NYUv2 數(shù)據(jù)集的平面區(qū)域,邊界區(qū)域和整個(gè)圖像。三個(gè)對比網(wǎng)絡(luò)分別是:Eigen-VGG 是用來預(yù)測深度值和平面法線的卷積結(jié)構(gòu)。SURGE 是最新的深度推算網(wǎng)絡(luò)可以優(yōu)化平面的。FCRN 是目前最好的單圖像推算網(wǎng)絡(luò)。
EN: Depthmaps in NYUv2 are very noisy and ground-truth plane extraction does not work well. Thus, we fine-tune our network using only the depth loss. Note that the key factor in this training is that the network is trained to generate a depthmap through our piece-wise planar depthmap represen-tation. To further verify the effects of this representation, we have also fine-tuned our network in the standard per-pixel depthmap representation by disabling the plane parameter and the plane segmentation branches. In this version, denoted as “PlaneNet (Depth rep.)”, the entire depthmap is predicted in the (K+1)th(K + 1)^{th}(K+1)th depthmap (DK+1)(D_{K+1})(DK+1?).
CH: NYUv2 的深度圖有很多噪音,并且真實(shí)實(shí)例的平面提取效果不好。因此,我們只使用深度損失來 fine-tune 我們的網(wǎng)絡(luò)。注意,訓(xùn)練時(shí)候的關(guān)鍵因素是網(wǎng)絡(luò)經(jīng)過訓(xùn)練可以通過我們分段平面深度信息表示生成深度圖。為了進(jìn)一步驗(yàn)證這種表示的效果,我們禁用了平面參數(shù)和平面分割掩膜兩個(gè)分支,只 fine-tune 像素的深度圖網(wǎng)絡(luò)分支,這個(gè)版本表示為 PlaneNet (Depth rep.)。
EN: Table 1 shows the depth prediction accuracy on various metrics introduced in the prior work. The left five metrics provide different error statistics such as relative difference (Rel) or rooted-mean-square-error (RMSE) on the average per-pixel depth errors. The right three metrics provide the ratio of pixels, for which the relative difference between the predicted and the ground-truth depths is below a threshold. The table demonstrates that PlaneNet outperforms the state of-the-art of single-image depth inference techniques. As observed in prior works, the planarity constraint makes differences in the depth prediction task, and the improvements are more significant when our piece-wise planar representation is enforced by our network.
CH: 表一展示了先前工作中用的各種指標(biāo)的深度預(yù)測準(zhǔn)確度。左邊五個(gè)是不同的誤差統(tǒng)計(jì),比如:平均像素深度誤差的相對偏差(Rel)和均方根誤差(RMSE)右邊三個(gè)是像素所占的比例,對于那些預(yù)測的和實(shí)際的深度相對誤差小于閾值的。該表表明 PlaneNet 要優(yōu)于目前單圖像深度信息推算的最新方法。之前的工作中有觀察到,在深度預(yù)測任務(wù)中,平面約束可以產(chǎn)生積極的影響,當(dāng)我們的網(wǎng)絡(luò)強(qiáng)制性執(zhí)行分段平面表示時(shí),這種影響更加的明顯了。
5.3. Plane ordering consistency
5.3. 平面順序的一致性
EN: For segment depth map inference, sorting ambiguity is a challenge. We found that PlaneNet automatically learns consistent sorting without supervision, for example, the floor is always returned to the second plane. In Figure 3, the colors in the planar segmentation results are defined by the order of the planes in the network output. While ordering loses consistency for small objects or extreme camera angles, in most cases, major common surfaces such as floors and walls have a consistent ordering.
CH: 對于分割深度圖的推算,平面的順序是一個(gè)挑戰(zhàn)。我們發(fā)現(xiàn) PlaneNet 在沒有干預(yù)的情況下會(huì)自動(dòng)進(jìn)行平面排序,例如:識別出來的地板總是被分到第二個(gè)平面。圖三中,平面分割結(jié)果的顏色就由輸出的平面順序決定的。一般情況下,對于墻面,地板這些大的平面,順序是一致的,只有在一些小平面上會(huì)失去一致性。
EN: We have taken advantage of this property and implemented a simple room layout estimation algorithm. More specifically, we look at the reconstruction example and manually select the plane entries that correspond to the ceiling, floor, and left/middle/right walls. For each possible room layout configuration (for example, a configuration with floor, left and middle walls visible), we build a 3D concave shell based on the plane parameters and project it back into the image to generate the room-layout. We measure the configured score by the number of pixels, where the constructed room layout is consistent with the inferred plane segmentation (determined by the winner). We chose the constructed room layout with the best score as our prediction. Figure 6 shows that our algorithm can generate reasonable room layout estimates even if the scene is confusing and contains many occlusion objects. Table 2 shows a quantitative assessment of the NYUv2 303 data set, where our method is comparable to the prior art designed specifically for this task.
CH: 根據(jù)這樣一個(gè)特點(diǎn),我們實(shí)現(xiàn)了一個(gè)房間布局估計(jì)算法,具體來說,我們在重建的實(shí)例中手動(dòng)選擇對應(yīng)的天花板,墻面,地板等平面。對于每個(gè)可能的房間布局配置,我們都根據(jù)推算的平面參數(shù)構(gòu)建一個(gè)3D結(jié)構(gòu),然后將這個(gè)3D結(jié)構(gòu)投影到原圖像生成房間的布局配置。在構(gòu)建的房間布局和推斷的平面分割一致時(shí),我們通過像素的數(shù)量來衡量預(yù)測布局的效果。最后選擇具有最佳效果的房間布局作為輸出的預(yù)測結(jié)果。圖六顯示即使場景很復(fù)雜,有許多遮擋對象,我們的算法也能夠生成合理的房間布局。表二顯示在 NYUv2 303 數(shù)據(jù)集上,我們的方法與專門針對此任務(wù)的方法效果相當(dāng)。
EN: Figure 6: Room layout estimations. We have exploited the ordering consistency in the predicted planes to infer room layouts.
CH: 圖六:房間布局估計(jì)。我們利用預(yù)測平面的順序一致性來預(yù)測房間布局。
EN: Table 2: Room layout estimations. Quantitative evaluations against the top-performers over the NYUv2 303 dataset.
CH: 表二:房間布局估計(jì)。在 NYUv2 303 數(shù)據(jù)集上與其他算法的定性效果比較。
5.4. Failure modes
5.4. 不足之處
EN: While achieving promising results on most images, PlaneNet has some failure modes as shown in Fig. 7. In the first example, PlaneNet generates two nearly co-planar vertical surfaces in the low-light region below the sink. In the second example, it cannot distinguish a white object on the floor from a white wall. In the third example, it misses a column structure on a wall due to the presence of object clutter. While the capability to infer precise plane parameters is already super-human, there is a lot of room for improvement on the planar segmentation, especially in the absence of texture information or at the presence of clutter.
CH: 雖然在很多圖像上有不錯(cuò)的效果,但是 PlaneNet 還是有許多不足之處,如圖七所示。在第一個(gè)例子中,PlaneNet 在一個(gè)低光區(qū)域產(chǎn)生了兩個(gè)幾乎共面的垂直表面,第二個(gè)例子中,沒有把白色墻壁和白色物體區(qū)分開來,第三個(gè)例子中,由于雜亂物體的影響,錯(cuò)過了墻上的列結(jié)構(gòu)。雖然 PlaneNet 在推算平面參數(shù)的能力已經(jīng)時(shí)很優(yōu)秀了,但是在平面分割精度方面還有待提升,尤其是在沒有紋理和有雜物的情況下。
EN: Figure 7: Typical failure modes occur in the absence of enough image texture cues or at the presence of small objects and clutter.
CH: 圖七:不足之處在于缺乏紋理或者有小物體遮擋的情況下。
6. Applications
6. 應(yīng)用
EN: Structured geometry reconstruction is important for many application in Augmented Reality. We demonstrate two image editing pplications enabled by our piece-wise planar representation: texture insertion and replacement (see Fig. 8). We first extract Manhattan directions by using the predicted plane normals through a standard voting scheme . Given a piece-wise planar region, we define an axis of its UV coordinate by the Manhattan direction that is the most parallel to the plane, while the other axis is simply the cross product of the first axis and the plane normal. Given a UV coordinate, we insert a new texture by alpha-blending or completely replace a texture with a new one. Please see the supplementary material and the video for more AR application examples.
CH: 結(jié)構(gòu)化幾何重建對于增強(qiáng)現(xiàn)實(shí)中的許多應(yīng)用都非常重要。通過使用我們的分段平面表示:紋理插入和替換,做了兩個(gè)圖像編輯的應(yīng)用。(見圖八)我們首先用一個(gè)標(biāo)準(zhǔn)的表決方法通過預(yù)測的平面法線來提取曼哈頓方向。給定分段平面分割的區(qū)域,我們通過最平行與平面的曼哈頓方向來定義它的 UV 坐標(biāo)軸,另一個(gè)軸是第一個(gè)軸和其平面法線的叉乘。給定 UV 坐標(biāo)軸,我們通過 alpha-blending 插入新的紋理或者完全替換舊紋理。更多實(shí)例請參閱補(bǔ)充材料及視頻。
EN: Figure 8: Texture editing applications. From top to bottom, an input image, a plane segmentation result, and an edited image.
CH: 圖八:圖片紋理編輯應(yīng)用。
7. Conclusion and future work
7. 結(jié)論及未來的工作
EN: This paper proposes PlaneNet, the first deep neural architecture for piece-wise planar depthmap reconstruction from a single RGB image. PlaneNet learns to directly infer a set of plane parameters and their probabilistic segmentation masks. The proposed approach significantly outperforms competing baselines in the plane segmentation task. It also advances the state-of-the-art in the single image depth prediction task. An interesting future direction is to go beyond the depthmap framework and tackle structured geometry prediction problems in a full 3D space.
CH: 本論文提出了第一個(gè)用于單幅圖像重建分段平面深度圖的深度神經(jīng)網(wǎng)絡(luò)-PlaneNet。PlaneNet 直接推斷平面參數(shù)及其分割掩膜。這個(gè)方法不僅在此任務(wù)中明顯的優(yōu)于目前的其他方法,還推動(dòng)了單一圖像深度預(yù)測任務(wù)的發(fā)展。在未來一個(gè)有趣的方向是超越深度圖,直接在3D空間處理幾何結(jié)構(gòu)化預(yù)測問題。
個(gè)人網(wǎng)站:心安便是歸處
GitHub:oh,ss
總結(jié)
以上是生活随笔為你收集整理的《PlaneNet-单幅RGB图像的分段平面重建》论文中英文对照解读的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 客栈(酒店)管理系统(源码+数据库+设计
- 下一篇: Bootloader和BIOS、uboo