當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Paper：《Multimodal Machine Learning: A Survey and Taxonomy，多模态机器学习:综述与分类》翻译与解读

發布時間：2025/3/21 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：《Multimodal Machine Learning: A Survey and Taxonomy，多模态机器学习:综述与分类》翻译与解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper：《Multimodal Machine Learning: A Survey and Taxonomy，多模態機器學習:綜述與分類》翻譯與解讀

《Multimodal Machine Learning: A Survey and Taxonomy》翻譯與解讀

Abstract

1 INTRODUCTION

2 Applications: a historical perspective?應用：歷史視角

3 Multimodal Representations多模態表示

3.1 Joint Representations?聯合表示

3.2 Coordinated Representations協調表示

3.3 Discussion討論

4 Translation翻譯

4.1 Example-based?基于實例

4.2 Generative approaches生成方法

4.3 Model evaluation and discussion模型評價與討論

5 Alignment對齊

5.1 Explicit alignment顯式對齊

5.2 Implicit alignment隱式對齊

5.3 Discussion討論

6 Fusion融合

6.1 Model-agnostic approaches與模型無關的方法

6.2 Model-based approaches基于模型的方法

6.3 Discussion討論

7 Co-learning共同學習

7.1 Parallel data并行數據

7.2 Non-parallel data非并行數據

7.3 Hybrid data混合數據

7.4 Discussion討論

8 Conclusion結論

《Multimodal Machine Learning: A Survey and Taxonomy》翻譯與解讀

作者：Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency
時間：2017年5月26日
地址：https://arxiv.org/abs/1705.09406

Abstract

Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation,?translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

我們對世界的體驗是多模態的(五大感官)——我們看到物體(視覺)，聽到聲音(聽覺)，感覺到質地(觸覺)，聞到氣味(嗅覺)，品嘗味道(味覺)，其實還包括第六感(心覺)。模態是指事物發生或經歷的方式，當一個研究問題包含多種模態時，它就被稱為多模態。為了讓人工智能在理解我們周圍的世界方面取得進展，它需要能夠同時解讀這些多模態信號。多模態機器學習旨在建立能夠處理和關聯來自多種模式信息的模型。這是一個充滿活力的多學科領域，其重要性和潛力都在不斷增加。本文不關注具體的多模態應用，而是對多模態機器學習本身的最新進展進行了調查，并將它們以一種常見的分類方式呈現出來。我們超越了典型的早期和晚期融合分類，并確定了多模態機器學習面臨的更廣泛的挑戰，即:表示、翻譯、對齊、融合和共同學習。這種新的分類方法將使研究人員更好地了解該領域的現狀，并確定未來的研究方向。

Index Terms—Multimodal, machine learning, introductory, survey.

索引術語-多模態，機器學習，入門，調查。

1 INTRODUCTION

THE world surrounding us involves multiple modalities— we see objects, hear sounds, feel texture, smell odors, and so on. In general terms, a modality refers to the way in which something happens or is experienced. Most people associate the word modality with the sensory modalities which represent our primary channels of communication and sensation, such as vision or touch. A research problem or dataset is therefore characterized as multimodal when it includes multiple such modalities. In this paper we focus primarily, but not exclusively, on three modalities: natural language which can be both written or spoken; visual signals which are often represented with images or videos; and vocal signals which encode sounds and para-verbal information such as prosody and vocal expressions.

In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret and reason about multimodal messages. Multi- modal machine learning aims to build models that can process and relate information from multiple modalities. From early research on audio-visual speech recognition to the recent explosion of interest in language and vision models, multi- modal machine learning is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.

我們周圍的世界包含多種形態——我們看到物體，聽到聲音，感覺到質地，聞到氣味，等等。一般來說，模態是指某事發生或被體驗的方式。大多數人將“情態”(后均譯為模態)一詞與代表我們溝通和感覺的主要渠道(如視覺或觸覺)的感官形式聯系在一起。因此，當一個研究問題或數據集包含多個這樣的模態時，它就被描述為多模態。在本文中，我們主要(但不完全)關注三種形式:可以書面或口頭的自然語言；通常用圖像或視頻表示的視覺信號；還有編碼聲音和似言語信息的聲音信號，如韻律和聲音表達。

為了讓人工智能在理解我們周圍的世界方面取得進展，它需要能夠解釋和推理關于多模態信息。多模態機器學習旨在建立能夠處理和關聯來自多種模態信息的模型。從早期的視聽語音識別研究到最近對語言和視覺模型的興趣激增，多模態機器學習是一個充滿活力的多學科領域，其重要性日益增加，具有非凡的潛力。

The research field of Multimodal Machine Learning brings some unique challenges for computational re- searchers given the heterogeneity of the data. Learning from multimodal sources offers the possibility of capturing cor- respondences between modalities and gaining an in-depth understanding of natural phenomena. In this paper we iden- tify and explore five core technical challenges (and related sub-challenges) surrounding multimodal machine learning. They are central to the multimodal setting and need to be tackled in order to progress the field. Our taxonomy goes beyond the typical early and late fusion split, and consists of the five following challenges:

1)、Representation A first fundamental challenge is learning how to represent and summarize multimodal data in a way that exploits the complementarity and redundancy of multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representa- tions. For example, language is often symbolic while au- dio and visual modalities will be represented as signals.

2)、Translation??A second challenge addresses how to trans- late (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often open-ended or subjective. For exam- ple, there exist a number of correct ways to describe an image and and one perfect translation may not exist.

3)、Alignment?A third challenge is to identify the direct rela- tions between (sub)elements from two or more different modalities. For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge we need to measure similarity between different modalities and deal with possible long- range dependencies and ambiguities.

4)、Fusion?A fourth challenge is to join information from two or more modalities to perform a prediction. For example, for audio-visual speech recognition, the visual description of the lip motion is fused with the speech signal to predict spoken words. The information coming from different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities.

5)、Co-learning A fifth challenge is to transfer knowledge between modalities, their representation, and their pre- dictive models. This is exemplified by algorithms of co- training, conceptual grounding, and zero shot learning. Co-learning explores how knowledge learning from one modality can help a computational model trained on a?different modality. This challenge is particularly relevant when one of the modalities has limited resources (e.g., annotated data).

考慮到數據的異質性，多模態機器學習的研究領域給計算研究人員帶來了一些獨特的挑戰。從多模態來源學習提供了捕獲模態之間的對應關系的可能性，并獲得對自然現象的深入理解。在本文中，我們確定并探討了圍繞多模態機器學習的五個核心技術挑戰(以及相關的子挑戰)。它們是多模態環境的核心，需要加以解決以推動該領域的發展。我們的分類超越了典型的早期和晚期融合分裂，包括以下五個挑戰:

1)、表示：第一個基本挑戰是學習如何以一種利用多模態的互補性和冗余性的方式來表示和總結多模態數據。多模態數據的異質性使得構造這樣的表示具有挑戰性。例如，語言通常是符號化的，而視聽形式將被表示為信號。

2)、翻譯：第二個挑戰是如何將數據從一種模態轉換(映射)到另一種模態。不僅數據是異質的，而且模態之間的關系往往是開放的或主觀的。例如，存在許多描述圖像的正確方法，并且可能不存在一種完美的翻譯。

3)、對齊：第三個挑戰是識別來自兩個或更多不同模態的(子)元素之間的直接關系。例如，我們可能想要將菜譜中的步驟與顯示菜肴制作過程的視頻對齊。為了應對這一挑戰，我們需要衡量不同模態之間的相似性，并處理可能的長期依賴和歧義。

4)、融合：第四個挑戰是將來自兩個或更多模態的信息連接起來進行預測。例如，在視聽語音識別中，將嘴唇運動的視覺描述與語音信號融合在一起來預測口語單詞。來自不同模態的信息可能具有不同的預測能力和噪聲拓撲，至少在一種模態中可能丟失數據。

5)、共同學習：第五項挑戰是如何在模態、表示和預測模型之間傳遞知識。這可以通過協同訓練、概念基礎和零樣本學習的算法來例證。共同學習探索了如何從一個模態學習知識可以幫助在不同模態上訓練的計算模型。當其中一種模態的資源有限(例如，注釋數據)時，這個挑戰尤其重要。

Table 1: A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it.

APPLICATIONS：REPRESENTATION、TRANSLATION、ALIGNMENT、FUSION、CO-LEARNING

1、Speech recognition and synthesis：Audio-visual speech recognition、(Visual) speech synthesis

2、Event detection：Action classification、Multimedia event detection

3、Emotion and affect：Recognition、Synthesis

4、Media description：Image description、Video description、Visual question-answering、Media summarization

5、Multimedia retrieval：Cross modal retrieval、Cross modal hashing

表1:多模態機器學習支持的應用程序的總結。對于每個應用領域，我們確定了需要解決的核心技術挑戰。

應用：表示、翻譯、對齊、融合、共同學習

1、語音識別與合成:視聽語音識別、(視覺)語音合成

2、事件檢測:動作分類、多媒體事件檢測

3、情感與影響:識別、綜合

4、媒體描述:圖像描述、視頻描述、視覺問答、媒體摘要

5、多媒體檢索:交叉模態檢索，交叉模態哈希

For each of these five challenges, we defines taxonomic classes and sub-classes to help structure the recent work in this emerging research field of multimodal machine learning. We start with a discussion of main applications of multimodal machine learning (Section 2)?followed by a discussion on the recent developments on all of the five core technical challenges facing multimodal machine learning: representation (Section 3),?translation (Section 4),?alignment (Section 5),?fusion (Section 6),?and co-learning (Section?7). We conclude with a discussion in Section 8.

對于這五個挑戰中的每一個，我們都定義了分類類別和子類別，以幫助構建多模態機器學習這一新興研究領域的最新工作。我們開始討論的多通道的主要應用機器學習(2節),后跟一個討論近期的事態發展在所有的五個核心技術多通道機器學習所面臨的挑戰:表示(第三節),翻譯(4節),對齊(5節),融合(6節),co-learning(第7節)。我們在第8節中進行了討論。

2 Applications: a historical perspective?應用：歷史視角

Multimodal machine learning enables a wide range of applications: from audio-visual speech recognition to im-age captioning. In this section we present a brief history of multimodal applications, from its beginnings in audio-visual speech recognition to a recently renewed interest in language and vision applications.

One of the earliest examples of multimodal research is audio-visual speech recognition (AVSR) [243]. It was moti-vated by the McGurk effect [138] — an interaction between hearing and vision during speech perception. When human subjects heard the syllable /ba-ba/ while watching the lips of a person saying /ga-ga/, they perceived a third sound: /da-da/. These results motivated many researchers from the speech community to extend their approaches with visual information. Given the prominence of hidden Markov mod-els (HMMs) in the speech community at the time [95], it is without surprise that many of the early models for AVSR were based on various HMM extensions [24], [25]. While research into AVSR is not as common these days, it has seen renewed interest from the deep learning community [151].

While the original vision of AVSR was to improve speech recognition performance (e.g., word error rate) in all contexts, the experimental results showed that the main advantage of visual information was when the speech signal was noisy (i.e., low signal-to-noise ratio) [75], [151], [243]. In?other words, the captured interactions between modalities were supplementary rather than complementary. The same information was captured in both, improving the robustness of the multimodal models but not improving the speech recognition performance in noiseless scenarios.

多模態機器學習實現了廣泛的應用：從視聽語音識別到圖像字幕。在本節中，我們將簡要介紹多模態應用的歷史，從它在視聽語音識別方面的起步，到最近在語言和視覺應用方面重新燃起的興趣。

多模態研究最早的例子之一是視聽語音識別(AVSR)[243]。它的動機是McGurk效應[138]——在言語感知過程中聽覺和視覺之間的交互作用。當受試者在觀察一個人說/ga-ga/時的嘴唇時聽到/ba-ba/音節，他們會聽到第三個聲音:/da-da/。這些結果激發了語言學界的許多研究人員將他們的方法擴展到視覺信息。考慮到隱馬爾可夫模型(HMM)在當時的語音社區中的突出程度[95]，許多早期的AVSR模型都是基于各種HMM擴展[24]，[25]，這一點也不令人驚訝。雖然目前對AVSR的研究并不常見，但深度學習社區對它重新燃起了興趣[151]。

雖然AVSR的原始視覺是為了提高所有語境下的語音識別性能（例如，單詞錯誤率），但實驗結果表明，視覺信息的主要優勢是在語音信號有噪聲(即低信噪比)時[75]、[151]、[243]。換句話說，模態之間的相互作用是互補的而不是互補的。兩種方法都捕獲了相同的信息，提高了多模態模型的魯棒性，但沒有提高在無噪聲場景下的語音識別性能。

A second important category of multimodal applications comes from the field of multimedia content indexing and retrieval [11], [188]. With the advance of personal comput-ers and the internet, the quantity of digitized multime-dia content has increased dramatically [2]. While earlier approaches for indexing and searching these multimedia videos were keyword-based [188], new research problems emerged when trying to search the visual and multimodal content directly. This led to new research topics in multi-media content analysis such as automatic shot-boundary detection [123] and video summarization [53]. These re-search projects were supported by the TrecVid initiative from the National Institute of Standards and Technologies which introduced many high-quality datasets, including the multimedia event detection (MED) tasks started in 2011 [1]

第二個重要的多模態應用類別來自多媒體內容索引和檢索領域[11][188]。隨著個人電腦和互聯網的發展，數字化多媒體內容的數量急劇增加。雖然早期對這些多媒體視頻進行索引和搜索的方法是基于關鍵詞的[188]，但當試圖直接搜索視覺和多模態內容時，出現了新的研究問題。這導致了多媒體內容分析的新研究課題，如自動鏡頭邊界檢測[123]和視頻摘要[53]。這些研究項目由國家標準和技術研究所的TrecVid計劃支持，該計劃引入了許多高質量的數據集，包括2011年[1]開始的多媒體事件檢測(MED)任務

A third category of applications was established in the early 2000s around the emerging field of multimodal interaction with the goal of understanding human multi-modal behaviors during social interactions. One of the first landmark datasets collected in this field is the AMI Meet-ing Corpus which contains more than 100 hours of video recordings of meetings, all fully transcribed and annotated [33]. Another important dataset is the SEMAINE corpus which allowed to study interpersonal dynamics between speakers and listeners [139]. This dataset formed the basis of the first audio-visual emotion challenge (AVEC) orga-nized in 2011 [179]. The fields of emotion recognition and affective computing bloomed in the early 2010s thanks to strong technical advances in automatic face detection, facial landmark detection, and facial expression recognition [46]. The AVEC challenge continued annually afterward with the later instantiation including healthcare applications such as automatic assessment of depression and anxiety [208]. A great summary of recent progress in multimodal affect recognition was published by D’Mello et al. [50]. Their meta-analysis revealed that a majority of recent work on multimodal affect recognition show improvement when using more than one modality, but this improvement is reduced when recognizing naturally-occurring emotions.

第三類應用是在21世紀初建立的，圍繞著新興的多模態互動領域，目的是理解社會互動中人類的多模態行為。在這個領域收集的第一個具有里程碑意義的數據集是AMI會議語料庫(AMI meetet Corpus)，它包含了100多個小時的會議視頻記錄，全部都是完全轉錄和注釋的[33]。另一個重要的數據集是SEMAINE語料庫，它可以研究說話者和聽者之間的人際動力學[139]。該數據集構成了2011年組織的第一次視聽情緒挑戰(AVEC)的基礎[179]。由于自動人臉檢測、面部地標檢測和面部表情識別[46]技術的強大進步，情緒識別和情感計算領域在2010年代早期蓬勃發展。此后，AVEC挑戰每年都在繼續，后來的實例包括抑郁和焦慮的自動評估等醫療保健應用[208]。D’mello et al.[50]對多模態情感識別的最新進展進行了很好的總結。他們的薈萃分析顯示，最近關于多模態情感識別的大部分工作在使用多個模態時表現出改善，但當識別自然發生的情緒時，這種改善就會減少。

Most recently, a new category of multimodal applica-tions emerged with an emphasis on language and vision: media description. One of the most representative applica-tions is image captioning where the task is to generate a text description of the input image [83]. This is motivated by the ability of such systems to help the visually impaired in their daily tasks [20]. The main challenges media description is evaluation: how to evaluate the quality of the predicted descriptions. The task of visual question-answering (VQA) was recently proposed to address some of the evaluation challenges [9], where the goal is to answer a specific ques-tion about the image.

In order to bring some of the mentioned applications to the real world we need to address a number of tech-nical challenges facing multimodal machine learning. We summarize the relevant technical challenges for the above mentioned application areas in Table 1. One of the most im-portant challenges is multimodal representation, the focus of our next section.

最近，一種新的多模態應用出現了，它強調語言和視覺:媒體描述。最具代表性的應用之一是圖像字幕，其任務是生成輸入圖像的文本描述[83]。這是由這些系統的能力來幫助視障人士在他們的日常任務[20]。媒體描述的主要挑戰是評估:如何評估預測描述的質量。最近提出的視覺回答任務(VQA)是為了解決[9]的一些評估挑戰，其目標是回答關于圖像的特定問題。

為了將上述一些應用應用到現實世界中，我們需要解決多模態機器學習所面臨的一系列技術挑戰。我們在表1中總結了上述應用領域的相關技術挑戰。最重要的挑戰之一是多模態表示，這是我們下一節的重點。

3 Multimodal Representations多模態表示

Representing raw data in a format that a computational model can work with has always been a big challenge in machine learning. Following the work of Bengio et al. [18] we use the term feature and representation interchangeably, with each referring to a vector or tensor representation of an entity, be it an image, audio sample, individual word, or a sentence. A multimodal representation is a representation of data using information from multiple such entities. Repre-senting multiple modalities poses many difficulties: how to combine the data from heterogeneous sources; how to deal with different levels of noise; and how to deal with missing data. The ability to represent data in a meaningful way is crucial to multimodal problems, and forms the backbone of any model.

Good representations are important for the performance of machine learning models, as evidenced behind the recent leaps in performance of speech recognition [79] and visual object classification [109] systems. Bengio et al. [18] identify a number of properties for good representations: smooth-ness, temporal and spatial coherence, sparsity, and natural clustering amongst others. Srivastava and Salakhutdinov [198] identify additional desirable properties for multi-modal representations: similarity in the representation space should reflect the similarity of the corresponding concepts, the representation should be easy to obtain even in the absence of some modalities, and finally, it should be possible to fill-in missing modalities given the observed ones.

以一種計算模型可以使用的格式表示原始數據一直是機器學習的一大挑戰。在Bengio等人[18]的工作之后，我們交替使用術語“特征”和“表示”，每一個都指一個實體的向量或張量表示，無論是圖像、音頻樣本、單個單詞還是一個句子。多模態表示是使用來自多個此類實體的信息的數據表示。表示多種模態帶來了許多困難:如何組合來自不同來源的數據;如何處理不同程度的噪音;以及如何處理丟失的數據。以有意義的方式表示數據的能力對多模態問題至關重要，并構成任何模型的支柱。

良好的表示對機器學習模型的性能非常重要，這在語音識別[79]和視覺對象分類[109]系統最近的性能飛躍中得到了證明。Bengio等人的[18]為良好的表示識別了許多屬性:平滑性、時間和空間一致性、稀疏性和自然聚類。Srivastava和Salakhutdinov[198]確定了多模態表示的其他理想屬性:表示空間中的相似性應反映出相應概念的相似性，即使在沒有某些模態的情況下，表示也應易于獲得，最后，對于觀察到的模態，應能夠填充缺失的模態。

The development of unimodal representations has been extensively studied [5], [18], [122]. In the past decade there has been a shift from hand-designed for specific applications to data-driven. For example, one of the most famous image descriptors in the early 2000s, the scale invariant feature transform (SIFT) was hand designed [127], but currently most visual descriptions are learned from data using neural architectures such as convolutional neural networks (CNN)[109]. Similarly, in the audio domain, acoustic features?such as Mel-frequency cepstral coefficients (MFCC) have been superseded by data-driven deep neural networks in speech recognition [79] and recurrent neural networks for para-linguistic analysis [207]. In natural language process-ing, the textual features initially relied on counting word occurrences in documents, but have been replaced data-driven word embeddings that exploit the word context [141]. While there has been a huge amount of work on unimodal representation, up until recently most multimodal representations involved simple concatenation of unimodal ones [50], but this has been rapidly changing.

單模態表征的發展已被廣泛研究[5]，[18]，[122]。在過去的十年里，已經出現了從手工設計特定應用程序到數據驅動的轉變。例如，本世紀初最著名的圖像描述符之一，尺度不變特征變換(SIFT)是手工設計的[127]，但目前大多數視覺描述都是使用卷積神經網絡(CNN)等神經體系結構從數據中學習的[109]。同樣，在音頻領域，如Mel-frequency倒譜系數(MFCC)等聲學特征已被語音識別中的數據驅動深度神經網絡[79]和輔助語言分析中的循環神經網絡[207]所取代。在自然語言處理中，文本特征最初依賴于計算文檔中的單詞出現次數，但已經取代了利用單詞上下文的數據驅動單詞嵌入[141]。盡管在單模態表示方面已經做了大量的工作，但直到最近，大多數多模態表示都涉及單模態表示[50]的簡單串聯，但這種情況正在迅速改變。

To help understand the breadth of work, we propose two categories of multimodal representation: joint and coor-dinated. Joint representations combine the unimodal signals into the same representation space, while coordinated repre-sentations process unimodal signals separately, but enforce certain similarity constraints on them to bring them to what we term a coordinated space. An illustration of different multimodal representation types can be seen in Figure 1.

Mathematically, the joint representation is expressed as:

?where the multimodal representation xm is computed using function f (e.g., a deep neural network, restricted Boltz-mann machine, or a recurrent neural network) that relies on unimodal representations x1, . . . xn. While coordinated representation is as follows:

where each modality has a corresponding projection func-tion (f and g above) that maps it into a coordinated multi-modal space. While the projection into the multimodal space is independent for each modality, but the resulting space is coordinated between them (indicated as ～). Examples of such coordination include minimizing cosine distance [61], maximizing correlation [7], and enforcing a partial order [212] between the resulting spaces.

為了幫助理解工作的廣度，我們提出了兩種類型的多模態表示:聯合的和協調的。聯合表示將單模態信號組合到相同的表示空間中，而協調表示則分別處理單模態信號，但對它們施加某種相似性約束，使它們進入我們所說的協調空間。圖1展示了不同的多模態表示類型。

在數學上，聯合表示為:

其中，多模態表示xm是使用依賴于單模態表示x1、…的函數f(例如，深度神經網絡、受限玻爾茲曼機或循環神經網絡)計算的。而協調表示如下:

其中，每個模態都有一個相應的投影函數(f和g)，將其映射到一個協調的多模態空間中。雖然投射到多模態空間的每個模態都是獨立的，但最終的空間在它們之間是協調的(表示為~)。這種協調的例子包括最小化余弦距離[61]，最大化相關性[7]，以及在結果空間之間強制執行偏序[212]。

3.1 Joint Representations?聯合表示

We start our discussion with joint representations that project unimodal representations together into a multimodal space (Equation 1). Joint representations are mostly (but not exclusively) used in tasks where multimodal data is present both during training and inference steps. The sim-plest example of a joint representation is a concatenation of individual modality features (also referred to as early fusion [50]). In this section we discuss more advanced methods for creating joint representations starting with neural net-works, followed by graphical models and recurrent neural networks (representative works can be seen in Table 2). Neural networks have become a very popular method for unimodal data representation [18]. They are used to repre-sent visual, acoustic, and textual data, and are increasingly used in the multimodal domain [151], [156], [217]. In this section we describe how neural networks can be used to construct a joint multimodal representation, how to train them, and what advantages they offer.

我們從聯合表示開始討論，聯合表示將單模態表示一起投射到多模態空間中(方程1)。聯合表示通常(但不是唯一)用于在訓練和推理步驟中都存在多模態數據的任務中。聯合表示的最簡單的例子是單個形態特征的串聯(也稱為早期融合[50])。在本節中，我們將討論創建聯合表示的更高級方法，首先是神經網絡，然后是圖形模型和循環神經網絡(代表性作品見表2)。神經網絡已經成為單模態數據表示[18]的一種非常流行的方法。它們被用來表示視覺、聽覺和文本數據，并在多模態領域中越來越多地使用[151]、[156]、[217]。在本節中，我們將描述如何使用神經網絡來構建聯合多模態表示，如何訓練它們，以及它們提供了什么優勢。

In general, neural networks are made up of successive building blocks of inner products followed by non-linear activation functions. In order to use a neural network as?a way to represent data, it is first trained to perform a specific task (e.g., recognizing objects in images). Due to the multilayer nature of deep neural networks each successive layer is hypothesized to represent the data in a more abstract way [18], hence it is common to use the final or penultimate neural layers as a form of data representation. To construct a multimodal representation using neural networks each modality starts with several individual neural layers fol-lowed by a hidden layer that projects the modalities into a joint space [9], [145], [156], [227]. The joint multimodal representation is then be passed through multiple hidden layers itself or used directly for prediction. Such models can be trained end-to-end — learning both to represent the data and to perform a particular task. This results in a close relationship between multimodal representation learning and multimodal fusion when using neural networks.

一般來說，神經網絡由內積的連續構建塊和非線性激活函數組成。為了使用神經網絡作為一種表示數據的方法，首先要訓練它執行特定的任務(例如，識別圖像中的對象)。由于深度神經網絡的多層性質，假設每一層都以更抽象的方式[18]表示數據，因此通常使用最后或倒數第二層神經網絡作為數據表示的一種形式。為了使用神經網絡構建多模態表示，每個模態從幾個單獨的神經層開始，然后是一個隱藏層，該層將模態投射到關節空間[9]，[145]，[156]，[227]。然后將聯合多模態表示通過多個隱藏層本身或直接用于預測。這樣的模型可以端到端進行訓練——既可以表示數據，也可以執行特定的任務。這導致了在使用神經網絡時，多模態表示學習和多模態融合之間的密切關系。

?Figure 1: Structure of joint and coordinated representations. Joint representations are projected to the same space using all of the modalities as input. Coordinated representations, on the other hand, exist in their own space, but are coordinated through a similarity (e.g. Euclidean distance) or structure constraint (e.g. partial order).

圖1:關節和協調表示的結構。使用所有的模態作為輸入，將聯合表示投影到同一空間。另一方面，協調表示存在于它們自己的空間中，但通過相似性(如歐幾里德距離)或結構約束(如部分順序)進行協調。

As neural networks require a lot of labeled training data, it is common to pre-train such representations using an autoencoder on unsupervised data [80]. The model pro-posed by Ngiam et al. [151] extended the idea of using autoencoders to the multimodal domain. They used stacked denoising autoencoders to represent each modality individ-ually and then fused them into a multimodal representation using another autoencoder layer. Similarly, Silberer and Lapata [184] proposed to use a multimodal autoencoder for the task of semantic concept grounding (see Section 7.2). In addition to using a reconstruction loss to train the representation they introduce a term into the loss function that uses the representation to predict object labels. It is also common to fine-tune the resulting representation on a particular task at hand as the representation constructed using an autoencoder is generic and not necessarily optimal for a specific task [217].

The major advantage of neural network based joint rep-resentations comes from their often superior performance and the ability to pre-train the representations in an unsu-pervised manner. The performance gain is, however, depen-dent on the amount of data available for training. One of the disadvantages comes from the model not being able to handle missing data naturally — although there are ways to alleviate this issue [151], [217]. Finally, deep networks are often difficult to train [69], but the field is making progress in better training techniques [196].

Probabilistic graphical models are another popular way to?construct representations through the use of latent random variables [18]. In this section we describe how probabilistic graphical models are used to represent unimodal and mul-timodal data.

由于神經網絡需要大量標注的訓練數據，通常使用自動編碼器對非監督數據進行此類表示的預訓練[80]。Ngiam等人[151]提出的模型將使用自動編碼器的思想擴展到多模態域。他們使用堆疊降噪自動編碼器來單獨表示每個模態，然后使用另一個自動編碼器層將它們融合成一個多模態表示。類似地，Silberer和Lapata[184]提出使用多模態自動編碼器來完成語義概念扎根的任務(見章節7.2)。除了使用重構損失來訓練表示之外，他們還在損失函數中引入了一個術語，該術語使用表示來預測對象標簽。由于使用自動編碼器構造的表示是通用的，對于特定的任務不一定是最佳的，因此對當前特定任務的結果表示進行微調也是很常見的[217]。

基于神經網絡的聯合表示的主要優勢來自于它們通常卓越的性能，以及以無監督的方式對表示進行預訓練的能力。然而，性能增益取決于可供訓練的數據量。缺點之一是模型不能自然地處理缺失的數據——盡管有一些方法可以緩解這個問題[151]，[217]。最后，深度網絡通常很難訓練[69]，但該領域在更好的訓練技術方面正在取得進展[196]。

概率圖形模型是另一種通過使用潛在隨機變量[18]來構造表示的流行方法。在本節中，我們將描述如何使用概率圖形模型來表示單模態和多模態數據。

The most popular approaches for graphical-model based representation are deep Boltzmann machines (DBM) [176], that stack restricted Boltzmann machines (RBM) [81] as building blocks. Similar to neural networks, each successive layer of a DBM is expected to represent the data at a higher level of abstraction. The appeal of DBMs comes from the fact that they do not need supervised data for training [176]. As they are graphical models the representation of data is probabilistic, however it is possible to convert them to a deterministic neural network — but this loses the generative aspect of the model [176].

Work by Srivastava and Salakhutdinov [197] introduced multimodal deep belief networks as a multimodal represen-tation. Kim et al. [104] used a deep belief network for each modality and then combined them into joint representation for audiovisual emotion recognition. Huang and Kingsbury [86] used a similar model for AVSR, and Wu et al. [225] for audio and skeleton joint based gesture recognition.

Multimodal deep belief networks have been extended to multimodal DBMs by Srivastava and Salakhutdinov [198]. Multimodal DBMs are capable of learning joint represen-tations from multiple modalities by merging two or more undirected graphs using a binary layer of hidden units on top of them. They allow for the low level representations of each modality to influence each other after the joint training due to the undirected nature of the model.

Ouyang et al. [156] explore the use of multimodal DBMs for the task of human pose estimation from multi-view data. They demonstrate that integrating the data at a later stage —after unimodal data underwent nonlinear transformations— was beneficial for the model. Similarly, Suk et al. [199] use multimodal DBM representation to perform Alzheimer’s disease classification from positron emission tomography and magnetic resonance imaging data.

最流行的基于圖形模型的表示方法是深度玻爾茲曼機(DBM)[176]，它將限制玻爾茲曼機(RBM)[81]堆疊為構建塊。與神經網絡類似，DBM的每一個后續層都被期望在更高的抽象級別上表示數據。DBMs的吸引力來自于這樣一個事實，即它們不需要監督數據進行訓練[176]。由于它們是圖形模型，數據的表示是概率的，但是可以將它們轉換為確定性神經網絡——但這失去了模型的生成方面[176]。

Srivastava和Salakhutdinov[197]的研究引入了多模態深度信念網絡作為多模態表征。Kim等人[104]對每個模態使用深度信念網絡，然后將它們組合成聯合表征，用于視聽情感識別。Huang和Kingsbury[86]在AVSR中使用了類似的模型，Wu等[225]在基于音頻和骨骼關節的手勢識別中使用了類似的模型。

Srivastava和Salakhutdinov將多模態深度信念網絡擴展到多模態DBMs[198]。多模態DBMs能夠通過在兩個或多個無向圖上使用隱藏單元的二進制層來合并它們，從而從多個模態中學習聯合表示。由于模型的無方向性，它們允許每個模態的低層次表示在聯合訓練后相互影響。

歐陽等人[156]探討了使用多模態DBMs完成從多視圖數據中估計人體姿態的任務。他們證明，在單模態數據經過非線性轉換后的后期階段對數據進行集成對模型是有益的。類似地，Suk等人[199]利用多模態DBM表示法，從正電子發射斷層掃描和磁共振成像數據中進行阿爾茨海默病分類。

One of the big advantages of using multimodal DBMs for learning multimodal representations is their generative nature, which allows for an easy way to deal with missing data — even if a whole modality is missing, the model has a natural way to cope. It can also be used to generate samples of one modality in the presence of the other one, or?both modalities from the representation. Similar to autoen-coders the representation can be trained in an unsupervised manner enabling the use of unlabeled data. The major disadvantage of DBMs is the difficulty of training them —high computational cost, and the need to use approximate variational training methods [198].

Sequential Representation. So far we have discussed mod-els that can represent fixed length data, however, we often need to represent varying length sequences such as sen-tences, videos, or audio streams. In this section we describe models that can be used to represent such sequences.

使用多模態DBMs學習多模態表示的一大優點是它們的生成特性，這允許使用一種簡單的方法來處理缺失的數據——即使整個模態都缺失了，模型也有一種自然的方法來處理。它還可以用于在存在另一種模態的情況下產生一種模態的樣本，或者從表示中產生兩種模態的樣本。與自動編碼器類似，表示可以以無監督的方式進行訓練，以便使用未標記的數據。DBMs的主要缺點是很難訓練它們——計算成本高，而且需要使用近似變分訓練方法[198]。

順序表示。到目前為止，我們已經討論了可以表示固定長度數據的模型，但是，我們經常需要表示不同長度的序列，例如句子、視頻或音頻流。在本節中，我們將描述可以用來表示這種序列的模型。

?Table 2: A summary of multimodal representation tech-niques. We identify three subtypes of joint representations (Section 3.1) and two subtypes of coordinated ones (Section 3.2). For modalities + indicates the modalities combined.

表2:多模態表示技術的概述。我們確定了聯合表示的三種子類型(章節3.1)和協調表示的兩種子類型(章節3.2)。對于模態，+表示組合的模態。

Recurrent neural networks (RNNs), and their variants such as long-short term memory (LSTMs) networks [82], have recently gained popularity due to their success in sequence modeling across various tasks [12], [213]. So far RNNs have mostly been used to represent unimodal se-quences of words, audio, or images, with most success in the language domain. Similar to traditional neural networks, the hidden state of an RNN can be seen as a representation of the data, i.e., the hidden state of RNN at timestep t can be seen as the summarization of the sequence up to that timestep. This is especially apparent in RNN encoder-decoder frameworks where the task of an encoder is to represent a sequence in the hidden state of an RNN in such a way that a decoder could reconstruct it [12].

The use of RNN representations has not been limited to the unimodal domain. An early use of constructing a multimodal representation using RNNs comes from work by Cosi et al. [43] on AVSR. They have also been used for representing audio-visual data for affect recognition [37],[152] and to represent multi-view data such as different visual cues for human behavior analysis [166].

循環神經網絡(rnn)及其變體，如長短期記憶(LSTMs)網絡[82]，由于它們在不同任務的序列建模中取得了成功[12]，[213]，近年來越來越受歡迎。到目前為止，神經網絡主要用于表示單模態的單詞序列、音頻序列或圖像序列，在語言領域取得了很大的成功。與傳統的神經網絡類似，RNN的隱藏狀態可以看作是數據的一種表示，即RNN在時間步長t處的隱藏狀態可以看作是該時間步長的序列的匯總。這在RNN編碼器-解碼器框架中尤為明顯，在該框架中，編碼器的任務是表示RNN的隱藏狀態下的序列，以便解碼器可以將其重構為[12]。

RNN表示的使用并不局限于單模態域。Cosi等人在AVSR上的工作最早使用RNNs構造多模態表示。它們還被用于表示視聽數據，用于情感識別[37][152]，并表示多視圖數據，如用于人類行為分析的不同視覺線索[166]。

3.2 Coordinated Representations協調表示

An alternative to a joint multimodal representation is a coor-dinated representation. Instead of projecting the modalities together into a joint space, we learn separate representations for each modality but coordinate them through a constraint. We start our discussion with coordinated representations that enforce similarity between representations, moving on to coordinated representations that enforce more structure on the resulting space (representative works of different coordinated representations can be seen in Table 2).

Similarity models minimize the distance between modal-ities in the coordinated space. For example such models encourage the representation of the word dog and an image of a dog to have a smaller distance between them than distance between the word dog and an image of a car [61]. One of the earliest examples of such a representation comes from the work by Weston et al. [221], [222] on the WSABIE (web scale annotation by image embedding) model, where a coordinated space was constructed for images and their annotations. WSABIE constructs a simple linear map from image and textual features such that corresponding anno-tation and image representation would have a higher inner product (smaller cosine distance) between them than non-corresponding ones.

聯合多模態表示的另一種選擇是協調表示。我們學習每個模態的單獨表示，但通過一個約束來協調它們，而不是將這些模態一起投影到關節空間中。我們從協調表示開始討論，協調表示強制表示之間的相似性，然后繼續討論在結果空間上強制更多結構的協調表示(不同協調表示的代表作品見表2)。

相似模型最小化協調空間中各模態之間的距離。例如，這樣的模型鼓勵單詞dog和一只狗的圖像之間的距離比單詞dog和一輛汽車的圖像之間的距離更小[61]。這種表達最早的例子之一來自Weston等人[221]，[222]在WSABIE(圖像嵌入的web尺度注釋)模型上的工作，其中為圖像及其注釋構建了一個協調的空間。WSABIE從圖像和文本特征構造了一個簡單的線性映射，這樣對應的標注和圖像表示就會比不對應的標注和圖像之間有更高的內積(更小的余弦距離)。

More recently, neural networks have become a popular way to construct coordinated representations, due to their ability to learn representations. Their advantage lies in the fact that they can jointly learn coordinated representations in an end-to-end manner. An example of such coordinated representation is DeViSE — a deep visual-semantic embed-ding [61]. DeViSE uses a similar inner product and ranking loss function to WSABIE but uses more complex image and word embeddings. Kiros et al. [105] extended this to sentence and image coordinated representation by using an LSTM model and a pairwise ranking loss to coordinate the feature space. Socher et al. [191] tackle the same task, but extend the language model to a dependency tree RNN to incorporate compositional semantics. A similar model was also proposed by Pan et al. [159], but using videos instead of images. Xu et al. [231] also constructed a coordinated space between videos and sentences using a subject, verb, object compositional language model and a deep video model. This representation was then used for the task of cross-modal retrieval and video description.

While the above models enforced similarity between representations, structured coordinated space models go beyond that and enforce additional constraints between the modality representations. The type of structure enforced is often based on the application, with different constraints for hashing, cross-modal retrieval, and image captioning.

最近，由于神經網絡具有學習表征的能力，它已經成為構建協調表征的一種流行方式。它們的優勢在于能夠以端到端方式共同學習協調的表示。這種協調表示的一個例子是設計——一種深度的視覺語義嵌入[61]。設計使用與WSABIE類似的內部產品和排名損失函數，但使用更復雜的圖像和單詞嵌入。Kiros等人[105]通過使用LSTM模型和一對排序損失來協調特征空間，將其擴展到句子和圖像的協調表示。Socher等人[191]處理了相同的任務，但將語言模型擴展到依賴樹RNN，以合并復合語義。Pan等人[159]也提出了類似的模型，但使用的是視頻而不是圖像。Xu等人[231]也使用主語、動詞、賓語構成語言模型和深度視頻模型構建了視頻和句子之間的協調空間。然后將該表示用于跨模態檢索和視頻描述任務。

雖然上述模型加強了表示之間的相似性，但結構化協調空間模型超越了這一點，并加強了模態表示之間的附加約束。強制的結構類型通常基于應用程序，對哈希、交叉模態檢索和圖像標題有不同的約束。

Structured coordinated spaces are commonly used in cross-modal hashing — compression of high dimensional data into compact binary codes with similar binary codes for similar objects [218]. The idea of cross-modal hashing is to create such codes for cross-modal retrieval [27], [93],[113]. Hashing enforces certain constraints on the result-ing multimodal space: 1) it has to be an N-dimensional Hamming space — a binary representation with controllable number of bits; 2) the same object from different modalities has to have a similar hash code; 3) the space has to be similarity-preserving. Learning how to represent the data as a hash function attempts to enforce all of these three requirements [27], [113]. For example, Jiang and Li [92] introduced a method to learn such common binary space between sentence descriptions and corresponding images using end-to-end trainable deep learning techniques. While Cao et al. [32] extended the approach with a more complex LSTM sentence representation and introduced an outlier insensitive bit-wise margin loss and a relevance feedback based semantic similarity constraint. Similarly, Wang et al.[219] constructed a coordinated space in which images (and?sentences) with similar meanings are closer to each other.

結構化協調空間通常用于高維數據的交叉模態哈希壓縮，將其壓縮為具有相似對象的相似二進制碼的緊湊二進制碼[218]。交叉模態哈希的思想是為交叉模態檢索[27]，[93]，[113]創建這樣的代碼。哈希對結果的多模態空間施加了一定的約束:1)它必須是一個n維的漢明空間——一個具有可控位數的二進制表示;2)來自不同模態的相同對象必須有相似的哈希碼;3)空間必須保持相似性。學習如何將數據表示為一個哈希函數，嘗試執行所有這三個要求[27]，[113]。例如，Jiang和Li[92]介紹了一種方法，利用端到端可訓練的深度學習技術學習句子描述與相應圖像之間的公共二值空間。而Cao等人的[32]擴展了該方法，使用了更復雜的LSTM句子表示，并引入了離群值不敏感的位邊緣損失和基于關聯反饋的語義相似度約束。同樣，Wang等[219]構建了一個具有相似意義的圖像(和句子)更加接近的協調空間。

Another example of a structured coordinated represen-tation comes from order-embeddings of images and lan-guage [212], [249]. The model proposed by Vendrov et al.[212] enforces a dissimilarity metric that is asymmetric and implements the notion of partial order in the multimodal space. The idea is to capture a partial order of the language and image representations — enforcing a hierarchy on the space; for example image of “a woman walking her dog“ → text “woman walking her dog” → text “woman walking”. A similar model using denotation graphs was also proposed by Young et al. [238] where denotation graphs are used to induce a partial ordering. Lastly, Zhang et al. present how exploiting structured representations of text and images can create concept taxonomies in an unsupervised manner [249].

A special case of a structured coordinated space is one based on canonical correlation analysis (CCA) [84]. CCA computes a linear projection which maximizes the correla-tion between two random variables (in our case modalities) and enforces orthogonality of the new space. CCA models have been used extensively for cross-modal retrieval [76],[106], [169] and audiovisual signal analysis [177], [187]. Extensions to CCA attempt to construct a correlation max-imizing nonlinear projection [7], [116]. Kernel canonical correlation analysis (KCCA) [116] uses reproducing kernel Hilbert spaces for projection. However, as the approach is nonparametric it scales poorly with the size of the training set and has issues with very large real-world datasets. Deep canonical correlation analysis (DCCA) [7] was introduced as an alternative to KCCA and addresses the scalability issue, it was also shown to lead to better correlated representation space. Similar correspondence autoencoder [58] and deep correspondence RBMs [57] have also been proposed for cross-modal retrieval.

CCA, KCCA, and DCCA are unsupervised techniques and only optimize the correlation over the representations, thus mostly capturing what is shared across the modal-ities. Deep canonically correlated autoencoders [220] also include an autoencoder based data reconstruction term. This encourages the representation to also capture modal-ity specific information. Semantic correlation maximization method [248] also encourages semantic relevance, while retaining correlation maximization and orthogonality of the resulting space — this leads to a combination of CCA and cross-modal hashing techniques.

另一個結構化協調表示的例子來自圖像和語言的順序嵌入[212]，[249]。venrov等人提出的模型[212]在多模態空間中實施了一個非對稱的不相似度規，并實現了偏序的概念。其想法是捕捉語言和圖像表示的部分順序——在空間上強制執行層次結構;例如，圖像“一個女人遛狗”→文本“女人遛狗”→文本“女人散步”。Young等人[238]也提出了一個類似的使用表示圖的模型，其中表示圖用于誘導部分排序。最后，Zhang等人提出了如何利用文本和圖像的結構化表示以無監督的方式創建概念分類[249]。

結構化協調空間的一種特殊情況是基于典型相關分析(CCA)的情況[84]。CCA計算線性投影，最大化兩個隨機變量(在本例中為模態)之間的相關性，并加強新空間的正交性。CCA模型被廣泛用于跨模態檢索[76]、[106]、[169]和視聽信號分析[177]、[187]。對CCA的擴展嘗試構造一個相關max- imalize非線性投影[7]，[116]。核典型相關分析(Kernel canonical correlation analysis, KCCA)[116]使用再現核希爾伯特空間進行投影。然而，由于該方法是非參數化的，它不能很好地適應訓練集的大小，并且在處理非常大的真實世界數據集時存在問題。深度典型相關分析(DCCA)[7]被引入作為KCCA的替代方案，并解決了可伸縮性問題，它還可以帶來更好的相關表示空間。類似的對應自動編碼器[58]和深度對應RBMs[57]也被提出用于跨模態檢索。

CCA、KCCA和DCCA是無監督技術，只優化表示的相關性，因此主要捕獲跨模態共享的內容。深層規范相關自動編碼器[220]還包括基于自動編碼器的數據重構項。這鼓勵表示也捕獲特定于模態的信息。語義相關性最大化方法[248]也鼓勵語義相關性，同時保留相關性最大化和結果空間的正交性——這導致CCA和跨模態哈希技術的結合。

3.3 Discussion討論

In this section we identified two major types of multimodal representations — joint and coordinated. Joint representa-tions project multimodal data into a common space and are best suited for situations when all of the modalities are present during inference. They have been extensively used for AVSR, affect, and multimodal gesture recognition. Coordinated representations, on the other hand, project each modality into a separate but coordinated space, making them suitable for applications where only one modality is present at test time, such as: multimodal retrieval and trans-lation (Section 4), grounding (Section 7.2), and zero shot learning (Section 7.2). Finally, while joint representations have been used in situations to construct representations of?more than two modalities, coordinated spaces have, so far, been mostly limited to two modalities.

在本節中，我們確定了兩種主要類型的多模態表示——聯合表示和協調表示。聯合表示將多模態數據投射到公共空間中，最適合在推理過程中出現所有模態的情況。它們已被廣泛用于AVSR、情感和多模態手勢識別。另一方面，協調表示將每個模態投射到一個獨立但協調的空間中，使它們適合于在測試時只有一個模態的應用，例如:多模態檢索和翻譯(章節4)、接地(章節7.2)和零鏡頭學習(章節7.2)。最后，雖然聯合表示已經被用于構造兩種以上形態的表示，但到目前為止，協調空間大多局限于兩種形態。

?Table 3: Taxonomy of multimodal translation research. For each class and sub-class, we include example tasks with references. Our taxonomy also includes the directionality of the translation: unidirectional (?) and bidirectional (?).

表3:多模態翻譯研究的分類。對于每個類及其子類，我們都包含帶有引用的示例任務。我們的分類還包括翻譯的方向性:單向(瑪)和雙向(?)。

4 Translation翻譯

A big part of multimodal machine learning is concerned with translating (mapping) from one modality to another. Given an entity in one modality the task is to generate the same entity in a different modality. For example given an image we might want to generate a sentence describing it or given a textual description generate an image matching it. Multimodal translation is a long studied problem, with early work in speech synthesis [88], visual speech generation [136] video description [107], and cross-modal retrieval [169].

More recently, multimodal translation has seen renewed interest due to combined efforts of the computer vision and natural language processing (NLP) communities [19] and recent availability of large multimodal datasets [38], [205]. A particularly popular problem is visual scene description, also known as image [214] and video captioning [213], which acts as a great test bed for a number of computer vision and NLP problems. To solve it, we not only need to fully understand the visual scene and to identify its salient parts, but also to produce grammatically correct and comprehensive yet concise sentences describing it.

多模態機器學習的很大一部分是關于從一種模態到另一種模態的翻譯(映射)。給定一個以一種形態存在的實體，任務是在不同形態中生成相同的實體。例如，給定一幅圖像，我們可能想要生成一個描述它的句子，或者給定一個文本描述生成與之匹配的圖像。多模態翻譯是一個長期研究的問題，早期的工作包括語音合成[88]、視覺語音生成[136]、視頻描述[107]和跨模態檢索[169]。

最近，由于計算機視覺和自然語言處理(NLP)社區[19]和最近可用的大型多模態數據集[38]的共同努力，多模態翻譯又引起了人們的興趣[205]。一個特別流行的問題是視覺場景描述，也被稱為圖像[214]和視頻字幕[213]，它是許多計算機視覺和NLP問題的一個很好的測試平臺。要解決這一問題，我們不僅需要充分理解視覺場景，識別視覺場景的突出部分，還需要生成語法正確、全面而簡潔的描述視覺場景的句子。

While the approaches to multimodal translation are very broad and are often modality specific, they share a number of unifying factors. We categorize them into two types —example-based, and generative. Example-based models use a dictionary when translating between the modalities. Genera-tive models, on the other hand, construct a model that is able to produce a translation. This distinction is similar to the one between non-parametric and parametric machine learning approaches and is illustrated in Figure 2, with representative examples summarized in Table 3.

Generative models are arguably more challenging to build as they require the ability to generate signals or sequences of symbols (e.g., sentences). This is difficult for any modality — visual, acoustic, or verbal, especially when temporally and structurally consistent sequences need to be generated. This led to many of the early multimodal transla-tion systems relying on example-based translation. However,

this has been changing with the advent of deep learning models that are capable of generating images [171], [210], sounds [157], [209], and text [12].

盡管多模態翻譯的方法非常廣泛，而且往往是針對特定的模態，但它們有許多共同的因素。我們將它們分為兩種類型——基于實例的和生成的。在模態之間轉換時，基于實例的模型使用字典。另一方面，生成模型構建的是能夠生成翻譯的模型。這種區別類似于非參數機器學習方法和參數機器學習方法之間的區別，如圖2所示，表3總結了具有代表性的例子。

生成模型的構建更具挑戰性，因為它們需要生成信號或符號序列(如句子)的能力。這對于任何形式(視覺的、聽覺的或口頭的)都是困難的，特別是當需要生成時間和結構一致的序列時。這導致了許多早期的多模態翻譯系統依賴于實例翻譯。然而,

隨著能夠生成圖像[171]、[210]、聲音[157]、[209]和文本[12]的深度學習模型的出現，這種情況已經有所改變。

?Figure 2: Overview of example-based and generative multimodal translation. The former retrieves the best translation from a dictionary, while the latter first trains a translation model on the dictionary and then uses that model for translation.

圖2:基于實例和生成式多模態翻譯的概述。前者從字典中檢索最佳的翻譯，而后者首先根據字典訓練翻譯模型，然后使用該模型進行翻譯。

4.1 Example-based?基于實例

Example-based algorithms are restricted by their training data — dictionary (see Figure 2a). We identify two types of such algorithms: retrieval based, and combination based. Retrieval-based models directly use the retrieved translation without modifying it, while combination-based models rely on more complex rules to create translations based on a number of retrieved instances.

Retrieval-based models are arguably the simplest form of multimodal translation. They rely on finding the closest sample in the dictionary and using that as the translated result. The retrieval can be done in unimodal space or inter-mediate semantic space.

Given a source modality instance to be translated, uni-modal retrieval finds the closest instances in the dictionary in the space of the source — for example, visual feature space for images. Such approaches have been used for visual speech synthesis, by retrieving the closest matching visual example of the desired phoneme [26]. They have also been used in concatenative text-to-speech systems [88]. More recently, Ordonez et al. [155] used unimodal retrieval to generate image descriptions by using global image features to retrieve caption candidates [155]. Yagcioglu et al. [232] used a CNN-based image representation to retrieve visu-ally similar images using adaptive neighborhood selection. Devlin et al. [49] demonstrated that a simple k-nearest neighbor retrieval with consensus caption selection achieves competitive translation results when compared to more complex generative approaches. The advantage of such unimodal retrieval approaches is that they only require the representation of a single modality through which we are performing retrieval. However, they often require an extra processing step such as re-ranking of retrieved translations [135], [155], [232]. This indicates a major problem with this approach — similarity in unimodal space does not always imply a good translation.

基于示例的算法受到訓練數據字典的限制(見圖2a)。我們確定了這類算法的兩種類型:基于檢索的和基于組合的。基于檢索的模型直接使用檢索到的翻譯而不修改它，而基于組合的模型則依賴于更復雜的規則來創建基于大量檢索到的實例的翻譯。

基于檢索的模型可以說是最簡單的多模態翻譯形式。他們依賴于在字典中找到最接近的樣本，并將其作為翻譯結果。檢索可以在單峰空間或中間語義空間進行。

給定要翻譯的源模態實例，單模態檢索在源空間(例如，圖像的視覺特征空間)中找到字典中最近的實例。這種方法已經被用于視覺語音合成，通過檢索最接近匹配的期望音素[26]的視覺示例。它們也被用于串聯文本-語音系統[88]。最近，Ordonez等人[155]使用單模態檢索，通過使用全局圖像特征檢索候選標題來生成圖像描述[155]。Yagcioglu等人[232]使用了一種基于cnn的圖像表示，使用自適應鄰域選擇來檢索視覺上相似的圖像。Devlin et al.[49]證明，與更復雜的生成方法相比，具有一致標題選擇的簡單k近鄰檢索可以獲得有競爭力的翻譯結果。這種單模態檢索方法的優點是，它們只需要表示我們執行檢索時所使用的單一模態。然而，它們通常需要額外的處理步驟，如對檢索到的翻譯進行重新排序[135]、[155]、[232]。這表明了這種方法的一個主要問題——單峰空間中的相似性并不總是意味著好的翻譯。

An alternative is to use an intermediate semantic space for similarity comparison during retrieval. An early ex-ample of a hand crafted semantic space is one used by?Farhadi et al. [56]. They map both sentences and images to a space of object, action, scene, retrieval of relevant caption to an image is then performed in that space. In contrast to hand-crafting a representation, Socher et al. [191] learn a coordinated representation of sentences and CNN visual features (see Section 3.2 for description of coordinated spaces). They use the model for both translating from text to images and from images to text. Similarly, Xu et al. [231] used a coordinated space of videos and their descriptions for cross-modal retrieval. Jiang and Li [93] and Cao et al. [32] use cross-modal hashing to perform multimodal translation from images to sentences and back, while Ho-dosh et al. [83] use a multimodal KCCA space for image-sentence retrieval. Instead of aligning images and sentences globally in a common space, Karpathy et al. [99] propose a multimodal similarity metric that internally aligns image fragments (visual objects) together with sentence fragments (dependency tree relations).

另一種方法是在檢索過程中使用中間語義空間進行相似度比較。Farhadi等人使用的手工語義空間是一個早期的例子。它們將句子和圖像映射到對象、動作、場景的空間中，然后在該空間中檢索圖像的相關標題。與手工制作表征不同，Socher等人[191]學習句子和CNN視覺特征的協調表征(關于協調空間的描述，請參見章節3.2)。他們將該模型用于從文本到圖像和從圖像到文本的轉換。類似地，Xu等人[231]使用視頻及其描述的協調空間進行跨模態檢索。Jiang和Li[93]、Cao等使用跨模態哈希進行圖像到句子的多模態轉換，Ho-dosh等[83]使用多模態KCCA空間進行圖像-句子檢索。Karpathy等人[99]提出了一種多模態相似性度量方法，該方法將圖像片段(視覺對象)與句子片段(依賴樹關系)內部對齊，而不是將圖像和句子整體對齊到一個公共空間中。

Retrieval approaches in semantic space tend to perform better than their unimodal counterparts as they are retriev-ing examples in a more meaningful space that reflects both modalities and that is often optimized for retrieval. Fur-thermore, they allow for bi-directional translation, which is not straightforward with unimodal methods. However, they require manual construction or learning of such a semantic space, which often relies on the existence of large training dictionaries (datasets of paired samples).

Combination-based models take the retrieval based ap-proaches one step further. Instead of just retrieving exam-ples from the dictionary, they combine them in a meaningful way to construct a better translation. Combination based media description approaches are motivated by the fact that sentence descriptions of images share a common and simple structure that could be exploited. Most often the rules for combinations are hand crafted or based on heuristics.

Kuznetsova et al. [114] first retrieve phrases that describe visually similar images and then combine them to generate novel descriptions of the query image by using Integer Linear Programming with a number of hand crafted rules. Gupta et al. [74] first find k images most similar to the source image, and then use the phrases extracted from their captions to generate a target sentence. Lebret et al. [119] use a CNN-based image representation to infer phrases that describe it. The predicted phrases are then combined using?a trigram constrained language model.

A big problem facing example-based approaches for translation is that the model is the entire dictionary — mak-ing the model large and inference slow (although, optimiza-tions such as hashing alleviate this problem). Another issue facing example-based translation is that it is unrealistic to expect that a single comprehensive and accurate translation relevant to the source example will always exist in the dic-tionary — unless the task is simple or the dictionary is very large. This is partly addressed by combination models that are able to construct more complex structures. However, they are only able to perform translation in one direction, while semantic space retrieval-based models are able to perform it both ways.

語義空間中的檢索方法往往比單模態檢索方法表現得更好，因為它們檢索的例子是在一個更有意義的空間中，反映了兩種模態，并且通常對檢索進行優化。此外，它們允許雙向翻譯，這與單峰方法不同。然而，它們需要手工構建或學習這樣的語義空間，而這往往依賴于大型訓練字典(成對樣本的數據集)的存在。

基于組合的模型將基于檢索的方法又向前推進了一步。它們不只是從字典中檢索示例，而是以一種有意義的方式將它們組合在一起，從而構建出更好的翻譯。基于組合的媒體描述方法是基于這樣一個事實，即圖像的句子描述具有共同的、簡單的結構，可以被利用。最常見的組合規則是手工制作的或基于啟發式。

Kuznetsova等人[114]首先檢索描述視覺上相似圖像的短語，然后通過使用帶有大量手工規則的整數線性規劃將它們組合起來，生成查詢圖像的新穎描述。Gupta等人[74]首先找到與源圖像最相似的k張圖像，然后使用從這些圖像的標題中提取的短語來生成目標句。Lebret等人[119]使用基于cnn的圖像表示來推斷描述圖像的短語。然后，使用三元組合約束語言模型將預測的短語組合在一起。

基于實例的翻譯方法面臨的一個大問題是，模型是整個字典——這使得模型變大，推理速度變慢(盡管，哈希等優化可以緩解這個問題)。基于示例的翻譯面臨的另一個問題是，期望與源示例相關的單個全面而準確的翻譯總是存在于詞典中是不現實的——除非任務很簡單或詞典非常大。這可以通過能夠構建更復雜結構的組合模型部分地解決。然而，它們只能在一個方向上進行翻譯，而基于語義空間檢索的模型可以以兩種方式進行翻譯。

4.2 Generative approaches生成方法

Generative approaches to multimodal translation construct models that can perform multimodal translation given a unimodal source instance. It is a challenging problem as it requires the ability to both understand the source modality and to generate the target sequence or signal. As discussed in the following section, this also makes such methods much more difficult to evaluate, due to large space of possible correct answers.

In this survey we focus on the generation of three modal-ities: language, vision, and sound. Language generation has been explored for a long time [170], with a lot of recent attention for tasks such as image and video description [19]. Speech and sound generation has also seen a lot of work with a number of historical [88] and modern approaches [157], [209]. Photo-realistic image generation has been less explored, and is still in early stages [132], [171], however, there have been a number of attempts at generating abstract scenes [253], computer graphics [45], and talking heads [6].

多模態翻譯的生成方法可以在給定單模態源實例的情況下構建能夠執行多模態翻譯的模型。這是一個具有挑戰性的問題，因為它要求既能理解源模態，又能生成目標序列或信號。正如下一節所討論的，這也使得這些方法更難評估，因為可能的正確答案空間很大。

在這個調查中，我們關注三種模態的生成:語言、視覺和聲音。語言生成已經被探索了很長一段時間[170]，最近很多人關注的是圖像和視頻描述[19]等任務。語音和聲音生成也見證了許多歷史[88]和現代方法[157]、[209]的大量工作。真實感圖像生成的研究較少，仍處于早期階段[132]，[171]，然而，在生成抽象場景[253]、計算機圖形學[45]和會說話的頭[6]方面已經有了一些嘗試。

We identify three broad categories of generative mod-els: grammar-based, encoder-decoder, and continuous generation models. Grammar based models simplify the task by re-stricting the target domain by using a grammar, e.g., by gen-erating restricted sentences based on a subject, object, verbtemplate. Encoder-decoder models first encode the source modality to a latent representation which is then used by a decoder to generate the target modality. Continuous gen-eration models generate the target modality continuously based on a stream of source modality inputs and are most suited for translating between temporal sequences — such as text-to-speech.

Grammar-based models rely on a pre-defined grammar for generating a particular modality. They start by detecting high level concepts from the source modality, such as objects in images and actions from videos. These detections are then incorporated together with a generation procedure based on a pre-defined grammar to result in a target modality.

Kojima et al. [107] proposed a system to describe human behavior in a video using the detected position of the person’s head and hands and rule based natural language generation that incorporates a hierarchy of concepts and actions. Barbu et al. [14] proposed a video description model that generates sentences of the form: who did what to whom and where and how they did it. The system was based on handcrafted object and event classifiers and used?a restricted grammar suitable for the task. Guadarrama et al.[73] predict subject, verb, object triplets describing a video using semantic hierarchies that use more general words in case of uncertainty. Together with a language model their approach allows for translation of verbs and nouns not seen in the dictionary.

我們確定了生成模型的三大類:基于語法的、編碼器-解碼器和連續生成模型。基于語法的模型通過使用語法限制目標領域來簡化任務，例如，通過基于主語、賓語、動詞模板生成限制句。編碼器-解碼器模型首先將源模態編碼為一個潛在的表示，然后由解碼器使用它來生成目標模態。連續生成模型基于源模態輸入流連續地生成目標模態，最適合于時間序列之間的轉換——比如文本到語音。

基于語法的模型依賴于預定義的語法來生成特定的模態。他們首先從源模態檢測高級概念，如圖像中的對象和視頻中的動作。然后將這些檢測與基于預定義語法的生成過程合并在一起，以產生目標模態。

Kojima等人[107]提出了一種系統，利用檢測到的人的頭和手的位置，以及基于規則的自然語言生成(包含概念和行為的層次)，來描述視頻中的人類行為。Barbu et al.[14]提出了一個視頻描述模型，該模型生成如下形式的句子:誰對誰做了什么，在哪里以及他們是如何做的。該系統基于手工制作的對象和事件分類器，并使用了適合該任務的限制性語法。guadarama等人[73]預測主語、動詞、賓語三連詞描述視頻，使用語義層次結構，在不確定的情況下使用更一般的詞匯。與語言模型一起，他們的方法允許翻譯字典中沒有的動詞和名詞。

To describe images, Yao et al. [235] propose to use an and-or graph-based model together with domain-specific lexicalized grammar rules, targeted visual representation scheme, and a hierarchical knowledge ontology. Li et al.[121] first detect objects, visual attributes, and spatial re-lationships between objects. They then use an n-gram lan-guage model on the visually extracted phrases to generatesubject, preposition, object style sentences. Mitchell et al.[142] use a more sophisticated tree-based language model to generate syntactic trees instead of filling in templates, leading to more diverse descriptions. A majority of ap-proaches represent the whole image jointly as a bag of visual objects without capturing their spatial and semantic relationships. To address this, Elliott et al. [51] propose to explicitly model proximity relationships of objects for image description generation.

Some grammar-based approaches rely on graphical models to generate the target modality. An example includes BabyTalk [112], which given an image generates object, preposition, object triplets, that are used together with a conditional random field to construct the sentences. Yang et al. [233] predict a set of noun, verb, scene, prepositioncandidates using visual features extracted from an image and combine them into a sentence using a statistical lan-guage model and hidden Markov model style inference. A similar approach has been proposed by Thomason et al. [204], where a factor graph model is used for video description of the form subject, verb, object, place. The factor model exploits language statistics to deal with noisy visual representations. Going the other way Zitnick et al.[253] propose to use conditional random fields to generate abstract visual scenes based on language triplets extracted from sentences.

為了描述圖像，Yao等人[235]提出使用基于和或圖的模型，以及特定領域的詞匯化語法規則、有針對性的視覺表示方案和層次知識本體。Li等人[121]首先檢測對象、視覺屬性和對象之間的空間關系。然后，他們在視覺提取的短語上使用一個n-gram語言模型，生成主語、介詞、賓語式的句子。Mitchell等人[142]使用更復雜的基于樹的語言模型來生成語法樹，而不是填充模板，從而產生更多樣化的描述。大多數方法將整個圖像共同表示為一袋視覺對象，而沒有捕捉它們的空間和語義關系。為了解決這個問題，Elliott et al.[51]提出明確地建模物體的接近關系，以生成圖像描述。

一些基于語法的方法依賴于圖形模型來生成目標模態。一個例子包括BabyTalk[112]，它給出一個圖像生成object，介詞，object三連詞，這些連詞與條件隨機場一起用來構造句子。Yang等人[233]利用從圖像中提取的視覺特征預測一組的名詞、動詞、場景、介詞候選人，并使用統計語言模型和隱馬爾可夫模型風格推理將它們組合成一個句子。Thomason等人也提出了類似的方法[204]，其中一個因子圖模型用于subject, verb, object, place形式的視頻描述。因子模型利用語言統計來處理嘈雜的視覺表示。Zitnick等人[253]則提出利用條件隨機場從句子中提取語言三聯體，生成抽象視覺場景。

An advantage of grammar-based methods is that they are more likely to generate syntactically (in case of lan-guage) or logically correct target instances as they use predefined templates and restricted grammars. However, this limits them to producing formulaic rather than creative translations. Furthermore, grammar-based methods rely on complex pipelines for concept detection, with each concept requiring a separate model and a separate training dataset. Encoder-decoder models based on end-to-end trained neu-ral networks are currently some of the most popular tech-niques for multimodal translation. The main idea behind the model is to first encode a source modality into a vectorial representation and then to use a decoder module to generate the target modality, all this in a single pass pipeline. Al-though, first used for machine translation [97], such models have been successfully used for image captioning [134],[214], and video description [174], [213]. So far, encoder-decoder models have been mostly used to generate text, but they can also be used to generate images [132], [171], and continuos generation of speech and sound [157], [209].

The first step of the encoder-decoder model is to encode the source object, this is done in modality specific way.Popular models to encode acoustic signals include RNNs [35] and DBNs [79]. Most of the work on encoding words sentences uses distributional semantics [141] and variants of RNNs [12]. Images are most often encoded using convo-lutional neural networks (CNN) [109], [185]. While learned CNN representations are common for encoding images, this is not the case for videos where hand-crafted features are still commonly used [174], [204]. While it is possible to use unimodal representations to encode the source modality, it has been shown that using a coordinated space (see Section 3.2) leads to better results [105], [159], [231].

基于語法的方法的一個優點是，當它們使用預定義模板和受限制的語法時，它們更有可能生成語法上(對于語言)或邏輯上正確的目標實例。然而，這限制了他們只能寫出公式化的翻譯，而不是創造性的翻譯。此外，基于語法的方法依賴于復雜的管道進行概念檢測，每個概念都需要一個單獨的模型和一個單獨的訓練數據集。基于端到端訓練神經網絡的編解碼模型是目前最流行的多模態翻譯技術之一。該模型背后的主要思想是，首先將源模態編碼為矢量表示，然后使用解碼器模塊生成目標模態，所有這些都在一個單通道中完成。雖然該模型最初用于機器翻譯[97]，但已成功應用于圖像字幕[134]、[214]和視頻描述[174]、[213]。到目前為止，編碼器-解碼器模型大多用于生成文本，但它們也可以用于生成圖像[132]、[171]，以及語音和聲音的連續生成[157]、[209]。

編碼器-解碼器模型的第一步是對源對象進行編碼，這是以特定于模態的方式完成的。常用的聲學信號編碼模型包括RNNs、[35]和DBNs[79]。大多數關于單詞和句子編碼的研究使用了分布語義[141]和RNNs的變體[12]。圖像通常使用卷積神經網絡(CNN)進行編碼[109]，[185]。雖然學習過的CNN表示通常用于編碼圖像，但對于手工制作的特征仍然常用的視頻卻不是這樣[174]，[204]。雖然可以使用單模態表示對源模態進行編碼，但已經證明使用協調空間(見3.2節)可以得到更好的結果[105]、[159]、[231]。

Decoding is most often performed by an RNN or an LSTM using the encoded representation as the initial hidden state [54], [132], [214], [215]. A number of extensions have been proposed to traditional LSTM models to aid in the task of translation. A guide vector could be used to tightly couple the solutions in the image input [91]. Venugopalan et al.[213] demonstrate that it is beneficial to pre-train a decoder LSTM for image captioning before fine-tuning it to video description. Rohrbach et al. [174] explore the use of various LSTM architectures (single layer, multilayer, factored) and a number of training and regularization techniques for the task of video description.

A problem facing translation generation using an RNN is that the model has to generate a description from a single vectorial representation of the image, sentence, or video. This becomes especially difficult when generating long sequences as these models tend to forget the initial input. This has been partly addressed by neural attention models (see Section 5.2) that allow the network to focus on certain parts of an image [230], sentence [12], or video [236] during generation.

Generative attention-based RNNs have also been used for the task of generating images from sentences [132], while the results are still far from photo-realistic they show a lot of promise. More recently, a large amount of progress has been made in generating images using generative adversarial networks [71], which have been used as an alternative to RNNs for image generation from text [171].

解碼通常由RNN或LSTM執行，使用編碼表示作為初始隱藏狀態[54]，[132]，[214]，[215]。人們對傳統的LSTM模型進行了大量的擴展，以幫助完成翻譯任務。一個引導向量可以用來緊耦合圖像輸入中的解[91]。Venugopalan等人[213]證明，在將解碼器LSTM微調為視頻描述之前，對圖像字幕進行預訓練是有益的。Rohrbach等人[174]探討了在視頻描述任務中使用各種LSTM架構(單層、多層、因子)和多種訓練和正則化技術。

使用RNN進行翻譯生成面臨的一個問題是，模型必須從圖像、句子或視頻的單個矢量表示生成描述。這在生成長序列時變得特別困難，因為這些模型往往會忘記最初的輸入。神經注意力模型已經部分解決了這一問題(見5.2節)，神經注意力模型允許網絡在生成時聚焦于圖像[230]、句子[12]或視頻[236]的某些部分。

基于生成注意力的神經網絡也被用于從句子中生成圖像的任務[132]，盡管其結果還遠遠不夠逼真，但它們顯示出了很大的希望。最近，在使用生成對抗網絡生成圖像方面取得了大量進展[71]，生成對抗網絡已被用于替代rnn從文本生成圖像[171]。

While neural network based encoder-decoder systems have been very successful they still face a number of issues. Devlin et al. [49] suggest that it is possible that the network is memorizing the training data rather than learning how to understand the visual scene and generate it. This is based on the observation that k-nearest neighbor models perform very similarly to those based on generation. Furthermore, such models often require large quantities of data for train-ing.

Continuous generation models are intended for sequence translation and produce outputs at every timestep in an online manner. These models are useful when translating from a sequence to a sequence such as text to speech, speech to text, and video to text. A number of different techniques have been proposed for such modeling — graphical models, continuous encoder-decoder approaches, and various other regression or classification techniques. The extra difficulty that needs to be tackled by these models is the requirement of temporal consistency between modalities.

A lot of early work on sequence to sequence transla-tion used graphical or latent variable models. Deena and Galata [47] proposed to use a shared Gaussian process latent?variable model for audio-based visual speech synthesis. The model creates a shared latent space between audio and vi-sual features that can be used to generate one space from the other, while enforcing temporal consistency of visual speech at different timesteps. Hidden Markov models (HMM) have also been used for visual speech generation [203] and text-to-speech [245] tasks. They have also been extended to use cluster adaptive training to allow for training on multiple speakers, languages, and emotions allowing for more con-trol when generating speech signal [244] or visual speech parameters [6].

雖然基于神經網絡的編碼器-解碼器系統已經非常成功，但它們仍然面臨一些問題。Devlin et al.[49]認為，網絡可能是在記憶訓練數據，而不是學習如何理解視覺場景并生成它。這是基于k近鄰模型與基于生成的模型非常相似的觀察得出的。此外，這種模型通常需要大量的數據進行訓練。

連續生成模型用于序列轉換，并以在線方式在每個時間步中產生輸出。這些模型在將一個序列轉換為另一個序列時非常有用，比如文本到語音、語音到文本和視頻到文本。為這種建模提出了許多不同的技術——圖形模型、連續編碼器-解碼器方法，以及各種其他回歸或分類技術。這些模型需要解決的額外困難是對模態之間時間一致性的要求。

許多早期的序列到序列轉換的工作使用圖形或潛在變量模型。Deena和Galata[47]提出了一種共享高斯過程潛變量模型用于基于音頻的可視語音合成。該模型在音頻和視覺特征之間創建了一個共享的潛在空間，可用于從另一個空間生成一個空間，同時在不同的時間步長強制實現視覺語音的時間一致性。隱馬爾可夫模型(HMM)也被用于視覺語音生成[203]和文本-語音轉換[245]任務。它們還被擴展到使用聚類自適應訓練，以允許對多種說話人、語言和情緒進行訓練，從而在產生語音信號[244]或視覺語音參數[6]時進行更多的控制。

Encoder-decoder models have recently become popular for sequence to sequence modeling. Owens et al. [157] used an LSTM to generate sounds resulting from drumsticks based on video. While their model is capable of generat-ing sounds by predicting a cochleogram from CNN visual features, they found that retrieving a closest audio sample based on the predicted cochleogram led to best results. Di-rectly modeling the raw audio signal for speech and music generation has been proposed by van den Oord et al. [209]. The authors propose using hierarchical fully convolutional neural networks, which show a large improvement over previous state-of-the-art for the task of speech synthesis. RNNs have also been used for speech to text translation (speech recognition) [72]. More recently encoder-decoder based continuous approach was shown to be good at pre-dicting letters from a speech signal represented as a filter bank spectra [35] — allowing for more accurate recognition of rare and out of vocabulary words. Collobert et al. [42] demonstrate how to use a raw audio signal directly for speech recognition, eliminating the need for audio features.

A lot of earlier work used graphical models for mul-timodal translation between continuous signals. However, these methods are being replaced by neural network encoder-decoder based techniques. Especially as they have recently been shown to be able to represent and generate complex visual and acoustic signals.

編碼器-解碼器模型是近年來序列對序列建模的流行方法。Owens等人[157]使用LSTM來產生基于視頻的鼓槌的聲音。雖然他們的模型能夠通過預測CNN視覺特征的耳蝸圖來產生聲音，但他們發現，根據預測的耳蝸圖檢索最近的音頻樣本會帶來最好的結果。van den Oord等人提出直接對原始音頻信號建模以生成語音和音樂[209]。作者建議使用分層全卷積神經網絡，這表明在語音合成的任務中，比以前的最先進技術有了很大的改進。rnn也被用于語音到文本的翻譯(語音識別)[72]。最近，基于編碼器-解碼器的連續方法被證明能夠很好地從表示為濾波器組光譜[35]的語音信號中預測字母，從而能夠更準確地識別罕見的和詞匯之外的單詞。Collobert等人的[42]演示了如何直接使用原始音頻信號進行語音識別，消除了對音頻特征的需求。

許多早期的工作使用圖形模型來實現連續信號之間的多模態轉換。然而，這些方法正在被基于神經網絡的編碼器-解碼器技術所取代。特別是它們最近被證明能夠表示和產生復雜的視覺和聽覺信號。

4.3 Model evaluation and discussion模型評價與討論

A major challenge facing multimodal translation methods is that they are very difficult to evaluate. While some tasks such as speech recognition have a single correct translation, tasks such as speech synthesis and media description do not. Sometimes, as in language translation, multiple answers are correct and deciding which translation is better is often subjective. Fortunately, there are a number of approximate automatic metrics that aid in model evaluation.

Often the ideal way to evaluate a subjective task is through human judgment. That is by having a group of people evaluating each translation. This can be done on a Likert scale where each translation is evaluated on a certain dimension: naturalness and mean opinion score for speech synthesis [209], [244], realism for visual speech synthesis [6],[203], and grammatical and semantic correctness, relevance, order, and detail for media description [38], [112], [142],[213]. Another option is to perform preference studies where two (or more) translations are presented to the participant for preference comparison [203], [244]. However, while user studies will result in evaluation closest to human judgments they are time consuming and costly. Furthermore, they require care when constructing and conducting them to avoid fluency, age, gender and culture biases.

多模態翻譯方法面臨的一個主要挑戰是它們很難評估。語音識別等任務只有一個正確的翻譯，而語音合成和媒體描述等任務則沒有。有時，就像在語言翻譯中，多重答案是正確的，決定哪個翻譯更好往往是主觀的。幸運的是，有許多有助于模型評估的近似自動指標。

評估主觀任務的理想方法通常是通過人的判斷。那就是讓一群人評估每一個翻譯。這可以通過李克特量表來完成，其中每一篇翻譯都在一個特定的維度上進行評估:語音合成的自然度和平均意見得分[209]，[244]，視覺語音合成的真實感[6]，[203]，以及媒體描述[38]，[112]，[142]，[213]的語法和語義正確性、相關性、順序和細節。另一種選擇是進行偏好研究，將兩種(或更多)翻譯呈現給參與者進行偏好比較[203]，[244]。然而，雖然用戶研究將導致最接近人類判斷的評估，但它們既耗時又昂貴。此外，在構建和指導這些活動時，需要小心謹慎，以避免流利性、年齡、性別和文化偏見。

While human studies are a gold standard for evaluation, a number of automatic alternatives have been proposed for the task of media description: BLEU [160], ROUGE [124], Meteor [48], and CIDEr [211]. These metrics are directly taken from (or are based on) work in machine translation and compute a score that measures the similarity between the generated and ground truth text. However, the use of them has faced a lot of criticism. Elliott and Keller [52] showed that sentence-level unigram BLEU is only weakly correlated with human judgments. Huang et al. [87] demon-strated that the correlation between human judgments and BLEU and Meteor is very low for visual story telling task. Furthermore, the ordering of approaches based on human judgments did not match that of the ordering using au-tomatic metrics on the MS COCO challenge [38] — with a large number of algorithms outperforming humans on all the metrics. Finally, the metrics only work well when a number of reference translations is high [211], which is often unavailable, especially for current video description datasets [205]

雖然人類研究是評估的黃金標準，但人們提出了許多媒體描述任務的自動替代方案:BLEU[160]、ROUGE[124]、Meteor[48]和CIDEr[211]。這些指標是直接從(或基于)機器翻譯的工作，并計算出一個分數，以衡量生成的文本和地面真實文本之間的相似性。然而，它們的使用面臨著許多批評。Elliott和Keller[52]表明句子層面的ungram BLEU與人類判斷只有弱相關。Huang等[87]研究表明，在視覺講故事任務中，人類判斷與BLEU和Meteor之間的相關性非常低。此外，基于人類判斷的方法排序與在MS COCO挑戰[38]上使用自動度量的排序并不匹配——大量算法在所有度量上都優于人類。最后，只有在大量參考翻譯量高的情況下，指標才能很好地工作[211]，而這通常是不可用的，特別是對于當前的視頻描述數據集[205]。

These criticisms have led to Hodosh et al. [83] proposing to use retrieval as a proxy for image captioning evaluation, which they argue better reflects human judgments. Instead of generating captions, a retrieval based system ranks the available captions based on their fit to the image, and is then evaluated by assessing if the correct captions are given a high rank. As a number of caption generation models are generative they can be used directly to assess the likelihood of a caption given an image and are being adapted by im-age captioning community [99], [105]. Such retrieval based evaluation metrics have also been adopted by the video captioning community [175].

Visual question-answering (VQA) [130] task was pro-posed partly due to the issues facing evaluation of image captioning. VQA is a task where given an image and a ques-tion about its content the system has to answer it. Evaluating such systems is easier due to the presence of a correct answer. However, it still faces issues such as ambiguity of certain questions and answers and question bias.

We believe that addressing the evaluation issue will be crucial for further success of multimodal translation systems. This will allow not only for better comparison be-tween approaches, but also for better objectives to optimize.

這些批評導致Hodosh等人[83]提出使用檢索作為圖像標題評價的代理，他們認為檢索可以更好地反映人類的判斷。基于檢索的系統不是生成標題，而是根據它們與圖像的契合度對可用的標題進行排序，然后通過評估正確的標題是否被給予較高的級別來進行評估。由于許多字幕生成模型是可生成的，它們可以直接用于評估給定圖像的字幕的可能性，并被圖像字幕社區改編[99]，[105]。這種基于檢索的評價指標也被視頻字幕社區采用[175]。

視覺問答(Visual question-answer, VQA)[130]任務的提出，部分是由于圖像字幕評價面臨的問題。VQA是一個任務，在這個任務中，給定一個圖像和一個關于其內容的問題，系統必須回答它。由于存在正確的答案，評估這些系統更容易。然而，它仍然面臨一些問題，如某些問題和答案的模糊性和問題的偏見。

我們認為，解決評價問題將是多模態翻譯系統進一步成功的關鍵。這不僅可以更好地比較不同的方法，而且還可以優化更好的目標。

5 Alignment對齊

We define multimodal alignment as finding relationships and correspondences between sub-components of instances from two or more modalities. For example, given an image and a caption we want to find the areas of the image cor-responding to the caption’s words or phrases [98]. Another example is, given a movie, aligning it to the script or the book chapters it was based on [252].

We categorize multimodal alignment into two types –implicit and explicit. In explicit alignment, we are explicitly interested in aligning sub-components between modalities,e.g., aligning recipe steps with the corresponding instructional video [131]. Implicit alignment is used as an interme-diate (often latent) step for another task, e.g., image retrieval?based on text description can include an alignment step between words and image regions [99]. An overview of such approaches can be seen in Table 4 and is presented in more detail in the following sections.

我們將多模態對齊定義為尋找來自兩個或多個模態實例的子組件之間的關系和對應關系。例如，給定一張圖片和一個標題，我們想要找到圖片中與標題中的單詞或短語相對應的區域[98]。另一個例子是，給定一部電影，將其與劇本或書中的章節對齊[252]。

我們將多模態對齊分為兩種類型-隱式和顯式。在顯式對齊中，我們明確地對模態之間的子組件對齊感興趣。，將配方步驟與相應的教學視頻對齊[131]。隱式對齊是另一個任務的中間(通常是潛伏的)步驟，例如，基于文本描述的圖像檢索可以包括單詞和圖像區域之間的對齊步驟[99]。這些方法的概述見表4，并在下面幾節中給出更詳細的介紹。

?Table 4: Summary of our taxonomy for the multimodal alignment challenge. For each sub-class of our taxonomy, we include reference citations and modalities aligned.

表4:我們對多模態對齊挑戰的分類總結。對于我們分類法的每一個子類，我們包括參考引文和模態對齊。

5.1 Explicit alignment顯式對齊

We categorize papers as performing explicit alignment if their main modeling objective is alignment between sub-components of instances from two or more modalities. A very important part of explicit alignment is the similarity metric. Most approaches rely on measuring similarity be-tween sub-components in different modalities as a basic building block. These similarities can be defined manually or learned from data.

We identify two types of algorithms that tackle ex-plicit alignment — unsupervised and (weakly) supervised. The first type operates with no direct alignment labels (i.e., la-beled correspondences) between instances from the different modalities. The second type has access to such (sometimes weak) labels.

Unsupervised multimodal alignment tackles modality alignment without requiring any direct alignment labels. Most of the approaches are inspired from early work on alignment for statistical machine translation [28] and genome sequences [3], [111]. To make the task easier the approaches assume certain constrains on alignment, such as temporal ordering of sequence or an existence of a similarity metric between the modalities.

如果論文的主要建模目標是對齊來自兩個或更多模態的實例的子組件，那么我們將其分類為執行顯式對齊。顯式對齊的一個非常重要的部分是相似性度量。大多數方法都依賴于度量不同模態的子組件之間的相似性作為基本構建塊。這些相似點可以手工定義，也可以從數據中學習。

我們確定了兩種處理顯式對齊的算法-無監督和(弱)監督。第一種類型在不同模態的實例之間沒有直接對齊標簽(即標簽對應)。第二種類型可以訪問這樣的標簽(有時是弱標簽)。

無監督多模態對齊處理模態對齊，而不需要任何直接對齊標簽。大多數方法的靈感來自于早期對統計機器翻譯[28]和基因組序列[3]的比對工作，[111]。為了使任務更容易，這些方法在對齊上假定了一定的約束，例如序列的時間順序或模態之間存在相似性度量。

Dynamic time warping (DTW) [3], [111] is a dynamic programming approach that has been extensively used to align multi-view time series. DTW measures the similarity between two sequences and finds an optimal match between them by time warping (inserting frames). It requires the timesteps in the two sequences to be comparable and re-quires a similarity measure between them. DTW can be used directly for multimodal alignment by hand-crafting similar-ity metrics between modalities; for example Anguera et al.[8] use a manually defined similarity between graphemes and phonemes; and Tapaswi et al. [201] define a similarity between visual scenes and sentences based on appearance of same characters [201] to align TV shows and plot syn-opses. DTW-like dynamic programming approaches have also been used for multimodal alignment of text to speech [77] and video [202].

As the original DTW formulation requires a pre-defined similarity metric between modalities, it was extended using?canonical correlation analysis (CCA) to map the modali-ties to a coordinated space. This allows for both aligning (through DTW) and learning the mapping (through CCA) between different modality streams jointly and in an unsu-pervised manner [180], [250], [251]. While CCA based DTW models are able to find multimodal data alignment under a linear transformation, they are not able to model non-linear relationships. This has been addressed by the deep canonical time warping approach [206], which can be seen as a generalization of deep CCA and DTW.

動態時間翹曲(DTW)[3]，[111]是一種動態規劃方法，被廣泛用于對齊多視圖時間序列。DTW測量兩個序列之間的相似性，并通過時間翹曲(插入幀)找到它們之間的最優匹配。它要求兩個序列中的時間步具有可比性，并要求它們之間的相似性度量。DTW可以通過手工制作模態之間的相似性度量直接用于多模態校準;例如，Anguera等人使用手工定義的字素和音素之間的相似度;和Tapaswi等[201]根據相同角色的出現定義視覺場景和句子之間的相似性[201]，以對齊電視節目和情節同步。類似dtw的動態規劃方法也被用于文本到語音[77]和視頻[202]的多模態對齊。

由于原始的DTW公式需要預定義的模態之間的相似性度量，因此使用典型相關分析(CCA)對其進行了擴展，以將模態映射到協調空間。這既允許(通過DTW)對齊，也允許(通過CCA)以非監督的方式共同學習不同模態流之間的映射[180]、[250]、[251]。雖然基于CCA的DTW模型能夠在線性變換下找到多模態數據對齊，但它們不能建模非線性關系。深度標準時間翹曲方法已經解決了這一問題[206]，該方法可以看作是深度CCA和DTW的推廣。

Various graphical models have also been popular for multimodal sequence alignment in an unsupervised man-ner. Early work by Yu and Ballard [239] used a generative graphical model to align visual objects in images with spoken words. A similar approach was taken by Cour et al.[44] to align movie shots and scenes to the corresponding screenplay. Malmaud et al. [131] used a factored HMM to align recipes to cooking videos, while Noulas et al. [154] used a dynamic Bayesian network to align speakers to videos. Naim et al. [147] matched sentences with corre-sponding video frames using a hierarchical HMM model to align sentences with frames and a modified IBM [28] algorithm for word and object alignment [15]. This model was then extended to use latent conditional random fields for alignments [146] and to incorporate verb alignment to actions in addition to nouns and objects [195].

Both DTW and graphical model approaches for align-ment allow for restrictions on alignment, e.g. temporal consistency, no large jumps in time, and monotonicity. While DTW extensions allow for learning both the similarity met-ric and alignment jointly, graphical model based approaches require expert knowledge for construction [44], [239]. Supervised alignment methods rely on labeled aligned in-stances. They are used to train similarity measures that are used for aligning modalities.

各種圖形模型也流行于無監督方式的多模態序列比對。Yu和Ballard的早期工作[239]使用生成圖形模型，將圖像中的視覺對象與口語對齊。Cour et al.[44]采用了類似的方法，將電影鏡頭和場景與相應的劇本對齊。Malmaud等人[131]使用一種經過分解的HMM將食譜與烹飪視頻進行對齊，而Noulas等人[154]使用動態貝葉斯網絡將說話者與視頻進行對齊。Naim等人[147]使用分層HMM模型對句子和幀進行對齊，并使用改進的IBM[28]算法對單詞和對象進行對齊[15]，將句子與相應的視頻幀進行匹配。隨后，該模型被擴展到使用潛在條件隨機場進行對齊[146]，并將動詞對齊合并到動作中，除了名詞和對象之外[195]。

DTW和圖形模型的對齊方法都允許對對齊的限制，例如時間一致性、時間上沒有大的跳躍和單調性。雖然DTW擴展可以同時學習相似性度量和對齊，但基于圖形模型的方法需要專家知識來構建[44]，[239]。監督對齊方法依賴于標記對齊的實例。它們被用來訓練用于對齊模態的相似性度量。

A number of supervised sequence alignment techniques take inspiration from unsupervised ones. Bojanowski et al.[22], [23] proposed a method similar to canonical time warp-ing, but have also extended it to take advantage of exist-ing (weak) supervisory alignment data for model training. Plummer et al. [161] used CCA to find a coordinated space between image regions and phrases for alignment. Gebru et al. [65] trained a Gaussian mixture model and performed semi-supervised clustering together with an unsupervised latent-variable graphical model to align speakers in an audio channel with their locations in a video. Kong et al. [108] trained a Markov random field to align objects in 3D scenes to nouns and pronouns in text descriptions.

Deep learning based approaches are becoming popular for explicit alignment (specifically for measuring similarity) due to very recent availability of aligned datasets in the lan-guage and vision communities [133], [161]. Zhu et al. [252] aligned books with their corresponding movies/scripts by training a CNN to measure similarities between scenes and text. Mao et al. [133] used an LSTM language model and a CNN visual one to evaluate the quality of a match between a referring expression and an object in an image. Yu et al.[242] extended this model to include relative appearance and context information that allows to better disambiguate between objects of the same type. Finally, Hu et al. [85] used an LSTM based scoring function to find similarities between?image regions and their descriptions.

許多監督序列比對技術的靈感來自于非監督序列比對技術。Bojanowski et al.[22]，[23]提出了一種類似于規范時間扭曲的方法，但也對其進行了擴展，以利用現有的(弱)監督對準數據進行模型訓練。Plummer等[161]利用CCA在圖像區域和短語之間找到一個協調的空間進行對齊。Gebru等人[65]訓練了一種高斯混合模型，并將半監督聚類與一種無監督的潛在變量圖形模型結合在一起，以使音頻通道中的揚聲器與視頻中的位置保持一致。Kong等人[108]訓練了馬爾可夫隨機場來將3D場景中的物體與文本描述中的名詞和代詞對齊。

基于深度學習的方法在顯式對齊(特別是度量相似性)方面正變得流行起來，這是由于最近在語言和視覺社區中對齊數據集的可用性[133]，[161]。Zhu等人[252]通過訓練CNN來衡量場景和文本之間的相似性，將書籍與相應的電影/腳本對齊。Mao等人[133]使用LSTM語言模型和CNN視覺模型來評估參考表達和圖像中物體匹配的質量。Yu等人[242]將該模型擴展到包含相對外觀和上下文信息，從而可以更好地消除同一類型對象之間的歧義。最后，Hu等[85]使用基于LSTM的評分函數來尋找圖像區域與其描述之間的相似點。

5.2 Implicit alignment隱式對齊

In contrast to explicit alignment, implicit alignment is used as an intermediate (often latent) step for another task. This allows for better performance in a number of tasks including speech recognition, machine translation, media description, and visual question-answering. Such models do not explic-itly align data and do not rely on supervised alignment examples, but learn how to latently align the data during model training. We identify two types of implicit alignment models: earlier work based on graphical models, and more modern neural network methods.

Graphical models have seen some early work used to better align words between languages for machine translation [216] and alignment of speech phonemes with their tran-scriptions [186]. However, they require manual construction of a mapping between the modalities, for example a gener-ative phone model that maps phonemes to acoustic features [186]. Constructing such models requires training data or human expertise to define them manually.

Neural networks Translation (Section 4) is an example of a modeling task that can often be improved if alignment is performed as a latent intermediate step. As we mentioned before, neural networks are popular ways to address this translation problem, using either an encoder-decoder model or through cross-modal retrieval. When translation is per-formed without implicit alignment, it ends up putting a lot of weight on the encoder module to be able to properly summarize the whole image, sentence or a video with a single vectorial representation.

與顯式對齊相反，隱式對齊用作另一個任務的中間(通常是潛在的)步驟。這允許在許多任務中有更好的表現，包括語音識別、機器翻譯、媒體描述和視覺問題回答。這些模型不顯式地對齊數據，也不依賴于監督對齊示例，而是學習如何在模型訓練期間潛在地對齊數據。我們確定了兩種類型的隱式對齊模型:基于圖形模型的早期工作，以及更現代的神經網絡方法。

圖形化模型的一些早期工作已被用于機器翻譯中更好地對齊語言之間的單詞[216]，以及對齊語音音素與其轉錄文本[186]。然而，它們需要人工構建模態之間的映射，例如將音素映射到聲學特征的生成電話模型[186]。構建這樣的模型需要訓練數據或人類專業知識來手動定義它們。

神經網絡翻譯(第4節)是建模任務的一個例子，如果將對齊作為一個潛在的中間步驟執行，那么通常可以改進建模任務。正如我們前面提到的，神經網絡是解決這個翻譯問題的常用方法，可以使用編碼器-解碼器模型，也可以通過跨模態檢索。當在沒有隱式對齊的情況下進行平移時，編碼器模塊最終會因為能夠正確地用單個矢量表示總結整個圖像、句子或視頻而受到很大的影響。

A very popular way to address this is through attention [12], which allows the decoder to focus on sub-components of the source instance. This is in contrast with encoding all source sub-components together, as is performed in a con-ventional encoder-decoder model. An attention module will tell the decoder to look more at targeted sub-components of the source to be translated — areas of an image [230], words of a sentence [12], segments of an audio sequence [35], [39], frames and regions in a video [236], [241], and even parts of an instruction [140]. For example, in image captioning in-stead of encoding an entire image using a CNN, an attention mechanism will allow the decoder (typically an RNN) to focus on particular parts of the image when generating each successive word [230]. The attention module which learns what part of the image to focus on is typically a shallow neural network and is trained end-to-end together with a target task (e.g., translation).

Attention models have also been successfully applied to question answering tasks, as they allow for aligning the words in a question with sub-components of an information source such as a piece of text [228], an image [62], or a video sequence [246]. This both allows for better performance in question answering and leads to better model interpretabil-ity [4]. In particular, different types of attention models have been proposed to address this problem, including hierar-chical [128], stacked [234], and episodic memory attention [228].

解決這個問題的一種非常流行的方法是通過關注[12]，它允許解碼器關注源實例的子組件。這與在傳統的編碼器-解碼器模型中執行的將所有源子組件編碼在一起形成對比。注意模塊將告訴解碼器看起來更有針對性的子組件的源代碼翻譯——圖像[230]、句子[12],段音頻序列[35],[39],一個視頻幀和地區[236],[241],甚至部分指令[140]。例如，在圖像標題中，不是使用CNN對整個圖像進行編碼，而是一種注意機制，允許解碼器(通常是RNN)在生成每個連續單詞時聚焦于圖像的特定部分[230]。注意力模塊學習圖像的哪一部分需要關注，它通常是一個淺層神經網絡，并與目標任務(如翻譯)一起進行端到端訓練。

注意力模型也已成功應用于問答任務，因為它們允許將問題中的單詞與信息源的子組件(如一段文本[228]、一幅圖像[62]或一段視頻序列[246])對齊。這既允許更好的問題回答性能，也導致更好的模型可解釋性[4]。特別是，人們提出了不同類型的注意模型來解決這個問題，包括層次結構注意[128]、堆疊注意[234]和情景記憶注意[228]。

Another neural alternative for aligning images with cap-tions for cross-modal retrieval was proposed by Karpathy?et al. [98], [99]. Their proposed model aligns sentence frag-ments to image regions by using a dot product similarity measure between image region and word representations. While it does not use attention, it extracts a latent alignment between modalities through a similarity measure that is learned indirectly by training a retrieval model.

Karpathy等人提出了另一種神經方法，可用于對帶有標題的圖像進行交叉模態檢索[98]，[99]。他們提出的模型通過使用圖像區域和單詞表示之間的點積相似度度量來將句子片段與圖像區域對齊。雖然它不使用注意力，但它通過通過訓練檢索模型間接學習的相似度度量來提取模態之間的潛在對齊。

5.3 Discussion討論

Multimodal alignment faces a number of difficulties:

1) there are few datasets with explicitly annotated alignments;

2) it is difficult to design similarity metrics between modalities;

3) there may exist multiple possible alignments and not all?elements in one modality have correspondences in another.

Earlier work on multimodal alignment focused on aligning multimodal sequences in an unsupervised manner using graphical models and dynamic programming techniques. It relied on hand-defined measures of similarity between the modalities or learnt them in an unsupervised manner. With recent availability of labeled training data supervised learn-ing of similarities between modalities has become possible. However, unsupervised techniques of learning to jointly align and translate or fuse data have also become popular.

多模態對齊面臨許多困難：

1）具有明確注釋對齊的數據集很少；

2) 難以設計模態之間的相似性度量；

3) 可能存在多種可能的對齊方式，并且并非一種模態中的所有元素在另一種模態中都有對應關系。

早期關于多模態對齊的工作側重于使用圖形模型和動態規劃技術以無監督方式對齊多模態序列。它依靠手動定義的模態之間的相似性度量或以無人監督的方式學習它們。隨著最近標記訓練數據的可用性，對模態之間相似性的監督學習成為可能。然而，學習聯合對齊和翻譯或融合數據的無監督技術也變得流行起來。

6 Fusion融合

Multimodal fusion is one of the original topics in mul-timodal machine learning, with previous surveys empha-sizing early, late and hybrid fusion approaches [50], [247]. In technical terms, multimodal fusion is the concept of integrating information from multiple modalities with the goal of predicting an outcome measure: a class (e.g., happy vs. sad) through classification, or a continuous value (e.g., positivity of sentiment) through regression. It is one of the most researched aspects of multimodal machine learning with work dating to 25 years ago [243].

The interest in multimodal fusion arises from three main benefits it can provide. First, having access to multiple modalities that observe the same phenomenon may allow for more robust predictions. This has been especially ex-plored and exploited by the AVSR community [163]. Second, having access to multiple modalities might allow us to capture complementary information — something that is not visible in individual modalities on their own. Third, a multimodal system can still operate when one of the modalities is missing, for example recognizing emotions from the visual signal when the person is not speaking [50].

多模態融合是多模態機器學習中最原始的主題之一，以往的研究強調早期、晚期和混合融合方法[50]，[247]。用技術術語來說，多模態融合是將來自多種模態的信息整合在一起的概念，目的是預測一個結果度量:通過分類得到一個類別(例如，快樂vs.悲傷)，或者通過回歸得到一個連續值(例如，情緒的積極性)。這是多模態機器學習研究最多的方面之一，可追溯到25年前的工作[243]。

人們對多模態融合的興趣源于它能提供的三個主要好處。首先，使用觀察同一現象的多種模態可能會使預測更加準確。AVSR社區對此進行了特別的探索和利用[163]。其次，接觸多種模態可能會讓我們獲得互補信息——在單獨的模態中是看不到的信息。第三，當其中一種模態缺失時，多模態系統仍然可以運行，例如，當一個人不說話時，從視覺信號中識別情緒。

Multimodal fusion has a very broad range of appli-cations, including audio-visual speech recognition (AVSR)[163], multimodal emotion recognition [192], medical image analysis [89], and multimedia event detection [117]. There are a number of reviews on the subject [11], [163], [188],[247]. Most of them concentrate on multimodal fusion for a particular task, such as multimedia analysis, information retrieval or emotion recognition. In contrast, we concentrate on the machine learning approaches themselves and the technical challenges associated with these approaches.

While some prior work used the term multimodal fu-sion to include all multimodal algorithms, in this survey paper we classify approaches as fusion category when the multimodal integration is performed at the later prediction?stages, with the goal of predicting outcome measures. In recent work, the line between multimodal representation and fusion has been blurred for models such as deep neural networks where representation learning is interlaced with classification or regression objectives. As we will describe in this section, this line is clearer for other approaches such as graphical models and kernel-based methods.

We classify multimodal fusion into two main categories: model-agnostic approaches (Section 6.1) that are not di-rectly dependent on a specific machine learning method; and model-based (Section 6.2) approaches that explicitly ad-dress fusion in their construction — such as kernel-based approaches, graphical models, and neural networks. An overview of such approaches can be seen in Table 5.

多模態融合有非常廣泛的應用，包括視聽語音識別[163]、多模態情感識別[192]、醫學圖像分析[89]、多媒體事件檢測[117]。關于這個主題[11]，[163]，[188]，[247]有許多評論。它們大多集中于針對特定任務的多模態融合，如多媒體分析、信息檢索或情感識別。相比之下，我們專注于機器學習方法本身以及與這些方法相關的技術挑戰。

雖然之前的一些工作使用術語多模態融合來包括所有的多模態算法，但在本調查論文中，當多模態集成在后期預測階段進行時，我們將方法歸類為融合類別，目的是預測結果度量。在最近的工作中，多模態表示和融合之間的界限已經模糊，例如在深度神經網絡中，表示學習與分類或回歸目標交織在一起。正如我們將在本節中描述的那樣，這一行對于其他方法(如圖形模型和基于內核的方法)來說更清晰。

我們將多模態融合分為兩大類:不直接依賴于特定機器學習方法的模型無關方法(章節6.1);和基于模型(第6.2節)的方法，這些方法在其構造中明確地處理融合——例如基于內核的方法、圖形模型和神經網絡。這些方法的概述見表5。

?Table 5: A summary of our taxonomy of multimodal fusion approaches. OUT — output type (class — classification or reg — regression), TEMP — is temporal modeling possible.

表5:我們對多模態融合方法的分類總結。OUT -輸出類型(類-分類或reg -回歸)，TEMP -是時間建模的可能。

6.1 Model-agnostic approaches與模型無關的方法

Historically, the vast majority of multimodal fusion has been done using model-agnostic approaches [50]. Such ap-proaches can be split into early (i.e., feature-based), late (i.e., decision-based) and hybrid fusion [11]. Early fusion inte-grates features immediately after they are extracted (often by simply concatenating their representations). Late fusion on the other hand performs integration after each of the modalities has made a decision (e.g., classification or regres-sion). Finally, hybrid fusion combines outputs from early fusion and individual unimodal predictors. An advantage of model agnostic approaches is that they can be implemented using almost any unimodal classifiers or regressors.

Early fusion could be seen as an initial attempt by mul-timodal researchers to perform multimodal representation learning — as it can learn to exploit the correlation and interactions between low level features of each modality. Furthermore it only requires the training of a single model, making the training pipeline easier compared to late and hybrid fusion.

歷史上，絕大多數的多模態融合都是使用模型無關的方法[50]完成的。這樣的方法可以分為早期(即基于特征的)、后期(即基于決策的)和混合融合[11]。早期融合會在特征被提取后立即進行整合(通常是簡單地將它們的表示連接起來)。另一方面，晚期融合在每種模態做出決定(如分類或回歸)后進行整合。最后，混合融合結合早期融合和單個單模態預測的結果。模型無關方法的一個優點是，它們可以使用幾乎任何單模態分類器或回歸器來實現。

早期的融合可以被看作是多模態研究人員進行多模態表征學習的初步嘗試，因為它可以學習利用每個模態的低水平特征之間的相關性和相互作用。而且，它只需要對單個模型進行訓練，相比后期的混合融合更容易實現。

In contrast, late fusion uses unimodal decision values and fuses them using a fusion mechanism such as averaging [181], voting schemes [144], weighting based on channel noise [163] and signal variance [53], or a learned model [68], [168]. It allows for the use of different models for each modality as different predictors can model each individual modality better, allowing for more flexibility. Furthermore, it makes it easier to make predictions when one or more of?the modalities is missing and even allows for training when no parallel data is available. However, late fusion ignores the low level interaction between the modalities.

Hybrid fusion attempts to exploit the advantages of both of the above described methods in a common framework. It has been used successfully for multimodal speaker identifi-cation [226] and multimedia event detection (MED) [117].

相反，后期融合使用單模態決策值，并使用一種融合機制來融合它們，如平均[181]、投票方案[144]、基于信道噪聲[163]和信號方差[53]的加權或學習模型[68]、[168]。它允許為每個模態使用不同的模型，因為不同的預測器可以更好地為每個模態建模，從而具有更大的靈活性。此外，當一個或多個模態缺失時，它可以更容易地進行預測，甚至可以在沒有并行數據可用時進行訓練。然而，晚期融合忽略了模態之間低水平的相互作用。

混合融合嘗試在一個公共框架中利用上述兩種方法的優點。它已成功地用于多模態說話人識別[226]和多媒體事件檢測(MED)[117]。

6.2 Model-based approaches基于模型的方法

While model-agnostic approaches are easy to implement using unimodal machine learning methods, they end up using techniques that are not designed to cope with mul-timodal data. In this section we describe three categories of approaches that are designed to perform multimodal fusion: kernel-based methods, graphical models, and neural networks.

Multiple kernel learning (MKL) methods are an extension to kernel support vector machines (SVM) that allow for the use of different kernels for different modalities/views of the data [70]. As kernels can be seen as similarity functions be-tween data points, modality-specific kernels in MKL allows for better fusion of heterogeneous data.

MKL approaches have been an especially popular method for fusing visual descriptors for object detection [31], [66] and only recently have been overtaken by deep learning methods for the task [109]. They have also seen use for multimodal affect recognition [36], [90], [182], mul-timodal sentiment analysis [162], and multimedia event detection (MED) [237]. Furthermore, McFee and Lanckriet [137] proposed to use MKL to perform musical artist simi-larity ranking from acoustic, semantic and social view data. Finally, Liu et al. [125] used MKL for multimodal fusion in Alzheimer’s disease classification. Their broad applicability demonstrates the strength of such approaches in various domains and across different modalities.

雖然使用單模態機器學習方法很容易實現模型無關的方法，但它們最終使用的技術不是用來處理多模態數據的。在本節中，我們將描述用于執行多模態融合的三類方法:基于核的方法、圖形模型和神經網絡。

多核學習(MKL)方法是對核支持向量機(SVM)的一種擴展，它允許對數據的不同模態/視圖使用不同的核[70]。由于內核可以被視為數據點之間的相似函數，因此MKL中的特定于模態的內核可以更好地融合異構數據。

MKL方法是融合視覺描述符用于目標檢測[31]的一種特別流行的方法[66]，直到最近才被用于任務的深度學習方法所取代[109]。它們也被用于多模態情感識別[36][90]，[182]，多模態情感分析[162]，以及多媒體事件檢測(MED)[237]。此外，McFee和Lanckriet[137]提出使用MKL從聲學、語義和社會視圖數據中進行音樂藝術家相似度排序。最后，Liu等[125]將MKL用于阿爾茨海默病的多模態融合分類。它們廣泛的適用性表明了這些方法在不同領域和不同模態中的優勢。

Besides flexibility in kernel selection, an advantage of MKL is the fact that the loss function is convex, allowing for model training using standard optimization packages and global optimum solutions [70]. Furthermore, MKL can be used to both perform regression and classification. One of the main disadvantages of MKL is the reliance on training data (support vectors) during test time, leading to slow inference and a large memory footprint.

Graphical models are another family of popular methods for multimodal fusion. In this section we overview work done on multimodal fusion using shallow graphical models. A description of deep graphical models such as deep belief networks can be found in Section 3.1.

Majority of graphical models can be classified into two main categories: generative — modeling joint probability; or discriminative — modeling conditional probability [200]. Some of the earliest approaches to use graphical models for multimodal fusion include generative models such as cou-pled [149] and factorial hidden Markov models [67] along-side dynamic Bayesian networks [64]. A more recently-proposed multi-stream HMM method proposes dynamic weighting of modalities for AVSR [75].

Arguably, generative models lost popularity to discrimi-native ones such as conditional random fields (CRF) [115] which sacrifice the modeling of joint probability for pre-dictive power. A CRF model was used to better segment?images by combining visual and textual information of image description [60]. CRF models have been extended to model latent states using hidden conditional random fields [165] and have been applied to multimodal meeting seg-mentation [173]. Other multimodal uses of latent variable discriminative graphical models include multi-view hidden CRF [194] and latent variable models [193]. More recently Jiang et al. [93] have shown the benefits of multimodal hidden conditional random fields for the task of multimedia classification. While most graphical models are aimed at classification, CRF models have been extended to a continu-ous version for regression [164] and applied in multimodal settings [13] for audio visual emotion recognition.

除了在核選擇上的靈活性外，MKL的一個優點是損失函數是凸的，允許使用標準優化包和全局最優解進行模型訓練[70]。此外，MKL可用于回歸和分類。MKL的主要缺點之一是在測試期間依賴于訓練數據(支持向量)，導致推理緩慢和占用大量內存。

圖形模型是另一類流行的多模態融合方法。在本節中，我們將概述使用淺層圖形模型進行多模態融合的工作。深度圖形模型(如深度信念網絡)的描述可以在3.1節中找到。

大多數圖形模型可分為兩大類:生成-建模聯合概率模型;或判別建模條件概率[200]。最早將圖形模型用于多模態融合的一些方法包括生成模型，如耦合模型[149]和階乘隱馬爾可夫模型[67]以及動態貝葉斯網絡[64]。最近提出的一種多流HMM方法提出了AVSR模態的動態加權[75]。

可以證明的是，生成型模型被諸如條件隨機域(CRF)[115]這樣的判別型模型所取代，后者犧牲了對聯合概率的建模來提高預測能力。結合圖像描述[60]的視覺和文本信息，采用CRF模型對圖像進行更好的分割。CRF模型已被擴展到使用隱藏條件隨機場來模擬潛在狀態[165]，并已被應用于多模態相遇分割[173]。潛變量判別圖形模型的其他多模態應用包括多視圖隱CRF[194]和潛變量模型[193]。最近，Jiang等人[93]展示了多模態隱藏條件隨機場對多媒體分類任務的好處。雖然大多數圖形模型的目的是分類，但CRF模型已擴展到連續版本用于回歸[164]，并應用于多模態設置[13]用于視聽情感識別。

The benefit of graphical models is their ability to easily exploit spatial and temporal structure of the data, making them especially popular for temporal modeling tasks, such as AVSR and multimodal affect recognition. They also allow to build in human expert knowledge into the models. and often lead to interpretable models.

Neural Networks have been used extensively for the task of multimodal fusion [151]. The earliest examples of using neural networks for multi-modal fusion come from work on AVSR [163]. Nowadays they are being used to fuse information for visual and media question answering [63],[130], [229], gesture recognition [150], affect analysis [96],[153], and video description generation [94]. While the modalities used, architectures, and optimization techniques might differ, the general idea of fusing information in joint hidden layer of a neural network remains the same.

Neural networks have also been used for fusing tempo-ral multimodal information through the use of RNNs and LSTMs. One of the earlier such applications used a bidi-rectional LSTM was used to perform audio-visual emotion classification [224]. More recently, W¨ollmer et al. [223] used LSTM models for continuous multimodal emotion recog-nition, demonstrating its advantage over graphical models and SVMs. Similarly, Nicolaou et al. [152] used LSTMs for continuous emotion prediction. Their proposed method used an LSTM to fuse the results from a modality specific (audio and facial expression) LSTMs.

圖形模型的優點是能夠輕松利用數據的空間和時間結構，這使得它們在時間建模任務(如AVSR和多模態影響識別)中特別受歡迎。它們還允許在模型中加入人類的專業知識。通常會產生可解釋的模型。

神經網絡已被廣泛用于多模態融合的任務[151]。使用神經網絡進行多模態融合的最早例子來自于AVSR的研究[163]。如今，它們被用于融合信息，用于視覺和媒體問答[63]、[130]、[229]、手勢識別[150]、情感分析[96]、[153]和視頻描述生成[94]。雖然所使用的模態、架構和優化技術可能不同，但在神經網絡的聯合隱層中融合信息的一般思想是相同的。

神經網絡也通過使用rnn和lstm來融合時間多模態信息。較早使用雙向LSTM進行視聽情緒分類的應用之一[224]。最近，W¨ollmer等人[223]使用LSTM模型進行連續多模態情緒識別，證明了其優于圖形模型和支持向量機。同樣，Nicolaou等[152]使用lstm進行連續情緒預測。他們提出的方法使用LSTM來融合來自特定模態(音頻和面部表情)LSTM的結果。

Approaching modality fusion through recurrent neural networks has been used in various image captioning tasks, example models include: neural image captioning [214] where a CNN image representation is decoded using an LSTM language model, gLSTM [91] which incorporates the image data together with sentence decoding at every time step fusing the visual and sentence data in a joint repre-sentation. A more recent example is the multi-view LSTM (MV-LSTM) model proposed by Rajagopalan et al. [166]. MV-LSTM model allows for flexible fusion of modalities in the LSTM framework by explicitly modeling the modality-specific and cross-modality interactions over time.

A big advantage of deep neural network approaches in data fusion is their capacity to learn from large amount of data. Secondly, recent neural architectures allow for end-to-end training of both the multimodal representation compo-nent and the fusion component. Finally, they show good performance when compared to non neural network based system and are able to learn complex decision boundaries that other approaches struggle with.

The major disadvantage of neural network approaches?is their lack of interpretability. It is difficult to tell what the prediction relies on, and which modalities or features play an important role. Furthermore, neural networks require large training datasets to be successful.

通過遞歸神經網絡實現模態融合已被用于各種圖像字幕任務，示例模型包括:神經圖像字幕[214]，其中CNN圖像表示使用LSTM語言模型進行解碼，gLSTM[91]將圖像數據和每一步的句子解碼結合在一起，將視覺數據和句子數據融合在一個聯合表示中。最近的一個例子是Rajagopalan等人提出的多視圖LSTM (MV-LSTM)模型[166]。MV-LSTM模型通過顯式地建模隨時間變化的特定模態和跨模態交互，允許LSTM框架中模態的靈活融合。

深度神經網絡方法在數據融合中的一大優勢是能夠從大量數據中學習。其次，最近的神經體系結構允許端到端訓練多模態表示組件和融合組件。最后，與基于非神經網絡的系統相比，它們表現出了良好的性能，并且能夠學習其他方法難以處理的復雜決策邊界。

神經網絡方法的主要缺點是缺乏可解釋性。很難判斷預測的依據是什么，以及哪種模態或特征發揮了重要作用。此外，神經網絡需要大量的訓練數據集才能成功。

6.3 Discussion討論

Multimodal fusion has been a widely researched topic with a large number of approaches proposed to tackle it, includ-ing model agnostic methods, graphical models, multiple kernel learning, and various types of neural networks. Each approach has its own strengths and weaknesses, with some more suited for smaller datasets and others performing bet-ter in noisy environments. Most recently, neural networks have become a very popular way to tackle multimodal fu-sion, however graphical models and multiple kernel learn-ing are still being used, especially in tasks with limited training data or where model interpretability is important.

多模態融合是一個被廣泛研究的課題，有大量的方法被提出來解決它，包括模型不確定方法、圖形模型、多核學習和各種類型的神經網絡。每種方法都有自己的優點和缺點，一些方法更適合于較小的數據集，而另一些方法在嘈雜的環境中表現得更好。最近，神經網絡已經成為處理多模態融合的一種非常流行的方法，但圖形模型和多核學習仍在使用，特別是在訓練數據有限的任務或模型可解釋性很重要的地方。

Despite these advances multimodal fusion still faces the following challenges:

1) signals might not be temporally aligned (possibly dense continuous signal and a sparse event);

2) it is difficult to build models that exploit supple-mentary and not only complementary information;

3) each modality might exhibit different types and different levels of noise at different points in time.

盡管有這些進展，但多模態融合仍面臨以下挑戰:

1)信號可能沒有時間對齊(可能是密集的連續信號和稀疏事件);

2)很難建立既能利用互補信息又能利用互補信息的模型;

3)各模態在不同時間點可能表現出不同類型和不同水平的噪聲。

7 Co-learning共同學習

The final multimodal challenge in our taxonomy is co-learning — aiding the modeling of a (resource poor) modal-ity by exploiting knowledge from another (resource rich) modality. It is particularly relevant when one of the modali-ties has limited resources — lack of annotated data, noisy input, and unreliable labels. We call this challenge co-learning as most often the helper modality is used only during model training and is not used during test time. We identify three types of co-learning approaches based on their training resources: parallel, non-parallel, and hybrid. Parallel-data approaches require training datasets where the observations from one modality are directly linked to the ob-servations from other modalities. In other words, when the multimodal observations are from the same instances, such as in an audio-visual speech dataset where the video and speech samples are from the same speaker. In contrast, non-parallel data approaches do not require direct links between observations from different modalities. These approaches usually achieve co-learning by using overlap in terms of categories. For example, in zero shot learning when the con-ventional visual object recognition dataset is expanded with a second text-only dataset from Wikipedia to improve the generalization of visual object recognition. In the hybrid data setting the modalities are bridged through a shared modality or a dataset. An overview of methods in co-learning can be seen in Table 6 and summary of data parallelism in Figure 3.

我們分類法中的最后一個多模態挑戰是共同學習——通過從另一個(資源豐富的)模態中獲取知識來幫助(資源貧乏的)模態建模。當其中一種模態的資源有限時——缺乏注釋的數據、嘈雜的輸入和不可靠的標簽——這一點尤其重要。我們稱這種挑戰為共同學習，因為大多數情況下，助手模態只在模型訓練中使用，而在測試期間不使用。我們根據他們的培訓資源確定了三種類型的共同學習方法:并行、非并行和混合。平行數據方法需要訓練數據集，其中一個模態的觀察結果與其他模態的觀察結果直接相連。換句話說，當多模態觀察來自相同的實例時，例如在一個視聽語音數據集中，視頻和語音樣本來自同一個說話者。相反，非平行數據方法不需要不同模態的觀察結果之間的直接聯系。這些方法通常通過使用類別上的重疊來實現共同學習。例如，在零鏡頭學習時，將傳統的視覺對象識別數據集擴展為維基百科的第二個純文本數據集，以提高視覺對象識別的泛化。在混合數據設置中，模態通過共享的模態或數據集進行連接。在表6中可以看到共同學習方法的概述，在圖3中可以看到數據并行性的總結。

7.1 Parallel data并行數據

In parallel data co-learning both modalities share a set of in-stances — audio recordings with the corresponding videos, images and their sentence descriptions. This allows for two types of algorithms to exploit that data to better model the modalities: co-training and representation learning.

Co-training is the process of creating more labeled training samples when we have few labeled samples in a multimodal problem [21]. The basic algorithm builds weak classifiers in each modality to bootstrap each other with labels for the unlabeled data. It has been shown to discover more training samples for web-page classification based on the web-page itself and hyper-links leading in the seminal work of Blum and Mitchell [21]. By definition this task requires parallel data as it relies on the overlap of multimodal samples.

在并行數據協同學習中，兩種模態都共享一組實例—音頻記錄與相應的視頻、圖像及其句子描述。這就允許了兩種類型的算法來利用這些數據來更好地為模態建模:協同訓練和表示學習。

協同訓練是在多模態問題[21]中有少量標記樣本的情況下，創建更多標記訓練樣本的過程。基本算法在每個模態中構建弱分類器，對未標記的數據進行標簽引導。Blum和Mitchell[21]的開創性工作表明，基于網頁本身和超鏈接，可以發現更多的訓練樣本用于網頁分類。根據定義，該任務需要并行數據，因為它依賴于多模態樣本的重疊。

?Figure 3: Types of data parallelism used in co-learning: parallel — modalities are from the same dataset and there is a direct correspondence between instances; non-parallel— modalities are from different datasets and do not have overlapping instances, but overlap in general categories or concepts; hybrid — the instances or concepts are bridged by a third modality or a dataset.

圖3:在共同學習中使用的數據并行類型:并行模態來自相同的數據集，并且實例之間有直接對應關系;非平行模態來自不同的數據集，沒有重疊的實例，但在一般類別或概念上有重疊;混合—實例或概念由第三種模態或數據集連接起來。

Co-training has been used for statistical parsing [178] to build better visual detectors [120] and for audio-visual speech recognition [40]. It has also been extended to deal with disagreement between modalities, by filtering out unreliable samples [41]. While co-training is a powerful method for generating more labeled data, it can also lead to biased training samples resulting in overfitting. Transfer learning is another way to exploit co-learning with parallel data. Multimodal representation learning (Section 3.1) approaches such as multimodal deep Boltzmann ma-chines [198] and multimodal autoencoders [151] transfer information from representation of one modality to that of another. This not only leads to multimodal representations, but also to better unimodal ones, with only one modality being used during test time [151] .

Moon et al. [143] show how to transfer information from a speech recognition neural network (based on audio) to a lip-reading one (based on images), leading to a better visual representation, and a model that can be used for lip-reading without need for audio information during test time. Similarly, Arora and Livescu [10] build better acoustic features using CCA on acoustic and articulatory (location of lips, tongue and jaw) data. They use articulatory data only during CCA construction and use only the resulting acoustic (unimodal) representation during test time.

協同訓練已被用于統計解析[178]，以構建更好的視覺檢測器[120]，并用于視聽語音識別[40]。通過過濾掉不可靠的樣本[41]，它還被擴展到處理不同模態之間的分歧。雖然協同訓練是一種生成更多標記數據的強大方法，但它也會導致有偏差的訓練樣本導致過擬合。遷移學習是利用并行數據進行共同學習的另一種方法。多模態表示學習(3.1節)方法，如多模態深度玻爾茲曼機[198]和多模態自編碼器[151]，將信息從一個模態的表示傳遞到另一個模態的表示。這不僅導致了多模態表示，而且還導致了更好的單模態表示，在測試期間只使用了一個模態[151]。

Moon等人[143]展示了如何將信息從語音識別神經網絡(基于音頻)傳輸到唇讀神經網絡(基于圖像)，從而獲得更好的視覺表示，以及在測試期間可以用于唇讀而不需要音頻信息的模型。類似地，Arora和Livescu[10]利用聲學和發音(嘴唇、舌頭和下巴的位置)數據的CCA構建了更好的聲學特征。他們僅在CCA構建期間使用發音數據，并在測試期間僅使用產生的聲學(單模態)表示。

7.2 Non-parallel data非并行數據

Methods that rely on non-parallel data do not require the modalities to have shared instances, but only shared cat-egories or concepts. Non-parallel co-learning approaches can help when learning representations, allow for better semantic concept understanding and even perform unseen object recognition.

?Table 6: A summary of co-learning taxonomy, based on data parallelism. Parallel data — multiple modalities can see the same instance. Non-parallel data — unimodal instances are independent of each other. Hybrid data — the modalities are pivoted through a shared modality or dataset.

依賴于非并行數據的方法不需要模態擁有共享的實例，而只需要共享的類別或概念。非平行的共同學習方法可以幫助學習表征，允許更好的語義概念理解，甚至執行看不見的對象識別。

表6:基于數據并行性的協同學習分類概述。并行數據-多種模態可以看到相同的實例。非并行數據——單模態實例彼此獨立。混合數據—模態是通過共享的模態或數據集來轉換的。

Transfer learning is also possible on non-parallel data and allows to learn better representations through transferring information from a representation built using a data rich or clean modality to a data scarce or noisy modality. This type of trasnfer learning is often achieved by using coordinated multimodal representations (see Section 3.2). For example, Frome et al. [61] used text to improve visual representations for image classification by coordinating CNN visual features with word2vec textual ones [141] trained on separate large datasets. Visual representations trained in such a way result in more meaningful errors — mistaking objects for ones of similar category [61]. Mahasseni and Todorovic [129] demonstrated how to regularize a color video based LSTM using an autoencoder LSTM trained on 3D skeleton data by enforcing similarities between their hidden states. Such an approach is able to improve the original LSTM and lead to state-of-the-art performance in action recognition. Conceptual grounding refers to learning semantic mean-ings or concepts not purely based on language but also on additional modalities such as vision, sound, or even smell [16]. While the majority of concept learning approaches are purely language-based, representations of meaning in humans are not merely a product of our linguistic exposure, but are also grounded through our sensorimotor experience and perceptual system [17], [126]. Human semantic knowl-edge relies heavily on perceptual information [126] and many concepts are grounded in the perceptual system and are not purely symbolic [17]. This implies that learning semantic meaning purely from textual information might not be optimal, and motivates the use of visual or acoustic cues to ground our linguistic representations.

在非并行數據上也可以進行遷移學習，通過將信息從使用數據豐富或干凈的模態構建的表示轉移到數據缺乏或有噪聲的模態，可以學習更好的表示。這種類型的遷移學習通常是通過使用協調的多模態表示來實現的(見第3.2節)。例如，Frome等人[61]通過將CNN視覺特征與在單獨的大數據集上訓練的word2vec文本特征[141]相協調，使用文本來改善圖像分類的視覺表示。以這種方式訓練的視覺表征會導致更有意義的錯誤——將物體誤認為類似類別的物體[61]。Mahasseni和Todorovic[129]演示了如何使用在3D骨架數據上訓練的自動編碼器LSTM來正則化基于LSTM的彩色視頻，方法是增強隱藏狀態之間的相似性。這種方法可以改進原有的LSTM，在動作識別方面達到最先進的性能。概念基礎是指學習語義的意義或概念，不單純基于語言，也基于其他形式，如視覺、聲音，甚至嗅覺。雖然大多數概念學習方法是純粹基于語言的，但人類意義的表征并不僅僅是語言接觸的產物，而且還基于我們的感覺運動經驗和感知系統[17]，[126]。人類的語義知識-邊緣嚴重依賴于感知信息[126]，許多概念都建立在感知系統的基礎上，而不是純粹的符號[17]。這意味著，單純從文本信息中學習語義意義可能不是最理想的，這促使我們使用視覺或聽覺線索來建立語言表征的基礎。

Starting from work by Feng and Lapata [59], grounding is usually performed by finding a common latent space between the representations [59], [183] (in case of parallel datasets) or by learning unimodal representations sepa-rately and then concatenating them to lead to a multimodal one [29], [101], [172], [181] (in case of non-parallel data). Once a multimodal representation is constructed it can be used on purely linguistic tasks. Shutova et al. [181] and Bruni et al. [29] used grounded representations for better classification of metaphors and literal language. Such representations have also been useful for measuring conceptual similarity and relatedness — identifying how semantically or conceptually related two words are [30], [101], [183] or actions [172]. Furthermore, concepts can be grounded not only using visual signals, but also acoustic ones, leading to better performance especially on words with auditory associations [103], or even olfactory signals [102] for words with smell associations. Finally, there is a lot of overlap between multimodal alignment and conceptual grounding, as aligning visual scenes to their descriptions leads to better textual or visual representations [108], [161], [172], [240].

Conceptual grounding has been found to be an effective way to improve performance on a number of tasks. It also shows that language and vision (or audio) are com-plementary sources of information and combining them in multimodal models often improves performance. However, one has to be careful as grounding does not always lead to better performance [102], [103], and only makes sense when grounding has relevance for the task — such as grounding using images for visually-related concepts.

從工作開始由馮和Lapata[59],接地通常是由之間找到一個共同的潛在空間表征[59],[183](并行數據集的情況下)或通過學習sepa-rately單峰表示,然后導致一個多通道連接[29],[101],[172],[181](對于非并行數據)。一旦構建了多模態表示，它就可以用于純語言任務。Shutova等人[181]和Bruni等人[29]使用扎根表征來更好地分類隱喻和字面語言。這樣的表征在度量概念相似度和相關性方面也很有用——識別兩個詞在語義上或概念上的關聯程度分別為[30]、[101]、[183]或動作[172]。此外，概念不僅可以使用視覺信號，也可以使用聽覺信號，從而使詞匯表現得更好[103]，甚至嗅覺信號[102]也能使詞匯表現得更好。最后，多模態對齊和概念基礎之間有很多重疊，因為將視覺場景與其描述相匹配會導致更好的文本或視覺表征[108]、[161]、[172]、[240]。

概念基礎已被發現是提高許多任務性能的有效方法。它還表明，語言和視覺(或音頻)是互補的信息來源，在多模態模型中結合它們通常可以提高性能。然而，必須小心，因為接地并不總是會帶來更好的性能[102]，[103]，而且只有當接地與任務相關時才有意義——例如使用圖像作為視覺相關概念的接地。

Zero shot learning (ZSL) refers to recognizing a concept without having explicitly seen any examples of it. For ex-ample classifying a cat in an image without ever having seen (labeled) images of cats. This is an important problem to address as in a number of tasks such as visual object clas-sification: it is prohibitively expensive to provide training examples for every imaginable object of interest.

There are two main types of ZSL — unimodal and multimodal. The unimodal ZSL looks at component parts or attributes of the object, such as phonemes to recognize an unheard word or visual attributes such as color, size, and shape to predict an unseen visual class [55]. The multimodal ZSL recognizes the objects in the primary modality through the help of the secondary one — in which the object has been seen. The multimodal version of ZSL is a problem facing non-parallel data by definition as the overlap of seen classes is different between the modalities.

Socher et al. [190] map image features to a conceptual word space and are able to classify between seen and unseen concepts. The unseen concepts can be then assigned to a word that is close to the visual representation — this is enabled by the semantic space being trained on a separate dataset that has seen more concepts. Instead of learning a mapping from visual to concept space Frome et al. [61] learn a coordinated multimodal representation between concepts and images that allows for ZSL. Palatucci et al. [158] per-form prediction of words people are thinking of based on functional magnetic resonance images, they show how it is possible to predict unseen words through the use of an intermediate semantic space. Lazaridou et al. [118] present a fast mapping method for ZSL by mapping extracted visual feature vectors to text-based vectors through a neural network.

零機會學習(ZSL)指的是在沒有明確看到任何例子的情況下識別一個概念。例如，在從未見過(有標記的)貓的圖像的情況下，將一只貓在圖像中分類。這是一個需要解決的重要問題，就像在許多任務中(如視覺對象分類)一樣:為每一個可以想象到的感興趣的對象提供訓練示例是非常昂貴的。

ZSL主要有兩種類型——單峰和多峰。單模態ZSL查看對象的組件部分或屬性，如音素，以識別未聽過的單詞或視覺屬性(如顏色、大小和形狀)，以預測不可見的視覺類[55]。多模態ZSL通過輔助模態的幫助識別主要模態的物體——在輔助模態中，物體已經被看到。根據定義，ZSL的多模態版本是面對非并行數據的一個問題，因為所看到的類的重疊在模態之間是不同的。

Socher等人[190]將圖像特征映射到概念詞空間，并能夠在可見和不可見概念之間進行分類。然后，看不見的概念可以分配給一個接近于視覺表示的單詞——這是通過在一個單獨的數據集上訓練語義空間實現的，該數據集已經看到了更多的概念。Frome等人[61]沒有學習從視覺空間到概念空間的映射，而是學習概念和圖像之間的協調多模態表示，從而實現ZSL。Palatucci等人[158]基于功能性磁共振圖像對人們正在思考的單詞進行形式預測，他們展示了如何通過使用中間語義空間來預測未見的單詞。Lazaridou等人[118]提出了一種ZSL的快速映射方法，通過神經網絡將提取的視覺特征向量映射到基于文本的向量。

7.3 Hybrid data混合數據

In the hybrid data setting two non-parallel modalities are bridged by a shared modality or a dataset (see Figure 3c). The most notable example is the Bridge Correlational Neural Network [167], which uses a pivot modality to learn coordinated multimodal representations in presence of non-parallel data. For example, in the case of multilingual image?captioning, the image modality would always be paired with at least one caption in any language. Such methods have also been used to bridge languages that might not have parallel corpora but have access to a shared pivot language, such as for machine translation [148], [167] and document transliteration [100].

在混合數據設置中，兩個非并行模態由一個共享的模態或數據集連接起來(見圖3c)。最著名的例子是Bridge相關神經網絡[167]，它使用一個樞軸模態來學習非并行數據的協調多模態表示。例如，在多語言圖像字幕的情況下，圖像模態總是與任何語言的至少一個字幕配對。這些方法也被用于連接那些可能沒有并行語料庫但可以訪問共享的主語言的語言，如機器翻譯[148]、[167]和文檔音譯[100]。

Instead of using a separate modality for bridging, some methods rely on existence of large datasets from a similar or related task to lead to better performance in a task that only contains limited annotated data. Socher and Fei-Fei [189] use the existence of large text corpora in order to guide image segmentation. While Hendricks et al. [78] use separately trained visual model and a language model to lead to a better image and video description system, for which only limited data is available.

一些方法依賴于來自類似或相關任務的大型數據集，而不是使用單獨的模態進行橋接，從而在只包含有限注釋數據的任務中獲得更好的性能。Socher和feifei[189]利用存在的大型文本語料庫來指導圖像分割。而Hendricks等人[78]則分別使用訓練過的視覺模型和語言模型來得到更好的圖像和視頻描述系統，但數據有限。

7.4 Discussion討論

Multimodal co-learning allows for one modality to influ-ence the training of another, exploiting the complementary information across modalities. It is important to note that co-learning is task independent and could be used to cre-ate better fusion, translation, and alignment models. This challenge is exemplified by algorithms such as co-training, multimodal representation learning, conceptual grounding, and zero shot learning (ZSL) and has found many applica-tions in visual classification, action recognition, audio-visual speech recognition, and semantic similarity estimation.

多模態共同學習允許一種模態影響另一種模態的訓練，利用不同模態之間的互補信息。值得注意的是，共同學習是獨立于任務的，可以用來創建更好的融合、翻譯和對齊模型。協同訓練、多模態表示學習、概念基礎和零樣本學習(ZSL)等算法都體現了這一挑戰，并在視覺分類、動作識別、視聽語音識別和語義相似度估計等方面得到了許多應用。

8 Conclusion結論

As part of this survey, we introduced a taxonomy of multi-modal machine learning: representation, translation, fusion, alignment, and co-learning.

Some of them such as fusion have been studied for a long time, but more recent interest in representation and translation have led to a large number of new multimodal algorithms and exciting multimodal applications.

We believe that our taxonomy will help to catalog future research papers and also better understand the remaining unresolved problems facing multimodal machine learning.

作為調查的一部分，我們介紹了多模態機器學習的分類:表示、翻譯、對齊、融合和共同學習。

其中一些如融合已經被研究了很長時間，但最近對表示、翻譯的興趣導致了大量新的多模態算法和令人興奮的多模態應用。

我們相信我們的分類法將有助于對未來的研究論文進行分類，并更好地理解多模態機器學習面臨的剩余未解決問題。

總結

以上是生活随笔為你收集整理的Paper：《Multimodal Machine Learning: A Survey and Taxonomy，多模态机器学习:综述与分类》翻译与解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： TF之DD：利用Inception模型+
下一篇： TF之DD：利用Inception模型+