ai 图灵测试_适用于现代AI系统的“视觉图灵测试”
ai 圖靈測試
Visual Question Answering (VQA) is a fascinating research field at the intersection of computer vision and language understanding.
視覺問答(VQA)是計(jì)算機(jī)視覺與語言理解相交的一個(gè)有趣的研究領(lǐng)域。
In this post we will elaborate both on existing data sets and examine potential approaches and applications and present a prototype in which the user can choose from images the algorithm has not seen before and asks question accordingly.
在這篇文章中,我們將詳細(xì)介紹現(xiàn)有數(shù)據(jù)集,并研究潛在的方法和應(yīng)用,并提出一個(gè)原型,用戶可以從中選擇算法從未見過的圖像,并提出相應(yīng)的問題。
什么是VQA? (What is VQA ?)
Visual Question Answering approaches are designed to handle the following tasks: Given an image and a natural language question about the image, the VQA model needs to provide an accurate natural language answer.
視覺問題解答方法旨在處理以下任務(wù):給定圖像和關(guān)于圖像的自然語言問題,VQA模型需要提供準(zhǔn)確的自然語言答案。
This is by nature a multi-discipline research problem. It consists of the following sub-tasks:· Computer Vision (CV)· Natural Language Processing (NLP)· Knowledge Representation & reasoning
本質(zhì)上,這是一個(gè)多學(xué)科研究問題。 它由以下子任務(wù)組成:·計(jì)算機(jī)視覺(CV)·自然語言處理(NLP)·知識(shí)表示與推理
That’s why some authors refer to Visual Question Answering as “Visual Turing Test” for modern AI systems.
這就是為什么有些作者將視覺問題解答稱為現(xiàn)代AI系統(tǒng)的“視覺圖靈測試”。
This screenshot from my prototype illustrates how a VQA system works. Be aware of the fact that the user has chosen an image the algorithm has not seen during training and asks question accordingly.
我的原型的屏幕截圖說明了VQA系統(tǒng)是如何工作的。 請注意,用戶選擇了算法在訓(xùn)練期間未看到的圖像,并提出了相應(yīng)的問題。
Prototype screenshot原型截圖數(shù)據(jù)集 (Datasets)
Most of the existing datasets contain triples made of an image, a question and its correct answer. Some publicly available datasets, on the other hand, provides extra information like image captions, image regions represented as bounding boxes or multiple-choice candidate answers.
現(xiàn)有的大多數(shù)數(shù)據(jù)集都包含由圖像,問題及其正確答案組成的三元組。 另一方面,一些公開可用的數(shù)據(jù)集提供了額外的信息,例如圖像標(biāo)題,以邊界框表示的圖像區(qū)域或多項(xiàng)選擇候選答案。
The available VQA datasets can be categorized based on three factors:· type of images (natural, clip-art, synthetic)· question–answer format (open-ended, multiple-choice)· use of external knowledge
可以根據(jù)以下三個(gè)因素對可用的VQA數(shù)據(jù)集進(jìn)行分類:·圖像類型(自然,剪貼畫,合成)·問答格式(開放式,多項(xiàng)選擇)·外部知識(shí)的使用
The following table shows an overview of the available datasets:
下表概述了可用的數(shù)據(jù)集:
Source: Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor, published in Artificial Intelligence Review (2020)資料來源:視覺問題解答:最新研究,Sruthy Manmadhan和Binsu C. Kovoor,發(fā)表在《人工智能評(píng)論》(2020)For our prototype we make use of the VQA dataset with natural images and open-ended questions. It is one of the most popular ones and also used for the annual VQA competition. The dataset we use consists of 443,757 image-questions pairs for training and 214,354 sets for validation. It can be downloaded here.
對于我們的原型,我們使用帶有自然圖像和開放式問題的VQA數(shù)據(jù)集。 它是最受歡迎的游戲之一,也用于年度VQA競賽。 我們使用的數(shù)據(jù)集包括用于訓(xùn)練的443,757個(gè)圖像問題對和用于驗(yàn)證的214,354套數(shù)據(jù)。 可以在這里下載。
Example of an annotated image-question-pair帶注釋的圖像問題對的示例One special characteristic of VQA dataset is that the annotations, i.e. the answers provided to a specific image-question pair are not unique. The answers have been collected via Amazon Mechanical Turk and for each image-question pair ten answers are supplied, that could be all equal but also different. The screenshot on the left shows an example.
VQA數(shù)據(jù)集的一個(gè)特殊特征是注釋,即提供給特定圖像問題對的答案不是唯一的。 答案是通過Amazon Mechanical Turk收集的,并且為每個(gè)圖像問題對提供了十個(gè)答案,它們可以相等但也可以不同。 左側(cè)的屏幕截圖顯示了一個(gè)示例。
方法與架構(gòu) (Approaches & Architectures)
The basic architecture as shown below consists of three main elements:· Image feature extraction· Question Feature extraction· Fusion model + classifier to merge the features
如下所示的基本架構(gòu)由三個(gè)主要元素組成:·圖像特征提取·問題特征提取·融合模型+分類器以合并特征
https://arxiv.org/abs/1610.01465https://arxiv.org/abs/1610.01465圖像特征提取 (Image feature extraction)
Image feature extraction describes the method to transform an image to a numerical vector to enable further computational processing.
圖像特征提取描述了將圖像轉(zhuǎn)換為數(shù)值向量以實(shí)現(xiàn)進(jìn)一步計(jì)算處理的方法。
Source: Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor, published in Artificial Intelligence Review (2020)資料來源:視覺問答:最新的評(píng)論,Sruthy Manmadhan和Binsu C. Kovoor,發(fā)表在《人工智能評(píng)論》(2020)中Convolutional neural network (CNN) has established themselves as the state-of-the-art approach. VQA architectures generally use already pre-trained CNN models by applying transfer learning. The chart shows an evaluation of the utilization rates of different architectures in several VQA research papers.
卷積神經(jīng)網(wǎng)絡(luò)(CNN)已將自己確立為最先進(jìn)的方法。 VQA架構(gòu)通常通過應(yīng)用轉(zhuǎn)移學(xué)習(xí)來使用已經(jīng)預(yù)先訓(xùn)練的CNN模型。 該圖表顯示了一些VQA研究論文中對不同體系結(jié)構(gòu)的利用率的評(píng)估。
In the prototype we use the VGG16 architecture that uses 224 × 224 pixel images as input and outputs a 4096-dimensional vector.
在原型中,我們使用VGG16架構(gòu),該架構(gòu)使用224×224像素的圖像作為輸入并輸出4096維向量。
問題特征提取 (Question feature extraction)
Source: Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor, published in Artificial Intelligence Review (2020)資料來源:視覺問答:最新的評(píng)論,Sruthy Manmadhan和Binsu C. Kovoor,發(fā)表在《人工智能評(píng)論》(2020)中To extract question features multiple approaches have been developed ranging from count-based methods like One-hot-encoding, Bag-of-words to text embedding methods like Long-short-term-memory (LSTM) or gated recurrent unit (GRU). The diagram below illustrates the utilization rate of these approaches in the research.
為了提取問題特征,已經(jīng)開發(fā)了多種方法,從基于計(jì)數(shù)的方法(例如,單次熱編碼,詞袋)到文本嵌入方法(例如,短期記憶(LSTM)或門控循環(huán)單元(GRU))。 下圖說明了這些方法在研究中的利用率。
For our prototype we use the most popular LSTM approach with Word2Vec representations of the single words fed into it. The LSTM-model outputs a 512-dimensional vector.
對于我們的原型,我們使用最受歡迎的LSTM方法,將Word2Vec表示形式的單詞輸入其中。 LSTM模型輸出512維向量。
融合模型+分類器 (Fusion model + classifier)
To fusion the two feature vectors several basic approaches exist including point-wise multiplication or addition and concatenation. More advanced architectures use Canonical Correlation Analysis (CCA) or end-to-end models with a Multimodal Compact Bilinear Pooling (MCB) layer.
為了融合兩個(gè)特征向量,存在幾種基本方法,包括逐點(diǎn)乘法或加法和級(jí)聯(lián)。 更高級(jí)的體系結(jié)構(gòu)使用規(guī)范相關(guān)分析(CCA)或具有多模式緊湊型雙線性池(MCB)層的端到端模型。
Coverage of questions by most frequent answers最常見的答案覆蓋問題In our prototype we use simple concatenation followed by a softmax classifier to the 1,000 most common answers. This approach is suitable as more than 95% of the question contain at least one annotation which is covered by the 1,000 most common answers (see graph on the left).
在我們的原型中,我們使用簡單的串聯(lián),然后使用softmax分類器對1,000個(gè)最常見的答案進(jìn)行分類。 這種方法是合適的,因?yàn)槌^95%的問題包含至少一個(gè)注釋,該注釋被1,000個(gè)最常見的答案覆蓋(請參見左側(cè)的圖表)。
更高級(jí)的方法 (More advanced approaches)
In the recent past more sophisticated architectures have been developed with attention-based approaches being the most popular. Here, the idea is to set the focus of the algorithm on the most relevant parts of the input. For example, if the question is “What is the color of the ball?”, the region of the image containing the ball is more relevant than the others. Concerning the question, “color” and “ball” are more informative than the rest of the words.
在最近的過去,已經(jīng)開發(fā)了更復(fù)雜的體系結(jié)構(gòu),其中基于注意力的方法是最流行的。 這里的想法是將算法的焦點(diǎn)設(shè)置在輸入的最相關(guān)部分。 例如,如果問題是“球的顏色是什么?”,則包含球的圖像區(qū)域比其他區(qū)域更相關(guān)。 關(guān)于這個(gè)問題,“顏色”和“球”比其余單詞更具參考價(jià)值。
The most common choice in VQA is to use spatial attention to generate region specific features to train the Convolutional Neural Network.
VQA中最常見的選擇是利用空間注意力生成特定于區(qū)域的特征以訓(xùn)練卷積神經(jīng)網(wǎng)絡(luò)。
Two common methods to obtain spatial attention are to either project a grid over the image and determine the relevance of each region by the specific question or to automatically generate bounding boxes in the image and utilize the question to determine the relevance of the features for each box.
獲得空間注意力的兩種常用方法是在圖像上投影網(wǎng)格并通過特定問題確定每個(gè)區(qū)域的相關(guān)性,或自動(dòng)在圖像中生成邊界框并利用該問題來確定每個(gè)框的特征的相關(guān)性。
The use of an attention-based approach goes beyond the scope of our prototype.
基于注意力的方法的使用超出了我們原型的范圍。
評(píng)價(jià) (Evaluation)
Due to the variety of datasets it is not surprising that multiple approaches to evaluate the performance of the algorithms exist. In a multiple-choice setting, there is just a single right answer for every question, so the assessment can be easily quantified by the mean accuracy over test questions. In open-ended setting though, several answers for a particular question could be correct due to synonyms and paraphrasing.
由于數(shù)據(jù)集的多樣性,因此存在多種評(píng)估算法性能的方法就不足為奇了。 在選擇題設(shè)置中,每個(gè)問題只有一個(gè)正確答案,因此可以通過測試題的平均準(zhǔn)確性輕松量化評(píng)估。 但是在開放式環(huán)境中,由于同義詞和措辭的不同,一個(gè)特定問題的幾個(gè)答案可能是正確的。
In such cases metrics that measure how much a predicted answer differs from the ground truth based on the difference in their semantic meaning could be used. The Wu-Palmer Similarity (WUPS) is such an example.
在這種情況下,可以使用基于其語義含義的差異來衡量預(yù)測答案與地面事實(shí)有多少不同的度量。 Wu-Palmer相似度(WUPS)就是這樣的一個(gè)例子。
As the VQA datasets work with very short answers a consensus metric defined as Accuracy_VQA = min(n/3, 1) is used, i.e. a 100% accuracy is achieved when the predicted answer matches at least 3 out of the 10 annotated answers.
當(dāng)VQA數(shù)據(jù)集使用非常短的答案工作時(shí),將使用定義為Accuracy_VQA = min(n / 3,1)的共識(shí)度量,即,當(dāng)預(yù)測答案與10個(gè)帶注釋的答案中的至少3個(gè)匹配時(shí),將達(dá)到100%的準(zhǔn)確性。
The diagram show the accuracy as defined above for the differen question types:
該圖顯示了以上針對不同問題類型定義的準(zhǔn)確性:
Evaluation results on validation set驗(yàn)證集的評(píng)估結(jié)果VQA的潛在應(yīng)用 (Potential applications of VQA)
VQA systems offer a vast number of potential applications. One of the most socially relevant and direct application is to help blind and visually-impaired users to communicate with pictures. Furthermore, it can be integrated in image retrieval system, which can be commercially used on e-commerce sites to attract customers by giving more exact results to their search queries. Incorporation of VQA may also increase the popularity of online educational services by allowing learners to interact with images. Another application of VQA is in the field of data analysis where VQA can help the analyst to summarize the available visual data.
VQA系統(tǒng)提供了大量潛在的應(yīng)用程序。 與社會(huì)最相關(guān)且最直接的應(yīng)用之一是幫助盲人和視障用戶與圖片進(jìn)行交流。 此外,它可以集成在圖像檢索系統(tǒng)中,該系統(tǒng)可以在電子商務(wù)站點(diǎn)上商業(yè)使用,通過為客戶的搜索查詢提供更準(zhǔn)確的結(jié)果來吸引他們。 通過允許學(xué)習(xí)者與圖像進(jìn)行交互,VQA的合并也可以增加在線教育服務(wù)的普及。 VQA的另一個(gè)應(yīng)用是在數(shù)據(jù)分析領(lǐng)域,其中VQA可以幫助分析人員總結(jié)可用的可視數(shù)據(jù)。
總結(jié)思想 (Closing thoughts)
VQA is a research field that requires the understanding of both text and vision. The current performance of the systems is still lagging behind human decisions, but since deep learning techniques are significantly improving both in Natural Language Processing and Computer Vision, we can reasonably expect VQA to achieve higher and higher accuracy. Progress will be further driven by contests like the VQA challenge hosted on visualqa.org.
VQA是一個(gè)需要了解文本和視覺的研究領(lǐng)域。 系統(tǒng)的當(dāng)前性能仍然落后于人類的決策,但是由于深度學(xué)習(xí)技術(shù)在自然語言處理和計(jì)算機(jī)視覺方面都得到了顯著改善,因此我們可以合理地期望VQA能夠?qū)崿F(xiàn)越來越高的準(zhǔn)確性。 visualqa.org舉辦的VQA挑戰(zhàn)賽等競賽將進(jìn)一步推動(dòng)進(jìn)步。
If you would like to dive deeper into this topic you can find the code of the prototype on my github repo here. Any feedback to the approach or the code is highy appreciated.
如果您想深入探討這個(gè)話題,你可以找到原型的代碼在我的github回購這里 。 高度贊賞對該方法或代碼的任何反饋。
Further recommend readings include:· Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor Artificial Intelligence Review (2020)· VQA: Visual Question Answering: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
進(jìn)一步的推薦讀物包括:·視覺問題解答:最新技術(shù)評(píng)論,Sruthy Manmadhan和Binsu C. Kovoor人工智能評(píng)論(2020)·VQA:視覺問題解答:Aishwarya Agrawal,Jiasen Lu,Stanislaw Antol,Margaret Mitchell,C.Lawrence Zitnick,Dhruv Batra,Devi Parikh
翻譯自: https://medium.com/@frank.merwerth/a-visual-turing-test-for-modern-ai-systems-de7530416e57
ai 圖靈測試
總結(jié)
以上是生活随笔為你收集整理的ai 图灵测试_适用于现代AI系统的“视觉图灵测试”的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。