[论文记录] 2019 - Utilizing Arousal-Valence Relationship for Continuous Prediction of Valence in Movies
[論文記錄] 2019 - Partners in Crime: Utilizing Arousal-Valence Relationship for Continuous Prediction of Valence in Movies
- 論文簡介
- 論文內容
- 摘要
- 1 介紹
- 2 數據集和特征
- 3 預測模型
論文簡介
原論文:Partners in Crime: Utilizing Arousal-Valence Relationship for Continuous Prediction of Valence in Movies1
利用“喚醒度-效價”的關系進行電影中“效價”的連續值預測
以下僅為作者閱讀論文時的記錄,學識淺薄,如有錯誤,歡迎指正。
論文內容
摘要
-
The arousal-valence model is often used in characterizing human emotions.
預備知識:喚醒度-效價模型經常被用于表征人類情感。 -
Arousal is defined as the intensity of emotion, while valence is defined as the polarity of emotion.
喚醒度被定義為情感的強度,而效價被定義為情感的極性。 -
Continuous prediction of valence in entertainment media such as movies is important for applications such as ad placement and personalized recommendations.
應用:對電影等娛樂性媒體中效價的連續值預測對于廣告投放和個性化推薦等應用非常重要。 -
While arousal can be effectively predicted using audio-visual information in movies, valence is reported to be more difficult to predict as it also involves understanding the semantics of the movie.
問題:雖然在電影中可以利用視聽信息有效預測喚醒度,但效價卻更難預測,因為它也涉及到電影中的語義信息理解。 -
In this paper, for improving valence prediction, we utilize the insight from psychology that valence and arousal are interrelated.
依據:本文為了改進效價預測,利用了心理學的觀點:效價和喚醒度相互關聯。 -
We use Long Short Term Memory networks (LSTMs) to model the temporal context in movies using standard audio features as input.
算法:我們使用長短期記憶網絡(LSTMs) 利用標準音頻特征作為輸入,對電影中時序上下文進行建模。 -
We incorporate arousal-valence interdependence in two ways:
我們用兩種方式將喚醒度-效價的關聯性進行結合: - as a joint loss function to optimize the prediction network;
作為聯合損失函數來優化網絡; - as a geometric constraint simulating the distribution of arousal-valence observed in psychology literature.
作為一種幾何約束模擬心理學中觀察到的喚醒度-效價分布。 -
Using a joint arousal-valence model, we predict continuous valence for a dataset containing Academy Award winning movies.
利用喚醒度-效價的聯合模型,本文在奧斯卡獲獎電影數據集上預測了效價的連續值。 -
We report a significant improvement over the state-of-the-art results, with an improved Pearson correlation of 0.69 between the annotation and prediction using the joint model, as compared to a baseline prediction of 0.49 using an independent valence model.
結果:本文的結果比SOTA有顯著進步,利用聯合模型進行的預測與標注之間的皮爾遜相關系數達到了0.69,而利用獨立的效價模型進行預測的基線為0.49。
1 介紹
-
類似于電影這種娛樂媒體可以激發觀看者一系列的情感,這種情感在強度(intensity) 和極性(polarity) 兩個維度上隨時間產生變化;
-
情感變化往往與攝影手法有關,例如:音樂強度(music intensity),語言強度(speech intensity),鏡頭框架(shot framing),構圖(composition)和角色運動(character movements);
-
靜態因素例如:色調(color tones)和環境音(ambient sound),也會影響到場景的情感極性;
-
電影情感預測的應用很廣泛,例如:
- 投放廣告( place advertisements)
CYadati, K., Katti, H., Kankanhalli, M.: Cavva: Computational affective video-in-video advertising. IEEE Transactions on Multimedia 16(1), 15–23 (2014) - 內容推薦(content recommendation)
Canini, L., Benini, S., Leonardi, R.: Affective recommendation of movies based
on selected connotative features. IEEE Transactions on Circuits and Systems for Video Technology 23(4), 636–647 (2013) - 內容索引( content indexing)
Zhang, S., Huang, Q., Jiang, S., Gao, W., Tian, Q.: Affective visualization and
retrieval for music video. IEEE Transactions on Multimedia 12(6), 510–522 (2010)
- 投放廣告( place advertisements)
-
提出影視情感可以映射到喚醒度(Arousal)-效價(Valence) 空間中,喚醒度表示情感的強度,效價表示情感極性(正向、負向、中性),如下圖所示,不同場景激發的情緒被映射到2D空間中對應的位置,整體展現出一個拋物線的輪廓;
-
這項任務很具有挑戰性,因為電影動態融合了聽覺、視覺、文本(語義)多種模態的信息,一些相關工作如下:
- 核方法和深度學習預測30個短電影的VA值
Baveye, Y., Chamaret, C., Dellandr′ea, E., Chen, L.: Affective video content analysis: A multidisciplinary insight. IEEE Transactions on Affective Computing (2017) - 手工視聽特征預測12部30分鐘奧斯卡獲獎電影片段的VA值
Malandrakis, N., Potamianos, A., Evangelopoulos, G., Zlatintsi, A.: A supervised
approach to movie emotion tracking. In: Acoustics, Speech and Signal Processing
(ICASSP), 2011 IEEE International Conference on. pp. 2376–2379. IEEE (2011) - 混合專家模型(Mixture-of-Experts ,MoE) 來改進視聽模型的融合
Goyal, A., Kumar, N., Guha, T., Narayanan, S.S.: A multimodal mixture-of-experts model for dynamic emotion prediction in movies. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. pp. 2822–2826.
IEEE (2016) - 長短時記憶網絡(Long Short Term Memory networks ,LSTMs) 捕獲視聽信息的上下文
Sivaprasad, S., Joshi, T., Agrawal, R., Pedanekar, N.: Multimodal continuous prediction of emotions in movies using long short-term memory networks. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval.
pp. 413–419. ACM (2018)
- 核方法和深度學習預測30個短電影的VA值
-
觀察發現效價(Valence) 往往比喚醒度(Arousal)的預測效果更差,因為Valence預測需要更多高階語義信息,例如:一場打斗應該有負面的含義,但如果主角贏了就是正向的情感;花園中明亮的景象應該有正向的含義,但對話可能更偏向負向情感;
-
上述工作都是將Valence與Arousal分開建模的,下述文獻提議將二者聯合建模,用LSTM在200個5-30s的短視頻上預測VA值:
Zhang, L., Zhang, J.: Synchronous prediction of arousal and valence using lstm
network for affective video content analysis. In: 2017 13th International Conference
on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD).
pp. 727–732. IEEE (2017) -
本文使用COGNIMUSE數據集,因為本文認為電影的情感預測在實際應用中很必要,該數據集標注的Valence與Arousal相關性較高(0.62),本文希望利用Arousal的信息來預測Valence,如果能夠利用認知心理學的觀點,也就是Arousal和Valence通常在一個拋物線范圍內(如上圖a),也許可以進一步提升Valence的預測;
2 數據集和特征
-
數據集:COGNIMUSE-包含來自12部奧斯卡獲獎電影的30分鐘片段
-
Valence和Arousal的標注都在 [ 1, 1] 之間(Valence=-1時為最負向的情感,Valence-+1時為最正向的情感;Arousal=-1時情感強度最低,Arousal=+1時情感強度最高);
-
本文每間隔5s提取一次情感,也就是將標注下采樣至每個不重復的5s片段對應一個標注;
-
本文僅用音頻特征作為模型的輸入,因為前人的研究標明音頻特征對Valence的預測更重要;
-
本文提取的音頻特征包括:
- 音頻壓縮性(Audio compressibility)
- 調和性 (Harmonicity)
- 梅爾頻譜系數(Mel frequency spectral coefficients ,MFCC) 及其導數(derivatives)和統計學特征(例如:最小值,最大值,平均值)
- 色度(Chroma)及其導數(derivatives)和統計學特征(例如:最小值,最大值,平均值)
-
進一步使用Written等人的提出的基于相關性的特征選擇來縮小輸入的特征集
3 預測模型
Joshi T, Sivaprasad S, Pedanekar N. Partners in Crime: Utilizing Arousal-Valence Relationship for Continuous Prediction of Valence in Movies[C]//AffCon@ AAAI. 2019. ??
總結
以上是生活随笔為你收集整理的[论文记录] 2019 - Utilizing Arousal-Valence Relationship for Continuous Prediction of Valence in Movies的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: E类直流-直流变换器 Matlab si
- 下一篇: 历史教学视频信息汇总