當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

VALSE2019总结(2)-以人为中心的视觉理解

發(fā)布時間：2024/8/26 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 VALSE2019总结(2)-以人为中心的视觉理解小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

2. 以人為中心的視覺理解 (ceiwu lu, SJU)

2.1 基于視頻的時序建模和動作識別方法 (liming wang, NJU)

dataset

兩張圖：
注意一個區(qū)分：trimmed and untrimmed videos

outline

action recognition
action temporal localization
action spatial detection
action spatial-temporal detection

opportunities and challenges

opportunities ：
- videos provide huge and rich data for visual learning
- action is core in motion perception and has many applications in video understanding
challenges：
- complex dynamics and temporal variations
- action vocabulary is not well defined
- noisy and weakly labels (dense labeling is expensive)
- High computational and memory cost

temporal structure: 需要對動作進行分解：decomposition

常用的 Deep networks

large-scale video classification with CNN (feifei li, CVPR2014)
Two-Stream CNN for action recognition in videos (NIPS2014)
learning spatiotemporal features with 3D CNN (Du Tran, ICCV2015)
TDD (liming wang, CVPR2015)
Real-time action recognition with enhanced motion vector CNNs (CVPR2016)
Two Stream I3D (CVPR2017)
R(2+1)D (CVPR2018)
SlowFast Networks (kaiming he, CVPR2019)

liming wang 自己的3篇工作

shot-term -> middle-term -> long term modeling，對應(yīng)的論文是 (ARTNet -> TSN -> UntrimmedNet)
更多細節(jié)理解，直接看他的PPT寫讀后感
按照 liming wang 自己說的，video action recognition/detection 對于我的VAD 基本沒有幫助

之前看到一些很好的zhihu link: 動作識別-1, 動作識別-2, 時序行為檢測-1, 時序行為檢測-2, 時序行為檢測-3, 時序行為檢測-4,

所有的PPT圖片

2.2 復雜視頻的深度高效分析與理解方法 (yu qiao, CAS)

DL的一些經(jīng)驗性Trick介紹

人臉識別的開集特點 (Open-set 和 novelty detection有點類似，參考TODO)

Center Loss (ECCV2016)

center loss意思即為：為每一個類別提供一個類別中心，最小化min-batch中每個樣本與對應(yīng)類別中心的距離，這樣就可以達到縮小類內(nèi)距離的目的。

center loss的原理主要是在softmax loss的基礎(chǔ)上，通過對訓練集的每個類別在特征空間分別維護一個類中心，在訓練過程，增加樣本經(jīng)過網(wǎng)絡(luò)映射后在特征空間與類中心的距離約束，從而兼顧了類內(nèi)聚合與類間分離。

From：link-1, link-2
Center Loss的改進 (IJCV2019): 用投影方向代替類中心
Large Margin思想設(shè)計的Loss:
- L-softmax (Liu, ICML 2016)
- A-softmax (Liu, CVPR2017)
- Additive Margin Softmax (ICLR 2018 workshop)
- CosLoss (wang, CVPR2018)
- ArcFace (CVPR2019)

Range Loss ：有效應(yīng)對類間樣本數(shù)不均衡造成的長尾問題

motivation：少數(shù)人(明星)具有大量的圖片，多數(shù)人卻只有少量圖片，這種長尾分布啟發(fā)了兩個動機：(1)長尾分布如何影響模型性能的理論分析；(2)設(shè)計新的Loss解決這個問題
此處有一張圖片，并且Range Loss 的 PPT缺失了

video action recognition

姿態(tài)注意力機制 RPAN (ICCV2017, Oral)
- 把行為識別和姿態(tài)估計兩個任務(wù)進行結(jié)合
- 利用姿態(tài)變化，引導 RNN 對行為的動態(tài)過程進行建模

一篇文章

Temporal Hallucinating for Action Recognition with Few Still Images (CVPR 2018)

一些圖

2.3 understanding emotions in videos (yanwei fu, FDU)

個人感覺：這是個剛挖的新坑，有趣，值得了解下

applicaltion

web video search
video recommendation system
avoid inappropriate advertisement

Tasks of Emotions in videos

Emotion recognition
emotion attribution
emotion-oriented summarization

Challenges

Sparsely expressed in videos
Diverse content and variable quality

Knowledge Transfer

Zero-shot Emotion learning (配一張圖)
- A multi-task neural approach for emotion attribution, classification and summarization (TMM)
- Frame-Transfermer emotion classification Network (*CMR 2017)

Emotion-oriented summarization

相當于選擇關(guān)鍵幀以及幀信息融合

Face emotion

Posture, Expression, Identity in faces

一些圖：

2.4 以人為中心視覺識別和定位中的結(jié)構(gòu)化深度學習方法探索 (wanli ouyang, sdney university)

outline

introduction
structured feature learning
back-bone model design
conclusion

introduction

object detection
human pose estimation
action recognition

structured feature learning

structure in neurons
- motivation：傳統(tǒng) neurons 在同一層沒有連接，在相鄰層存在局部或者全部連接，沒有保證局部區(qū)域的信息。從而引出每一層網(wǎng)絡(luò)的各神經(jīng)元具有結(jié)構(gòu)化信息的。然后以人體姿態(tài)估計為例，分析了基于全連接神經(jīng)網(wǎng)絡(luò)的問題：在對人體節(jié)點的距離進行建模需要大的卷積核以及一些關(guān)節(jié)點的關(guān)系是不穩(wěn)定。提出結(jié)構(gòu)化特征學習的人體姿態(tài)估計模型（Bidirectional Tree）。
- Bidirectional Tree
- 對應(yīng)的papers
  - end-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation (CVPR2016)
  - structure feature learning for pose estimation (CVPR2016)
  - CRF-CNN, modeling structured information in human pose estimation (CVPR2016)
  - learning deep structured multi-scale features using attention-gated CRFs for contour prediction (NIPS 2017)
- application of structured feature learning
  - 有一張圖片

back-bone model design

Hourglass for classification (Encode-Decoder 結(jié)構(gòu)，比如 UNet，一般用于圖像分割，不用于分類)
- 希望: feature with high-level semantics and high resolution is good
- 現(xiàn)實：feature with high-level semantics with low resolution
- Hourglass for classification has poor performance的原因分析：Different tasks require different resolution of features，所以提出 FishNet
FishNet
- motivation: 為了統(tǒng)一利用像素級、區(qū)域級以及圖像級任務(wù)的優(yōu)勢，歐陽萬里老師提出了FishNet，FishNet的優(yōu)勢是：更好的將梯度傳到淺層網(wǎng)絡(luò)，所提取的特征包含了豐富的低層和高層語義信息并保留和微調(diào)了各層級信息。
- pros.
  - better gradient flow to shallow layers
  - features:
    - contain rich low-level and high-level semantics
    - are preserved and refined from each other (信息互相交流)
- code: https://github.com/kevin-ssy/FishNet

conclusion

structured deep learning is (1)effective (2)from observation
end-to-end joint training bridges the gap between structure modeling and feature learning

一些圖

2.5 面向監(jiān)控視頻的行為識別與理解 (xiaowei lin, SJU)

行為識別領(lǐng)域的task

基于軌跡的行為分析
面向任意視頻的行為識別 (liming wang)
面向監(jiān)控視頻的行為識別

目標檢測的幾個點

Tiny DSOD (BMVC 2018)
Toward accurate one-stage object detection with AP-Loss (CVPR 2019)
kill two birds with one stone: boosting both object detection accuracy and speed with adaptive patch-of-interest composition (2017)

若干應(yīng)用

三維目標檢測與姿態(tài)估計
多目標跟蹤
基于目標檢測、跟蹤的實時場景統(tǒng)計分析
多相機跟蹤
- (correspondence structure Re-ID)
  - learning correspondence structure for person re-identification (TIP2017)
  - person re-identification with correspondence sturcture learning (ICCV 2015)
- (Group Re-ID)
  - Group re-identification: leveraging and integrating multi-grain information, (MM2018)
- 車載跨相機定位
- 無人超市
- 野生東北虎 Re-ID

行為識別

多尺度特征
- action recognition with coarse-to-fine deep feature interation and asynchronous fusion (AAAI 2018)
時空異步guanlian
- cross-stream selective networks for action recognition (CVPR workshop 2019)
時空行為定位
- finding action tubes with an integrated sparse-to-dense framework (arxiv 2019)
監(jiān)控行為識別
- 有一張圖

其他應(yīng)用

實時行為/事件檢測
基于軌跡的行為分析-個體行為分析
- a tube-and-droplet-based apporach for representing and analyzing motion trajectories (TPAMI 2017)，非DL，無好感
基于軌跡的行為分析-軌跡聚類與挖掘
- unsupervised trajectory clustering via adaptive multi-kernel-based shrinkage (ICCV 2015)，比較老。。。但可以以它為base，看最新的引用它的高質(zhì)量論文即可
密集場景行為分析
- a diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes (TIP 2016)
- finding coherent motions and semantic regions in crowd scenes: a diffusion and clustering apporach (ECCV 2014)

主頁：link-1

轉(zhuǎn)載于:https://www.cnblogs.com/LS1314/p/10885080.html

總結(jié)

以上是生活随笔為你收集整理的VALSE2019总结(2)-以人为中心的视觉理解的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CF767C Garland
下一篇：每周分享第7期(2019.5.18)