當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

睡眠音频分割及识别问题(三)

發(fā)布時(shí)間：2024/8/23 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了睡眠音频分割及识别问题(三) 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文獻(xiàn)一：PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

[摘要] 音頻模式識別是機(jī)器學(xué)習(xí)領(lǐng)域的一個(gè)重要研究課題，包括音頻標(biāo)注、聲場景分類、音樂分類、語音情感分類和聲音事件檢測等多項(xiàng)任務(wù)。最近，神經(jīng)網(wǎng)絡(luò)已被應(yīng)用于解決音頻模式識別問題。但是，以前的系統(tǒng)建立在持續(xù)時(shí)間有限的特定數(shù)據(jù)集上。最近，在計(jì)算機(jī)視覺和自然語言處理中，在大規(guī)模數(shù)據(jù)集上預(yù)訓(xùn)練的系統(tǒng)已經(jīng)很好地推廣到了幾個(gè)任務(wù)。然而，在用于音頻模式識別的大規(guī)模數(shù)據(jù)集上的預(yù)訓(xùn)練系統(tǒng)的研究有限。在本文中，我們提出了在大規(guī)模 AudioSet 數(shù)據(jù)集上訓(xùn)練的預(yù)訓(xùn)練音頻神經(jīng)網(wǎng)絡(luò) (PANN)。這些 PANN 被轉(zhuǎn)移到其他與音頻相關(guān)的任務(wù)中。我們研究了由各種卷積神經(jīng)網(wǎng)絡(luò)建模的 PANN 的性能和計(jì)算復(fù)雜性。我們提出了一種稱為 Wavegram-Logmel-CNN 的架構(gòu)，使用 log-mel 頻譜圖和波形作為輸入特征。我們最好的 PANN 系統(tǒng)在 AudioSet 標(biāo)記上實(shí)現(xiàn)了 0.439 的最先進(jìn)的平均精度 (mAP)，優(yōu)于之前最好的系統(tǒng) 0.392。我們將 PANN 轉(zhuǎn)移到六個(gè)音頻模式識別任務(wù)中，并在其中幾個(gè)任務(wù)中展示了最先進(jìn)的性能。

文獻(xiàn)二：Towards Duration Robust Weakly Supervised Sound Event Detection

> [1]

引言部分

SOUND event detection (SED) research classifies and localizes particular audio events (e.g., dog barking, alarm ringing) within an audio clip, assigning each event a label along with a start point (onset) and an endpoint (offset).
聲音事件檢測 (SED) 研究對音頻剪輯中的特定音頻事件（例如，狗吠、警報(bào)響起）進(jìn)行分類和定位，為每個(gè)事件分配一個(gè)標(biāo)簽以及起點(diǎn)（開始）和終點(diǎn)（偏移）。?
Label assignment is usually referred to as tagging, while the onset/offset detection is referred to as localization.
標(biāo)簽分配通常稱為標(biāo)記，而起始/偏移檢測稱為定位。?
SED can be used for query-based sound retrieval [1], smart cities, and homes [2], [3], as well as voice activity detection [4].
SED 可用于基于查詢的聲音檢索 [1]、智能城市和家庭 [2]、[3]，以及語音活動檢測 [4]。?
Unlike common classification tasks such as image or speaker recognition, a single audio clip might contain multiple different sound events (multi-output), sometimes occurring simultaneously (multi-label).
與圖像或說話人識別等常見分類任務(wù)不同，單個(gè)音頻剪輯可能包含多個(gè)不同的聲音事件（多輸出），有時(shí)同時(shí)發(fā)生（多標(biāo)簽）。?
In particular, the localization task escalates the difficulty within the scope of SED, since different sound events have various time lengths, and each occurrence is unique.
特別是定位任務(wù)在 SED 范圍內(nèi)升級了難度，因?yàn)椴煌穆曇羰录哂胁煌臅r(shí)間長度，并且每次發(fā)生都是獨(dú)一無二的。?
Two main approaches exist to train an effective localization model: Fully supervised SED and weakly supervised SED (WSSED).
訓(xùn)練有效定位模型的主要方法有兩種：全監(jiān)督 SED 和弱監(jiān)督 SED (WSSED)。?
Fully supervised approaches, which potentially perform better than weakly supervised ones, require manual time-stamp labeling.
完全監(jiān)督的方法可能比弱監(jiān)督的方法表現(xiàn)得更好，需要手動標(biāo)記時(shí)間戳。?
However, manual labeling is a significant hindrance for scaling to large datasets due to the expensive labor cost.?
然而，由于昂貴的勞動力成本，手動標(biāo)記是擴(kuò)展到大型數(shù)據(jù)集的重大障礙。
This paper primarily focuses on WSSED, which only has access to clip event labels during training yet requires to predict onsets and offsets at the inference stage.
本文主要關(guān)注 WSSED，它只能在訓(xùn)練期間訪問剪輯事件標(biāo)簽，但需要在推理階段預(yù)測開始和偏移。?
Challenges such as the Detection and Classification of Acoustic Scenes and Events (DCASE) exemplify the difficulties in training robust SED systems.
聲學(xué)場景和事件的檢測和分類 (DCASE) 等挑戰(zhàn)體現(xiàn)了訓(xùn)練穩(wěn)健 SED 系統(tǒng)的困難。?
DCASE challenge datasets are real-world recordings (e.g., audio with no quality control and lossy compression), thus containing unknown noises and scenarios.?
DCASE 挑戰(zhàn)數(shù)據(jù)集是真實(shí)世界的錄音（例如，沒有質(zhì)量控制和有損壓縮的音頻），因此包含未知的噪音和場景。?
Specifically, in each challenge since 2017, at least one task was primarily concerned with WSSED. Most previous work focuses on providing single target task-specific solutions for WSSED on either tagging-, segment- or event-level.?
具體而言，在 2017 年以來的每項(xiàng)挑戰(zhàn)中，至少有一項(xiàng)任務(wù)主要與 WSSED 相關(guān)。以前的大部分工作都集中在為 WSSED 提供標(biāo)記、段或事件級別的單一目標(biāo)任務(wù)特定解決方案。?
Tagging-level solutions are often capable of localizing event boundaries, yet their temporal consistency is subpar to segment- and event-level methods.?
標(biāo)記級解決方案通常能夠定位事件邊界，但它們的時(shí)間一致性低于段級和事件級方法。?
This has been seen during the DCASE2017 challenge, where no single model could win both tagging and localization subtasks.
這已經(jīng)在 DCASE2017 挑戰(zhàn)中看到了，在那里沒有一個(gè)模型可以同時(shí)贏得標(biāo)記和本地化子任務(wù)。?
Solutions optimized for segment level often utilize a fixed target time resolution (e.g., 1 Hz), inhibiting fine-scale localization performance (e.g., 50 Hz).
針對分段級別優(yōu)化的解決方案通常使用固定的目標(biāo)時(shí)間分辨率（例如 1 Hz），從而抑制精細(xì)定位性能（例如 50 Hz）。?
Lastly, successful event-level solutions require prior knowledge about each events’ duration to obtain temporally consistent predictions.
最后，成功的事件級解決方案需要關(guān)于每個(gè)事件持續(xù)時(shí)間的先驗(yàn)知識，以獲得時(shí)間上一致的預(yù)測。?
Previous work in [5] showed that successful models such as the DCASE2018 task 4 winner are biased towards predicting tags from long-duration clips, which might limit themselves from generalizing towards different datasets (e.g., deploy the same model on a new dataset) since new datasets possibly contain short or unknown duration events.
[5] 之前的工作表明，成功的模型，例如 DCASE2018 任務(wù) 4 獲勝者傾向于從長持續(xù)時(shí)間的剪輯中預(yù)測標(biāo)簽，這可能會限制自己對不同數(shù)據(jù)集的泛化（例如，在新數(shù)據(jù)集上部署相同的模型），因?yàn)?新數(shù)據(jù)集可能包含短時(shí)間或未知持續(xù)時(shí)間的事件。
In contrast, we aim to enhance WSSED performance, specifically in duration estimation regarding short, abrupt events, without a pre-estimation of each respective event’s individual weight.
相比之下，我們的目標(biāo)是提高 WSSED 性能，特別是在關(guān)于短暫、突然事件的持續(xù)時(shí)間估計(jì)方面，而不預(yù)先估計(jì)每個(gè)事件的單獨(dú)權(quán)重。?

相關(guān)工作
Most current approaches within SED and WSSED utilize neural networks, in particular convolutional neural networks [6], [7] (CNN) and convolutional recurrent neural networks [4], [5] (CRNN).?
SED 和 WSSED 中的大多數(shù)當(dāng)前方法都利用神經(jīng)網(wǎng)絡(luò)，特別是卷積神經(jīng)網(wǎng)絡(luò) [6]、[7]（CNN）和卷積循環(huán)神經(jīng)網(wǎng)絡(luò) [4]、[5]（CRNN）。
CNN models generally excel at audio tagging [8], [9] and scale with data, yet falling behind CRNN approaches in onset and offset estimations [10].
CNN 模型通常在音頻標(biāo)記 [8]、[9] 和數(shù)據(jù)規(guī)模方面表現(xiàn)出色，但在開始和偏移估計(jì)方面落后于 CRNN 方法 [10]。?
Apart from different modeling methods, many recent works propose other approaches for the localization conundrum.
除了不同的建模方法外，許多最近的工作還為定位難題提出了其他方法。?
A plethora of temporal pooling strategies are proposed, aiming to summarize frame-level beliefs into a single clip-wise probability.
提出了大量的時(shí)間池策略，旨在將幀級信念總結(jié)為單個(gè)剪輯概率。?
Contribution:?
In our work, we modify and extend the framework of [5] further towards other datasets and aim to analyze the benefits and the limits of duration robust training.?
貢獻(xiàn)：
在我們的工作中，我們將 [5] 的框架進(jìn)一步修改和擴(kuò)展到其他數(shù)據(jù)集，旨在分析持續(xù)時(shí)間穩(wěn)健訓(xùn)練的好處和限制。
Our main goal with this work is to bridge the gap between real-world SED and research models and facilitate a common framework that works well on both tagging and localization-level without utilizing dataset-specific knowledge.
我們這項(xiàng)工作的主要目標(biāo)是彌合現(xiàn)實(shí)世界 SED 和研究模型之間的差距，并促進(jìn)一個(gè)通用框架，該框架在標(biāo)記和本地化級別上都能很好地工作，而無需利用特定于數(shù)據(jù)集的知識。?
Our contributions are:?
A new, lightweight, model architecture for WSSED using L4-norm temporal subsampling.?
我們的貢獻(xiàn)是：
使用 L4 范數(shù)時(shí)間子采樣的 WSSED 新的輕量級模型架構(gòu)。
A novel thresholding technique named triple threshold, bridging the gap between tagging and localization performance.?
一種名為三重閾值的新閾值技術(shù)，彌合了標(biāo)記和定位性能之間的差距。
Verification of our proposed approach across three publicly available datasets, without the requirement of manually optimizing towards dataset-specific hyperparameters.
在三個(gè)公開可用的數(shù)據(jù)集上驗(yàn)證我們提出的方法，無需手動優(yōu)化特定于數(shù)據(jù)集的超參數(shù)。

?
參考文獻(xiàn)

[1]: Paper is https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9335265
[2]: Source code is available https://github.com/RicherMans/CDur

總結(jié)

以上是生活随笔為你收集整理的睡眠音频分割及识别问题(三)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： mysql移动文件后打不开_Window
下一篇：别傻啦，不会高数，你连人话都听不懂