當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

chime-4 lstm_CHIME-6挑战赛回顾

發布時間：2023/12/15 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 chime-4 lstm_CHIME-6挑战赛回顾小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

chime-4 lstm

關于比賽 (About the competition)

Not so long ago, there was a challenge for speech separation and recognition called CHIME-6. CHIME competitions have been held since 2011. The main distinguishing feature of these competitions is that conversational speech recorded in everyday home environments on several devices simultaneously is used to train and evaluate participants’ solutions.

不久前，語音分離和識別面臨一個挑戰，稱為CHIME-6 。 CHIME競賽自2011年以來一直舉行。這些競賽的主要區別在于，在日常家庭環境中同時在幾種設備上記錄的對話語音用于培訓和評估參與者的解決方案。

The data provided for the competition was recorded during the “home parties” of the four participants. These parties were recorded on 32 channels at the same time (six four-channel Microsoft Kinects in the room + two-channel microphones on each of the four speakers).

為比賽提供的數據是在四名參與者的“家庭聚會”期間記錄的。這些參與者同時記錄在32個通道上(房間中有六個四通道Microsoft Kinects +四個揚聲器中的每個都有兩個通道麥克風)。

There were 20 parties, all of them lasting 2–3 hours. The organizers have chosen some of them for testing:

有20個聚會，所有聚會持續2-3小時。組織者選擇了其中一些進行測試：

Table 1. The data have been split into training, development test, and evaluation test sets表1.數據已分為培訓，開發測試和評估測試集

By the way, it was the same dataset that had been used in the previous CHIME-5 challenge. However, the organizers prepared several techniques to improve data quality (see the description of software baselines on GitHub, sections “Array synchronization” and “Speech enhancement”).

順便說一句，它與之前的CHIME-5挑戰賽中使用的數據集相同。但是，組織者準備了幾種提高數據質量的技術(請參閱GitHub上軟件基線的描述，“數組同步”和“語音增強”部分)。

To find out more about the competition and data preparation here, visit its GitHub page or read the overview on Arxiv.org.

要在此處找到有關競賽和數據準備的更多信息，請訪問其GitHub頁面或在Arxiv.org上閱讀概述。

This year there were two tracks:

今年有兩條路：

Track 1 — multiple-array speech recognition;Track 2 — multiple-array diarization and recognition.

音軌1-多陣列語音識別；專題2 –多陣列分析和識別。

And for each track, there were two separate ranking categories:

對于每個曲目，都有兩個單獨的排名類別：

Ranking A — systems based on conventional acoustic modeling and official language modeling;

排名A-基于常規聲學建模和官方語言建模的系統；

Ranking B — all other systems, including systems based on the end-to-end ASR baseline or systems whose lexicon and/or language model have been modified.

等級B-所有其他系統，包括基于端到端ASR基線的系統或已修改其詞典和/或語言模型的系統。

The organizers provided a baseline for participation, which includes a pipeline based on the Kaldi speech recognition toolkit.

組織者提供了參與的基線，其中包括基于Kaldi語音識別工具包的管道。

The main criterion for evaluating participants was the speech recognition metric — word error rate (WER). For the second track, two additional metrics were used — diarization error rate (DER) and Jaccard error rate (JER), which allow to evaluate the quality of the diarization task:

評估參與者的主要標準是語音識別指標-單詞錯誤率(WER)。對于第二條軌道，使用了兩個額外的度量標準-區分錯誤率(DER)和Jaccard錯誤率(JER)，它們可以評估區分任務的質量：

In the tables below, you can see the competition results:

在下表中，您可以看到比賽結果：

Table 2. Track 1, ranking А (left), ranking B (right)表2.軌道1，排名А(左)，排名B(右) Table 3. Track 2, ranking А (left), ranking B (right)表3.軌道2，排名А(左)，排名B(右)

You can access the links to the papers on GitHub.

您可以訪問GitHub上的論文鏈接。

Here are a few curious things I’d like to discuss about the challenge: don’t want you to miss anything!

關于挑戰，我想討論一些奇怪的事情：不想讓您錯過任何事情！

挑戰亮點 (Challenge highlights)

以研究為導向的參與 (Research-oriented participation)

In the previous CHiME-5 challenge, the Paderborn University team, who ranked 4th in Track 2, focused on a speech enhancement technique called Guided Source Separation (GSS). The results of the CHiME-5 competition have demonstrated that the technique improves other solutions.

在上一屆CHiME-5挑戰賽中，帕德博恩大學團隊在第二分賽中排名第四，其研究重點是語音增強技術，即引導源分離 (GSS)。 CHiME-5競賽的結果表明該技術改進了其他解決方案。

This year, the organizers officially referred to GSS in the sound improvement section and included this technique in the baseline of the first track.

今年，組織者在聲音改善部分正式提及了GSS，并將此技術納入了第一首曲目的基線。

That is why many participants, including all the front runners, used GSS or a similar modification inspired by this idea.

這就是為什么包括所有領先者在內的許多參與者都使用GSS或受此想法啟發的類似修改的原因。

Here is how it works: you need to construct a Spatial Mixture Model (SMM), which allows to determine time-frequency masks for speakers (i.e., on what frequencies and at what time a given speaker was speaking). Training is performed using the EM algorithm with temporary annotations of speakers as initialization.

它是這樣工作的：您需要構建一個空間混合模型(SMM)，該模型可以確定揚聲器的時頻掩碼(即，給定揚聲器在什么頻率和什么時間說話)。訓練是使用EM算法進行的，將說話者的臨時注釋作為初始化。

Picture 1. Spatial Mixture Model圖1.空間混合模型

Then, this block is integrated into the general speech enhancement pipeline. This pipeline includes the dereverberation technique (i.e., removing the echo effect from a signal, which occurs when sound is reflected from the walls) and beamforming (i.e., generating a single signal from several audio tracks).

然后，將該塊集成到通用語音增強管道中。該管道包括去混響技術(即從聲音中去除聲音時從信號中消除回聲效應)和波束成形(即從多個音軌中生成單個信號)。

Since there were no speaker annotations for the second track, speech recognition alignments (time stamps of spoken phrases) were used to initialize the EM algorithm for training SMM.

由于第二音軌沒有說話者注釋，因此使用語音識別對齊方式(口語短語的時間戳)來初始化用于訓練SMM的EM算法。

Picture 2. Speech enhancement scheme圖2.語音增強方案

You can read more about the technique and its implementation results in the Hitachi and Paderborn University team’s paper on Arxiv or take a look at their slides.

您可以在Hitachi和Paderborn University團隊關于Arxiv的論文中閱讀有關該技術及其實現結果的更多信息，或查看其幻燈片。

自己的解決方案 (Solution for themselves)

The USTC team, who ranked 1st in Track 1 and 3rd in Track 2, improved the solution they presented at the CHIME-5 challenge. This is the same team that won the CHIME-5 contest. However, other participants did not use the techniques described in their solution.

USTC團隊在Track 1中排名第一，在Track 2中排名第三，改進了他們在CHIME-5挑戰賽上提出的解決方案。這是贏得CHIME-5比賽的同一支球隊。但是，其他參與者未使用其解決方案中描述的技術。

Inspired by the idea of GSS, USTC implemented its modification of the speech enhancement algorithm — IBF-SS.

受GSS想法的啟發，USTC實施了對語音增強算法IBF-SS的修改。

You can find out more in their paper.

您可以在他們的論文中找到更多信息。

基線改善 (Baseline improvement)

There were several evident weaknesses in the baseline that the organizers did not bother to hide: for example, using only one audio channel to build diarization or the lack of rescoring by the language model for ranking B. In baseline scripts, you can also find tips for improvement (for example, to add augmentation with noise from the CHIME-6 data to build x-vectors).

基線存在一些明顯的弱點，組織者不會費心掩蓋：例如，僅使用一個音頻通道進行二值化，或者缺乏語言模型的評分來排名B。在基線腳本中，您還可以找到提示進行改進(例如，添加來自CHIME-6數據的噪聲增強以構建x矢量)。

The JHU team solution completely eliminates all these weaknesses: there no brand new super efficient techniques, but the participants went over all the problem areas of the baseline. In Track 2, they ranked 2nd.

JHU團隊的解決方案完全消除了所有這些弱點：沒有全新的超高效技術，但是參與者研究了基線的所有問題區域。在Track 2中，他們排名第二。

They explored multi-array processing techniques at each stage of the pipeline, such as multi-array GSS for enhancement, posterior fusion for speech activity detection, PLDA score fusion for diarization, and lattice combination for ASR. The GSS technique was described above. A good enough fusion technique, according to the JHU research, is simple max function.

他們探索了管線各個階段的多陣列處理技術，例如用于增強的多陣列GSS，用于語音活動檢測的后融合，用于二值化的PLDA分數融合以及用于ASR的晶格組合。上面描述了GSS技術。根據JHU的研究，足夠好的融合技術是簡單的max函數。

Besides, they integrated techniques such as online multi-channel WPE dereverberation and VB-HMM based overlap assignment to deal with challenges like background noise and overlapping speakers, respectively.

此外，他們還集成了在線多通道WPE去混響和基于VB-HMM的重疊分配等技術，分別應對背景噪聲和揚聲器重疊等挑戰。

More detailed description of the JHU solution can be found in their paper.

有關JHU解決方案的更多詳細說明，請參見他們的論文。

追蹤2名獲勝者：有趣的把戲 (Track 2 winners: interesting tricks)

I‘d like to highlight a few tricks that were used by the winners of the second track of the competition, the ITMO (STC) team:

我想強調一下第二道比賽的冠軍ITMO(STC)團隊使用的一些技巧：

WRN x向量 (WRN x-vectors)

X-Vector is a speaker embedding, i.e a vector that contains speaker information. Such vectors are used in speaker recognition tasks. WRN x-vectors is an improvement of x-vectors through using the ResNet architecture and some other tricks. It has reduced DER so much that this technique alone would have been enough for the team to win the competition.

X-Vector是說話者嵌入，即包含說話者信息的向量。這樣的向量被用于說話者識別任務。 WRN x-vector是通過使用ResNet體系結構和其他技巧來對x-vector的改進。它降低了DER，以至于僅憑這項技術就足以使團隊贏得比賽。

You can read more about WRN x-vectors in the paper by the ITMO team.

您可以在ITMO團隊的論文中閱讀有關WRN x矢量的更多信息。

余弦相似度和頻譜聚類，并自動選擇二值化閾值 (Cosine similarities and spectral clustering with automatic selection of the binarization threshold)

By default, the Kaldi diarization pipeline includes extracting x-vectors from audio, calculating PLDA scores and clustering audio using agglomerative hierarchical clustering (AHC).

默認情況下，Kaldi diarization流水線包括從音頻中提取x矢量，計算PLDA分數并使用聚集層次聚類(AHC)對音頻進行聚類。

Now look at the PLDA component that used to build distances between speaker vectors. PLDA has a mathematical justification in calculating distances for I-Vectors. Put simply, it relies on the fact that I-Vectors contain information about the speaker and the channel, and when clustering, the only important thing for us is the speaker information. It also works nicely for X-Vectors.

現在看一下PLDA組件，該組件用于建立說話人矢量之間的距離。 PLDA在計算I向量的距離時具有數學上的依據。簡而言之，它依賴于一個事實，即I-Vector包含有關揚聲器和聲道的信息，并且在聚類時，對我們來說唯一重要的事情就是揚聲器信息。它也適用于X-Vectors。

However, using cosine similarity instead of PLDA scores and spectral clustering with automatic selection of a threshold instead of AHC allowed the ITMO team to make another significant improvement in diarization.

但是，使用余弦相似度代替PLDA分數，并通過自動選擇閾值而不是AHC進行光譜聚類，這使ITMO團隊在數字化方面有了另一個顯著改進。

To find out more about spectral clustering with automatic selection of the binarization threshold, read this paper on Arxiv.org.

要了解更多關于二值化閾值自動選擇譜聚類，請閱讀本文件上Arxiv.org。

TS-VAD (TS-VAD)

The authors focus specifically on this technique in their work. TS-VAD is a novel approach that predicts an activity of each speaker on each time frame. The TS-VAD model uses acoustic features (e.g., MFCC) along with i-vectors for each speaker as inputs. Two versions of TS-VAD are provided: single-channel and multi-channel. The architecture of this model is presented below.

作者在工作中特別關注此技術。 TS-VAD是一種新穎的方法，可以預測每個時間段每個發言人的活動。 TS-VAD模型使用聲學功能(例如MFCC)以及每個揚聲器的i矢量作為輸入。提供了TS-VAD的兩個版本：單通道和多通道。該模型的體系結構如下所示。

Picture 3. TS-VAD schemes圖3. TS-VAD方案

Note that this network is tailored for dialogs of four participants, which is actually stipulated by the requirements of this challenge.

請注意，此網絡是為四個參與者的對話量身定制的，這實際上是此挑戰的要求所規定的。

The single-channel TS-VAD model was designed to predict speech probabilities for all speakers simultaneously. This model with four output layers was trained using a sum of binary cross-entropies as a loss function. Authors also found that it is essential to process each speaker by the same Speaker Detection (SD) 2-layer BLSTMP, and then combine SD outputs for all speakers by one more BLSTMP layer.

單通道TS-VAD模型旨在同時預測所有說話者的語音概率。使用二進制交叉熵之和作為損失函數訓練了具有四個輸出層的模型。作者還發現，必須通過同一揚聲器檢測(SD)2層BLSTMP處理每個揚聲器，然后再通過一個BLSTMP層將所有揚聲器的SD輸出組合在一起。

The single-channel version of TS-VAD processes each channel separately. To process separate Kinect channels jointly, authors investigated the multi-channel TS-VAD model, which uses a combination of SD blocks outputs of the single TS-VAD model as an input. All the SD vectors for each speaker are passed through a 1-d convolutional layer and then combined by means of a simple attention mechanism. Combined outputs of attention for all speakers are passed through a single BLSTM layer and converted into a set of perframe probabilities of each speaker’s presence/absence.

TS-VAD的單通道版本分別處理每個通道。為了共同處理單獨的Kinect通道，作者研究了多通道TS-VAD模型，該模型使用單個TS-VAD模型的SD塊輸出的組合作為輸入。每個說話者的所有SD向量都經過一維卷積層，然后借助簡單的關注機制進行組合。所有發言者的注意力集中輸出通過單個BLSTM層傳遞，并轉換為每個發言者存在/不存在的一組perframe概率。

Finally, to improve overall diarization performance, the ITMO team fused several single-channel and multi-channel TS-VAD models by computing a weighted average of their probability streams.

最后，為了提高總體數字化性能，ITMO團隊通過計算概率流的加權平均值，融合了幾種單通道和多通道TS-VAD模型。

You can learn more about the TS-VAD model on Arxiv.org.

您可以在Arxiv.org上了解有關TS-VAD模型的更多信息。

結論 (Conclusion)

I hope this review has helped you get a better understanding of the CHiME-6 challenge. Maybe you’ll find the tips and tricks I mentioned useful if you decide to take part in the upcoming contests. Feel free to reach out in case I missed something!

我希望這篇評論能幫助您更好地了解CHiME-6挑戰。如果您決定參加即將舉行的比賽，也許您會發現我提到的提示和技巧很有用。如有任何遺漏，請隨時與我們聯系！

翻譯自: https://towardsdatascience.com/the-chime-6-challenge-review-15f3cbf0062d

chime-4 lstm

總結

以上是生活随笔為你收集整理的chime-4 lstm_CHIME-6挑战赛回顾的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。