IIUM Repository

Multi-attention bottleneck for gated convolutional encoder-decoder-based speech enhancement

Saleem, Nasir and Gunawan, Teddy Surya and Shafi, Muhammad and Bourouis, Sami and Trigui, Aymen (2023) Multi-attention bottleneck for gated convolutional encoder-decoder-based speech enhancement. IEEE Acces, 11. pp. 114172-114186. E-ISSN 2169-3536

[img]
Preview
PDF (SCOPUS) - Supplemental Material
Download (244kB) | Preview
[img] PDF (Article) - Published Version
Restricted to Repository staff only

Download (4MB)

Abstract

Convolutional encoder-decoder (CED) has emerged as a powerful architecture, particularly in speech enhancement (SE), which aims to improve the intelligibility and quality and intelligibility of noise-contaminated speech. This architecture leverages the strength of the convolutional neural networks (CNNs) in capturing high-level features. Usually, the CED architectures use the gated recurrent unit (GRU) or long-short-term memory (LSTM) as a bottleneck to capture temporal dependencies, enabling a SE model to effectively learn the dynamics and long-term temporal dependencies in the speech signal. However, Transformers neural networks with self-attention effectively capture long-term temporal dependencies. This study proposes a multi-attention bottleneck (MAB) comprised of a self-attention Transformer powered by a time-frequency attention (TFA) module followed by a channel attention module (CAM) to focus on the important features. The proposed bottleneck (MAB) is integrated into a CED architecture and named MAB-CED. The MAB-CED uses an encoder-decoder structure including a shared encoder and two decoders, where one decoder is dedicated to spectral masking and the other is used for spectral mapping. Convolutional Gated Linear Units (ConvGLU) and Deconvolutional Gated Linear Units (DeconvGLU) are used to construct the encoder-decoder framework. The outputs of two decoders are coupled by applying coherent averaging to synthesize the enhanced speech signal. The proposed speech enhancement is examined using two databases, VoiceBank+DEMAND and LibriSpeech. The results show that the proposed speech enhancement outperforms the benchmarks in terms of intelligibility and quality at various input SNRs. This indicates the performance of the proposed MAB-CED at improving the average PESQ by 0.55 (22.85%) with VoiceBank+DEMAND and by 0.58 (23.79%) with LibriSpeech. The average STOI is improved by 9.63% (VoiceBank+DEMAND) and 9.78% (LibriSpeech) over the noisy mixtures.

Item Type: Article (Journal)
Uncontrolled Keywords: Multi-attention, time-frequency attention, channel attention, transformer, speech enhancement, gated convolutional encoder-decoder
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800 Electronics. Computer engineering. Computer hardware. Photoelectronic devices > TK7885 Computer engineering
Kulliyyahs/Centres/Divisions/Institutes (Can select more than one option. Press CONTROL button): Kulliyyah of Engineering
Kulliyyah of Engineering > Department of Electrical and Computer Engineering
Depositing User: Prof. Dr. Teddy Surya Gunawan
Date Deposited: 15 Nov 2023 08:41
Last Modified: 15 Nov 2023 08:58
URI: http://irep.iium.edu.my/id/eprint/108080

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year