IIUM Repository

NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network

Saleem, Nasir and Gunawan, Teddy Surya and Kartiwi, Mira and Nugroho, Bambang Setia and Wijayanto, Inung (2023) NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network. IEEE Access, 11. 66979- 66994. E-ISSN 2169-3536

[img] PDF (Journal) - Published Version
Restricted to Registered users only

Download (5MB) | Request a copy
[img]
Preview
PDF (Scopus) - Supplemental Material
Download (240kB) | Preview

Abstract

Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attention and the Transformer model have demonstrated competitive results in SE. Transformer models with convolution layers can capture short and long-term temporal sequences by leveraging multi-head self-attention, which allows the model to attend the entire sequence. This study proposes a neural speech enhancement (NSE) using the convolutional encoder-decoder (CED) and convolutional attention Transformer (CAT), named the NSE-CATNet. To effectively process the time-frequency (T-F) distribution of spectral components in speech signals, a T-F attention module is incorporated into the convolutional Transformer model. This module enables the model to explicitly leverage position information and generate a two-dimensional attention map for the time-frequency speech distribution. The performance of the proposed SE is evaluated using objective speech quality and intelligibility metrics on two different datasets, the VoiceBank-DEMAND Corpus and the LibriSpeech dataset. The experimental results indicate that the proposed SE outperformed the competitive baselines in terms of speech enhancement performance at -5dB, 0dB, and 5dB. This suggests that the model is effective at improving the overall quality by 0.704 with VoiceBank-DEMAND and by 0.692 with LibriSpeech. Further, the intelligibility with VoiceBank-DEMAND and LibriSpeech is improved by 11.325% and 11.75% over the noisy speech signals.

Item Type: Article (Journal)
Additional Information: This is the work of Dr. Nasir Saleem, postdoc under Prof Teddy and Prof Mira. In collaboration with Telkom University lecturers (external international collaboration).
Uncontrolled Keywords: Neural speech enhancement, T-F attention, convolutional encoder-decoder, convolutional attention transformer, T-F masking.
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800 Electronics. Computer engineering. Computer hardware. Photoelectronic devices > TK7885 Computer engineering
Kulliyyahs/Centres/Divisions/Institutes (Can select more than one option. Press CONTROL button): Kulliyyah of Engineering
Kulliyyah of Engineering > Department of Electrical and Computer Engineering
Kulliyyah of Information and Communication Technology
Kulliyyah of Information and Communication Technology

Kulliyyah of Information and Communication Technology > Department of Information System
Kulliyyah of Information and Communication Technology > Department of Information System
Depositing User: Prof. Dr. Teddy Surya Gunawan
Date Deposited: 17 Aug 2023 14:41
Last Modified: 17 Aug 2023 14:41
URI: http://irep.iium.edu.my/id/eprint/106019

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year