Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Jian Hong Wang, Yen Ting Lai, Tzu Chiang Tai, Phuong Thi Le, Tuan Pham, Ze Yu Wang, Yung Hui Li, Jia Ching Wang, Pao Chi Chang

Research output: Contribution to journalArticlepeer-review


When recording conversations, there may be multiple people talking at once. While our human ears can filter out unwanted sounds, this can be challenging for automatic speech recognition (ASR) systems, leading to reduced accuracy. To address this issue, preprocessing mechanisms such as speech separation and targeted speaker extraction are necessary to separate each person’s speech. With the development of deep learning, the quality of separated speech has improved significantly. Our objective is to focus on speaker extraction, which entails implementing a primary system for speech extraction and a secondary subsystem for delivering target information. To accomplish this, we have chosen a temporal convolutional network (TCN) architecture as the foundation of our speech extraction model. A TCN enables convolutional neural networks (CNNs) to manage time series modeling, and it can be constructed in various model lengths. Furthermore, we have integrated attention enhancement into the secondary subsystem to provide the speech extraction model with comprehensive and effective target information, which helps to improve the model’s ability to estimate masks. As a result, the quality of the target speaker extraction will be greatly enhanced with a more precise mask.

Original languageEnglish
Article number307
JournalElectronics (Switzerland)
Issue number2
StatePublished - Jan 2024


  • automatic speech recognition (ASR)
  • convolutional neural network (CNN)
  • deep learning
  • target speaker extraction
  • temporal convolutional network (TCN)


Dive into the research topics of 'Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network'. Together they form a unique fingerprint.

Cite this