Speech Separation Using Augmented-Discrimination Learning on Squash-Norm Embedding Vector and Node Encoder

Ha Minh Tan, Kai Wen Liang, Yuan Shan Lee, Chung Ting Li, Yung-Hui Li, Jia Ching Wang

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Speech separation has been employed in important applications such as automatic speech, paralinguistics, speech recognition, hearing aids, and human-machine interactions. In recent years, deep neural networks have been widely used for speech and music separation. Some of these breakthrough successful models based on embedding vectors have been proposed, such as deep clustering. In this paper, we propose a node encoder Squash-norm deep clustering (ESDC) as an enhanced discriminative learning framework by combining node encoder, Squash-norm, and deep clustering (DC). First, a node encoder is used to create intermediate features. Node encoders are developed through a matrix factorization-based learning method for graph representations. It creates distinguishable intermediate features that play an important role in improving performance. These discriminated intermediate features are then used as input features for the separation block. The decoder block finally constructs the estimation mask through the clustering method and reconstructs the estimated signal for each source. In particular, we apply a normalization function, Squash-norm, to the input and output vectors to enhance the distinction between high-dimensional embedding vectors. This nonlinear function amplifies the differences in the input vectors, resulting in highly unique features, which are scalar products of the vectors. Similar to the input vector, Squash-norm also enhances the discrimination of the output vector, thereby enhancing the ability to construct an estimated mask by clustering the output vector. Overall, the proposed ESDC achieves 1.27-2.09 dB SDR, 1.28-2.21 dB SDRi, and 1.3-2.44 dB SI-SNRi gain compared to the DC baseline separation performance across genders on the TSP and TIMIT datasets. With the same gender, our proposed ESDC achieves 1.14-2.71 dB SDR, 0.99-2.74 dB SDRi, and 0.62-2.86 dB SI-SNRi gain compared with the DC baseline on the TIMIT dataset. In all cases, the proposed ESDC model consistently maintains STOI and PESQ higher than the DC baselines on the TSP and TIMIT datasets.

Original languageEnglish
Pages (from-to)102048-102063
Number of pages16
JournalIEEE Access
Volume10
DOIs
StatePublished - 2022

Keywords

  • Speaker separation
  • deep clustering
  • monophonic source separation
  • speech enhancement
  • supervised speech separation
  • time frequency masking

Fingerprint

Dive into the research topics of 'Speech Separation Using Augmented-Discrimination Learning on Squash-Norm Embedding Vector and Node Encoder'. Together they form a unique fingerprint.

Cite this