TY - JOUR
T1 - Speech Separation Using Augmented-Discrimination Learning on Squash-Norm Embedding Vector and Node Encoder
AU - Tan, Ha Minh
AU - Liang, Kai Wen
AU - Lee, Yuan Shan
AU - Li, Chung Ting
AU - Li, Yung Hui
AU - Wang, Jia Ching
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2022
Y1 - 2022
N2 - Speech separation has been employed in important applications such as automatic speech, paralinguistics, speech recognition, hearing aids, and human-machine interactions. In recent years, deep neural networks have been widely used for speech and music separation. Some of these breakthrough successful models based on embedding vectors have been proposed, such as deep clustering. In this paper, we propose a node encoder Squash-norm deep clustering (ESDC) as an enhanced discriminative learning framework by combining node encoder, Squash-norm, and deep clustering (DC). First, a node encoder is used to create intermediate features. Node encoders are developed through a matrix factorization-based learning method for graph representations. It creates distinguishable intermediate features that play an important role in improving performance. These discriminated intermediate features are then used as input features for the separation block. The decoder block finally constructs the estimation mask through the clustering method and reconstructs the estimated signal for each source. In particular, we apply a normalization function, Squash-norm, to the input and output vectors to enhance the distinction between high-dimensional embedding vectors. This nonlinear function amplifies the differences in the input vectors, resulting in highly unique features, which are scalar products of the vectors. Similar to the input vector, Squash-norm also enhances the discrimination of the output vector, thereby enhancing the ability to construct an estimated mask by clustering the output vector. Overall, the proposed ESDC achieves 1.27-2.09 dB SDR, 1.28-2.21 dB SDRi, and 1.3-2.44 dB SI-SNRi gain compared to the DC baseline separation performance across genders on the TSP and TIMIT datasets. With the same gender, our proposed ESDC achieves 1.14-2.71 dB SDR, 0.99-2.74 dB SDRi, and 0.62-2.86 dB SI-SNRi gain compared with the DC baseline on the TIMIT dataset. In all cases, the proposed ESDC model consistently maintains STOI and PESQ higher than the DC baselines on the TSP and TIMIT datasets.
AB - Speech separation has been employed in important applications such as automatic speech, paralinguistics, speech recognition, hearing aids, and human-machine interactions. In recent years, deep neural networks have been widely used for speech and music separation. Some of these breakthrough successful models based on embedding vectors have been proposed, such as deep clustering. In this paper, we propose a node encoder Squash-norm deep clustering (ESDC) as an enhanced discriminative learning framework by combining node encoder, Squash-norm, and deep clustering (DC). First, a node encoder is used to create intermediate features. Node encoders are developed through a matrix factorization-based learning method for graph representations. It creates distinguishable intermediate features that play an important role in improving performance. These discriminated intermediate features are then used as input features for the separation block. The decoder block finally constructs the estimation mask through the clustering method and reconstructs the estimated signal for each source. In particular, we apply a normalization function, Squash-norm, to the input and output vectors to enhance the distinction between high-dimensional embedding vectors. This nonlinear function amplifies the differences in the input vectors, resulting in highly unique features, which are scalar products of the vectors. Similar to the input vector, Squash-norm also enhances the discrimination of the output vector, thereby enhancing the ability to construct an estimated mask by clustering the output vector. Overall, the proposed ESDC achieves 1.27-2.09 dB SDR, 1.28-2.21 dB SDRi, and 1.3-2.44 dB SI-SNRi gain compared to the DC baseline separation performance across genders on the TSP and TIMIT datasets. With the same gender, our proposed ESDC achieves 1.14-2.71 dB SDR, 0.99-2.74 dB SDRi, and 0.62-2.86 dB SI-SNRi gain compared with the DC baseline on the TIMIT dataset. In all cases, the proposed ESDC model consistently maintains STOI and PESQ higher than the DC baselines on the TSP and TIMIT datasets.
KW - Speaker separation
KW - deep clustering
KW - monophonic source separation
KW - speech enhancement
KW - supervised speech separation
KW - time frequency masking
UR - http://www.scopus.com/inward/record.url?scp=85134204046&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2022.3188712
DO - 10.1109/ACCESS.2022.3188712
M3 - 期刊論文
AN - SCOPUS:85134204046
SN - 2169-3536
VL - 10
SP - 102048
EP - 102063
JO - IEEE Access
JF - IEEE Access
ER -