Code-Switching Speech Synthesis Based on Self-Supervised Learning and Domain Adaptive Speaker Encoder

Yi Xing Lin, Cheng Hsun Pai, Phuong Thi Le, Bima Prihasto, Chien Ling Huang, Jia Ching Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Recently, end-to-end speech synthesis models based on deep learning have made great progress in speech quality, and gradually replaced traditional speech synthesis methods into the mainstream. However, these methods are still challenging to synthesize highly natural speech. In order to solve the above problems, we introduce self-supervised learning and frame-level domain adversarial training into the speaker encoder based on the speaker verification task, so that the speaker vectors of different languages keep a consistent distribution in the speaker space, and the performance of speech synthesis is improved. In addition, we use a non-autoregressive speech synthesis model in the selection of speech synthesis model, so as to solve the problem of unnatural speech rate caused by cross-language speech synthesis. We first demonstrate that in the mixed language dataset of LibriTTS and AISHELL3, the speaker encoder trained with self-supervised representation has a 4.968% absolute EER reduction compared to the traditional MFCC on the speaker verification task, indicating that self-supervised representation has better generalization for domain-complex datasets. Then we obtain MOS scores of 3.635 and 3.675 for speech naturalness and speaker similarity in the code-switching speech synthesis task, respectively. Our approach simplifies the need to use multiple monolingual encoders to model linguistic information in the past literature, and adds frame-level domain adversarial training to optimize the speaker vectors in the speaker feature space to facilitate the code-switching speech synthesis task.

Original languageEnglish
Title of host publicationICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728163277
DOIs
StatePublished - 2023
Event48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece
Duration: 4 Jun 202310 Jun 2023

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2023-June
ISSN (Print)1520-6149

Conference

Conference48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Country/TerritoryGreece
CityRhodes Island
Period4/06/2310/06/23

Keywords

  • Code Switching
  • Domain Adaptation
  • Self-Supervised Learning
  • Speech synthesis

Fingerprint

Dive into the research topics of 'Code-Switching Speech Synthesis Based on Self-Supervised Learning and Domain Adaptive Speaker Encoder'. Together they form a unique fingerprint.

Cite this