Abstract
In the last few years, many deep learning (DL) models have been developed to improve the accuracy of speech emotion recognition (SER). However, as SER datasets are generally small and insufficient due to their difficult and expensive collection, the DL models are prone to overfitting, so their performance is limited. In this paper, we introduce a novel data augmentation (DA) method for the SER problem, namely EMix, which is simple but effective. The method creates new data by mixing pairs of selective samples from the original data. The generated mixtures will be noisier or less ambiguous than their constructive ones. To verify the effectiveness of the proposed DA, we develop a transformer-based network for the SER task, and experiment with the two public datasets including IEMOCAP and Crema-D. The experimental results demonstrate the superiority of EMix over other DA methods. In comparison with state-of-the-art methods, our approach shows competitive performance.
Original language | English |
---|---|
Journal | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
DOIs | |
State | Published - 2023 |
Event | 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023 |
Keywords
- data augmentation
- EMix
- Speech emotion recognition