Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the real-world environment. This paper presents a novel feature extraction and classification method to solve the problem of SER. An audio-visual descriptor, called the auditory-receptive-field binary pattern, is designed based on the spectrogram image feature, the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network, is a deep neural network system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP dataset, the proposed system achieved a recognition rate of 99.27% for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06% recognition rate with 0 dB signal-to-noise ratio. Using the TUT sound event dataset, the proposed system achieves error rates of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field.
|Number of pages||16|
|Journal||IEEE/ACM Transactions on Audio Speech and Language Processing|
|State||Published - Aug 2018|
- Auditory receptive fields binary patterns
- environmental sound
- hierarchical diving deep belief network
- spectrogram image feature