TY - JOUR
T1 - Sound Event Recognition Using Auditory-Receptive-Field Binary Pattern and Hierarchical-Diving Deep Belief Network
AU - Wang, Chien Yao
AU - Wang, Jia Ching
AU - Santoso, Andri
AU - Chiang, Chin Chin
AU - Wu, Chung Hsien
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2018/8
Y1 - 2018/8
N2 - Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the real-world environment. This paper presents a novel feature extraction and classification method to solve the problem of SER. An audio-visual descriptor, called the auditory-receptive-field binary pattern, is designed based on the spectrogram image feature, the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network, is a deep neural network system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP dataset, the proposed system achieved a recognition rate of 99.27% for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06% recognition rate with 0 dB signal-to-noise ratio. Using the TUT sound event dataset, the proposed system achieves error rates of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field.
AB - Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the real-world environment. This paper presents a novel feature extraction and classification method to solve the problem of SER. An audio-visual descriptor, called the auditory-receptive-field binary pattern, is designed based on the spectrogram image feature, the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network, is a deep neural network system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP dataset, the proposed system achieved a recognition rate of 99.27% for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06% recognition rate with 0 dB signal-to-noise ratio. Using the TUT sound event dataset, the proposed system achieves error rates of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field.
KW - Auditory receptive fields binary patterns
KW - environmental sound
KW - hierarchical diving deep belief network
KW - spectrogram image feature
UR - http://www.scopus.com/inward/record.url?scp=85042851663&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2017.2738443
DO - 10.1109/TASLP.2017.2738443
M3 - 期刊論文
AN - SCOPUS:85042851663
SN - 2329-9290
VL - 26
SP - 1336
EP - 1351
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
IS - 8
ER -