TY - JOUR
T1 - Human Action Recognition and Note Recognition
T2 - A Deep Learning Approach Using STA-GCN
AU - Enkhbat, Avirmed
AU - Shih, Timothy K.
AU - Cheewaprakobkit, Pimpa
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/4
Y1 - 2024/4
N2 - Human action recognition (HAR) is growing in machine learning with a wide range of applications. One challenging aspect of HAR is recognizing human actions while playing music, further complicated by the need to recognize the musical notes being played. This paper proposes a deep learning-based method for simultaneous HAR and musical note recognition in music performances. We conducted experiments on Morin khuur performances, a traditional Mongolian instrument. The proposed method consists of two stages. First, we created a new dataset of Morin khuur performances. We used motion capture systems and depth sensors to collect data that includes hand keypoints, instrument segmentation information, and detailed movement information. We then analyzed RGB images, depth images, and motion data to determine which type of data provides the most valuable features for recognizing actions and notes in music performances. The second stage utilizes a Spatial Temporal Attention Graph Convolutional Network (STA-GCN) to recognize musical notes as continuous gestures. The STA-GCN model is designed to learn the relationships between hand keypoints and instrument segmentation information, which are crucial for accurate recognition. Evaluation on our dataset demonstrates that our model outperforms the traditional ST-GCN model, achieving an accuracy of 81.4%.
AB - Human action recognition (HAR) is growing in machine learning with a wide range of applications. One challenging aspect of HAR is recognizing human actions while playing music, further complicated by the need to recognize the musical notes being played. This paper proposes a deep learning-based method for simultaneous HAR and musical note recognition in music performances. We conducted experiments on Morin khuur performances, a traditional Mongolian instrument. The proposed method consists of two stages. First, we created a new dataset of Morin khuur performances. We used motion capture systems and depth sensors to collect data that includes hand keypoints, instrument segmentation information, and detailed movement information. We then analyzed RGB images, depth images, and motion data to determine which type of data provides the most valuable features for recognizing actions and notes in music performances. The second stage utilizes a Spatial Temporal Attention Graph Convolutional Network (STA-GCN) to recognize musical notes as continuous gestures. The STA-GCN model is designed to learn the relationships between hand keypoints and instrument segmentation information, which are crucial for accurate recognition. Evaluation on our dataset demonstrates that our model outperforms the traditional ST-GCN model, achieving an accuracy of 81.4%.
KW - action recognition
KW - deep learning
KW - morin khuur
KW - recognize musical notes
KW - spatial temporal attention graph convolutional network (STA-GCN)
UR - http://www.scopus.com/inward/record.url?scp=85191391877&partnerID=8YFLogxK
U2 - 10.3390/s24082519
DO - 10.3390/s24082519
M3 - 期刊論文
C2 - 38676137
AN - SCOPUS:85191391877
SN - 1424-8220
VL - 24
JO - Sensors (Switzerland)
JF - Sensors (Switzerland)
IS - 8
M1 - 2519
ER -