TY - GEN
T1 - Fine-Tuning Vision Transformer for Arabic Sign Language Video Recognition on Augmented Small-Scale Dataset
AU - Gochoo, Munkhjargal
AU - Batnasan, Ganzorig
AU - Ahmed, Ahmed Abdelhadi
AU - Otgonbold, Munkh Erdene
AU - Alnajjar, Fady
AU - Shih, Timothy K.
AU - Tan, Tan Hsu
AU - Wee, Lai Khin
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - With the rise of AI, the recognition of Sign Language (SL) through sign-to-text has gained significance in the field of computer vision and deep machine learning. However, there are only a few medium to large open datasets available for this task, as it requires a vast dataset of thousands of signs for words/phrases in different environments, which is a time-consuming and tedious process. Furthermore, there has been very little effort towards Arabic Sign Language Recognition (ArSLR). This research paper presents the results of fine-tuning the Vision Transformer (ViT) model on a small-scale in-house dataset of ArSL. The main goal is to attain satisfactory results by utilizing minimal computing power and a small dataset involving less than 10 individuals, with only one recording made for each sign in every environment. The dataset comprises 49 classes/signs, all of which were made with two hands and belong to the Level I category in terms of popularity. To enhance the dataset, three types of augmentations - translation, shear, and rotation were employed. The ViT model, pre-trained on the Kinetics dataset, was trained on the variation of augmented datasets with 2 to 40 times samples for each original video, where the training set includes original and augmented videos of 8 volunteers and the test set includes only original videos of one particular volunteer. Experimental results reveal that the combination of rotation and shear outperformed the others, achieving an accuracy of 93% on the 20 times augmented samples per class per signer dataset. We believe this study sheds light on small-scale dataset-based SLR tasks and video/action recognition in general.
AB - With the rise of AI, the recognition of Sign Language (SL) through sign-to-text has gained significance in the field of computer vision and deep machine learning. However, there are only a few medium to large open datasets available for this task, as it requires a vast dataset of thousands of signs for words/phrases in different environments, which is a time-consuming and tedious process. Furthermore, there has been very little effort towards Arabic Sign Language Recognition (ArSLR). This research paper presents the results of fine-tuning the Vision Transformer (ViT) model on a small-scale in-house dataset of ArSL. The main goal is to attain satisfactory results by utilizing minimal computing power and a small dataset involving less than 10 individuals, with only one recording made for each sign in every environment. The dataset comprises 49 classes/signs, all of which were made with two hands and belong to the Level I category in terms of popularity. To enhance the dataset, three types of augmentations - translation, shear, and rotation were employed. The ViT model, pre-trained on the Kinetics dataset, was trained on the variation of augmented datasets with 2 to 40 times samples for each original video, where the training set includes original and augmented videos of 8 volunteers and the test set includes only original videos of one particular volunteer. Experimental results reveal that the combination of rotation and shear outperformed the others, achieving an accuracy of 93% on the 20 times augmented samples per class per signer dataset. We believe this study sheds light on small-scale dataset-based SLR tasks and video/action recognition in general.
KW - Arabic Sign Language
KW - Augmentation
KW - Deep Learning
KW - Smale-scale dataset
KW - Vision Transformer
KW - ViT
UR - http://www.scopus.com/inward/record.url?scp=85187247949&partnerID=8YFLogxK
U2 - 10.1109/SMC53992.2023.10394501
DO - 10.1109/SMC53992.2023.10394501
M3 - 會議論文篇章
AN - SCOPUS:85187247949
T3 - Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
SP - 2880
EP - 2885
BT - 2023 IEEE International Conference on Systems, Man, and Cybernetics
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2023
Y2 - 1 October 2023 through 4 October 2023
ER -