TY - GEN
T1 - Deep Learning Based Vietnamese Diacritics Restoration
AU - Nga, Cao Hong
AU - Thinh, Nguyen Khai
AU - Chang, Pao Chi
AU - Wang, Jia Ching
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - Diacritics are very important in diacritical languages, because the meaning of sentences can be changed in accordance to diacritics. Writing without diacritics makes the sentences ambiguous; however, there are several reasons make people do not write words with diacritics, such as fast typing, convenience, or texting on unsupported diacritics devices. As a result, these texts are very difficult to process on further natural language processing (NLP) tasks like machine translation, sentiment analysis, or question answering system. Therefore, diacritics restoration is critical for further usage or processing in NLP related tasks. In this study, we propose a method which combines convolutional neural network (CNN) and bidirectional gated recurrent unit (Bi-GRU) to restore diacritics. In addition, we use residual block to resolve vanishing gradient problem of recurrent neural networks. We applied the model for restoring diacritics of Vietnamese language that has the highest ratio of diacritics in words. This approach has character accuracy at 98.63% and word accuracy at 94.77%.
AB - Diacritics are very important in diacritical languages, because the meaning of sentences can be changed in accordance to diacritics. Writing without diacritics makes the sentences ambiguous; however, there are several reasons make people do not write words with diacritics, such as fast typing, convenience, or texting on unsupported diacritics devices. As a result, these texts are very difficult to process on further natural language processing (NLP) tasks like machine translation, sentiment analysis, or question answering system. Therefore, diacritics restoration is critical for further usage or processing in NLP related tasks. In this study, we propose a method which combines convolutional neural network (CNN) and bidirectional gated recurrent unit (Bi-GRU) to restore diacritics. In addition, we use residual block to resolve vanishing gradient problem of recurrent neural networks. We applied the model for restoring diacritics of Vietnamese language that has the highest ratio of diacritics in words. This approach has character accuracy at 98.63% and word accuracy at 94.77%.
KW - convolutional neural network
KW - diacritics
KW - diacritics restoration
KW - neural networks
KW - recurrent neural network
UR - http://www.scopus.com/inward/record.url?scp=85078883758&partnerID=8YFLogxK
U2 - 10.1109/ISM46123.2019.00074
DO - 10.1109/ISM46123.2019.00074
M3 - 會議論文篇章
AN - SCOPUS:85078883758
T3 - Proceedings - 2019 IEEE International Symposium on Multimedia, ISM 2019
SP - 331
EP - 334
BT - Proceedings - 2019 IEEE International Symposium on Multimedia, ISM 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 21st IEEE International Symposium on Multimedia, ISM 2019
Y2 - 9 December 2019 through 11 December 2019
ER -