Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Sin En Lu, Bo Han Lu, Chao Yi Lu, Richard Tzong Han Tsai

研究成果: 會議貢獻類型會議論文同行評審

3 引文 斯高帕斯(Scopus)

摘要

In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.

原文???core.languages.en_GB???
頁面6316-6334
頁數19
出版狀態已出版 - 2022
事件2022 Findings of the Association for Computational Linguistics: EMNLP 2022 - Abu Dhabi, United Arab Emirates
持續時間: 7 12月 202211 12月 2022

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???2022 Findings of the Association for Computational Linguistics: EMNLP 2022
國家/地區United Arab Emirates
城市Abu Dhabi
期間7/12/2211/12/22

指紋

深入研究「Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien」主題。共同形成了獨特的指紋。

引用此