BRCC and SentiBahasaRojak: The First Bahasa Rojak Corpus for Pretraining and Sentiment Analysis Dataset

Nanda Putri Romadhona, Sin En Lu, Bo Han Lu, Richard Tzong Han Tsai

研究成果: 雜誌貢獻會議論文同行評審

摘要

Code-mixing refers to the mixed use of multiple languages. It is prevalent in multilingual societies and is also one of the most challenging natural language processing tasks. In this paper, we study Bahasa Rojak, a dialect popular in Malaysia that consists of English, Malay, and Chinese. Aiming to establish a model to deal with the code-mixing phenomena of Bahasa Rojak, we use data augmentation to automatically construct the first Bahasa Rojak corpus for pre-training language models, which we name the Bahasa Rojak Crawled Corpus (BRCC). We also develop a new pre-trained model called "Mixed XLM". The model can tag the language of the input token automatically to process code-mixing input. Finally, to test the effectiveness of the Mixed XLM model pre-trained on BRCC for social media scenarios where code-mixing is found frequently, we compile a new Bahasa Rojak sentiment analysis dataset, SentiBahasaRojak1, with a Kappa value of 0.77.

原文???core.languages.en_GB???
頁(從 - 到)4418-4428
頁數11
期刊Proceedings - International Conference on Computational Linguistics, COLING
29
發行號1
出版狀態已出版 - 2022
事件29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of
持續時間: 12 10月 202217 10月 2022

指紋

深入研究「BRCC and SentiBahasaRojak: The First Bahasa Rojak Corpus for Pretraining and Sentiment Analysis Dataset」主題。共同形成了獨特的指紋。

引用此