TY - JOUR
T1 - BRCC and SentiBahasaRojak
T2 - 29th International Conference on Computational Linguistics, COLING 2022
AU - Romadhona, Nanda Putri
AU - Lu, Sin En
AU - Lu, Bo Han
AU - Tsai, Richard Tzong Han
N1 - Publisher Copyright:
© 2022 Proceedings - International Conference on Computational Linguistics, COLING. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Code-mixing refers to the mixed use of multiple languages. It is prevalent in multilingual societies and is also one of the most challenging natural language processing tasks. In this paper, we study Bahasa Rojak, a dialect popular in Malaysia that consists of English, Malay, and Chinese. Aiming to establish a model to deal with the code-mixing phenomena of Bahasa Rojak, we use data augmentation to automatically construct the first Bahasa Rojak corpus for pre-training language models, which we name the Bahasa Rojak Crawled Corpus (BRCC). We also develop a new pre-trained model called "Mixed XLM". The model can tag the language of the input token automatically to process code-mixing input. Finally, to test the effectiveness of the Mixed XLM model pre-trained on BRCC for social media scenarios where code-mixing is found frequently, we compile a new Bahasa Rojak sentiment analysis dataset, SentiBahasaRojak1, with a Kappa value of 0.77.
AB - Code-mixing refers to the mixed use of multiple languages. It is prevalent in multilingual societies and is also one of the most challenging natural language processing tasks. In this paper, we study Bahasa Rojak, a dialect popular in Malaysia that consists of English, Malay, and Chinese. Aiming to establish a model to deal with the code-mixing phenomena of Bahasa Rojak, we use data augmentation to automatically construct the first Bahasa Rojak corpus for pre-training language models, which we name the Bahasa Rojak Crawled Corpus (BRCC). We also develop a new pre-trained model called "Mixed XLM". The model can tag the language of the input token automatically to process code-mixing input. Finally, to test the effectiveness of the Mixed XLM model pre-trained on BRCC for social media scenarios where code-mixing is found frequently, we compile a new Bahasa Rojak sentiment analysis dataset, SentiBahasaRojak1, with a Kappa value of 0.77.
UR - http://www.scopus.com/inward/record.url?scp=85165726174&partnerID=8YFLogxK
M3 - 會議論文
AN - SCOPUS:85165726174
SN - 2951-2093
VL - 29
SP - 4418
EP - 4428
JO - Proceedings - International Conference on Computational Linguistics, COLING
JF - Proceedings - International Conference on Computational Linguistics, COLING
IS - 1
Y2 - 12 October 2022 through 17 October 2022
ER -