Contrastive approach towards text source classification based on top-bag-of-word similarity

Chu Ren Huang, Lung Hao Lee

研究成果: 會議貢獻類型會議論文同行評審

32 引文 斯高帕斯(Scopus)

摘要

This paper proposes a method to automatically classify texts from different varieties of the same language. We show that similarity measure is a robust tool for studying comparable corpora of language variations. We take LDC's Chinese Gigaword Corpus composed of three varieties of Chinese from Mainland China, Singapore, and Taiwan, as the comparable corpora. Top-bag-of-word similarity measures reflect distances among the three varieties of the same language. A Top-bag-of-word similarity based contrastive approach was taken to solve the text source classification problem. Our results show that a contrastive approach using similarity to rule out identity of source and to arrive actual source by inference is more robust that directly confirmation of source by similarity. We show that this approach is robust when applied to other texts.

原文???core.languages.en_GB???
頁面404-410
頁數7
出版狀態已出版 - 2008
事件22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22 - Cebu, Philippines
持續時間: 20 11月 200822 11月 2008

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22
國家/地區Philippines
城市Cebu
期間20/11/0822/11/08

指紋

深入研究「Contrastive approach towards text source classification based on top-bag-of-word similarity」主題。共同形成了獨特的指紋。

引用此