Exploiting unlabeled text to extract new words of different semantic transparency for Chinese word segmentation

Richard Tzong Han Tsai, Hsi Chuan Hung

研究成果: 會議貢獻類型會議論文同行評審

摘要

This paper exploits unlabeled text data to improve new word identification and Chinese word segmentation performance. Our contributions are twofold. First, for new words that lack semantic transparency, such as person, location, or transliteration names, we calculate association metrics of adjacent character segments on unlabeled data and encode this information as features. Second, we construct an internal dictionary by using an initial model to extract words from both the unlabeled training and test set to maintain balanced coverage on the training and test set. In comparison to the baseline model which only uses n-gram features, our approach increases new word recall up to 6.0%. Additionally, our approaches reduce segmentation errors up to 32.3%. Our system achieves state-of-the-art performance for both the closed and open tasks of the 2006 SIGHAN bakeoff.

原文???core.languages.en_GB???
頁面931-936
頁數6
出版狀態已出版 - 2008
事件3rd International Joint Conference on Natural Language Processing, IJCNLP 2008 - Hyderabad, India
持續時間: 7 1月 200812 1月 2008

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???3rd International Joint Conference on Natural Language Processing, IJCNLP 2008
國家/地區India
城市Hyderabad
期間7/01/0812/01/08

指紋

深入研究「Exploiting unlabeled text to extract new words of different semantic transparency for Chinese word segmentation」主題。共同形成了獨特的指紋。

引用此