Exploiting unlabeled text to extract new words of different semantic transparency for Chinese word segmentation

Richard Tzong Han Tsai, Hsi Chuan Hung

Research output: Contribution to conferencePaperpeer-review

Abstract

This paper exploits unlabeled text data to improve new word identification and Chinese word segmentation performance. Our contributions are twofold. First, for new words that lack semantic transparency, such as person, location, or transliteration names, we calculate association metrics of adjacent character segments on unlabeled data and encode this information as features. Second, we construct an internal dictionary by using an initial model to extract words from both the unlabeled training and test set to maintain balanced coverage on the training and test set. In comparison to the baseline model which only uses n-gram features, our approach increases new word recall up to 6.0%. Additionally, our approaches reduce segmentation errors up to 32.3%. Our system achieves state-of-the-art performance for both the closed and open tasks of the 2006 SIGHAN bakeoff.

Original languageEnglish
Pages931-936
Number of pages6
StatePublished - 2008
Event3rd International Joint Conference on Natural Language Processing, IJCNLP 2008 - Hyderabad, India
Duration: 7 Jan 200812 Jan 2008

Conference

Conference3rd International Joint Conference on Natural Language Processing, IJCNLP 2008
Country/TerritoryIndia
CityHyderabad
Period7/01/0812/01/08

Fingerprint

Dive into the research topics of 'Exploiting unlabeled text to extract new words of different semantic transparency for Chinese word segmentation'. Together they form a unique fingerprint.

Cite this