Abstract
This paper exploits unlabeled text data to improve new word identification and Chinese word segmentation performance. Our contributions are twofold. First, for new words that lack semantic transparency, such as person, location, or transliteration names, we calculate association metrics of adjacent character segments on unlabeled data and encode this information as features. Second, we construct an internal dictionary by using an initial model to extract words from both the unlabeled training and test set to maintain balanced coverage on the training and test set. In comparison to the baseline model which only uses n-gram features, our approach increases new word recall up to 6.0%. Additionally, our approaches reduce segmentation errors up to 32.3%. Our system achieves state-of-the-art performance for both the closed and open tasks of the 2006 SIGHAN bakeoff.
Original language | English |
---|---|
Pages | 931-936 |
Number of pages | 6 |
State | Published - 2008 |
Event | 3rd International Joint Conference on Natural Language Processing, IJCNLP 2008 - Hyderabad, India Duration: 7 Jan 2008 → 12 Jan 2008 |
Conference
Conference | 3rd International Joint Conference on Natural Language Processing, IJCNLP 2008 |
---|---|
Country/Territory | India |
City | Hyderabad |
Period | 7/01/08 → 12/01/08 |