Abstract
This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TCF) with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005. The experiment results show that all of those features improve our system's F1 measure (F) and Recall of Out-of-Vocabulary (ROOV). In particular, the feature collections which contain AVS feature outperform other types of features in terms of F, whereas the feature collections containing TCB/TCF information has better ROOV.
Original language | English |
---|---|
Pages | 109-122 |
Number of pages | 14 |
State | Published - 2011 |
Event | 23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011 - Taipei, Taiwan Duration: 8 Sep 2011 → 9 Sep 2011 |
Conference
Conference | 23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011 |
---|---|
Country/Territory | Taiwan |
City | Taipei |
Period | 8/09/11 → 9/09/11 |
Keywords
- Conditional random fields
- Unsupervised feature selection
- Word segmentation