This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TCF) with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005. The experiment results show that all of those features improve our system's F1 measure (F) and Recall of Out-of-Vocabulary (ROOV). In particular, the feature collections which contain AVS feature outperform other types of features in terms of F, whereas the feature collections containing TCB/TCF information has better ROOV.
|出版狀態||已出版 - 2011|
|事件||23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011 - Taipei, Taiwan|
持續時間: 8 9月 2011 → 9 9月 2011
|???event.eventtypes.event.conference???||23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011|
|期間||8/09/11 → 9/09/11|