摘要
This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TCF) with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005. The experiment results show that all of those features improve our system's F1 measure (F) and Recall of Out-of-Vocabulary (ROOV). In particular, the feature collections which contain AVS feature outperform other types of features in terms of F, whereas the feature collections containing TCB/TCF information has better ROOV.
原文 | ???core.languages.en_GB??? |
---|---|
頁面 | 109-122 |
頁數 | 14 |
出版狀態 | 已出版 - 2011 |
事件 | 23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011 - Taipei, Taiwan 持續時間: 8 9月 2011 → 9 9月 2011 |
???event.eventtypes.event.conference???
???event.eventtypes.event.conference??? | 23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011 |
---|---|
國家/地區 | Taiwan |
城市 | Taipei |
期間 | 8/09/11 → 9/09/11 |