Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation

Ting Hao Yang, Tian Jian Jiang, Chan Hung Kuo, Richard Tzong Han Tsai, Wen Lian Hsu

Research output: Contribution to conferencePaperpeer-review

5 Scopus citations

Abstract

This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TCF) with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005. The experiment results show that all of those features improve our system's F1 measure (F) and Recall of Out-of-Vocabulary (ROOV). In particular, the feature collections which contain AVS feature outperform other types of features in terms of F, whereas the feature collections containing TCB/TCF information has better ROOV.

Original languageEnglish
Pages109-122
Number of pages14
StatePublished - 2011
Event23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011 - Taipei, Taiwan
Duration: 8 Sep 20119 Sep 2011

Conference

Conference23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011
Country/TerritoryTaiwan
CityTaipei
Period8/09/119/09/11

Keywords

  • Conditional random fields
  • Unsupervised feature selection
  • Word segmentation

Fingerprint

Dive into the research topics of 'Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation'. Together they form a unique fingerprint.

Cite this