Chinese word segmentation with minimal linguistic knowledge: An improved conditional random fields coupled with character clustering and automatically discovered template matching

Richard Tzong Han Tsai, Hong Jie Dai, Hsieh Chuan Hung, Cheng Lung Sung, Min Yuh Day, Wen Lian Hsu

研究成果: 書貢獻/報告類型會議論文篇章同行評審

2 引文 斯高帕斯(Scopus)

摘要

This paper addresses three major problems of closed task Chinese word segmentation (CWS): word overlap, tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. For the first, we use additional bigram features to approximate trigram and tetragram features. For the second, we first apply K-means clustering to identify non-Chinese characters. Then, we employ a two-tagger architecture: one for Chinese text and the other for non-Chinese text. Finally, we post-process our CWS output using automatically generated templates. Our results show that additional bigrams can effectively identify more unknown words. Secondly, using our two-tagger method, segmentation performance on sentences containing non-Chinese words is significantly improved when non-Chinese characters are sparse in the training corpus. Lastly, identification of long NEs and long words is also enhanced by template-based post-processing. Using corpora in closed task of SIGHAN CWS, our best system achieves F-scores of 0.956, 0.947, and 0.965 on the AS, HK, and MSR corpora respectively, compared to the best context scores of 0.952, 0.943, and 0.964 in SIGHAN Bakeoff 2005. In AS, this performance is comparable to the best result (F=0.956) in the open task.

原文???core.languages.en_GB???
主出版物標題Proceedings of the 2006 IEEE International Conference on Information Reuse and Integration, IRI-2006
頁面274-279
頁數6
DOIs
出版狀態已出版 - 2006
事件2006 IEEE International Conference on Information Reuse and Integration, IRI-2006 - Waikoloa Village, HI, United States
持續時間: 16 9月 200618 9月 2006

出版系列

名字Proceedings of the 2006 IEEE International Conference on Information Reuse and Integration, IRI-2006

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???2006 IEEE International Conference on Information Reuse and Integration, IRI-2006
國家/地區United States
城市Waikoloa Village, HI
期間16/09/0618/09/06

指紋

深入研究「Chinese word segmentation with minimal linguistic knowledge: An improved conditional random fields coupled with character clustering and automatically discovered template matching」主題。共同形成了獨特的指紋。

引用此