TY - GEN
T1 - Chinese word segmentation with minimal linguistic knowledge
AU - Tsai, Richard Tzong Han
AU - Dai, Hong Jie
AU - Hung, Hsieh Chuan
AU - Sung, Cheng Lung
AU - Day, Min Yuh
AU - Hsu, Wen Lian
PY - 2006
Y1 - 2006
N2 - This paper addresses three major problems of closed task Chinese word segmentation (CWS): word overlap, tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. For the first, we use additional bigram features to approximate trigram and tetragram features. For the second, we first apply K-means clustering to identify non-Chinese characters. Then, we employ a two-tagger architecture: one for Chinese text and the other for non-Chinese text. Finally, we post-process our CWS output using automatically generated templates. Our results show that additional bigrams can effectively identify more unknown words. Secondly, using our two-tagger method, segmentation performance on sentences containing non-Chinese words is significantly improved when non-Chinese characters are sparse in the training corpus. Lastly, identification of long NEs and long words is also enhanced by template-based post-processing. Using corpora in closed task of SIGHAN CWS, our best system achieves F-scores of 0.956, 0.947, and 0.965 on the AS, HK, and MSR corpora respectively, compared to the best context scores of 0.952, 0.943, and 0.964 in SIGHAN Bakeoff 2005. In AS, this performance is comparable to the best result (F=0.956) in the open task.
AB - This paper addresses three major problems of closed task Chinese word segmentation (CWS): word overlap, tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. For the first, we use additional bigram features to approximate trigram and tetragram features. For the second, we first apply K-means clustering to identify non-Chinese characters. Then, we employ a two-tagger architecture: one for Chinese text and the other for non-Chinese text. Finally, we post-process our CWS output using automatically generated templates. Our results show that additional bigrams can effectively identify more unknown words. Secondly, using our two-tagger method, segmentation performance on sentences containing non-Chinese words is significantly improved when non-Chinese characters are sparse in the training corpus. Lastly, identification of long NEs and long words is also enhanced by template-based post-processing. Using corpora in closed task of SIGHAN CWS, our best system achieves F-scores of 0.956, 0.947, and 0.965 on the AS, HK, and MSR corpora respectively, compared to the best context scores of 0.952, 0.943, and 0.964 in SIGHAN Bakeoff 2005. In AS, this performance is comparable to the best result (F=0.956) in the open task.
UR - http://www.scopus.com/inward/record.url?scp=34547442359&partnerID=8YFLogxK
U2 - 10.1109/IRI.2006.252425
DO - 10.1109/IRI.2006.252425
M3 - 會議論文篇章
AN - SCOPUS:34547442359
SN - 0780397886
SN - 9780780397880
T3 - Proceedings of the 2006 IEEE International Conference on Information Reuse and Integration, IRI-2006
SP - 274
EP - 279
BT - Proceedings of the 2006 IEEE International Conference on Information Reuse and Integration, IRI-2006
Y2 - 16 September 2006 through 18 September 2006
ER -