TY - JOUR
T1 - Chinese text segmentation
T2 - A hybrid approach using transductive learning and statistical association measures
AU - Tsai, Richard Tzong Han
PY - 2010/5
Y1 - 2010/5
N2 - Chinese text segmentation (CTS) is a fundamental step in building any Chinese or cross-language information retrieval system. This paper identifies and proposes solutions to two main challenges facing today's CTS systems: segmenting words longer than the context window and identifying words not derived from affixation or composition. Our methods exploit unlabeled data, making them scalable at little extra cost. To tackle the first problem, we use a transductive learning approach to automatically construct a dictionary, and then refine it by improving its test set coverage while reducing its over-fitting tendency. In addition, we incorporate frequency information to discriminate overlapping matching words. For the second problem, we employ statistical association measures non-parametrically through a natural but novel feature representation scheme. To demonstrate the generality of our approach, we verify our system on the most reputable CTS evaluation standard - the SIGHAN bakeoff, which contains datasets in both traditional and simplified Chinese. These datasets are provided by representative academic or industrial research institutes. The experimental results show that with only training data and unlabeled test data and with no external dictionaries, our approach effectively overcomes the above-mentioned problems and reduces segmentation errors by an average of 27.8% compared with the traditional approach. Notably, our approach improves the recall of new words, the most informative words, by 4.7% on average. Also, our approach outperforms the best SIGHAN CTS system, which requires many external resources. Additional analysis shows that our approach has the potential to gain accuracy as the test data increases.
AB - Chinese text segmentation (CTS) is a fundamental step in building any Chinese or cross-language information retrieval system. This paper identifies and proposes solutions to two main challenges facing today's CTS systems: segmenting words longer than the context window and identifying words not derived from affixation or composition. Our methods exploit unlabeled data, making them scalable at little extra cost. To tackle the first problem, we use a transductive learning approach to automatically construct a dictionary, and then refine it by improving its test set coverage while reducing its over-fitting tendency. In addition, we incorporate frequency information to discriminate overlapping matching words. For the second problem, we employ statistical association measures non-parametrically through a natural but novel feature representation scheme. To demonstrate the generality of our approach, we verify our system on the most reputable CTS evaluation standard - the SIGHAN bakeoff, which contains datasets in both traditional and simplified Chinese. These datasets are provided by representative academic or industrial research institutes. The experimental results show that with only training data and unlabeled test data and with no external dictionaries, our approach effectively overcomes the above-mentioned problems and reduces segmentation errors by an average of 27.8% compared with the traditional approach. Notably, our approach improves the recall of new words, the most informative words, by 4.7% on average. Also, our approach outperforms the best SIGHAN CTS system, which requires many external resources. Additional analysis shows that our approach has the potential to gain accuracy as the test data increases.
KW - Association measure
KW - Chinese word segmentation
KW - Transductive learning
KW - Unlabeled data
UR - http://www.scopus.com/inward/record.url?scp=73249114922&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2009.10.004
DO - 10.1016/j.eswa.2009.10.004
M3 - 期刊論文
AN - SCOPUS:73249114922
SN - 0957-4174
VL - 37
SP - 3553
EP - 3560
JO - Expert Systems with Applications
JF - Expert Systems with Applications
IS - 5
ER -