Exploiting unlabeled internal data in conditional random fields to reduce word segmentation errors for Chinese texts

Richard Tzong Han Tsai, Hsi Chuan Hung, Hong Jie Dai, Wen Lian Hsu

研究成果: 書貢獻/報告類型會議論文篇章同行評審

摘要

The application of text-to-speech (TTS) conversion has become widely used in recent years. Chinese TTS faces several unique difficulties. The most critical is caused by the lack of word delimiters in written Chinese. This means that Chinese word segmentation (CWS) must be the first step in Chinese TTS. Unfortunately, due to the ambiguous nature of word boundaries in Chinese, even the best CWS systems make serious segmentation errors. Incorrect sentence interpretation causes TTS errors, preventing TTS's wider use in applications such as automatic customer services or computer reader systems for the visually impaired. In this paper, we propose a novel method that exploits unlabeled internal data to reduce word segmentation errors without using external dictionaries. To demonstrate the generality of our method, we verify our system on the most widely recognized CWS evaluation tool--the SIGHAN bakeoff, which includes datasets in both traditional and simplified Chinese. These datasets are provided by four representative academies or industrial research institutes in HK, Taiwan, Mainland China, and the U.S. Our experimental results show that with only internal data and unlabeled test data, our approach reduces segmentation errors by an average of 15% compared to the traditional approach. Moreover, our approach achieves comparable performance to the best CWS systems that use external resources. Further analysis shows that our method has the potential to become more accurate as the amount of test data increases.

原文???core.languages.en_GB???
主出版物標題International Speech Communication Association - 8th Annual Conference of the International Speech Communication Association, Interspeech 2007
頁面2944-2947
頁數4
出版狀態已出版 - 2007
事件8th Annual Conference of the International Speech Communication Association, Interspeech 2007 - Antwerp, Belgium
持續時間: 27 8月 200731 8月 2007

出版系列

名字International Speech Communication Association - 8th Annual Conference of the International Speech Communication Association, Interspeech 2007
4

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???8th Annual Conference of the International Speech Communication Association, Interspeech 2007
國家/地區Belgium
城市Antwerp
期間27/08/0731/08/07

指紋

深入研究「Exploiting unlabeled internal data in conditional random fields to reduce word segmentation errors for Chinese texts」主題。共同形成了獨特的指紋。

引用此