TY - GEN
T1 - 基於已知名稱搜尋結果的網路實體辨識模型建立工具
AU - Huang, Ya Yun
AU - Chang, Chia Hui
AU - Chang, Chia Hui
N1 - Publisher Copyright:
© Proceedings of the 27th Conference on Computational Linguistics and Speech Processing, ROCLING 2015.
PY - 2015/10/1
Y1 - 2015/10/1
N2 - Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER models are trained mainly on journalistic documents such as news articles. Since they have not been trained to deal with informal documents, the performance drops on Web documents, which may lack sentence structure and contain colloquial expression. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER model training via automatic labeling and tri-training, which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities. In the task of Chinese organization name extraction, the generated model can achieve 86.1% F1 score on the 38,692 sentences with 16,241 distinct names, while the performance for Japanese organization name, English organization name, Chinese location name extraction, Chinese address recognition and English address recognition can be reached 80.3%, 83.2%, 84.5%, 97.2% and 94.8% F1-measure, respectively.
AB - Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER models are trained mainly on journalistic documents such as news articles. Since they have not been trained to deal with informal documents, the performance drops on Web documents, which may lack sentence structure and contain colloquial expression. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER model training via automatic labeling and tri-training, which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities. In the task of Chinese organization name extraction, the generated model can achieve 86.1% F1 score on the 38,692 sentences with 16,241 distinct names, while the performance for Japanese organization name, English organization name, Chinese location name extraction, Chinese address recognition and English address recognition can be reached 80.3%, 83.2%, 84.5%, 97.2% and 94.8% F1-measure, respectively.
KW - Co-Training
KW - Named Entity Recognition
KW - Tri-Training
UR - http://www.scopus.com/inward/record.url?scp=85080411792&partnerID=8YFLogxK
M3 - 會議論文篇章
AN - SCOPUS:85080411792
T3 - Proceedings of the 27th Conference on Computational Linguistics and Speech Processing, ROCLING 2015
SP - 148
EP - 163
BT - Proceedings of the 27th Conference on Computational Linguistics and Speech Processing, ROCLING 2015
A2 - Chen, Sin-Horng
A2 - Wang, Hsin-Min
A2 - Chien, Jen-Tzung
A2 - Kao, Hung-Yu
A2 - Chang, Wen-Whei
A2 - Wang, Yih-Ru
A2 - Wu, Shih-Hung
PB - The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
T2 - 27th Conference on Computational Linguistics and Speech Processing, ROCLING 2015
Y2 - 1 October 2015 through 2 October 2015
ER -