TY - JOUR
T1 - Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities
AU - Tsai, Tzong Han
AU - Chou, Wen Chi
AU - Wu, Shih Hung
AU - Sung, Ting Yi
AU - Hsiang, Jieh
AU - Hsu, Wen Lian
N1 - Funding Information:
We would like to thank Sunita Sarawagi and Imran Mansuri for answering my questions about the CRF package. We are grateful for the support of National Science Council under GRANT NSC94-2752-E-001-001.
PY - 2006/1
Y1 - 2006/1
N2 - As new high-throughput technologies have created an explosion of biomedical literature, there arises a pressing need for automatic information extraction from the literature bank. To this end, biomedical named entity recognition (NER) from natural language text is indispensable. Current NER approaches include: dictionary based, rule based, or machine learning based. Since, there is no consolidated nomenclature for most biomedical NEs, any NER system relying on limited dictionaries or rules does not seem to perform satisfactorily. In this paper, we consider a machine learning model, CRF, for the construction of our NER framework. CRF is a well-known model for solving other sequence tagging problems. In our framework, we do our best to utilize available resources including dictionaries, web corpora, and lexical analyzers, and represent them as linguistic features in the CRF model. In the experiment on the JNLPBA 2004 data, with minimal post-processing, our system achieves an F-score of 70.2%, which is better than most state-of-the-art systems. On the GENIA 3.02 corpus, our system achieves an F-score of 78.4% for protein names, which is 2.8% higher than the next-best system. In addition, we also examine the usefulness of each feature in our CRF model. Our experience could be valuable to other researchers working on machine learning based NER.
AB - As new high-throughput technologies have created an explosion of biomedical literature, there arises a pressing need for automatic information extraction from the literature bank. To this end, biomedical named entity recognition (NER) from natural language text is indispensable. Current NER approaches include: dictionary based, rule based, or machine learning based. Since, there is no consolidated nomenclature for most biomedical NEs, any NER system relying on limited dictionaries or rules does not seem to perform satisfactorily. In this paper, we consider a machine learning model, CRF, for the construction of our NER framework. CRF is a well-known model for solving other sequence tagging problems. In our framework, we do our best to utilize available resources including dictionaries, web corpora, and lexical analyzers, and represent them as linguistic features in the CRF model. In the experiment on the JNLPBA 2004 data, with minimal post-processing, our system achieves an F-score of 70.2%, which is better than most state-of-the-art systems. On the GENIA 3.02 corpus, our system achieves an F-score of 78.4% for protein names, which is 2.8% higher than the next-best system. In addition, we also examine the usefulness of each feature in our CRF model. Our experience could be valuable to other researchers working on machine learning based NER.
KW - Biomedical named entity recognition
KW - Conditional random fields
KW - Linguistic features
KW - Literature mining
UR - http://www.scopus.com/inward/record.url?scp=27844538955&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2005.09.072
DO - 10.1016/j.eswa.2005.09.072
M3 - 期刊論文
AN - SCOPUS:27844538955
VL - 30
SP - 117
EP - 128
JO - Expert Systems with Applications
JF - Expert Systems with Applications
SN - 0957-4174
IS - 1
ER -