Using contextual information to clarify cross-species gene normalization ambiguity

Research output: Contribution to journalArticlepeer-review

Abstract

The goal of Gene Normalization (GN) is to identify the unique database IDs of genes and proteins mentioned in biomedical literature. A major difficulty in GN comes from the ambiguity of gene names. That is, the same gene name can refer to different database IDs depending on the species in question. In this paper, we introduce a method to exploit contextual information in an abstract, like tissue type, chromosome location, etc., to tackle this problem. Using this technique, we have been able to improve system performance (F-score) by 14.3% on the BioCreAtIvE-II GN task test set. We also examined our method on a full-text dataset with cross-species genes. The experimental results show a promising performance (AUC) of 42.94%. Our experimental results also show that with full text, versus abstract only, the system performance was 12.24% higher.

Original languageEnglish
Pages (from-to)197-214
Number of pages18
JournalInternational Journal of Software Engineering and Knowledge Engineering
Volume20
Issue number2
DOIs
StatePublished - Mar 2010

Keywords

  • Gene normalization
  • Natural language processing
  • Text mining

Fingerprint

Dive into the research topics of 'Using contextual information to clarify cross-species gene normalization ambiguity'. Together they form a unique fingerprint.

Cite this