Using contextual information to clarify cross-species gene normalization ambiguity

Research output: Contribution to journalArticlepeer-review


The goal of Gene Normalization (GN) is to identify the unique database IDs of genes and proteins mentioned in biomedical literature. A major difficulty in GN comes from the ambiguity of gene names. That is, the same gene name can refer to different database IDs depending on the species in question. In this paper, we introduce a method to exploit contextual information in an abstract, like tissue type, chromosome location, etc., to tackle this problem. Using this technique, we have been able to improve system performance (F-score) by 14.3% on the BioCreAtIvE-II GN task test set. We also examined our method on a full-text dataset with cross-species genes. The experimental results show a promising performance (AUC) of 42.94%. Our experimental results also show that with full text, versus abstract only, the system performance was 12.24% higher.

Original languageEnglish
Pages (from-to)197-214
Number of pages18
JournalInternational Journal of Software Engineering and Knowledge Engineering
Issue number2
StatePublished - Mar 2010


  • Gene normalization
  • Natural language processing
  • Text mining


Dive into the research topics of 'Using contextual information to clarify cross-species gene normalization ambiguity'. Together they form a unique fingerprint.

Cite this