Background: This paper is concerned with the identification of biomedical abstracts related to protein-protein interactions. We propose a novel feature representation scheme, contextual-bag-of-words, to exploit protein name information. Results: Our method outperforms well-known methods that use protein name information as additional features. We further improve performance by extracting reliable and informative instances from unlabeled and likely positive data to provide additional training data. We employ F-measure and the area under a receiver operating characteristic curve (AUC) to measure the classification and ranking abilities, respectively. Our final model achieves an F-measure of 80.34% and an AUC score of 88.06%, which are higher than those of the top-ranking system in BioCreAtIvE-II by 2.34% and 2.52%, respectively. Conclusions: These results show the effectiveness of our contextual-bag-of-words scheme and suggest that our system could serve as an efficient preprocessing tool for modern PPI database curation.
|Journal||CEUR Workshop Proceedings|
|State||Published - 2007|
|Event||2nd International Symposium on Languages in Biology and Medicine, LBM 2007 - Singapore, Singapore|
Duration: 6 Dec 2007 → 7 Dec 2007