CiteSeerx: A scholarly big dataset

Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernández-Ramírez, Hung Hsuan Chen, Zhaohui Wu, Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

48 Scopus citations


The CiteSeer x digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific information, CiteSeer x has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeer x is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata. Since the performance of models trained on these data highly depends on the quality of the data, we propose an approach to CiteSeer x metadata cleaning that incorporates information from an external data source. The result is a subset of CiteSeer x, which is substantially cleaner than the entire set. Our goal is to make the new dataset available to the research community to facilitate future work in Information Retrieval.

Original languageEnglish
Title of host publicationAdvances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014, Proceedings
PublisherSpringer Verlag
Number of pages12
ISBN (Print)9783319060279
StatePublished - 2014
Event36th European Conference on Information Retrieval, ECIR 2014 - Amsterdam, Netherlands
Duration: 13 Apr 201416 Apr 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8416 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference36th European Conference on Information Retrieval, ECIR 2014


  • Record Linkage
  • Scholarly Big Data


Dive into the research topics of 'CiteSeerx: A scholarly big dataset'. Together they form a unique fingerprint.

Cite this