IEPAD: Information extraction based on pattern discovery

Chia Hui Chang, Shao Chen Lui

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

307 Scopus citations

Abstract

The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semistructured Web documents. Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule. Previous work in this field aims to learn extraction rules from users' training example. In this paper, we propose IEPAD, a system that automatically discovers extraction rules from Web pages. The system can automatically identify record boundary by repeated pattern mining and multiple sequence alignment. The discovery of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines.

Original languageEnglish
Title of host publicationProceedings of the 10th International Conference on World Wide Web, WWW 2001
PublisherAssociation for Computing Machinery, Inc
Pages681-688
Number of pages8
ISBN (Print)1581133480, 9781581133486
DOIs
StatePublished - 1 Apr 2001
Event10th International Conference on World Wide Web, WWW 2001 - Hong Kong, Hong Kong
Duration: 1 May 20015 May 2001

Publication series

NameProceedings of the 10th International Conference on World Wide Web, WWW 2001

Conference

Conference10th International Conference on World Wide Web, WWW 2001
Country/TerritoryHong Kong
CityHong Kong
Period1/05/015/05/01

Keywords

  • Extraction rule
  • Information extraction
  • Multiple string alignment
  • PAT tree

Fingerprint

Dive into the research topics of 'IEPAD: Information extraction based on pattern discovery'. Together they form a unique fingerprint.

Cite this