Applying pattern mining to web information extraction

Chia Hui Chang, Shao Chen Lui, Yen Chin Wu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

18 Scopus citations


Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 5th Pacific-Asia Conference, PAKDD 2001, Proceedings
EditorsDavid Cheung, Graham J. Williams, Qing Li
PublisherSpringer Verlag
Number of pages12
ISBN (Print)3540419101, 9783540419105
StatePublished - 2001
Event5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2001 - Kowloon, Hong Kong
Duration: 16 Apr 200118 Apr 2001

Publication series

NameLecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
ISSN (Print)0302-9743


Conference5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2001
Country/TerritoryHong Kong


  • Information extraction
  • Multiple alignment
  • Pattern discovery
  • Semi-structured documents
  • Wrap-per generation


Dive into the research topics of 'Applying pattern mining to web information extraction'. Together they form a unique fingerprint.

Cite this