Efficient page-level data extraction via schema induction and verification

Chia Hui Chang, Tian Sheng Chen, Ming Chuan Chen, Jhung Li Ding

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Page-level data extraction provides a complete solution for all kinds of information requirement, however very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema/wrapper generation and verification. In this paper, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. To speed up the processing, we utilize leaf nodes of the DOM trees as the processing units and dynamically adjust the encoding for better alignment. The proposed system works better than other page-level extraction systems in terms of schema correctness and extraction efficiency. Overall, the extraction efficiency is 2.7 times faster than state-of-the-art unsupervised approaches that extract data page by page without schema verification.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 20th Pacific-Asia Conference, PAKDD 2016, Proceedings
EditorsJames Bailey, Latifur Khan, Takashi Washio, Gillian Dobbie, Joshua Zhexue Huang, Ruili Wang
PublisherSpringer Verlag
Pages478-490
Number of pages13
ISBN (Print)9783319317496
DOIs
StatePublished - 2016
Event20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016 - Auckland, New Zealand
Duration: 19 Apr 201622 Apr 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9652 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016
Country/TerritoryNew Zealand
CityAuckland
Period19/04/1622/04/16

Fingerprint

Dive into the research topics of 'Efficient page-level data extraction via schema induction and verification'. Together they form a unique fingerprint.

Cite this