AFIS: Aligning detail-pages for full schema induction

Oviliani Yenty Yuliana, Chia Hui Chang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Web data extraction is an essential task for web data integration. Most researches focus on data extraction from list-pages by detecting data-rich section and record boundary segmentation. However, in detail-pages which contain all-inclusive product information in each page, so the number of data attributes need to be aligned is much larger. In this paper, we formulate data extraction problem as alignment of leaf nodes from DOM Trees. We propose AFIS, Annotation-Free Induction of Full Schema for detail pages in this paper. AFIS applies Divide-and-Conquer and Longest Increasing Sequence (LIS) algorithms to mine landmarks from input. The experiments show that AFIS outperforms RoadRunner, FivaTech and TEX (F1 0.990) in terms of selected data. For full schema evaluation (all data), AFIS also represents the highest average performance (F1 0.937) compared with TEX and RoadRunner.

Original languageEnglish
Title of host publicationTAAI 2016 - 2016 Conference on Technologies and Applications of Artificial Intelligence, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages220-227
Number of pages8
ISBN (Electronic)9781509057320
DOIs
StatePublished - 16 Mar 2017
Event2016 Conference on Technologies and Applications of Artificial Intelligence, TAAI 2016 - Hsinchu, Taiwan
Duration: 25 Nov 201627 Nov 2016

Publication series

NameTAAI 2016 - 2016 Conference on Technologies and Applications of Artificial Intelligence, Proceedings

Conference

Conference2016 Conference on Technologies and Applications of Artificial Intelligence, TAAI 2016
Country/TerritoryTaiwan
CityHsinchu
Period25/11/1627/11/16

Keywords

  • detail-pages alignment
  • divide-conquer alignment
  • landmark equivalence class
  • semi-structured data
  • web data extraction

Fingerprint

Dive into the research topics of 'AFIS: Aligning detail-pages for full schema induction'. Together they form a unique fingerprint.

Cite this