TY - GEN
T1 - AFIS
T2 - 2016 Conference on Technologies and Applications of Artificial Intelligence, TAAI 2016
AU - Yuliana, Oviliani Yenty
AU - Chang, Chia Hui
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/3/16
Y1 - 2017/3/16
N2 - Web data extraction is an essential task for web data integration. Most researches focus on data extraction from list-pages by detecting data-rich section and record boundary segmentation. However, in detail-pages which contain all-inclusive product information in each page, so the number of data attributes need to be aligned is much larger. In this paper, we formulate data extraction problem as alignment of leaf nodes from DOM Trees. We propose AFIS, Annotation-Free Induction of Full Schema for detail pages in this paper. AFIS applies Divide-and-Conquer and Longest Increasing Sequence (LIS) algorithms to mine landmarks from input. The experiments show that AFIS outperforms RoadRunner, FivaTech and TEX (F1 0.990) in terms of selected data. For full schema evaluation (all data), AFIS also represents the highest average performance (F1 0.937) compared with TEX and RoadRunner.
AB - Web data extraction is an essential task for web data integration. Most researches focus on data extraction from list-pages by detecting data-rich section and record boundary segmentation. However, in detail-pages which contain all-inclusive product information in each page, so the number of data attributes need to be aligned is much larger. In this paper, we formulate data extraction problem as alignment of leaf nodes from DOM Trees. We propose AFIS, Annotation-Free Induction of Full Schema for detail pages in this paper. AFIS applies Divide-and-Conquer and Longest Increasing Sequence (LIS) algorithms to mine landmarks from input. The experiments show that AFIS outperforms RoadRunner, FivaTech and TEX (F1 0.990) in terms of selected data. For full schema evaluation (all data), AFIS also represents the highest average performance (F1 0.937) compared with TEX and RoadRunner.
KW - detail-pages alignment
KW - divide-conquer alignment
KW - landmark equivalence class
KW - semi-structured data
KW - web data extraction
UR - http://www.scopus.com/inward/record.url?scp=85017620948&partnerID=8YFLogxK
U2 - 10.1109/TAAI.2016.7880164
DO - 10.1109/TAAI.2016.7880164
M3 - 會議論文篇章
AN - SCOPUS:85017620948
T3 - TAAI 2016 - 2016 Conference on Technologies and Applications of Artificial Intelligence, Proceedings
SP - 220
EP - 227
BT - TAAI 2016 - 2016 Conference on Technologies and Applications of Artificial Intelligence, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 November 2016 through 27 November 2016
ER -