TY - JOUR
T1 - DCADE
T2 - divide and conquer alignment with dynamic encoding for full page data extraction
AU - Yuliana, Oviliani Yenty
AU - Chang, Chia Hui
N1 - Publisher Copyright:
© 2019, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2020/2/1
Y1 - 2020/2/1
N2 - In this paper, we consider the problem of full schema induction from either multiple list pages or singleton pages with the same template. Existing approaches do not work well for this problem because they use fixed abstraction schemes that are suitable for data-rich detection, but they are not appropriate for small records and complex data found in other sections. We propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE for short). We define the Content Equivalence Class (CEC) and Typeset Equivalence Class (TEC) based on leaf node content. We then combine HTML attributes (i.e., id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from specific to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (denoted by FD), and a 0.936 F1 measure for recordset data extraction (denoted by FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD= 0.660), TEX (FD= 0.454 and FS= 0.549), RoadRunner (FD= 0.396 and FS= 0.330), and UWIDE (FD= 0.260 and FS= 0.081).
AB - In this paper, we consider the problem of full schema induction from either multiple list pages or singleton pages with the same template. Existing approaches do not work well for this problem because they use fixed abstraction schemes that are suitable for data-rich detection, but they are not appropriate for small records and complex data found in other sections. We propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE for short). We define the Content Equivalence Class (CEC) and Typeset Equivalence Class (TEC) based on leaf node content. We then combine HTML attributes (i.e., id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from specific to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (denoted by FD), and a 0.936 F1 measure for recordset data extraction (denoted by FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD= 0.660), TEX (FD= 0.454 and FS= 0.549), RoadRunner (FD= 0.396 and FS= 0.330), and UWIDE (FD= 0.260 and FS= 0.081).
KW - Deep web data extraction
KW - Divide-conquer alignment
KW - Dynamic encoding
KW - Full-schema induction
KW - Multiple template pages
UR - http://www.scopus.com/inward/record.url?scp=85069650474&partnerID=8YFLogxK
U2 - 10.1007/s10489-019-01499-0
DO - 10.1007/s10489-019-01499-0
M3 - 期刊論文
AN - SCOPUS:85069650474
SN - 0924-669X
VL - 50
SP - 271
EP - 295
JO - Applied Intelligence
JF - Applied Intelligence
IS - 2
ER -