Page-level wrapper verification for unsupervised web data extraction

Chia Hui Chang, Yen Ling Lin, Kuan Chen Lin, Mohammed Kayed

研究成果: 書貢獻/報告類型會議論文篇章同行評審

4 引文 斯高帕斯(Scopus)

摘要

Unsupervised information extraction has been studied a lot in the past decade. However, not much attention has been paid to its wrapper maintenance. In this paper, we study wrapper construction and verification problem based on the given schema and template which is induced from unsupervised page-level wrapper induction system. We model the verification problem as a constraint satisfaction problem (CSP) for leaf node label assignment with respect to constraints specified by a finite state machine (FSM) which is constructed from previous learned schema and template. If there exists no solution to the CSP, i.e. no valid label sequence exists, we say the test page fails the verification; otherwise, we rank all valid label sequences by measuring the fitness of each label sequence for extraction. We evaluate the FSM based approach with XML validation via false positive rate and false negative rate and measure the extraction performance through extraction accuracy. The experimental result shows the proposed method can effectively filter invalid pages (zero false positive rate) and rank the correct label sequence with the highest score with 96.5% accuracy.

原文???core.languages.en_GB???
主出版物標題Web Information Systems Engineering, WISE 2013 - 14th International Conference, Proceedings
頁面454-467
頁數14
版本PART 1
DOIs
出版狀態已出版 - 2013
事件14th International Conference on Web Information Systems Engineering, WISE 2013 - Nanjing, China
持續時間: 13 10月 201315 10月 2013

出版系列

名字Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
號碼PART 1
8180 LNCS
ISSN(列印)0302-9743
ISSN(電子)1611-3349

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???14th International Conference on Web Information Systems Engineering, WISE 2013
國家/地區China
城市Nanjing
期間13/10/1315/10/13

指紋

深入研究「Page-level wrapper verification for unsupervised web data extraction」主題。共同形成了獨特的指紋。

引用此