TY - JOUR
T1 - Automatic information extraction from semi-structured Web pages by pattern discovery
AU - Chang, Chia Hui
AU - Hsu, Chun Nan
AU - Lui, Shao Cheng
N1 - Funding Information:
The research reported here was supported in part by the National Science Council of Taiwan under Grant No.90-2213-E-008-042 and in part by DeepSpot Intelligent Systems, Taiwan under Contract No.05A-880527-03C with Academia Sinica, Taiwan. We wish to thank reviewers for their valuable comments.
PY - 2003/4
Y1 - 2003/4
N2 - The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. Previous work in wrapper induction aims at learning extraction rules from user-labeled training examples, which, however, can be expensive in some practical applications. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that discovers extraction patterns from Web pages without user-labeled examples. IEPAD applies several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms. Extractors generated by IEPAD can be generalized over unseen pages from the same Web data source. We empirically evaluate the performance of IEPAD on an information extraction task from 14 real Web data sources. Experimental results show that with the extraction rules discovered from a single page, IEPAD achieves 96% average retrieval rate, and with less than five example pages, IEPAD achieves 100% retrieval rate for 10 of the sample Web data sources.
AB - The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. Previous work in wrapper induction aims at learning extraction rules from user-labeled training examples, which, however, can be expensive in some practical applications. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that discovers extraction patterns from Web pages without user-labeled examples. IEPAD applies several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms. Extractors generated by IEPAD can be generalized over unseen pages from the same Web data source. We empirically evaluate the performance of IEPAD on an information extraction task from 14 real Web data sources. Experimental results show that with the extraction rules discovered from a single page, IEPAD achieves 96% average retrieval rate, and with less than five example pages, IEPAD achieves 100% retrieval rate for 10 of the sample Web data sources.
KW - Information extraction
KW - Multiple string alignment
KW - PAT trees
KW - Semi-structured data
KW - Wrapper generation
UR - http://www.scopus.com/inward/record.url?scp=0037375290&partnerID=8YFLogxK
U2 - 10.1016/S0167-9236(02)00100-8
DO - 10.1016/S0167-9236(02)00100-8
M3 - 期刊論文
AN - SCOPUS:0037375290
SN - 0167-9236
VL - 35
SP - 129
EP - 147
JO - Decision Support Systems
JF - Decision Support Systems
IS - 1
ER -