Applying pattern mining to web information extraction

Chia Hui Chang, Shao Chen Lui, Yen Chin Wu

研究成果: 書貢獻/報告類型會議論文篇章同行評審

18 引文 斯高帕斯(Scopus)

摘要

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.

原文???core.languages.en_GB???
主出版物標題Advances in Knowledge Discovery and Data Mining - 5th Pacific-Asia Conference, PAKDD 2001, Proceedings
編輯David Cheung, Graham J. Williams, Qing Li
發行者Springer Verlag
頁面4-15
頁數12
ISBN(列印)3540419101, 9783540419105
DOIs
出版狀態已出版 - 2001
事件5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2001 - Kowloon, Hong Kong
持續時間: 16 4月 200118 4月 2001

出版系列

名字Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
2035
ISSN(列印)0302-9743

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2001
國家/地區Hong Kong
城市Kowloon
期間16/04/0118/04/01

指紋

深入研究「Applying pattern mining to web information extraction」主題。共同形成了獨特的指紋。

引用此