A signal-representation-based parser to extract text-based information from the web

Mu Chun Su, Shao Jui Wang, Chen Ko Huang, Pa Chun Wang, Fu Hau Hsu, Shih Chieh Lin, Yi Zeng Hsieh

研究成果: 雜誌貢獻期刊論文同行評審

摘要

Most of the dramatically increased amount of information available on the World Wide Web is provided via HTML and formatted for human browsing rather than for software programs. This situation calls for a tool that automatically extracts information from semistructured Web information sources, increasing the usefulness of value-added Web services. We present a signal-representation- based parser (SIRAP) that breaks Web pages up into logically coherent groups - groups of information related to an entity, for example. Templates for records with different tag structures are generated incrementally by a Histogram-Based Correlation Coefficient (HBCC) algorithm, then records on a Web page are detected efficiently using templates generated by matching. Hundreds of Web pages from 17 state-of-the-art search engines were used to demonstrate the feasibility of our approach. information extraction, wrapper, parser, Web, template matching.

原文???core.languages.en_GB???
頁(從 - 到)531-539
頁數9
期刊Journal of Advanced Computational Intelligence and Intelligent Informatics
14
發行號5
DOIs
出版狀態已出版 - 7月 2010

指紋

深入研究「A signal-representation-based parser to extract text-based information from the web」主題。共同形成了獨特的指紋。

引用此