A signal-representation-based parser to extract text-based information from the web

Mu Chun Su, Shao Jui Wang, Chen Ko Huang, Pa Chun Wang, Fu Hau Hsu, Shih Chieh Lin, Yi Zeng Hsieh

Research output: Contribution to journalArticlepeer-review

Abstract

Most of the dramatically increased amount of information available on the World Wide Web is provided via HTML and formatted for human browsing rather than for software programs. This situation calls for a tool that automatically extracts information from semistructured Web information sources, increasing the usefulness of value-added Web services. We present a signal-representation- based parser (SIRAP) that breaks Web pages up into logically coherent groups - groups of information related to an entity, for example. Templates for records with different tag structures are generated incrementally by a Histogram-Based Correlation Coefficient (HBCC) algorithm, then records on a Web page are detected efficiently using templates generated by matching. Hundreds of Web pages from 17 state-of-the-art search engines were used to demonstrate the feasibility of our approach. information extraction, wrapper, parser, Web, template matching.

Original languageEnglish
Pages (from-to)531-539
Number of pages9
JournalJournal of Advanced Computational Intelligence and Intelligent Informatics
Volume14
Issue number5
DOIs
StatePublished - Jul 2010

Fingerprint

Dive into the research topics of 'A signal-representation-based parser to extract text-based information from the web'. Together they form a unique fingerprint.

Cite this