A template independent method for large online news content extraction

Yu Chieh Wu, Jie Chi Yang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Online news provides a convenient way for users to read novel news. Building online news corpus is important to many text mining and data mining issues. The creation of web news data required to construct a set of HTML parsing rules to identify content text. When a website rapidly change the layout style, the parsing rules (wrapper) should be reconstructed. In this paper, we address this issue and propose a news content recognition algorithm that is portable to different language and various domains. Our method first scans the entire HTML document and detects a set of candidate blocks. Second, the proposed weighted scoring function that combines stopword language models and HTML penalty functions is used to rank the importance of each candidate. We then check the block which obtains the highest score and a predefined threshold value. To validate the approach, we conduct experiments by using 533 online news HTML files from 24 web sites. The empirical study shows that our method achieves ~95% macro F-measure rate in recognizing news content.

Original languageEnglish
Title of host publicationProceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
Pages254-257
Number of pages4
DOIs
StatePublished - 2012
Event1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 - Fukuoka, Japan
Duration: 20 Sep 201222 Sep 2012

Publication series

NameProceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012

Conference

Conference1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
Country/TerritoryJapan
CityFukuoka
Period20/09/1222/09/12

Keywords

  • Content text recognition
  • Information extraction
  • Language model
  • Text corpus construction
  • Text mining

Fingerprint

Dive into the research topics of 'A template independent method for large online news content extraction'. Together they form a unique fingerprint.

Cite this