Automatic extraction of blog post from diverse blog pages

Chia Hui Chang, Jhih Ming Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Blog post extraction is essential for researches on blogosphere. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of the previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages. Our research is based on the combination of maximum scoring subsequence and text-to-tag ratio to develop algorithms that are suitable for blog pages. The first method that we propose is PTR Scoring, which combines post-to-tag ratio with maximum scoring subsequence. The second method is CRF Scoring, which applies Conditional Random Field to train a sequence labeling model and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9\% compared with other methods.

Original languageEnglish
Title of host publicationProceedings - 2012 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2012
Pages129-136
Number of pages8
DOIs
StatePublished - 2012
Event2012 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2012 - Macau, China
Duration: 4 Dec 20127 Dec 2012

Publication series

NameProceedings - 2012 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2012

Conference

Conference2012 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2012
Country/TerritoryChina
CityMacau
Period4/12/127/12/12

Keywords

  • blog post extraction
  • maximum sequence
  • sequence labeling

Fingerprint

Dive into the research topics of 'Automatic extraction of blog post from diverse blog pages'. Together they form a unique fingerprint.

Cite this