Web text clustering with dynamic themes

Ping Ju Hung, Ping Yu Hsu, Ming Shien Cheng, Chih Hao Wen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm-Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.

Original languageEnglish
Title of host publicationWeb Information Systems and Mining - International Conference, WISM 2011, Proceedings
Pages122-130
Number of pages9
EditionPART 2
DOIs
StatePublished - 2011
Event2011 International Conference on Web Information Systems and Mining, WISM 2011 - Taiyuan, China
Duration: 24 Sep 201125 Sep 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume6988 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2011 International Conference on Web Information Systems and Mining, WISM 2011
Country/TerritoryChina
CityTaiyuan
Period24/09/1125/09/11

Keywords

  • Documents Clustering
  • Extracting Theme
  • Temporal Text Mining

Fingerprint

Dive into the research topics of 'Web text clustering with dynamic themes'. Together they form a unique fingerprint.

Cite this