Web text clustering with dynamic themes

Ping Ju Hung, Ping Yu Hsu, Ming Shien Cheng, Chih Hao Wen

研究成果: 書貢獻/報告類型會議論文篇章同行評審

摘要

Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm-Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.

原文???core.languages.en_GB???
主出版物標題Web Information Systems and Mining - International Conference, WISM 2011, Proceedings
頁面122-130
頁數9
版本PART 2
DOIs
出版狀態已出版 - 2011
事件2011 International Conference on Web Information Systems and Mining, WISM 2011 - Taiyuan, China
持續時間: 24 9月 201125 9月 2011

出版系列

名字Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
號碼PART 2
6988 LNCS
ISSN(列印)0302-9743
ISSN(電子)1611-3349

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???2011 International Conference on Web Information Systems and Mining, WISM 2011
國家/地區China
城市Taiyuan
期間24/09/1125/09/11

指紋

深入研究「Web text clustering with dynamic themes」主題。共同形成了獨特的指紋。

引用此