Discovery and fusion of salient multi-modal features towards news story segmentation

Winston Hsu, Shih Fu Chang, Chih Wei Huang, Lyndon Kennedy, Ching Yung Lin, Giridharan Iyengar

Research output: Contribution to journalConference articlepeer-review

49 Scopus citations

Abstract

In this paper, we present our new results in news video story segmentation and classification in the context of TRECVID video retrieval benchmarking event 2003. We applied and extended the Maximum Entropy statistical model to effectively fuse diverse features from multiple levels and modalities, including visual, audio, and text. We have included various features such as motion, face, music/speech types, prosody, and high-level text segmentation information. The statistical fusion model is used to automatically discover relevant features contributing to the detection of story boundaries. One novel aspect of our method is the use of a feature wrapper to address different types of features - asynchronous, discrete, continuous and delta ones. We also developed several novel features related to prosody. Using the large news video set from the TRECVID 2003 benchmark, we demonstrate satisfactory performance (F1 measures up to 0.76 in ABC news and 0.73 in CNN news), present how these multi-level multi-modal features construct the probabilistic framework, and more importantly observe an interesting opportunity for further improvement.

Original languageEnglish
Pages (from-to)244-258
Number of pages15
JournalProceedings of SPIE - The International Society for Optical Engineering
Volume5307
DOIs
StatePublished - 2004
EventStorage and Retrieval Methods and Applications for Multimedia 2004 - San Jose, CA, United States
Duration: 20 Jan 200422 Jan 2004

Keywords

  • Exponential model
  • Face detection
  • Maximum Entropy Model
  • Multi-modal fusion
  • Prosody
  • Story segmentation
  • TRECVID

Fingerprint

Dive into the research topics of 'Discovery and fusion of salient multi-modal features towards news story segmentation'. Together they form a unique fingerprint.

Cite this