Tech-Talk-Sum: fine-tuning extractive summarization and enhancing BERT text contextualization for technological talk videos

Chalothon Chootong, Timothy K. Shih

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Automatic summarization is a task to condense the data to a shorter version while preserving key informational components and the meaning of content. In this paper, we introduce Tech-Talk-Sum, which is the combination of BERT (Bidirectional Encoder Representations from Transformers) and the attention mechanism to summarize the technological talk videos. We first introduce the technology talk datasets that were constructed from YouTube including short- and long-talk videos. Second, we explored various sentence representations from BERT’s output. Using the top hidden layer to represent sentences is the best choice for our datasets. The outputs from BERT were fed forward to the Bi-LSTM network to build local context vectors. Besides, we built the document encoder layer that leverages BERT and the self-attention mechanism to express the semantics of a video caption and to form the global context vector. Third, the undirected LSTM was added to bridge the local and global sentence’s contexts to predict the sentence’s salience score. Finally, the video summaries were generated based on the scores. We trained a single unified model on long-talk video datasets. ROUGE was utilized to evaluate our proposed methods. The experimental results demonstrate that our model has generalization ability, and achieves the baselines and state-of-the-art results for both long and short videos.

Original languageEnglish
Pages (from-to)31295-31312
Number of pages18
JournalMultimedia Tools and Applications
Volume81
Issue number22
DOIs
StatePublished - Sep 2022

Keywords

  • Attention mechanism
  • BERT
  • Spoken summarization
  • Technological talk
  • Video summary

Fingerprint

Dive into the research topics of 'Tech-Talk-Sum: fine-tuning extractive summarization and enhancing BERT text contextualization for technological talk videos'. Together they form a unique fingerprint.

Cite this