Video captioning based on joint image-audio deep learning techniques

Chien Yao Wang, Pei Sin Liaw, Kai Wen Liang, Jai Ching Wang, Pao Chi Chang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

With the advancement in technology, deep learning has been widely used for various multimedia applications. Herein, we utilized this technology to video captioning. The proposed system uses different neural networks to extract features from image, audio, and semantic signals. Image and audio features are concatenated before being fed into a long short-term memory (LSTM) for initialization. The joint audio-image features help the entire semantics to form a network with better performance.A bilingual evaluation understudy algorithm (BLEU) - an automatic speech scoring mechanism - was used to score sentences. We considered the length of the word group (one word to four words); with the increase of all BLEU scores by more than 1%, the CIDEr-D score increased by 2.27%, and the METEOR and ROUGE-L scores increased by 0.2% and 0.7%, respectively. The improvement is highly significant.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 9th International Conference on Consumer Electronics, ICCE-Berlin 2019
EditorsGordan Velikic, Christian Gross
PublisherIEEE Computer Society
Pages127-131
Number of pages5
ISBN (Electronic)9781728127453
DOIs
StatePublished - Sep 2019
Event9th IEEE International Conference on Consumer Electronics, ICCE-Berlin 2019 - Berlin, Germany
Duration: 8 Sep 201911 Sep 2019

Publication series

NameIEEE International Conference on Consumer Electronics - Berlin, ICCE-Berlin
Volume2019-September
ISSN (Print)2166-6814
ISSN (Electronic)2166-6822

Conference

Conference9th IEEE International Conference on Consumer Electronics, ICCE-Berlin 2019
Country/TerritoryGermany
CityBerlin
Period8/09/1911/09/19

Keywords

  • Acoustic scene classification
  • Convolutional neural networks
  • Long short-term memory
  • Sound event detection
  • Video captioning
  • Word embedding

Fingerprint

Dive into the research topics of 'Video captioning based on joint image-audio deep learning techniques'. Together they form a unique fingerprint.

Cite this