An Interpretable Visual Attention Plug-in for Convolutions

Chih Yang Lin, Chia Lin Wu, Hui Fuang Ng, Timothy K. Shih

Research output: Contribution to journalArticlepeer-review

2 Scopus citations


Raw images, which may contain many noisy background pixels, are typically used in convolutional neural network (CNN) training. This paper proposes a novel variance loss function based on a ground truth mask of the target object to enhance the visual attention of a CNN. The loss function regularizes the training process so that the feature maps in the later convolutional layer are focused more on target object areas and less on the background. Attention loss is computed directly from the feature maps, so no new parameters are added to the backbone network; therefore, no extra computational cost is added to the testing phase. The proposed attention model can be a plug-in for any pre-trained network architecture and can be used in conjunction with other attention models. Experimental results demonstrate that the proposed variance loss function improves classification accuracy by 2.22% over the baseline on the Stanford Dogs dataset, which is significantly higher than the improvements achieved by SENet (0.3%) and CBAM (1.14%). Our method also improves object detection accuracy by 2.5 mAP on the Pascal-VOC2007 dataset and store sign detection by 2.66 mAP over respective baseline models. Furthermore, the proposed loss function enhances the visualization and interpretability of a CNN.

Original languageEnglish
Article number9146869
Pages (from-to)136992-137003
Number of pages12
JournalIEEE Access
StatePublished - 2020


  • Attention model
  • convolutional neural network
  • deep learning
  • object detection
  • variance loss


Dive into the research topics of 'An Interpretable Visual Attention Plug-in for Convolutions'. Together they form a unique fingerprint.

Cite this