TY - JOUR
T1 - Multi-view and multi-augmentation for self-supervised visual representation learning
AU - Tran, Van Nhiem
AU - Huang, Chi En
AU - Liu, Shen Hsuan
AU - Aslam, Muhammad Saqlain
AU - Yang, Kai Lin
AU - Li, Yung-Hui
AU - Wang, Jia Ching
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2024/1
Y1 - 2024/1
N2 - In the real world, the appearance of identical objects depends on factors as varied as resolution, angle, illumination conditions, and viewing perspectives. This suggests that the data augmentation pipeline could benefit downstream tasks by exploring the overall data appearance in a self-supervised framework. Previous work on self-supervised learning that yields outstanding performance relies heavily on data augmentation such as cropping and color distortion. However, most methods use a static data augmentation pipeline, limiting the amount of feature exploration. To generate representations that encompass scale-invariant, explicit information about various semantic features and are invariant to nuisance factors such as relative object location, brightness, and color distortion, we propose the Multi-View, Multi-Augmentation (MVMA) framework. MVMA consists of multiple augmentation pipelines, with each pipeline comprising an assortment of augmentation policies. By refining the baseline self-supervised framework to investigate a broader range of image appearances through modified loss objective functions, MVMA enhances the exploration of image features through diverse data augmentation techniques. Transferring the resultant representation learning using convolutional networks (ConvNets) to downstream tasks yields significant improvements compared to the state-of-the-art DINO across a wide range of vision tasks and classification tasks: +4.1% and +8.8% top-1 on the ImageNet dataset with linear evaluation and k-NN classifier, respectively. Moreover, MVMA achieves a significant improvement of +5% AP50 and +7% AP50m on COCO object detection and segmentation.
AB - In the real world, the appearance of identical objects depends on factors as varied as resolution, angle, illumination conditions, and viewing perspectives. This suggests that the data augmentation pipeline could benefit downstream tasks by exploring the overall data appearance in a self-supervised framework. Previous work on self-supervised learning that yields outstanding performance relies heavily on data augmentation such as cropping and color distortion. However, most methods use a static data augmentation pipeline, limiting the amount of feature exploration. To generate representations that encompass scale-invariant, explicit information about various semantic features and are invariant to nuisance factors such as relative object location, brightness, and color distortion, we propose the Multi-View, Multi-Augmentation (MVMA) framework. MVMA consists of multiple augmentation pipelines, with each pipeline comprising an assortment of augmentation policies. By refining the baseline self-supervised framework to investigate a broader range of image appearances through modified loss objective functions, MVMA enhances the exploration of image features through diverse data augmentation techniques. Transferring the resultant representation learning using convolutional networks (ConvNets) to downstream tasks yields significant improvements compared to the state-of-the-art DINO across a wide range of vision tasks and classification tasks: +4.1% and +8.8% top-1 on the ImageNet dataset with linear evaluation and k-NN classifier, respectively. Moreover, MVMA achieves a significant improvement of +5% AP50 and +7% AP50m on COCO object detection and segmentation.
KW - Data augmentation policies
KW - Metric learning
KW - Multi-augmentation
KW - Nuisance factors
KW - Scale-invariant representation learning
KW - SSL augmentation pipelines
UR - http://www.scopus.com/inward/record.url?scp=85179703145&partnerID=8YFLogxK
U2 - 10.1007/s10489-023-05163-6
DO - 10.1007/s10489-023-05163-6
M3 - 期刊論文
AN - SCOPUS:85179703145
SN - 0924-669X
VL - 54
SP - 629
EP - 656
JO - Applied Intelligence
JF - Applied Intelligence
IS - 1
ER -