TY - GEN
T1 - Evolving Deep Architectures: A New Blend of CNNs and Transformers Without Pre-training Dependencies
AU - Kiiskilä, Manu
AU - Kiiskilä, Padmasheela
PY - 2024
Y1 - 2024
N2 - Modeling in computer vision is slowly moving from Convolution Neural Networks (CNNs) to Vision Transformers due to the high performance of self-attention mechanisms in capturing global dependencies within the data. Although vision transformers proved to surpass CNNs in performance and require less computational power, their need for pre-training on large-scale datasets can become burdensome. Using pre-trained models has critical limitations, including limited flexibility to adjust network structures and domain mismatches of source and target domains. To address this, a new architecture with a blend of CNNs and Transformers is proposed. SegFormer with four transformer blocks is used as an example, replacing the first two transformer blocks with two CNN modules and training from scratch. Experiments with the MS COCO dataset show a clear improvement in accuracy with C-C-T-T architecture compared to T-T-T-T architecture when trained from scratch on limited data. This project proposes an architecture modifying the SegFormer Transformer with two convolutional modules, achieving pixel accuracies of 0.6956 on MS COCO.
AB - Modeling in computer vision is slowly moving from Convolution Neural Networks (CNNs) to Vision Transformers due to the high performance of self-attention mechanisms in capturing global dependencies within the data. Although vision transformers proved to surpass CNNs in performance and require less computational power, their need for pre-training on large-scale datasets can become burdensome. Using pre-trained models has critical limitations, including limited flexibility to adjust network structures and domain mismatches of source and target domains. To address this, a new architecture with a blend of CNNs and Transformers is proposed. SegFormer with four transformer blocks is used as an example, replacing the first two transformer blocks with two CNN modules and training from scratch. Experiments with the MS COCO dataset show a clear improvement in accuracy with C-C-T-T architecture compared to T-T-T-T architecture when trained from scratch on limited data. This project proposes an architecture modifying the SegFormer Transformer with two convolutional modules, achieving pixel accuracies of 0.6956 on MS COCO.
U2 - 10.1007/978-3-031-66694-0_10
DO - 10.1007/978-3-031-66694-0_10
M3 - Conference contribution
T3 - Communications in Computer and Information Science
SP - 163
EP - 175
BT - Deep Learning Theory and Applications
A2 - Fred, Ana
A2 - Hadjali, Allel
A2 - Gusikhin, Oleg
A2 - Sansone, Carlo
PB - Springer
T2 - International Conference on Deep Learning Theory and Applications
Y2 - 10 July 2024 through 11 July 2024
ER -