Evolving Deep Architectures: A New Blend of CNNs and Transformers Without Pre-training Dependencies

Manu Kiiskilä, Padmasheela Kiiskilä

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Modeling in computer vision is slowly moving from Convolution Neural Networks (CNNs) to Vision Transformers due to the high performance of self-attention mechanisms in capturing global dependencies within the data. Although vision transformers proved to surpass CNNs in performance and require less computational power, their need for pre-training on large-scale datasets can become burdensome. Using pre-trained models has critical limitations, including limited flexibility to adjust network structures and domain mismatches of source and target domains. To address this, a new architecture with a blend of CNNs and Transformers is proposed. SegFormer with four transformer blocks is used as an example, replacing the first two transformer blocks with two CNN modules and training from scratch. Experiments with the MS COCO dataset show a clear improvement in accuracy with C-C-T-T architecture compared to T-T-T-T architecture when trained from scratch on limited data. This project proposes an architecture modifying the SegFormer Transformer with two convolutional modules, achieving pixel accuracies of 0.6956 on MS COCO.
Original languageEnglish
Title of host publicationDeep Learning Theory and Applications
Subtitle of host publication5th International Conference, DeLTA 2024, Dijon, France, July 10–11, 2024, Proceedings, Part I
EditorsAna Fred, Allel Hadjali, Oleg Gusikhin, Carlo Sansone
PublisherSpringer
Pages163-175
ISBN (Electronic)978-3-031-66694-0
DOIs
Publication statusPublished - 2024
Publication typeA4 Article in conference proceedings
EventInternational Conference on Deep Learning Theory and Applications - Dijon, France
Duration: 10 Jul 202411 Jul 2024

Publication series

NameCommunications in Computer and Information Science
Volume2171 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

ConferenceInternational Conference on Deep Learning Theory and Applications
Country/TerritoryFrance
CityDijon
Period10/07/2411/07/24

Publication forum classification

  • Publication forum level 1

Fingerprint

Dive into the research topics of 'Evolving Deep Architectures: A New Blend of CNNs and Transformers Without Pre-training Dependencies'. Together they form a unique fingerprint.

Cite this