Fact-checked by Grok 2 weeks ago

Vision transformer

The Vision Transformer (ViT) is a pioneering architecture that adapts the model, originally designed for , to tasks such as image classification by dividing input images into fixed-size patches and treating them as sequences of to enable self-attention mechanisms without convolutional operations. Introduced in October 2020 by researchers at , including Alexey Dosovitskiy, Lucas Beyer, and colleagues, ViT demonstrated that pure transformer-based models can achieve state-of-the-art results on large-scale benchmarks like when pre-trained on massive datasets exceeding hundreds of millions of images. At its core, ViT processes an image by splitting it into non-overlapping square patches (typically 16×16 pixels), flattening and linearly projecting each patch into a vector , appending learnable embeddings to preserve spatial information, and feeding the resulting sequence—along with a special classification token—through a stack of encoder layers comprising multi-head self-attention and blocks, culminating in a simple MLP head for output prediction. This design leverages the transformer's ability to model long-range dependencies across the entire image, contrasting with the local receptive fields of traditional convolutional neural networks (CNNs). ViT's key strengths include its scalability with model size and data volume, where larger variants (e.g., ViT-L/16 and ViT-H/14) outperform CNNs like EfficientNet on JFT-300M pre-training followed by , achieving up to 88.55% top-1 accuracy, and its flexibility for in downstream tasks. However, it exhibits limitations such as high data hunger—underperforming CNNs on smaller datasets like ImageNet-1k without extensive pre-training—and substantial computational demands during training, often requiring JFT-300M-scale corpora. Since its debut, ViT has profoundly influenced , amassing over 49,000 citations and spawning variants like Swin Transformer for hierarchical processing and DeiT for data-efficient training, while enabling advancements in (e.g., DETR), semantic segmentation, and efficient edge deployment through optimizations like model . By 2025, hybrid CNN-Transformer architectures and self-supervised pre-training strategies have further extended ViT's applicability, solidifying transformers as a dominant alongside or beyond CNNs in visual modeling.

Introduction

Definition and Core Principles

The Vision Transformer (ViT) is a model that adapts the transformer architecture, originally designed for , to tasks by treating images as sequences of fixed-size patches rather than continuous arrays. This approach enables the model to process visual data directly through mechanisms suited for sequential inputs, achieving competitive performance on image classification and related benchmarks when pretrained on large datasets. At its core, ViT operates on the principle of sequence-to-sequence processing, where the entire is tokenized into a linear of patch embeddings that are analyzed holistically via self- layers. Unlike traditional convolutional neural networks (CNNs), which rely on built-in inductive biases such as spatial locality and translation equivariance to efficiently capture hierarchical features, ViT eschews these assumptions in favor of learning global interdependencies purely from -driven attention patterns. This design emphasizes the transformer's ability to model long-range dependencies across the image, potentially offering greater flexibility for tasks requiring broad contextual understanding, though it demands substantial computational resources and for effective . The high-level workflow of ViT begins with dividing the input into non-overlapping patches, which are then linearly projected into high-dimensional embeddings to form a sequence of , often augmented with a learnable class for aggregation. These are processed through multiple encoder layers, each applying multi-head self-attention followed by feed-forward networks, to refine representations that capture patch-wise relationships. The output from the final layer, typically the class 's representation, is passed through a (MLP) head to yield task-specific predictions, such as logits. The initial patch embedding step is mathematically expressed as: \mathbf{z}^0 = \left[ \mathbf{x}_{\text{class}}; \mathbf{x}_p^1 E_p; \mathbf{x}_p^2 E_p; \dots; \mathbf{x}_p^N E_p \right] + E_{\text{pos}} where \mathbf{x}_p^i is the flattened vector of the i-th image patch, E_p \in \mathbb{R}^{(P^2 \cdot C) \times D} is the linear projection matrix (with P^2 denoting the patch area and C the number of input channels), \mathbf{x}_{\text{class}} is the optional class token, and E_{\text{pos}} supplies positional encodings to preserve spatial information.

Motivation from NLP to Vision

The success of transformer architectures in (NLP) stemmed from their ability to model long-range dependencies through self-attention mechanisms, which allow tokens to interact globally regardless of distance in the sequence. This capability enabled transformers to scale effectively with increasing data and compute, achieving breakthroughs in tasks like and language modeling, as demonstrated by models handling billions of parameters. In contrast, convolutional neural networks (CNNs), the dominant paradigm in , relied on fixed local receptive fields and inductive biases such as translation equivariance and locality, which facilitated efficient feature extraction but constrained global reasoning across an entire image. These CNN limitations became particularly evident as vision models scaled: while deeper architectures improved performance, they encountered diminishing returns due to the challenges of propagating information over large receptive fields without explicit global interactions. The fixed hierarchical structure of CNNs also promoted reliance on handcrafted features and augmentations for generalization, hindering fully end-to-end learning on diverse datasets. Inspired by NLP transformers' scalability, researchers sought to adapt the architecture to vision by treating images as sequences of patches, leveraging self-attention to capture holistic context and enable better performance with abundant training data. Key motivations included the potential for improved , where transformers could benefit more from massive datasets and compute than CNNs, as well as the promise of unified models across modalities without domain-specific priors. Early experiments validated this approach, showing that vision transformers achieved competitive accuracy on large-scale benchmarks like when pretrained on extensive image corpora, rivaling or surpassing CNNs in efficiency and transferability. This marked a conceptual shift from localized feature hierarchies in CNNs to sequence-based modeling of visual tokens, fostering architectures that integrate global dependencies natively and pave the way for multimodal applications.

Historical Development

Origins and Inception

The transformer architecture originated in the field of (NLP) with the seminal 2017 paper "Attention Is All You Need" by and colleagues at and the . This work introduced a novel model that relied entirely on mechanisms to process sequential data, dispensing with recurrent and convolutional layers that were prevalent in prior NLP systems. Designed primarily for tasks, such as English-to-German, the demonstrated superior performance by parallelizing computations and capturing long-range dependencies more effectively than recurrent neural networks. Its success in NLP quickly inspired explorations into applying similar principles to other domains, including . Prior to the direct adaptation of transformers for image classification, early efforts bridged and through specialized tasks. A notable example is the 2020 paper "End-to-End Object Detection with Transformers" (DETR) by Nicolas Carion and co-authors from AI Research. DETR employed a encoder-decoder architecture to perform in a set-to-set prediction framework, eliminating the need for hand-crafted components like non-maximum suppression or anchor boxes common in (CNN)-based detectors. By treating object queries as learnable embeddings and processing features via self-attention, DETR marked an initial foray into transformer-based models, achieving competitive results on benchmarks like COCO while highlighting the potential for end-to-end learning in visual tasks. The inception of the Vision Transformer (ViT) occurred in 2020 with the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy and colleagues at . This work proposed ViT as a pure model for image classification, directly applying the to vision without incorporating convolutional inductive biases, by dividing input images into fixed-size patches treated as "words" in a sequence. Trained from scratch on large datasets, ViT models matched or exceeded state-of-the-art performance, such as EfficientNet, on when scaled appropriately. A key challenge addressed in the ViT proposal was the model's data efficiency compared to CNNs, which benefit from strong inductive biases like equivariance. To achieve competitive results, ViT required extensive pretraining on massive datasets, exemplified by Google's internal JFT-300M dataset comprising 300 million images and 18,000 classes. Pretraining on JFT-300M enabled ViT to learn robust visual representations transferable to downstream tasks, underscoring the importance of scale for success in vision.

Key Milestones and Evolution

Following the introduction of the original Vision Transformer (ViT) in 2020, significant advancements starting in late 2020 and 2021 addressed key limitations such as data efficiency and computational scalability. The Data-efficient Image Transformers (DeiT) framework, proposed in December 2020 by researchers at , introduced a approach using a teacher-student strategy with attention-based distillation, enabling competitive performance on -1K without requiring massive external datasets like JFT-300M. DeiT models achieved top-1 accuracy of 81.8% on using only 300 epochs of training on a single node, demonstrating that ViTs could be trained effectively on standard hardware and smaller data regimes. Concurrently, the Swin Transformer, developed in March 2021 by Asia, incorporated hierarchical feature processing through shifted window-based self-attention, reducing quadratic complexity to linear in image resolution and improving suitability for dense prediction tasks. This design allowed Swin models to outperform prior ViTs on benchmarks like COCO , with a Swin-L variant reaching 58.7 box AP. In 2021, paradigms further propelled ViT evolution by mitigating reliance on labeled data. The method, from and submitted in November 2021, adapted BERT-style masking to by randomly masking 75% of patches during pretraining and reconstructing them via an asymmetric encoder-decoder architecture, achieving 87.8% top-1 accuracy on after a ViT-H/14 model pretrained on ImageNet-1K alone. highlighted the scalability of masked reconstruction for ViTs, showing strong transfer to downstream tasks like with 47.2 mask on COCO instance segmentation using ViT-L. Complementing this, the framework, also from and submitted in April 2021, employed self-distillation without labels by training a student network to predict a teacher's momentum-encoded outputs, revealing emergent properties like self-supervised maps resembling object segmentations in ViTs. enabled ViT-Small models to reach 78.3% top-1 accuracy on via self-supervision, underscoring ViTs' ability to learn semantically rich representations without explicit supervision. ViT adoption surged post-2021, with seamless integration into open-source libraries like Transformers, which hosted pretrained models such as ViT-Base, facilitating rapid experimentation and deployment across research and industry. Benchmarks increasingly demonstrated ViTs surpassing CNNs on when pretrained on large-scale data; for instance, ViT-Huge/14 achieved 88.55% top-1 accuracy on after pretraining on ImageNet-21K, outperforming EfficientNet baselines by leveraging global for better generalization. In 2023, the by utilized a ViT-based encoder for promptable segmentation, enabling zero-shot to new tasks and marking a milestone in foundation models for vision. By , ViT variants scaled to 113 billion parameters for applications like weather prediction, further extending their impact. Overall, ViT evolution from 2020 to 2023 trended toward efficiency, transitioning from data-hungry global models to variants with localized mechanisms and self-supervised pretraining, enabling broader applicability in resource-constrained settings. These developments, driven by contributions from and , reduced training costs by up to 10x compared to early ViTs while maintaining or exceeding performance on standard benchmarks.

Architecture

Input Processing and Patch Embedding

The input processing stage of the Vision Transformer (ViT) begins by dividing a raw input image x \in \mathbb{R}^{H \times W \times C}, where H and W denote the height and width, and C is the number of color channels (typically 3 for RGB), into a of non-overlapping patches. Each patch has a fixed size of P \times P pixels, resulting in N = \frac{HW}{P^2} patches that are extracted in a non-overlapping manner, similar to tokenizing a sentence in . This patching mechanism transforms the 2D image structure into a 1D of fixed-length tokens, enabling the application of architectures designed for sequential data. In practice, images are resized to a standard square , such as 224 × 224 pixels, to maintain consistent aspect ratios and ensure uniform patch counts across inputs. Following extraction, each is flattened into a of P^2 \cdot C, forming a matrix x_p \in \mathbb{R}^{N \times (P^2 \cdot C)}. These flattened patches are then linearly projected into a fixed D (commonly 768 for base models) using a trainable E \in \mathbb{R}^{(P^2 \cdot C) \times D}, yielding patch embeddings x_p E. This projection layer, which includes a term, serves as a simple yet effective tokenizer that maps the high-dimensional patch representations into the transformer's input , preserving essential visual features while reducing . A common choice for patch size is P = [16](/page/16), which balances sequence length and representational granularity; for a 224 × 224 input, this produces N = 196 patches. Smaller patch sizes increase N, leading to longer sequences and higher computational costs due to the transformer's quadratic scaling with sequence length, whereas larger patches reduce resolution but lower overhead. To facilitate global image representation and retain spatial order, a learnable class token x_{\text{class}} \in \mathbb{R}^{1 \times D} is prepended to the sequence of patch embeddings, forming z_0 = [x_{\text{class}}; x_p E] \in \mathbb{R}^{(N+1) \times D}. This [CLS] token, inspired by BERT's usage in NLP, aggregates information across the entire image during subsequent processing, with its final representation used for downstream tasks like classification. Positional embeddings E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D} are then added element-wise to z_0, encoding the sequential order of patches since transformers lack inherent positional awareness. ViT employs learnable 1D positional embeddings, which outperform fixed sinusoidal alternatives in this vision context; experiments showed no significant benefits from 2D-structured positional encodings that explicitly model patch coordinates. This embedding strategy ensures the model captures both local patch content and global spatial relationships efficiently.

Transformer Encoder and Attention Mechanism

The Vision Transformer (ViT) processes the sequence of patch embeddings through a stack of L identical encoder layers, where L is a configurable hyperparameter such as 12 for the base model. Each layer consists of a (MSA) sub-layer followed by a (MLP) sub-layer, with residual connections around both sub-layers and layer normalization applied before each. This structure enables the model to capture global dependencies across image patches without relying on convolutional operations. The core of the encoder is the self-attention mechanism, which computes representations by attending to all input patches simultaneously. In scaled dot-product attention, queries Q, keys K, and values V are linear projections of the input sequence X \in \mathbb{R}^{N \times D}, where N is the number of patches and D is the embedding dimension: \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V Here, d_k is the dimension of the keys, and the scaling factor \sqrt{d_k} prevents vanishing gradients in the softmax. This formulation, adapted from , allows each patch to interact with every other patch, modeling long-range interactions essential for vision tasks. Multi-head attention extends this by performing the attention operation in parallel across h heads, each with independent projections for Q, K, and V of dimension d_k = D/h. The outputs from all heads are concatenated and linearly projected back to dimension D: \text{[MSA](/page/MSA)}(X) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O where \text{head}_i = \text{[Attention](/page/Attention)}(QW_i^Q, KW_i^K, VW_i^V). This design enables the model to attend to information from different representation subspaces jointly, improving expressiveness; in ViT, h is typically 12 for the base model. Following the MSA sub-layer, the MLP sub-layer applies a two-layer feed-forward network with a for non-linear transformations: \text{MLP}(X) = \text{[GELU](/page/Activation_function)}(X W_1 + b_1) W_2 + b_2 The intermediate dimension is set to approximately four times D (e.g., 3072 for D=768), expanding and then contracting the representations to enhance feature diversity. This component, positioned after layer normalization in the pre-norm formulation, contributes to the model's capacity for complex pattern learning. Layer normalization in ViT uses the pre-norm variant, applying it before the and MLP sub-layers to stabilize training and mitigate issues like gradient vanishing, differing from the post-norm approach in the original . Residual connections add the input to the sub-layer output, formulated as X + \text{Sublayer}( \text{LN}(X) ), ensuring smooth information flow through the deep stack. These elements collectively form a robust encoder that scales effectively with model depth and width.

Output Layer and Classification Head

The output layer of the Vision Transformer (ViT) processes the representations produced by the encoder to generate predictions for downstream tasks, primarily image classification. In the standard ViT architecture, a learnable class , denoted as [ \text{CLS} ], is prepended to the sequence of embeddings at the input stage. This interacts with the tokens through the self-attention mechanism across multiple encoder layers, aggregating global image information. The final representation of the class from the last encoder layer, z_L^0, serves as the image-level and is fed into a classification head consisting of a simple linear layer followed by a softmax to produce class logits: \mathbf{y} = \text{softmax} \left( \mathbf{W} z_L^0 \right) where \mathbf{W} is a learnable weight matrix mapping the embedding dimension to the number of classes K. Alternatives to the class token aggregation exist for certain tasks or to improve flexibility. For instance, global average pooling (GAP) can be applied over the patch tokens from the final encoder layer, averaging their representations to obtain a fixed-size image feature before passing it to the classification head. This approach has been shown to achieve comparable performance to the class token method on image classification benchmarks when paired with optimized learning rates, such as 3 \times 10^{-4} versus 8 \times 10^{-4} for the class token. GAP is particularly useful in distillation or regression scenarios where the class token's inductive bias toward classification may be less desirable. During , the head is often adapted to suit specific tasks by replacing the original linear layer with task-specific modules while keeping the pretrained encoder frozen or lightly tuned. For example, in , additional heads such as region proposal networks or detection-specific decoders can be attached to the encoder outputs to predict bounding boxes and class labels, enabling end-to-end training on datasets like COCO. This modular design allows ViT to serve as a versatile backbone for various tasks beyond . The primary training objective for supervised classification in ViT is the cross-entropy loss applied to the output logits, encouraging the model to minimize prediction errors on . For pretraining, objectives such as masked patch modeling or contrastive losses can be employed to learn robust representations from unlabeled data, with task-specific using cross-entropy thereafter; detailed strategies for these pretraining methods are discussed in specialized variants.

Variants and Improvements

Self-Supervised and Pretraining Variants

Self-supervised learning variants of the Vision Transformer (ViT) address the limitations of supervised pretraining, which requires large labeled datasets like , by leveraging unlabeled data through pretext tasks inspired by techniques. These methods enable scalable representation learning by masking parts of the input image and training the model to reconstruct or predict the masked content, fostering robust feature extraction without explicit labels. By pretraining on vast unlabeled corpora, such as millions of images from diverse sources, ViTs can then be fine-tuned efficiently on smaller labeled datasets for downstream tasks. One seminal approach is Bidirectional Encoder representation from Image Transformers (BEiT), which adapts BERT-style masked modeling to vision by discretizing image patches into visual tokens using a dVAE tokenizer. In BEiT, random patches are masked (typically 40% of the input), and the model predicts the discrete tokens of the masked regions based on context from visible patches, processed through the ViT encoder. This self-supervised pretraining encourages the model to learn semantic representations of image structures, bridging the gap between continuous data and discrete token prediction. Another influential method is with No Labels (DINO), which employs a self-distillation framework using teacher-student networks to learn visual representations without negative samples or explicit reconstruction. The student network, a standard ViT, is trained to match the softened (via ) of the teacher network's outputs on the same input, while the teacher is updated as an exponential moving average of the student to maintain stability. Centering the teacher's distribution further prevents representation collapse, allowing to emerge with properties like self-attention maps resembling object centroids, enhancing interpretability in learned features. Masked Autoencoders () introduce a reconstruction-based with high-ratio masking, where 75-90% of patches are randomly removed, and the model reconstructs the full values of the masked regions. employs an asymmetric encoder-decoder architecture: a decoder reconstructs only from the encoder's latent representations of visible patches (processed by the ViT backbone), promoting efficiency by avoiding processing masked inputs during encoding. This design scales effectively to large models and datasets, as the high masking ratio forces the encoder to capture high-level semantics from sparse visible context. These self-supervised pretraining strategies yield significant benefits for ViT deployment, particularly in , where models pretrained on large unlabeled image collections (e.g., over 100 million images) outperform those relying solely on supervised pretraining when fine-tuned on ImageNet-1k for classification. By reducing dependence on costly annotations, they enable better generalization to downstream tasks like segmentation and detection, with pretrained representations capturing richer, more transferable visual hierarchies.

Hierarchical and Efficient Variants

To address the quadratic of standard self-attention in Vision Transformers (ViTs), which scales as O(N^2) with sequence length N, several variants introduce structures and efficiency optimizations to enable better scalability for dense prediction tasks while preserving representational power. The Swin Transformer, introduced in 2021, achieves through a multi-stage where input patches are progressively merged to form larger tokens, reducing across four stages similar to convolutional backbones. It employs shifted window-based multi-head self-attention (W-MSA) within non-overlapping local windows to enforce locality, followed by shifted windows in subsequent blocks to model cross-window connections, resulting in linear complexity O(N) relative to image size. This significantly lowers —for instance, Swin-Tiny requires only 4.5 GFLOPs compared to 17.6 GFLOPs for DeiT-Small—while achieving superior performance on classification (81.3% top-1 accuracy) and downstream tasks like . Similarly, the Pyramid Vision Transformer (PVT), proposed in 2021, constructs a pyramid-like feature hierarchy by progressively shrinking spatial dimensions through patch embedding and spatial-reduction mechanisms, which subsample keys and values to cut attention computation by a factor of four per stage. This enables efficient dense prediction without convolutions, with -Small using just 3.8 GFLOPs and attaining 79.3% accuracy, outperforming prior ViT models in semantic segmentation on ADE20K (44.0% mIoU). PVT v2 further refines this with overlap adjustments and deeper convolutions for embedding, boosting efficiency and accuracy. Pooling-based improvements address information loss during downsampling in hierarchical ViTs by replacing abrupt token reduction with gradual aggregation. The Pooling-based Vision Transformer (), for example, integrates a novel pooling layer that downsamples spatial dimensions while expanding channels, preserving fine-grained details better than linear projections; achieves 77.1% accuracy at 0.87 GFLOPs, demonstrating improved generalization over vanilla ViTs. Overlapping or adaptive pooling variants further mitigate effects, enhancing feature expressiveness in early stages. For mobile deployment, MobileViT (2021) hybridizes transformers with lightweight convolutions, using inverted residual blocks to process local features before applying factorized self-attention on unfolded patches, yielding models under 2 million parameters. MobileViT-S delivers 78.4% accuracy with approximately 2 GFLOPs, a substantial improvement over comparable CNNs like MobileNetV3-Large (75.2% at 0.22 GFLOPs), making it suitable for edge devices while retaining global modeling benefits.

Specialized and Multimodal Variants

One prominent specialized variant of the Vision Transformer (ViT) is TimeSformer, introduced in 2021, which adapts the architecture for by applying self- mechanisms across both spatial and temporal dimensions without relying on convolutions. TimeSformer processes videos as sequences of frame patches, employing a divided space-time strategy that alternates between spatial within frames and temporal across frames to reduce from O(T^2 S^2) to O(T^2 S + T S^2), where T is the number of frames and S is the number of spatial patches. This factorization enables efficient modeling of spatiotemporal dependencies, achieving state-of-the-art performance on benchmarks like Kinetics-400 with 78.0% top-1 accuracy using a base model, surpassing prior CNN-based methods while maintaining scalability. Another generative adaptation is ViT-VQGAN, developed in 2021, which integrates ViT into a vector-quantized (VQGAN) framework to enhance high-resolution image synthesis. By replacing convolutional components with ViT encoders and decoders, ViT-VQGAN learns discrete image tokens in a two-stage process: first quantizing images into a compact via a ViT-based VQ layer, then autoregressively modeling these tokens with a for reconstruction. This approach improves sample quality and efficiency over the original VQGAN, yielding higher FID scores (e.g., 4.17 on 256x256) and faster training due to ViT's global attention, making it suitable for tasks like image inpainting and super-resolution. In more recent specialized developments, -VisTA (2023) extends ViT principles to vision- by pre-training a on point clouds paired with textual descriptions. The model processes scenes as sequences of point patches, using cross-modal to align spatial features with embeddings, enabling zero-shot transfer to downstream tasks like captioning and retrieval. On the ScanRefer , -VisTA achieves 52.1% accuracy in referring expression , outperforming prior VL models by leveraging ViT's patch-based tokenization for geometric data. Additionally, ongoing robustness enhancements for ViTs against adversarial attacks have focused on architectural modifications and strategies, as surveyed in 2024-2025 literature, including adversarial with momentum and regularization to mitigate vulnerabilities in domains like . These improvements have boosted robust accuracy under PGD attacks in specialized domains. Multimodal variants often hybridize ViT with language models, as seen in CLIP-ViT architectures, which use ViT as the image encoder in contrastive learning frameworks for vision-language tasks. In the original CLIP setup (2021), ViT processes images into patch embeddings that are projected into a joint space with text features from a , enabling zero-shot classification via . This hybrid has been widely adopted, powering applications like open-vocabulary detection and achieving 76.2% zero-shot accuracy on , far exceeding supervised baselines without task-specific . Further integrations with diffusion models for generative AI, such as DiffiT (2023), replace backbones with pure ViT architectures in denoising diffusion probabilistic models (DDPMs) to generate images autoregressively. DiffiT leverages ViT's self-attention for global context in the diffusion process, attaining FID scores of 1.95 on , demonstrating superior sample diversity and quality over convolutional diffusion models while scaling efficiently to larger resolutions. As of 2025, recent advances include hybrid models like EfficientViT and multimodal extensions such as LLaVA-ViT, enhancing efficiency and cross-modal capabilities, further solidifying ViT's role in real-world applications.

Comparison with Convolutional Neural Networks

Architectural and Computational Differences

The Vision Transformer (ViT) fundamentally differs from convolutional neural networks (CNNs) in its core architecture, employing global self-attention mechanisms across sequences of image patches rather than local convolutional filters that process neighborhoods of pixels. This global attention allows ViT to capture long-range dependencies in a single pass, contrasting with CNNs' hierarchical feature extraction through stacked local operations. Unlike CNNs, which inherently encode translation equivariance via weight sharing in convolutions, ViT lacks built-in spatial invariance and relies on learnable positional embeddings to inject patch locations, enabling the model to model spatial relationships explicitly. CNNs incorporate strong inductive biases such as locality—assuming relevant features are nearby—and shift-equivariance, which reduces the need for extensive data to learn these properties. In contrast, ViTs possess weaker inductive biases, treating images as unordered sets of patches and learning spatial hierarchies, translation invariance, and locality solely from training data, which grants greater flexibility but demands larger datasets and more parameters to achieve comparable . Computationally, ViT's self- layers exhibit complexity, O(N²) with respect to the number of patches N, due to pairwise interactions in the attention matrix, whereas CNNs scale linearly, O(N), through fixed-size slides. This scaling in ViT leads to higher demands during training and inference, particularly for high-resolution inputs, as the mechanism stores and computes interactions across all patch pairs, often requiring multiple times the GPU of equivalent CNNs like ResNet-50. Hybrid approaches, such as ConvNeXt, bridge these paradigms by modernizing architectures with Transformer-inspired design choices—like larger kernels, fewer activation functions, and inverted bottlenecks—while remaining fully convolutional, thereby enhancing efficiency and performance without adopting explicit . These models demonstrate that incorporating elements of Transformer's scalability into frameworks can yield competitive results with reduced computational overhead compared to pure ViTs.

Performance and Efficiency Benchmarks

The original Vision Transformer (ViT) models, when pre-trained on the large JFT-300M dataset, achieved 88.55% top-1 accuracy on for the ViT-H/14 variant, surpassing contemporary convolutional neural networks (CNNs) like ResNet that plateau around 80-82% without extensive pretraining. In contrast, the Data-efficient Image Transformer (DeiT-S), trained solely on without external data, reached 81.2% top-1 accuracy with , closely matching EfficientNet-B3's 81.6% while using comparable parameters but higher throughput (936 images/second versus 732). Scaling studies demonstrate that ViT performance follows a power-law relationship with model size and data volume, where error rates decrease as compute scales, enabling larger variants like ViT-Huge (ViT-H/14) to attain 88.55% top-1 accuracy on and ViT-G/14 to reach 90.45% with billions of parameters and extensive pretraining. This scaling advantage allows ViTs to exceed 90% accuracy on with sufficient resources, a CNNs like ResNet struggle to surpass without modifications. Meanwhile, modern CNNs such as ConvNeXt have narrowed the gap in supervised settings through transformer-inspired designs, achieving comparable or superior results to base ViTs on (e.g., ConvNeXt-Large at 87.8% versus ViT-L/16 at 85-88%) while maintaining CNN efficiency. Efficiency benchmarks highlight trade-offs: ViTs typically require 2-4× more floating-point operations () than equivalent ResNets—for instance, ViT-B/16 demands about 17.6 GFLOPs compared to ResNet-50's 4 GFLOPs—leading to 1.5-3× longer times on standard despite similar counts. However, ViTs excel on large-scale tasks due to their global attention, outperforming CNNs in data-rich scenarios, though this incurs higher on resource-limited devices. Recent 2025 surveys on edge deployment address these challenges through techniques like and quantization, reducing ViT models by 50-80% in size while preserving over 90% of accuracy for mobile . In 2024-2025 benchmarks for scene interpretation on datasets like NWPU-RESISC45 and , ViT-based models outperformed CNNs by 2-10% in accuracy when trained on large-scale data, leveraging long-range dependencies for better holistic understanding. Federated learning evaluations further underscore ViT robustness; the EFTViT framework achieved up to 28% higher classification accuracy than baseline methods across distributed datasets, with 2.8× reduced compute and enhanced privacy preservation on heterogeneous clients.
ModelPretraining DataImageNet Top-1 Accuracy (%)GFLOPs (Inference)Resolution
ViT-H/14JFT-300M88.55~630384×384
DeiT-S (distilled)81.2~4.6224×224
EfficientNet-B3 + augment81.6~1.8300×300
ConvNeXt-LargeImageNet-21k87.8~101384×384
ResNet-5077.44.1224×224

Applications

Static Image Tasks

Vision transformers (ViTs) have been widely applied to image tasks, where images are divided into patches and processed through transformer encoders to produce class predictions via a classification head. The original ViT model, when fine-tuned on datasets like -1K, demonstrates competitive performance with convolutional neural networks (s), achieving top-1 accuracy of 88.55% for the ViT-L/16 variant pre-trained on JFT-300M and fine-tuned on . On the COCO dataset, ViT backbones enable high-accuracy in downstream tasks, particularly when integrated into larger frameworks. Ensembles of ViT models have pushed state-of-the-art results, with combinations of multiple ViT variants attaining over 90% top-1 accuracy on , surpassing single models like EfficientNet. In semantic segmentation, ViTs facilitate pixel-level predictions by encoding global contextual information across image patches. SegFormer employs a hierarchical ViT encoder to generate multi-scale feature maps, combined with a lightweight (MLP) decoder for efficient segmentation without positional encodings or complex post-processing. This design achieves state-of-the-art mean intersection over union (mIoU) scores, such as 51.8% mIoU (multi-scale) on the ADE20K dataset for the SegFormer-B5 model, while maintaining lower computational costs compared to prior transformer-based segmentors. The hierarchical structure in SegFormer allows for progressive feature resolution, enabling precise boundary delineation in static scenes like urban environments or natural images. For , ViTs serve as robust backbones in end-to-end frameworks that treat detection as a set prediction problem. DETR integrates a encoder-decoder to directly output bounding boxes and labels, eliminating the need for hand-crafted components like non-maximum suppression, and achieves 42 average precision () on the COCO dataset with a ResNet-50 backbone, with ViT variants further enhancing performance through better global reasoning. Deformable DETR extends this by introducing deformable mechanisms in the , focusing on sparse points to improve and , resulting in 46.9 on COCO val after 50 epochs—ten times faster than DETR—while excelling in small through adaptive sampling. These ViT-based detectors leverage the 's ability to model long-range dependencies, making them suitable for dense object scenes in static images. ViTs have been deployed in medical imaging for tasks like tumor detection, where their attention mechanisms capture subtle global patterns in scans. For instance, hybrid ViT-CNN ensembles applied to brain MRI datasets achieve over 98% accuracy in classifying gliomas and meningiomas, outperforming standalone CNNs by integrating patch-level and holistic features for early diagnosis. In autonomous driving, ViTs enhance scene understanding by processing static camera feeds to segment roads, vehicles, and pedestrians. Vision transformers integrated into perception pipelines, such as those using Swin Transformer backbones, improve semantic segmentation on datasets like Cityscapes, enabling reliable environmental parsing for safe navigation in urban settings.

Video and Sequential Tasks

Vision transformers, originally designed for static images, have been adapted for video and sequential tasks by extending their self-attention mechanisms to incorporate temporal dimensions, allowing the capture of spatio-temporal dynamics without relying on convolutional operations. This adaptation treats videos as sequences of spatial patches over time, enabling end-to-end learning of both spatial and temporal relationships. In video classification, TimeSformer employs divided space-time on frame patches, where is first applied spatially across frames and then temporally along patch trajectories, achieving competitive results on benchmarks like Kinetics-400. The Video Swin Transformer extends the hierarchical Swin to videos by using shifted window-based in 3D space-time volumes, processing tubelet embeddings to model local spatio-temporal interactions efficiently and attaining state-of-the-art accuracy on datasets such as Something-Something-v2. For action recognition, ViViT factorizes the transformer encoder into spatial and temporal branches, applying attention separately to image patches and temporal sequences of patches, which scales effectively on the Kinetics dataset with top-1 accuracies exceeding 80% for longer clips. This factorization reduces computational overhead while preserving the transformer's ability to model long-range dependencies in video actions. Vision transformers also support sequential modeling in domains like time-series imagery, where they process ordered image sequences to detect temporal evolutions. In satellite imagery for change detection, adaptations of ViT analyze multi-temporal inputs, such as Sentinel-2 time series, by attending to patch sequences across dates to identify urban changes with high precision, often outperforming CNN-based methods in capturing subtle temporal variations. To address the high token counts in videos, efficiency techniques like tubelet embedding are commonly used, wherein non-overlapping spatio-temporal cubes (e.g., 2×16×16 pixels over time) are extracted and linearly projected into , reducing sequence length by up to 90% compared to frame-by-frame patching while maintaining representational power.

Emerging and Multimodal Applications

Vision Transformers (ViTs) are increasingly applied in emerging domains beyond conventional imaging, including , generative , and fusion, leveraging their mechanisms for global contextual understanding. These extensions highlight ViTs' adaptability to complex data structures and cross-modal interactions, driving innovations in fields like autonomous systems and content creation. In vision tasks, ViTs process such as point clouds or voxels to enable applications like in sparse environments. The Pointformer architecture, a dedicated backbone for 3D point clouds, employs local and global to extract robust features, achieving 77.06% average for moderate car detection on the KITTI test split, outperforming prior methods. Building on this, Point Transformer V3 refines the design for simplicity and efficiency, achieving state-of-the-art results on ScanNet and nuScenes benchmarks while reducing computational overhead. Generative applications integrate ViTs into diffusion models for high-fidelity image synthesis and forgery detection. The Diffusion Transformer (DiT) substitutes with a pure Transformer architecture, enabling scalable training that generates images at resolutions up to 512x512 with improved FID scores compared to hybrid models. In text-to-image systems like 2, ViT-based CLIP encoders align textual prompts with visual latents in the , facilitating coherent generation from diverse descriptions. For deepfake detection, ViTs excel by capturing inconsistencies in global patterns; standalone ViT models, for instance, attain over 95% accuracy on the DeepFakeDetection Challenge dataset, surpassing baselines in generalization. Multimodal advancements fuse ViTs with and spaces for integrated reasoning. BLIP-2 utilizes a frozen ViT as its visual backbone, paired with a query , to bootstrap vision- pre-training efficiently, yielding top performance on zero-shot image-text retrieval tasks like COCO with recall@1 exceeding 80%. In , the Visual Navigation (ViNT) processes egocentric visual inputs via ViT for goal-conditioned , demonstrating zero-shot generalization across simulated and real-world robots with success rates above 70% in unseen environments. By 2025, ViT deployments emphasize efficiency and trustworthiness, with edge-optimized variants enabling real-time inference on resource-constrained devices through techniques like neurons and quantization. Federated learning frameworks, such as EFTViT, support privacy-preserving ViT training by masking images during distributed updates, ideal for sensitive applications including AR/VR where data locality prevents central aggregation. Concurrently, explainable integrations, like interpretability-aware ViTs, provide causal visualizations to elucidate decision processes, enhancing trust in high-stakes deployments.

Challenges and Future Directions

Current Limitations

Vision Transformers (ViTs) are characterized by a pronounced data hunger, performing poorly on small or medium-sized datasets without extensive pretraining due to their lack of vision-specific inductive biases, such as translation equivariance and locality. When trained from scratch on , which contains approximately 1.3 million images, the base ViT model achieves only 77.9% top-1 accuracy, significantly trailing convolutional neural networks () like at 76.1% but optimized CNNs reaching over 80%. This gap closes only with pretraining on massive datasets; for instance, the same ViT variant attains 88.6% accuracy after pretraining on the 300 million-image corpus. Subsequent ViT implementations continue this reliance, often drawing from web-scale datasets like , a collection of 5.85 billion CLIP-filtered image-text pairs, to enable effective learning of visual representations. The computational demands of ViTs pose substantial challenges, particularly arising from the quadratic complexity of the self-attention mechanism, which scales as O(N²) with the number of patches N derived from input resolution. For a standard 224×224 image divided into 16×16 patches, this results in approximately 18 billion floating-point operations (FLOPs) per forward pass for a base ViT, far exceeding lightweight CNNs like MobileNet at under 1 billion FLOPs. This scaling limits ViTs' applicability to high-resolution inputs, as increasing patch count exponentially raises memory and time requirements, leading to high inference latency—often orders of magnitude slower than CNNs—on edge devices with constrained resources. ViTs exhibit notable robustness gaps, including heightened vulnerability to adversarial perturbations and distribution shifts, stemming from their reduced emphasis on low-level feature biases inherent in CNNs. Patch-wise adversarial attacks, such as those in the Patch-Fool framework, demonstrate that ViTs can be misled with minimal localized noise, achieving success rates comparable to or higher than CNNs under similar conditions, thus undermining claims of inherent superiority in robustness. Moreover, without built-in priors for local texture or , ViTs can show vulnerabilities under out-of-distribution shifts, such as common corruptions in datasets like ImageNet-C; however, pre-trained ViT variants often achieve competitive or superior mean corruption error (mCE) rates compared to CNNs. Regarding interpretability, attention maps in ViTs provide valuable insights into global inter-patch relationships but fall short of the intuitive, hierarchical visualizations offered by CNNs, such as activation maps revealing or detectors in early layers. This absence of structured low-level representations complicates the dissection of ViT processes, often necessitating advanced post-hoc methods like rollout or gradient-based attribution to approximate CNN-style explanations. As a result, ViTs' black-box nature hinders trust and debugging in safety-critical applications compared to the more transparent hierarchies in CNNs. Recent advances in Vision Transformers (ViTs) have focused on compression techniques to enable deployment on resource-constrained edge devices. methods, such as structured and unstructured , reduce model parameters while maintaining performance, with surveys highlighting up to 50% parameter reduction without significant accuracy loss on benchmarks. Quantization approaches, including post-training and quantization-aware training, have lowered precision from 32-bit to 8-bit or lower, achieving energy savings of 4x on mobile hardware for tasks like . A 2025 comprehensive survey on edge ViTs categorizes these techniques, emphasizing hardware-aware optimizations that integrate and quantization for inference on devices. Hybrid CNN-ViT models, such as those combining ConvNeXt blocks with transformer layers, leverage convolutional locality for efficiency, demonstrating 20-30% faster inference than pure ViTs on recognition tasks while preserving state-of-the-art accuracy. Improvements in robustness have addressed ViT vulnerabilities through adversarial integrations and studies on internal representations. Adversarial training variants, like Robustness Tokens, enhance resistance to white-box attacks by injecting specialized tokens during , improving robust accuracy by 5-10% on against PGD attacks compared to baseline ViTs. Concept emergence studies reveal that ViTs develop increasingly complex representations across layers, with early layers capturing low-level features like edges and colors, while deeper layers abstract concepts such as object parts, correlating with improved . These findings, derived from probing large pretrained ViTs, underscore layered complexity as a key to robustness, with interventions like register-based adaptations boosting out-of-distribution performance by 2-4% on variants. Future trends in ViTs emphasize multimodal unification with large language models (LLMs), sustainable , and ethical deployment considerations. architectures integrate ViT encoders with LLMs via unified token spaces, enabling tasks like visual and image captioning in a single model; for instance, Patch-as-Decodable paradigms allow direct generation of visual and textual outputs, advancing towards generalist vision--action models. Sustainable AI efforts promote efficient scaling laws, where parameter-efficient and sparse activation reduce carbon footprints by optimizing compute allocation, with surveys noting that scaling ViTs to billions of parameters via mixture-of-experts maintains at lower costs than dense models. Ethical considerations in ViT deployment highlight fairness and issues, such as amplification in vision- tasks, prompting calls for auditing frameworks and diverse datasets to mitigate societal risks in applications like . Open research areas include enhancing small-data learning and extending ViTs to 3D/ spaces for and . Few-shot adaptation techniques, like dynamic-static prompting, enable ViTs to learn new classes with minimal examples by synergizing with pretrained knowledge, achieving 5-15% gains over standard on miniImageNet. For 3D/ extensions, transformer-based world models predict dynamic scenes using spatiotemporal , supporting robotic and ; surveys on embodied underscore autoregressive 4D representations for , with applications in autonomous showing improved prediction accuracy in cluttered environments.

References

  1. [1]
    [2010.11929] An Image is Worth 16x16 Words: Transformers for ...
    Oct 22, 2020 · An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Authors:Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk ...
  2. [2]
    Transformers in Vision: A Survey
    ### Summary of Vision Transformer (ViT) from "Transformers in Vision: A Survey"
  3. [3]
    Transformers for Image Recognition at Scale - Semantic Scholar
    This paper investigates how to train ViTs with limited data and gives theoretical analyses that the method (based on parametric instance discrimination) is ...Missing: original | Show results with:original
  4. [4]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  5. [5]
    [2005.12872] End-to-End Object Detection with Transformers - arXiv
    May 26, 2020 · View a PDF of the paper titled End-to-End Object Detection with Transformers, by Nicolas Carion and 5 other authors. View PDF. Abstract:We ...
  6. [6]
    [2012.12877] Training data-efficient image transformers & distillation ...
    Dec 23, 2020 · In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days.
  7. [7]
    Hierarchical Vision Transformer using Shifted Windows - arXiv
    Mar 25, 2021 · This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
  8. [8]
    Masked Autoencoders Are Scalable Vision Learners - arXiv
    Nov 11, 2021 · This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask ...
  9. [9]
    Emerging Properties in Self-Supervised Vision Transformers - arXiv
    Apr 29, 2021 · In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks ( ...
  10. [10]
    google/vit-base-patch16-224 - Hugging Face
    Sep 17, 2025 · It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first ...
  11. [11]
    Exploring Plain Vision Transformer Backbones for Object Detection
    Mar 30, 2022 · This paper explores using a plain, non-hierarchical Vision Transformer (ViT) for object detection, achieving competitive results with minimal  ...
  12. [12]
    [2106.08254] BEiT: BERT Pre-Training of Image Transformers - arXiv
    Jun 15, 2021 · We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers.
  13. [13]
    Pyramid Vision Transformer: A Versatile Backbone for Dense ... - arXiv
    Feb 24, 2021 · This work investigates a simple backbone network useful for many dense prediction tasks without convolutions.
  14. [14]
    PVT v2: Improved Baselines with Pyramid Vision Transformer - arXiv
    Jun 25, 2021 · In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs.
  15. [15]
    [2103.16302] Rethinking Spatial Dimensions of Vision Transformers
    Mar 30, 2021 · ... Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and ...
  16. [16]
    Light-weight, General-purpose, and Mobile-friendly Vision Transformer
    Oct 5, 2021 · Abstract page for arXiv paper 2110.02178: MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer.
  17. [17]
    Is Space-Time Attention All You Need for Video Understanding?
    Feb 9, 2021 · We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
  18. [18]
    Vector-quantized Image Modeling with Improved VQGAN - arXiv
    Oct 9, 2021 · We explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively.
  19. [19]
    3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
    In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks.
  20. [20]
    Recent Advances in Vision Transformer Robustness Against ...
    May 7, 2025 · This survey provides a comprehensive review of the integration of ViTs in traffic sign detection and recognition, emphasizing their vulnerability to ...
  21. [21]
    DiffiT: Diffusion Vision Transformers for Image Generation - arXiv
    Dec 4, 2023 · In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers ( ...
  22. [22]
    A comparative study between vision transformers and CNNs in ...
    Jun 1, 2022 · In comparison to convolutional neural networks, vision transformers have a weaker inductive bias and therefore allow a more flexible feature ...
  23. [23]
    Position-aware Efficient Vision Transformer with Dual Token Fusion
    Sep 21, 2023 · However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality ...
  24. [24]
    Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers
    ### Summary of Memory Usage Comparisons Between ViTs and CNNs
  25. [25]
    [2201.03545] A ConvNet for the 2020s - arXiv
    Jan 10, 2022 · This paper introduces ConvNeXt, a pure ConvNet model that competes with Transformers in accuracy and scalability, achieving 87.8% ImageNet top- ...
  26. [26]
    [PDF] arXiv:2010.11929v2 [cs.CV] 3 Jun 2021
    Jun 3, 2021 · We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the hybrid. To understand the data requirements of ...
  27. [27]
    [PDF] Training data-efficient image transformers & distillation through ...
    Jan 15, 2021 · In this paper, we train a vision transformer on a single 8-GPU node in two to three days (53 hours of pre-training, and optionally 20 hours of ...
  28. [28]
    None
    ### Scaling Laws for Vision Transformers (ViT)
  29. [29]
    Vision Transformers on the Edge: A Comprehensive Survey ... - arXiv
    Feb 26, 2025 · This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware ...
  30. [30]
    Are vision transformers replacing convolutional neural networks in ...
    Aug 15, 2025 · On large-scale scene classification benchmark datasets, ViT-based models consistently outperform CNNs in terms of accuracy when sufficient data ...
  31. [31]
    EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients
    ### Summary of Benchmarks Showing ViT Robustness in Federated Learning Settings
  32. [32]
    SegFormer: Simple and Efficient Design for Semantic Segmentation ...
    May 31, 2021 · We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) ...
  33. [33]
    Deformable Transformers for End-to-End Object Detection - arXiv
    Oct 8, 2020 · Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs.
  34. [34]
  35. [35]
    [2103.15691] ViViT: A Video Vision Transformer - arXiv
    Mar 29, 2021 · ViViT: A Video Vision Transformer. Authors:Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
  36. [36]
    [2106.13230] Video Swin Transformer - arXiv
    Jun 24, 2021 · In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous ...
  37. [37]
    Continuous Urban Change Detection from Satellite Image Time ...
    Jun 25, 2024 · Continuous Urban Change Detection from Satellite Image Time Series with Temporal Feature Refinement and Multi-Task Integration. Authors: ...
  38. [38]
    [2012.11409] 3D Object Detection with Pointformer - arXiv
    Dec 21, 2020 · In this paper, we propose Pointformer, a Transformer backbone designed for 3D point clouds to learn features effectively.
  39. [39]
    [2312.10035] Point Transformer V3: Simpler, Faster, Stronger - arXiv
    Dec 15, 2023 · We present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall ...
  40. [40]
    [2306.14846] ViNT: A Foundation Model for Visual Navigation - arXiv
    Jun 26, 2023 · In this paper, we describe the Visual Navigation Transformer (ViNT), a foundation model that aims to bring the success of general-purpose pre-trained models to ...