Residual neural network

A residual neural network (ResNet) is a deep convolutional neural network architecture that employs skip connections, also known as shortcut or residual connections, to enable the effective training of networks with hundreds or thousands of layers by learning residual functions relative to the input rather than direct mappings from input to output.^[1] These connections allow the addition of the layer input directly to the output of the layer's transformations, formulated as H(x) = F(x) + x, where x is the input, F(x) is the residual mapping learned by the layer, and H(x) is the desired output mapping.^[1] This design mitigates the vanishing gradient problem and the degradation issue, where deeper plain networks experience diminished accuracy despite increased capacity.^[1] ResNets were introduced in the 2015 paper "Deep Residual Learning for Image Recognition" by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun from Microsoft Research, with the work presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2016.^[2] The architecture builds on prior convolutional neural networks like VGG and AlexNet but innovates by stacking residual blocks—typically consisting of convolutional layers followed by batch normalization and ReLU activation—connected via identity shortcuts when dimensions match or projection shortcuts otherwise.^[1] Variants include ResNet-34 (34 layers), ResNet-50 (50 layers using bottleneck blocks for efficiency), up to ResNet-152, demonstrating that residual learning permits scaling to extreme depths without performance saturation.^[1] On the ImageNet dataset, a benchmark for large-scale image classification, ResNet models set new records; for instance, the 34-layer ResNet achieved a 7.40% top-5 error rate on the validation set, the 50-layer version 6.71%, the 152-layer model 5.71%, and an ensemble of models reached 3.57% top-5 error on the test set, outperforming all prior entries and winning first place in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 classification task.^[1] These results validated the framework's ability to train networks up to 1202 layers on smaller datasets like CIFAR-10, where a 110-layer ResNet reduced error to 6.43% with regularization.^[1] The introduction of ResNets profoundly influenced deep learning by establishing residual connections as a core building block for modern architectures, enabling deeper models in computer vision tasks such as object detection, semantic segmentation, and beyond into natural language processing and generative models.^[3] The original paper has garnered over 300,000 citations as of 2025, underscoring its enduring impact on neural network design and training practices.^[4]

Overview

Definition and core concept

A residual neural network (ResNet) is a deep neural network architecture designed to facilitate the training of networks with hundreds or thousands of layers by incorporating residual connections, also known as skip connections. These connections allow the input to a block of layers to be added directly to the output, enabling the network to learn residual mappings relative to the input rather than complete transformations from scratch. Introduced in 2015, ResNets have become a foundational architecture in computer vision and beyond, powering advancements in image recognition and other tasks.^[1] The core concept of residual learning reformulates the layers of a neural network such that, instead of directly approximating a desired underlying mapping H(x), each layer fits a residual function F(x) = H(x) - x. The resulting approximation is then H(x) = F(x) + x, where x is the input to the layer and F(x) is typically computed by a stack of nonlinear layers. This identity shortcut connection x can be implemented with minimal computational overhead when dimensions match, or via a linear projection otherwise. By making it easier for the network to learn identity functions—where deeper layers effectively pass through the input unchanged—residual learning mitigates the vanishing gradient problem and the degradation issue observed in very deep plain networks, where adding more layers leads to increased training error.^[1] In practice, this is visualized in a residual block as:

\begin{align} y &= F(x, \{W_i\}) + x, \\ x_{\ell+1} &= \mathcal{F}(x_\ell, \{W_i\}) + x_\ell, \end{align}

where \mathcal{F} represents the residual function (e.g., convolutions and activations), and \ell indexes the layer. This design not only stabilizes training but also empirically demonstrates superior performance; for instance, an ensemble of 152-layer ResNets achieved a top-5 test error rate of 3.57% on the ImageNet dataset, surpassing previous state-of-the-art models at the time.^[1]

Historical motivation and benefits

The pursuit of deeper neural networks in the early 2010s was driven by the observation that increased depth could enhance representational power and accuracy in tasks like image recognition, as demonstrated by architectures such as AlexNet (8 layers, 15.3% top-5 error on ImageNet in 2012) and subsequent models like VGGNet (19 layers)^[5] and GoogLeNet (22 layers, achieving 6.67% top-5 error in ILSVRC 2014).^[6]^[7] However, training networks beyond approximately 30 layers encountered significant obstacles, including the vanishing gradient problem, where gradients diminish exponentially during backpropagation, impeding updates to early-layer parameters.^[8] This issue, first highlighted in the context of recurrent networks, extended to feedforward architectures, limiting the effective depth of models despite advances like Batch Normalization for stabilizing training. A key empirical challenge was the degradation problem, observed in plain convolutional networks without skip connections: as depth increased from 20 to 56 layers, training error rose despite sufficient model capacity, indicating optimization difficulties rather than overfitting.^[1] This counterintuitive phenomenon suggested that deeper networks struggled to learn identity mappings effectively, where deeper models should at least match shallower ones. Building on prior work like Highway Networks, which introduced learnable gating mechanisms to enable gradient flow across hundreds of layers by allowing information to bypass nonlinear transformations, residual networks simplified this approach.^[8] Highway Networks demonstrated that gated skip connections could train 100-layer feedforward nets, outperforming shallower counterparts on tasks like character recognition, but required complex parameterizations.^[8] The residual learning framework addressed these issues by reformulating network layers to learn residual functions \mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}, with skip connections adding the input \mathbf{x} directly to the output, facilitating identity mappings when residuals are near zero.^[1] This design eases optimization, as residual blocks default to identity functions if untrained, avoiding the degradation observed in plain nets and promoting smoother gradient propagation. The benefits were immediately evident in experiments: an ensemble of 152-layer ResNets achieved a top-5 test error of 3.57% on ImageNet—outperforming the prior best by over 3 percentage points—while the 34-layer ResNet outperformed the 18-layer ResNet by 2.8%.^[1] These results established residual connections as a foundational technique, enabling networks deeper than 1000 layers in later variants and influencing architectures across computer vision and beyond.^[1]

Mathematical foundations

Residual connections

Residual connections, also referred to as skip connections, form the core mechanism of residual neural networks by enabling direct addition of the input to the output of a neural network layer or block. This design allows the network to learn residual functions rather than the full underlying mappings, facilitating the training of deeper architectures. Specifically, for an input x, a residual block computes the residual function F(x), typically through a series of convolutional layers, and adds it to the input via H(x) = F(x) + x, where H(x) represents the desired mapping.^[1] The motivation for residual connections arises from the degradation problem observed in deep plain networks, where increasing depth beyond a certain point leads to higher training and test errors, even with sufficient optimization capacity. By reformulating layers to learn residuals F(x) = H(x) - x, the network can more easily approximate identity mappings when the residual is near zero, which is hypothesized to be simpler than optimizing stacked nonlinear layers directly. This approach addresses optimization difficulties in very deep networks, as identity shortcuts serve as a strong baseline that subsequent layers can refine.^[1] Key benefits include improved gradient flow during backpropagation, where the shortcut path allows gradients to propagate directly without being diminished by repeated multiplications through deep layers, mitigating the vanishing gradient problem. Empirically, this enables training of networks with hundreds of layers; for instance, a 152-layer residual network achieved a 4.49% top-5 validation error on the ImageNet dataset, surpassing shallower models like VGG while maintaining faster convergence. Residual connections also enhance representational power by combining features from multiple depths, contributing to better generalization in tasks requiring hierarchical feature extraction.^[1]

Dimension matching and projections

In residual neural networks, the core operation within a residual block involves adding the input \mathbf{x} to the output of the residual function \mathcal{F}(\mathbf{x}, \{W_i\}), yielding \mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}. This identity shortcut connection assumes that the dimensions of \mathbf{x} and \mathcal{F}(\mathbf{x}) match, allowing direct element-wise addition without additional parameters or computational overhead. However, dimension mismatches arise when the number of channels (or feature map dimensions) changes across layers, such as during downsampling or transitions to blocks with more filters. In such cases, a projection shortcut is employed to align dimensions, formulated as \mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s \mathbf{x}, where W_s is a linear projection matrix, typically implemented via a 1×1 convolution to reduce parameters while preserving spatial dimensions. This projection ensures compatibility for addition and introduces minimal extra computation compared to the main path. The projection W_s is initialized as an identity mapping when possible (e.g., for dimension expansion), which helps maintain stable training dynamics by approximating the identity function at initialization. This design choice was empirically shown to improve performance in deeper networks, such as ResNet-50 and ResNet-101, where dimension changes occur at stage transitions. Without proper dimension matching, the shortcut would fail to propagate information effectively, exacerbating degradation issues in plain networks. In practice, projections are used sparingly—only when dimensions differ—to keep the model lightweight; for instance, in the original ResNet architecture, they appear in roughly 10-20% of shortcut connections depending on depth and width. Subsequent works have explored alternatives like zero-padding for minor mismatches, but linear projections remain the standard for significant changes due to their parameter efficiency and gradient flow benefits.

Forward and backward propagation

In residual neural networks, forward propagation through a residual block computes the output as the sum of the input and the residual function applied to the input. Formally, for the l-th layer, the output H_l(x_l) is given by H_l(x_l) = F_l(x_l, \{W_l^i\}) + x_l, where x_l is the input from the previous layer, F_l is the residual function (typically comprising multiple convolutional layers with nonlinearities), and \{W_l^i\} are the learned weights for those layers. This shortcut connection, which is an identity mapping when dimensions match, allows the input to bypass the residual function directly, enabling the network to effectively learn perturbations around the input identity. The residual function F_l is often implemented as a stack of layers, such as two convolutional layers with batch normalization and ReLU activations: F_l(x_l) = \sigma(BN(W_2 \sigma(BN(W_1 x_l)))), where \sigma denotes ReLU, BN is batch normalization, and W_1, W_2 are weight matrices. During forward propagation, this addition ensures smooth signal flow, as the network can default to an identity mapping if the residual learns near-zero values, facilitating training of very deep architectures without immediate degradation in performance. Backward propagation in ResNets leverages the same shortcut to mitigate the vanishing gradient problem. The gradient of the loss L with respect to the input x_l is \frac{\partial L}{\partial x_l} = \frac{\partial L}{\partial H_l} \left( \frac{\partial F_l}{\partial x_l} + I \right), where I is the identity matrix. This formulation adds a multiplicative factor of 1 to the gradient multiplier \frac{\partial F_l}{\partial x_l}, providing a constant path for gradients to flow backward unchanged, even if \frac{\partial F_l}{\partial x_l} approaches zero. As a result, information and gradients propagate effectively across many layers, enabling end-to-end training via stochastic gradient descent without the exponential decay seen in plain networks. This design choice is particularly beneficial for deep networks, as it allows the backward signal to reach early layers with minimal attenuation, promoting stable optimization and convergence to better minima. Analyses confirm that identity shortcuts preserve signal variance in both forward and backward directions, contrasting with plain networks where gradients often vanish.

Residual block designs

Basic residual block

The basic residual block forms the core component of shallower residual networks, such as the 34-layer ResNet architecture, enabling the training of deeper models by mitigating the vanishing gradient problem through shortcut connections. Introduced in the seminal work on deep residual learning, this block processes an input feature map x by applying a sequence of two 3×3 convolutional layers, each followed by batch normalization (BN) and rectified linear unit (ReLU) activation, except for the final activation which occurs after the residual addition.^[1] The mathematical formulation of the block is given by y = F(x, \{W_i\}) + x, where F(x, \{W_i\}) represents the residual mapping implemented via the stacked convolutions, and the shortcut connection adds the input x directly to the output of the second convolution layer (after its BN). Specifically, the residual function F for the basic block is structured as:

\begin{align} z_1 &= \sigma \left( \mathrm{BN} \left( W_1 x \right) \right), \\ z_2 &= \mathrm{BN} \left( W_2 z_1 \right), \\ y &= z_2 + x, \end{align}

followed by y = \sigma(y), with \sigma denoting the ReLU function, W_1 and W_2 as the weights of the 3×3 convolutional kernels (typically maintaining the same number of channels within a stage, e.g., 64 filters), and BN applied post-convolution to normalize activations and stabilize training. This design assumes dimension compatibility between x and F(x); when channels differ across stages, a 1×1 projection convolution is used in the shortcut for matching, though the basic block itself operates within equal-dimensional layers.^[1] In practice, these blocks are stacked to form stages in the network, with the first stage often using a 7×7 convolution and max-pooling for initial downsampling, after which basic residual blocks process features at progressively reduced spatial resolutions (e.g., 56×56 to 7×7). The simplicity of this two-layer configuration contrasts with more complex variants, allowing ResNet-34 to achieve a top-5 error of 7.40% on ImageNet validation (10-crop testing, option C shortcuts) while converging faster than plain networks of similar depth.^[1] While competitive with prior models like VGG (7.1% top-5 error), deeper ResNets using residual blocks enable substantial improvements, such as ResNet-152's 5.71% top-5 error, a reduction of over 1.4% compared to VGG, with ensembles achieving even greater gains.^[1]

Bottleneck residual block

The bottleneck residual block is a specialized variant of the residual block introduced to address the increased computational demands of very deep networks in ResNet architectures. It replaces the basic residual block, which uses two 3×3 convolutional layers, with a more efficient three-layer structure that reduces the number of parameters and floating-point operations while preserving representational capacity. This design is particularly utilized in deeper ResNet variants, such as ResNet-50, ResNet-101, and ResNet-152, where stacking numerous basic blocks would become prohibitively expensive.^[2] The core structure of the bottleneck block consists of a sequence of three convolutional layers: a 1×1 convolution that compresses the input feature map's channel dimension (typically to one-quarter of the original), followed by a 3×3 convolution operating on this reduced representation, and concluding with a 1×1 convolution that expands the channels back to the original dimension. A residual connection adds the input directly to the output of this sequence, provided the dimensions match; otherwise, a 1×1 projection shortcut is applied to align them. This "bottleneck" principle, drawing from dimensionality reduction techniques, minimizes intermediate computations by performing the expensive 3×3 operation on fewer channels, making it suitable for scaling to hundreds of layers. Batch normalization is applied after each convolution, in a post-activation structure (conv-BN-ReLU for initial layers, followed by addition and final ReLU).^[2] Mathematically, given an input \mathbf{x} with C channels, the residual function \mathcal{F}(\mathbf{x}) in the bottleneck block is formulated as:

\mathcal{F}(\mathbf{x}) = \mathcal{W}_3 \left( \sigma \left( \mathcal{W}_2 \left( \sigma \left( \mathcal{W}_1 (\mathbf{x}) \right) \right) \right) \right),

where \mathcal{W}_1 is a 1×1 convolution reducing to C/4 channels, \mathcal{W}_2 is a 3×3 convolution maintaining C/4 channels, \mathcal{W}_3 is a 1×1 convolution expanding to C channels, and \sigma denotes the activation function (typically ReLU). Batch normalization is applied after each \mathcal{W}_i. The block output is then \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x} for identity shortcuts, or \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathcal{W}_s (\mathbf{x}) for projection shortcuts \mathcal{W}_s.^[2] This design choice significantly lowers complexity compared to equivalent basic blocks; for instance, in a 256-channel setting, a bottleneck block requires approximately 94% fewer parameters than a basic one while enabling deeper stacking. Empirical results in the original ResNet implementation demonstrate that bottleneck blocks facilitate training of 152-layer networks with lower error rates on ImageNet (e.g., 3.57% top-5 error for ResNet-152 ensemble) than shallower models, without degradation. The approach has been widely adopted in subsequent architectures for efficient deep learning.^[2]

Pre-activation and other variants

In the original residual block design introduced in the foundational ResNet architecture, the sequence within each block follows a post-activation structure: a convolution is applied first, followed by batch normalization (BN), ReLU activation, another convolution, another BN, and then the addition of the shortcut connection, with a final ReLU after the addition. This configuration, while effective, can distort the identity mapping in very deep networks, as the nonlinearities and normalizations applied after the shortcut addition impede direct signal propagation.^[9] To address this limitation and facilitate cleaner identity mappings, He et al. proposed the pre-activation variant in their follow-up work, where BN and ReLU are placed before the convolutional weight layers.^[9] In this design, each residual block begins with BN followed by ReLU on the input (or the previous block's output), then applies the first convolution, followed by another BN-ReLU-convolution pair, and finally adds the shortcut connection before passing to the next block's BN-ReLU. This rearrangement ensures that the shortcut path bypasses the nonlinearities and normalizations in a way that preserves the input signal more effectively during forward and backward propagation, reducing optimization difficulties in networks exceeding 1000 layers.^[9] The pre-activation approach applies to both basic and bottleneck blocks, with the latter incorporating 1×1 convolutions for dimension reduction and expansion around the core 3×3 convolution.^[9] Empirical evaluations demonstrated substantial improvements with pre-activation blocks. On the CIFAR-10 dataset, a 1001-layer pre-activation ResNet achieved a test error of 4.29%, outperforming shallower original ResNets and enabling successful training of ultra-deep models that previously suffered from degradation.^[9] Similarly, on ImageNet, a 200-layer pre-activation ResNet yielded a top-1 error of 20.7%, a 1.1% absolute improvement over the baseline post-activation ResNet-200's 21.8%.^[9] These gains stem from better gradient flow, as the pre-activation structure allows activations to remain normalized at the block's input, mitigating vanishing gradients in deep stacks.^[9] Beyond pre-activation, other notable variants refine residual blocks for specific challenges. One such improvement, also from He et al., involves strict identity shortcuts without projection for dimension-matched paths, eliminating any BN on the shortcut to avoid signal perturbation and further enhancing propagation in homogeneous-depth blocks.^[9] Additionally, the decoupled pre-activation variant separates the BN-ReLU pairs more explicitly, applying them only before convolutions and omitting the final ReLU after addition in non-terminal blocks to approximate pure identity functions more closely; this led to even lower errors, such as 4.19% on CIFAR-10 for a 1001-layer model.^[9] These modifications prioritize signal integrity over aggressive nonlinear transformations, influencing subsequent architectures like Wide ResNets, which widen channels while adopting pre-activation for scalability.

Applications

Computer vision tasks

Residual neural networks, commonly known as ResNets, have become a foundational backbone in numerous computer vision tasks due to their ability to train very deep architectures without degradation in performance. In image classification, ResNets were initially developed and evaluated on large-scale datasets like ImageNet, where an ensemble of ResNet-152 models achieved a top-5 error rate of 3.57%, outperforming prior state-of-the-art models such as VGG and GoogleNet by significant margins and securing first place in the ILSVRC 2015 classification challenge.^[1] This success stems from residual connections that facilitate gradient flow, enabling the network to learn identity mappings and capture intricate hierarchical features essential for distinguishing thousands of object categories. Subsequent variants, such as ResNet-50, have been widely adopted for their balance of depth and computational efficiency, often serving as pretrained encoders in transfer learning scenarios for domain-specific classification tasks like medical imaging or fine-grained recognition. Beyond classification, ResNets excel as feature extractors in object detection pipelines, where they replace shallower backbones to enhance localization and classification of multiple instances within images. For instance, integrating a ResNet-101 backbone into the Faster R-CNN framework improved mean average precision (mAP) on the COCO dataset from 41.5% (using VGG-16) to 48.4%.^[10] This demonstrates the residual architecture's capacity to produce richer, more discriminative feature maps for region proposal and bounding box regression. This approach has influenced modern detectors like Mask R-CNN and YOLO variants, where ResNet backbones contribute to real-time performance in applications such as autonomous driving and surveillance by enabling the detection of objects at various scales and occlusions. The residual design's efficiency allows for deeper networks without excessive parameter overhead, making it suitable for edge deployment in resource-constrained environments. In semantic segmentation, ResNets provide robust feature hierarchies that support pixel-wise labeling of scenes, a critical task for applications like scene understanding and robotics. The DeepLabv3 model, employing an atrous ResNet-101 backbone with dilated convolutions, achieved a mean IoU of 82.1% on the PASCAL VOC 2012 dataset, surpassing earlier atrous spatial pyramid pooling methods by leveraging residual skips to preserve spatial resolution and contextual information across layers. Similarly, the Pyramid Scene Parsing Network (PSPNet) uses ResNet-50 or deeper variants to fuse multi-scale features via pyramid pooling, attaining 78.4% mIoU on Cityscapes, which highlights ResNets' role in capturing global and local semantics for dense prediction tasks.^[11] These integrations underscore ResNets' versatility in segmentation, where residual connections mitigate information loss in downsampling paths, enabling accurate boundary delineation in complex scenes.

Extensions to other domains

Residual connections, originally developed for convolutional neural networks in computer vision, have been extended to natural language processing (NLP) through their integration into Transformer architectures, enabling the training of deeper models for sequence transduction tasks such as machine translation. In the Transformer model, residual connections are applied around each sub-layer (self-attention and feed-forward), where the output of a sub-layer is added to its input via a skip connection, followed by layer normalization; this formulation allows gradients to flow directly during backpropagation, mitigating vanishing gradient issues in deep stacks of layers.^[12] This design has become foundational in NLP, powering models like BERT and GPT, which achieve state-of-the-art performance on benchmarks including GLUE (achieving up to 80.5% average score on the suite) by stacking multiple Transformer blocks with residuals to capture long-range dependencies in text.^[13] In speech recognition, residual networks have been adapted to process audio spectrograms as 2D inputs, similar to images, improving automatic speech recognition (ASR) systems by enabling deeper convolutional architectures without degradation. The Residual Convolutional CTC Network (RCNN-CTC) incorporates residual blocks within a deep CNN framework, using Connectionist Temporal Classification (CTC) for end-to-end training on sequential audio data; this allows the model to exploit both temporal and spectral structures in speech, resulting in a single-system word error rate (WER) reduction to 4.5% on the WSJ dataset compared to prior baselines.^[14] System combinations of such residual-based models further yield relative WER reductions of 14.91% on WSJ dev93 and 6.52% on the Tencent Chat dataset, demonstrating their robustness to diverse acoustic conditions and vocabulary sizes.^[14] Residual learning has also been applied to reinforcement learning (RL), where it facilitates the optimization of deep policies and value functions in both model-free and model-based settings by revisiting residual algorithms to address optimization challenges in high-dimensional action spaces. The Deep Residual Reinforcement Learning framework introduces bidirectional target networks alongside residual updates, stabilizing training and improving sample efficiency; in continuous control tasks like the DeepMind Control Suite, it outperforms vanilla DDPG baselines by enabling deeper networks to learn complex dynamics.^[15] Beyond these, residual connections extend to time series forecasting by combining them with recurrent units, as in the Residual Recurrent Neural Network (R2N2) model, which first fits a linear trend and then uses RNNs with residuals to model nonlinear residuals, outperforming standalone RNNs on real-world multivariate time series datasets.^[16] In recommender systems, residual blocks enhance collaborative filtering by learning hierarchical user-item interactions, with ResNetMF achieving improvements in MAE and RMSE on datasets like MovieLens over traditional matrix factorization.^[17] These adaptations highlight the versatility of residual mechanisms in handling sequential and structured data across domains. Recent extensions include integration into vision transformers, such as Swin Transformer, which employ residual-like designs for hierarchical feature extraction in large-scale vision tasks as of 2023.^[18]

Development and impact

Pre-ResNet research

The resurgence of interest in deep convolutional neural networks (CNNs) for computer vision tasks began with AlexNet in 2012, an eight-layer architecture that dramatically improved performance on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) by achieving a top-5 error rate of 15.3%. This success was enabled by innovations such as rectified linear unit (ReLU) activations to mitigate vanishing gradients, overlapping pooling to reduce artifacts, dropout regularization to prevent overfitting, and extensive data augmentation for robustness. AlexNet demonstrated that deeper networks could outperform shallower ones when trained on large datasets with GPUs, reigniting research into scaling neural network depth beyond the limitations of earlier hand-crafted features like SIFT or HOG. Building on this foundation, subsequent work focused on systematically increasing depth while maintaining architectural simplicity. The VGG networks, developed in 2014, employed stacks of small 3×3 convolutional filters in uniform configurations reaching up to 19 layers, showing that greater depth with consistent filter sizes enhances representational power for semantic feature extraction. On ImageNet, VGG-16 and VGG-19 achieved top-5 error rates of 7.3% and 7.0%, respectively, outperforming AlexNet by leveraging depth to learn more abstract features, though at the cost of higher parameter counts (around 138 million for VGG-16) and computational demands. This approach emphasized the principle that deeper networks generalize better when optimized properly, influencing transfer learning practices where VGG features serve as robust baselines.^[5] Parallel efforts explored efficient deepening through modular designs. GoogLeNet, introduced in 2014 as the Inception architecture, scaled to 22 layers by incorporating inception modules that apply convolutions at multiple scales (1×1, 3×3, 5×5) within parallel paths, followed by concatenation, to capture diverse features while reducing parameters via 1×1 dimensionality reduction. This network won ILSVRC 2014 with a top-5 error of 6.7%, using only 7 million parameters compared to VGG's scale, and included auxiliary classifiers at intermediate layers to combat gradient vanishing during backpropagation. Inception's emphasis on computational efficiency and multi-branch processing marked a shift toward balancing depth and width for practical deployment.^[6] To facilitate training of such increasingly deep models, supporting techniques emerged. Batch normalization, proposed in early 2015, normalizes activations across mini-batches to stabilize learning by reducing internal covariate shift, enabling higher learning rates and shallower optimizers without the need for careful initialization or low learning rates. Applied to Inception variants, it accelerated convergence and improved accuracy, becoming a standard preprocessing step in deep CNNs. Complementing this, Highway Networks from mid-2015 introduced LSTM-inspired gating units to create "information highways" that allow gradients to flow directly across layers, enabling the training of feedforward networks with 50 to over 100 layers on tasks like character recognition, where plain deep nets failed due to optimization bottlenecks. These gates adaptively weighted layer transformations versus skip connections, with experiments demonstrating successful optimization of up to 900-layer models.^[19]^[8] Despite these innovations, attempts to extend plain architectures like VGG beyond 20-30 layers revealed a persistent degradation problem: as depth increased, training accuracy saturated and then declined, even before overfitting occurred, indicating fundamental optimization challenges rather than representational limitations. This issue, observed in empirical experiments on ImageNet, highlighted the need for mechanisms to preserve information propagation in ultra-deep networks, setting the stage for residual learning paradigms.

The degradation problem and ResNet introduction

In deep neural networks, a key challenge emerged as researchers attempted to scale architectures to greater depths: the degradation problem. This phenomenon, observed in plain feedforward networks without skip connections, manifests as a saturation followed by a decline in accuracy as the number of layers increases beyond a certain threshold, typically around 20-30 layers for datasets like CIFAR and ImageNet. Notably, this degradation affects both training and test errors, indicating it is not primarily due to overfitting but rather optimization difficulties, such as the vanishing gradient problem during backpropagation, which hinders effective learning of identity mappings in deeper architectures. Experiments in the seminal work demonstrated that a 56-layer plain network achieved higher errors than an 18-layer counterpart on CIFAR-10, underscoring the counterintuitive barrier to deeper learning despite the theoretical capacity for more complex representations.^[2] To mitigate this issue, the Residual Network (ResNet) framework was introduced in 2015 by Kaiming He and colleagues, revolutionizing the training of very deep networks. Rather than forcing each stack of layers to directly approximate the desired underlying mapping H(\mathbf{x}), ResNet reformulates the learning objective around residual functions: \mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}, such that H(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}. This is implemented via shortcut connections (or skip connections) that bypass one or more layers, adding the input directly to the output of the residual block, enabling the network to default to identity mappings when residuals approach zero—a scenario that is easier to optimize than direct nonlinear mappings. The residual block typically consists of convolutional layers followed by normalization and nonlinearity, with the skip connection ensuring gradient flow during training. This design not only alleviates the degradation problem but also facilitates convergence in networks exceeding 100 layers.^[2] The introduction of ResNet marked a pivotal advancement, enabling unprecedented network depths and superior performance on large-scale benchmarks. On the ImageNet dataset, a 152-layer ResNet achieved a top-5 error rate of 3.57% in the ILSVRC 2015 classification challenge, surpassing previous state-of-the-art models like VGGNet by a significant margin and winning multiple tracks including detection and localization. This success stemmed from the residual learning paradigm's ability to maintain accuracy gains with depth, as evidenced by error reductions continuing up to 1000 layers in controlled experiments on CIFAR, where plain networks plateaued early. ResNet's framework has since become foundational, influencing subsequent architectures by demonstrating that depth, when properly managed, enhances representational power without the pitfalls of degradation.^[2]

Subsequent innovations and influence

Following the success of ResNet, researchers developed several architectures that extended the residual learning paradigm to address limitations in feature propagation, parameter efficiency, and representational capacity. One prominent innovation was the Dense Convolutional Network (DenseNet), introduced in 2017, which connects each layer directly to every subsequent layer in a feed-forward manner. This dense connectivity fosters stronger feature reuse, alleviates vanishing gradients more effectively than skip connections alone, and reduces the number of parameters compared to equivalently performing ResNets, achieving superior performance on datasets like CIFAR-10 and ImageNet with fewer computations.^[20] Building on ResNet's block structure, the ResNeXt architecture, also proposed in 2017, introduced the notion of "cardinality" as an additional dimension of scaling alongside depth and width. By employing parallel grouped convolutions within residual blocks—akin to ensembling multiple paths—ResNeXt enhances model capacity without proportionally increasing complexity, outperforming ResNet-50 on ImageNet classification (e.g., 77.8% top-1 accuracy versus 76.3%) and object detection tasks on COCO. Similarly, Squeeze-and-Excitation Networks (SENet) in 2017 augmented residual blocks with lightweight channel attention mechanisms, adaptively recalibrating feature maps to emphasize informative channels, which boosted ImageNet accuracy by about 2% when integrated into ResNet backbones with negligible added parameters.^[21]^[22] Subsequent advancements focused on holistic scaling and modernization of convolutional designs. EfficientNet, presented in 2019, unified the scaling of network depth, width, and input resolution via a compound coefficient derived from neural architecture search, yielding models that achieve state-of-the-art ImageNet accuracy (e.g., 84.4% top-1 for EfficientNet-B7) at up to 10x lower computational cost than prior ResNet variants. More recently, ConvNeXt in 2022 revisited pure convolutional networks by incorporating transformer-inspired modifications—such as larger kernels, layer normalization, and inverted bottlenecks—into a ResNet-like framework, rivaling Vision Transformers in accuracy (e.g., 87.8% top-1 on ImageNet-1K) while maintaining convolutional efficiency and simpler training.^[23]^[24] The influence of ResNet extends far beyond convolutional networks, establishing residual connections as a foundational motif in deep learning architectures. These skip connections, which enable direct gradient flow and mitigate degradation in very deep models, have been widely adopted as building blocks in subsequent CNNs and have inspired designs in sequential models, including transformers, where they stabilize training of stacked attention layers and address vanishing gradients in non-convolutional settings. ResNet backbones remain ubiquitous in applications like object detection (e.g., in Faster R-CNN variants) and segmentation (e.g., DeepLab), powering advancements in medical imaging, autonomous driving, and beyond, with over 300,000 citations to the original paper underscoring its seminal impact.^[25]