Fact-checked by Grok 2 weeks ago

Residual neural network

A residual neural network (ResNet) is a deep architecture that employs skip connections, also known as or connections, to enable the effective of networks with hundreds or thousands of layers by learning functions relative to the input rather than direct mappings from input to output. These connections allow the addition of the layer input directly to the output of the layer's transformations, formulated as H(x) = F(x) + x, where x is the input, F(x) is the residual mapping learned by the layer, and H(x) is the desired output mapping. This design mitigates the and the degradation issue, where deeper plain networks experience diminished accuracy despite increased capacity. ResNets were introduced in the 2015 paper "Deep Residual Learning for Image Recognition" by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun from , with the work presented at the IEEE Conference on Computer Vision and (CVPR) in 2016. The architecture builds on prior convolutional neural networks like VGG and but innovates by stacking residual blocks—typically consisting of convolutional layers followed by and ReLU activation—connected via identity shortcuts when dimensions match or projection shortcuts otherwise. Variants include ResNet-34 (34 layers), ResNet-50 (50 layers using bottleneck blocks for efficiency), up to ResNet-152, demonstrating that residual learning permits scaling to extreme depths without performance saturation. On the dataset, a benchmark for large-scale image classification, ResNet models set new records; for instance, the 34-layer ResNet achieved a 7.40% top-5 error rate on the validation set, the 50-layer version 6.71%, the 152-layer model 5.71%, and an ensemble of models reached 3.57% top-5 error on the test set, outperforming all prior entries and winning first place in the Large Scale Visual Recognition Challenge (ILSVRC) 2015 classification task. These results validated the framework's ability to train networks up to 1202 layers on smaller datasets like , where a 110-layer ResNet reduced error to 6.43% with regularization. The introduction of ResNets profoundly influenced by establishing residual connections as a core building block for modern architectures, enabling deeper models in tasks such as , semantic segmentation, and beyond into and generative models. The original paper has garnered over 300,000 citations as of 2025, underscoring its enduring impact on design and training practices.

Overview

Definition and core concept

A residual neural network (ResNet) is a deep neural network architecture designed to facilitate the training of networks with hundreds or thousands of layers by incorporating residual connections, also known as skip connections. These connections allow the input to a block of layers to be added directly to the output, enabling the network to learn residual mappings relative to the input rather than complete transformations from scratch. Introduced in , ResNets have become a foundational architecture in and beyond, powering advancements in image recognition and other tasks. The core concept of residual learning reformulates the layers of a such that, instead of directly approximating a desired underlying mapping H(x), each layer fits a residual function F(x) = H(x) - x. The resulting approximation is then H(x) = F(x) + x, where x is the input to the layer and F(x) is typically computed by a of nonlinear layers. This connection x can be implemented with minimal computational overhead when dimensions match, or via a linear otherwise. By making it easier for the network to learn functions—where deeper layers effectively pass through the input unchanged—residual learning mitigates the and the degradation issue observed in very deep plain networks, where adding more layers leads to increased training error. In practice, this is visualized in a residual block as: \begin{align} y &= F(x, \{W_i\}) + x, \\ x_{\ell+1} &= \mathcal{F}(x_\ell, \{W_i\}) + x_\ell, \end{align} where \mathcal{F} represents the residual function (e.g., convolutions and activations), and \ell indexes the layer. This design not only stabilizes training but also empirically demonstrates superior performance; for instance, an ensemble of 152-layer ResNets achieved a top-5 test error rate of 3.57% on the ImageNet dataset, surpassing previous state-of-the-art models at the time.

Historical motivation and benefits

The pursuit of deeper neural networks in the early was driven by the observation that increased depth could enhance representational power and accuracy in tasks like image recognition, as demonstrated by architectures such as (8 layers, 15.3% top-5 error on in ) and subsequent models like VGGNet (19 layers) and GoogLeNet (22 layers, achieving 6.67% top-5 error in ILSVRC 2014). However, training networks beyond approximately 30 layers encountered significant obstacles, including the , where gradients diminish exponentially during , impeding updates to early-layer parameters. This issue, first highlighted in the context of recurrent networks, extended to architectures, limiting the effective depth of models despite advances like for stabilizing training. A key empirical challenge was the degradation problem, observed in plain convolutional networks without skip connections: as depth increased from 20 to 56 layers, training error rose despite sufficient model capacity, indicating optimization difficulties rather than overfitting. This counterintuitive phenomenon suggested that deeper networks struggled to learn identity mappings effectively, where deeper models should at least match shallower ones. Building on prior work like Highway Networks, which introduced learnable gating mechanisms to enable gradient flow across hundreds of layers by allowing information to bypass nonlinear transformations, residual networks simplified this approach. Highway Networks demonstrated that gated skip connections could train 100-layer feedforward nets, outperforming shallower counterparts on tasks like character recognition, but required complex parameterizations. The residual learning framework addressed these issues by reformulating network layers to learn residual functions \mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}, with skip connections adding the input \mathbf{x} directly to the output, facilitating mappings when residuals are near . This design eases optimization, as residual blocks default to functions if untrained, avoiding the degradation observed in plain nets and promoting smoother gradient propagation. The benefits were immediately evident in experiments: an ensemble of 152-layer ResNets achieved a top-5 test error of 3.57% on —outperforming the prior best by over 3 percentage points—while the 34-layer ResNet outperformed the 18-layer ResNet by 2.8%. These results established residual connections as a foundational , enabling networks deeper than layers in later variants and influencing architectures across and beyond.

Mathematical foundations

Residual connections

Residual connections, also referred to as skip connections, form the core mechanism of neural networks by enabling direct addition of the input to the output of a neural network layer or block. This design allows the network to learn residual functions rather than the full underlying mappings, facilitating the of deeper architectures. Specifically, for an input x, a residual block computes the residual function F(x), typically through a series of convolutional layers, and adds it to the input via H(x) = F(x) + x, where H(x) represents the desired mapping. The motivation for residual connections arises from the degradation problem observed in deep plain networks, where increasing depth beyond a certain point leads to higher and errors, even with sufficient optimization . By reformulating layers to learn F(x) = H(x) - x, the network can more easily approximate mappings when the residual is near zero, which is hypothesized to be simpler than optimizing stacked nonlinear layers directly. This approach addresses optimization difficulties in very deep networks, as shortcuts serve as a strong baseline that subsequent layers can refine. Key benefits include improved gradient flow during , where the shortcut path allows gradients to propagate directly without being diminished by repeated multiplications through deep layers, mitigating the . Empirically, this enables training of networks with hundreds of layers; for instance, a 152-layer residual network achieved a 4.49% top-5 validation error on the dataset, surpassing shallower models like VGG while maintaining faster convergence. Residual connections also enhance representational power by combining features from multiple depths, contributing to better in tasks requiring hierarchical feature extraction.

Dimension matching and projections

In residual neural networks, the core operation within a residual block involves adding the input \mathbf{x} to the output of the residual function \mathcal{F}(\mathbf{x}, \{W_i\}), yielding \mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}. This shortcut connection assumes that the dimensions of \mathbf{x} and \mathcal{F}(\mathbf{x}) match, allowing direct element-wise addition without additional parameters or computational overhead. However, dimension mismatches arise when the number of channels (or map dimensions) changes across layers, such as during downsampling or transitions to blocks with more filters. In such cases, a shortcut is employed to align dimensions, formulated as \mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s \mathbf{x}, where W_s is a linear , typically implemented via a 1×1 to reduce parameters while preserving spatial dimensions. This projection ensures compatibility for and introduces minimal extra computation compared to the main path. The projection W_s is initialized as an identity mapping when possible (e.g., for dimension expansion), which helps maintain stable training dynamics by approximating the at initialization. This design choice was empirically shown to improve performance in deeper , such as ResNet-50 and ResNet-101, where dimension changes occur at stage transitions. Without proper dimension matching, the shortcut would fail to propagate information effectively, exacerbating degradation issues in plain networks. In practice, projections are used sparingly—only when dimensions differ—to keep the model lightweight; for instance, in the original ResNet , they appear in roughly 10-20% of connections depending on depth and width. Subsequent works have explored alternatives like zero-padding for minor mismatches, but linear projections remain the standard for significant changes due to their parameter efficiency and gradient flow benefits.

Forward and backward

In neural networks, forward through a block computes the output as the sum of the input and the function applied to the input. Formally, for the l-th layer, the output H_l(x_l) is given by H_l(x_l) = F_l(x_l, \{W_l^i\}) + x_l, where x_l is the input from the previous layer, F_l is the function (typically comprising multiple convolutional layers with nonlinearities), and \{W_l^i\} are the learned weights for those layers. This connection, which is an identity mapping when dimensions match, allows the input to bypass the function directly, enabling the network to effectively learn perturbations around the input . The residual function F_l is often implemented as a stack of layers, such as two convolutional layers with and ReLU activations: F_l(x_l) = \sigma(BN(W_2 \sigma(BN(W_1 x_l)))), where \sigma denotes ReLU, BN is , and W_1, W_2 are weight matrices. During forward propagation, this addition ensures smooth signal flow, as the network can default to an identity mapping if the residual learns near-zero values, facilitating training of very deep architectures without immediate degradation in performance. Backward propagation in ResNets leverages the same shortcut to mitigate the vanishing gradient problem. The gradient of the loss L with respect to the input x_l is \frac{\partial L}{\partial x_l} = \frac{\partial L}{\partial H_l} \left( \frac{\partial F_l}{\partial x_l} + I \right), where I is the identity matrix. This formulation adds a multiplicative factor of 1 to the gradient multiplier \frac{\partial F_l}{\partial x_l}, providing a constant path for gradients to flow backward unchanged, even if \frac{\partial F_l}{\partial x_l} approaches zero. As a result, information and gradients propagate effectively across many layers, enabling end-to-end training via stochastic gradient descent without the exponential decay seen in plain networks. This design choice is particularly beneficial for deep networks, as it allows the backward signal to reach early layers with minimal attenuation, promoting stable optimization and convergence to better minima. Analyses confirm that identity shortcuts preserve signal variance in both forward and backward directions, contrasting with plain networks where gradients often vanish.

Residual block designs

Basic residual block

The basic residual block forms the core component of shallower residual networks, such as the 34-layer ResNet architecture, enabling the training of deeper models by mitigating the vanishing gradient problem through shortcut connections. Introduced in the seminal work on deep residual learning, this block processes an input feature map x by applying a sequence of two 3×3 convolutional layers, each followed by batch normalization (BN) and rectified linear unit (ReLU) activation, except for the final activation which occurs after the residual addition. The mathematical formulation of the block is given by y = F(x, \{W_i\}) + x, where F(x, \{W_i\}) represents the residual mapping implemented via the stacked convolutions, and the shortcut connection adds the input x directly to the output of the second convolution layer (after its BN). Specifically, the residual function F for the basic block is structured as: \begin{align} z_1 &= \sigma \left( \mathrm{BN} \left( W_1 x \right) \right), \\ z_2 &= \mathrm{BN} \left( W_2 z_1 \right), \\ y &= z_2 + x, \end{align} followed by y = \sigma(y), with \sigma denoting the ReLU function, W_1 and W_2 as the weights of the convolutional kernels (typically maintaining the same number of channels within a stage, e.g., 64 filters), and applied post-convolution to normalize activations and stabilize training. This design assumes dimension compatibility between x and F(x); when channels differ across stages, a 1×1 projection convolution is used in the shortcut for matching, though the itself operates within equal-dimensional layers. In practice, these blocks are stacked to form stages in , with the first stage often using a 7×7 and max-pooling for initial downsampling, after which basic residual blocks process features at progressively reduced spatial resolutions (e.g., 56×56 to 7×7). The simplicity of this two-layer configuration contrasts with more complex variants, allowing ResNet-34 to achieve a top-5 error of 7.40% on validation (10-crop testing, option C shortcuts) while converging faster than plain networks of similar depth. While competitive with prior models like VGG (7.1% top-5 error), deeper ResNets using residual blocks enable substantial improvements, such as ResNet-152's 5.71% top-5 error, a reduction of over 1.4% compared to VGG, with ensembles achieving even greater gains.

Bottleneck residual block

The bottleneck residual block is a specialized variant of the residual block introduced to address the increased computational demands of very deep networks in ResNet architectures. It replaces the basic residual block, which uses two convolutional layers, with a more efficient three-layer structure that reduces the number of parameters and floating-point operations while preserving representational capacity. This design is particularly utilized in deeper ResNet variants, such as ResNet-50, ResNet-101, and ResNet-152, where stacking numerous basic blocks would become prohibitively expensive. The core structure of the bottleneck block consists of a sequence of three layers: a 1×1 that compresses the input map's (typically to one-quarter of the original), followed by a operating on this reduced representation, and concluding with a 1×1 that expands the channels back to the original . A residual connection adds the input directly to the output of this sequence, provided the dimensions match; otherwise, a 1×1 is applied to align them. This "bottleneck" principle, drawing from techniques, minimizes intermediate computations by performing the expensive operation on fewer channels, making it suitable for scaling to hundreds of layers. is applied after each , in a post-activation structure (conv-BN-ReLU for initial layers, followed by addition and final ReLU). Mathematically, given an input \mathbf{x} with C channels, the residual \mathcal{F}(\mathbf{x}) in the bottleneck block is formulated as: \mathcal{F}(\mathbf{x}) = \mathcal{W}_3 \left( \sigma \left( \mathcal{W}_2 \left( \sigma \left( \mathcal{W}_1 (\mathbf{x}) \right) \right) \right) \right), where \mathcal{W}_1 is a 1×1 convolution reducing to C/4 channels, \mathcal{W}_2 is a 3×3 convolution maintaining C/4 channels, \mathcal{W}_3 is a 1×1 convolution expanding to C channels, and \sigma denotes the (typically ReLU). Batch normalization is applied after each \mathcal{W}_i. The block output is then \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x} for identity shortcuts, or \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathcal{W}_s (\mathbf{x}) for projection shortcuts \mathcal{W}_s. This design choice significantly lowers complexity compared to equivalent basic blocks; for instance, in a 256-channel setting, a bottleneck block requires approximately 94% fewer parameters than a basic one while enabling deeper stacking. Empirical results in the original ResNet implementation demonstrate that bottleneck blocks facilitate training of 152-layer networks with lower error rates on ImageNet (e.g., 3.57% top-5 error for ResNet-152 ensemble) than shallower models, without degradation. The approach has been widely adopted in subsequent architectures for efficient deep learning.

Pre-activation and other variants

In the original residual block design introduced in the foundational ResNet , the sequence within each block follows a post-activation structure: a is applied first, followed by (BN), ReLU , another , another BN, and then the of the shortcut connection, with a final ReLU after the . This configuration, while effective, can distort the identity mapping in very deep networks, as the nonlinearities and normalizations applied after the shortcut impede direct signal propagation. To address this limitation and facilitate cleaner identity mappings, He et al. proposed the pre-activation variant in their follow-up work, where and ReLU are placed before the convolutional weight layers. In this design, each residual block begins with followed by ReLU on the input (or the previous block's output), then applies the first , followed by another -ReLU- pair, and finally adds the shortcut connection before passing to the next block's -ReLU. This rearrangement ensures that the shortcut path bypasses the nonlinearities and normalizations in a way that preserves the input signal more effectively during forward and backward propagation, reducing optimization difficulties in networks exceeding 1000 layers. The pre-activation approach applies to both basic and bottleneck blocks, with the latter incorporating 1×1 convolutions for and around the core 3×3 . Empirical evaluations demonstrated substantial improvements with pre-activation blocks. On the dataset, a 1001-layer pre-activation ResNet achieved a test error of 4.29%, outperforming shallower original ResNets and enabling successful training of ultra-deep models that previously suffered from degradation. Similarly, on , a 200-layer pre-activation ResNet yielded a top-1 error of 20.7%, a 1.1% absolute improvement over the baseline post-activation ResNet-200's 21.8%. These gains stem from better gradient flow, as the pre-activation structure allows activations to remain normalized at the block's input, mitigating vanishing gradients in deep stacks. Beyond pre-activation, other notable variants refine residual blocks for specific challenges. One such improvement, also from He et al., involves strict identity shortcuts without projection for dimension-matched paths, eliminating any BN on the shortcut to avoid signal perturbation and further enhancing propagation in homogeneous-depth blocks. Additionally, the decoupled pre-activation variant separates the BN-ReLU pairs more explicitly, applying them only before convolutions and omitting the final ReLU after addition in non-terminal blocks to approximate pure identity functions more closely; this led to even lower errors, such as 4.19% on for a 1001-layer model. These modifications prioritize over aggressive nonlinear transformations, influencing subsequent architectures like Wide ResNets, which widen channels while adopting pre-activation for .

Applications

Computer vision tasks

Residual neural networks, commonly known as ResNets, have become a foundational backbone in numerous tasks due to their ability to train very deep architectures without degradation in performance. In image classification, ResNets were initially developed and evaluated on large-scale datasets like , where an ensemble of ResNet-152 models achieved a top-5 error rate of 3.57%, outperforming prior state-of-the-art models such as VGG and GoogleNet by significant margins and securing first place in the ILSVRC 2015 classification challenge. This success stems from residual connections that facilitate gradient flow, enabling the network to learn identity mappings and capture intricate hierarchical features essential for distinguishing thousands of object categories. Subsequent variants, such as ResNet-50, have been widely adopted for their balance of depth and computational efficiency, often serving as pretrained encoders in scenarios for domain-specific classification tasks like or fine-grained recognition. Beyond , ResNets excel as feature extractors in pipelines, where they replace shallower backbones to enhance localization and of multiple instances within images. For instance, integrating a ResNet-101 backbone into the Faster R-CNN framework improved mean average precision (mAP) on the COCO dataset from 41.5% (using VGG-16) to 48.4%. This demonstrates the residual architecture's capacity to produce richer, more discriminative feature maps for region proposal and bounding box regression. This approach has influenced modern detectors like Mask R-CNN and variants, where ResNet backbones contribute to real-time performance in applications such as autonomous driving and by enabling the detection of objects at various scales and occlusions. The residual design's efficiency allows for deeper networks without excessive parameter overhead, making it suitable for edge deployment in resource-constrained environments. In semantic segmentation, ResNets provide robust feature hierarchies that support pixel-wise labeling of scenes, a critical task for applications like scene understanding and . The DeepLabv3 model, employing an atrous ResNet-101 backbone with dilated convolutions, achieved a mean of 82.1% on the PASCAL VOC 2012 dataset, surpassing earlier atrous spatial pyramid pooling methods by leveraging residual skips to preserve and contextual information across layers. Similarly, the Pyramid Scene Parsing Network (PSPNet) uses ResNet-50 or deeper variants to fuse multi-scale features via pyramid pooling, attaining 78.4% mIoU on Cityscapes, which highlights ResNets' role in capturing global and local semantics for dense prediction tasks. These integrations underscore ResNets' versatility in segmentation, where residual connections mitigate information loss in downsampling paths, enabling accurate boundary delineation in complex scenes.

Extensions to other domains

Residual connections, originally developed for convolutional neural networks in , have been extended to (NLP) through their integration into architectures, enabling the training of deeper models for sequence transduction tasks such as . In the model, residual connections are applied around each sub-layer (self-attention and feed-forward), where the output of a sub-layer is added to its input via a skip connection, followed by layer normalization; this formulation allows gradients to flow directly during , mitigating vanishing gradient issues in deep stacks of layers. This design has become foundational in NLP, powering models like and , which achieve state-of-the-art performance on benchmarks including GLUE (achieving up to 80.5% average score on the suite) by stacking multiple blocks with residuals to capture long-range dependencies in text. In speech recognition, residual networks have been adapted to process audio spectrograms as 2D inputs, similar to images, improving automatic speech recognition (ASR) systems by enabling deeper convolutional architectures without degradation. The Residual Convolutional CTC Network (RCNN-CTC) incorporates residual blocks within a deep CNN framework, using (CTC) for end-to-end training on sequential audio data; this allows the model to exploit both temporal and spectral structures in speech, resulting in a single-system (WER) reduction to 4.5% on the WSJ dataset compared to prior baselines. System combinations of such residual-based models further yield relative WER reductions of 14.91% on WSJ dev93 and 6.52% on the Tencent Chat dataset, demonstrating their robustness to diverse acoustic conditions and vocabulary sizes. Residual learning has also been applied to (), where it facilitates the optimization of deep policies and value functions in both model-free and model-based settings by revisiting algorithms to address optimization challenges in high-dimensional action spaces. The Deep Residual framework introduces bidirectional target networks alongside updates, stabilizing training and improving sample efficiency; in continuous control tasks like the DeepMind Control Suite, it outperforms vanilla DDPG baselines by enabling deeper networks to learn complex dynamics. Beyond these, residual connections extend to time series forecasting by combining them with recurrent units, as in the Residual Recurrent Neural Network (R2N2) model, which first fits a linear trend and then uses with residuals to model nonlinear residuals, outperforming standalone on real-world multivariate datasets. In recommender systems, residual blocks enhance by learning hierarchical user-item interactions, with ResNetMF achieving improvements in and RMSE on datasets like MovieLens over traditional matrix factorization. These adaptations highlight the versatility of residual mechanisms in handling sequential and structured data across domains. Recent extensions include integration into vision transformers, such as Swin Transformer, which employ residual-like designs for hierarchical feature extraction in large-scale vision tasks as of 2023.

Development and impact

Pre-ResNet research

The resurgence of interest in deep convolutional neural networks (CNNs) for tasks began with in 2012, an eight-layer architecture that dramatically improved performance on the Large Scale Visual Recognition Challenge (ILSVRC) by achieving a top-5 error rate of 15.3%. This success was enabled by innovations such as rectified linear unit (ReLU) activations to mitigate vanishing gradients, overlapping pooling to reduce artifacts, dropout regularization to prevent , and extensive for robustness. AlexNet demonstrated that deeper networks could outperform shallower ones when trained on large datasets with GPUs, reigniting research into scaling neural network depth beyond the limitations of earlier hand-crafted features like SIFT or . Building on this foundation, subsequent work focused on systematically increasing depth while maintaining architectural simplicity. The VGG networks, developed in 2014, employed stacks of small 3×3 convolutional filters in uniform configurations reaching up to 19 layers, showing that greater depth with consistent filter sizes enhances representational power for semantic feature extraction. On , VGG-16 and VGG-19 achieved top-5 error rates of 7.3% and 7.0%, respectively, outperforming by leveraging depth to learn more abstract features, though at the cost of higher parameter counts (around 138 million for VGG-16) and computational demands. This approach emphasized the principle that deeper networks generalize better when optimized properly, influencing practices where VGG features serve as robust baselines. Parallel efforts explored efficient deepening through modular designs. GoogLeNet, introduced in 2014 as the architecture, scaled to 22 layers by incorporating inception modules that apply convolutions at multiple scales (1×1, 3×3, 5×5) within parallel paths, followed by concatenation, to capture diverse features while reducing parameters via 1×1 . This network won ILSVRC 2014 with a top-5 error of 6.7%, using only 7 million parameters compared to VGG's scale, and included auxiliary classifiers at intermediate layers to combat gradient vanishing during . Inception's emphasis on computational efficiency and multi-branch processing marked a shift toward balancing depth and width for practical deployment. To facilitate training of such increasingly deep models, supporting techniques emerged. , proposed in early 2015, normalizes activations across mini-batches to stabilize learning by reducing internal covariate shift, enabling higher learning rates and shallower optimizers without the need for careful initialization or low learning rates. Applied to variants, it accelerated convergence and improved accuracy, becoming a standard preprocessing step in deep CNNs. Complementing this, Highway Networks from mid-2015 introduced LSTM-inspired gating units to create "information highways" that allow gradients to flow directly across layers, enabling the training of feedforward networks with 50 to over 100 layers on tasks like character recognition, where plain deep nets failed due to optimization bottlenecks. These gates adaptively weighted layer transformations versus skip connections, with experiments demonstrating successful optimization of up to 900-layer models. Despite these innovations, attempts to extend plain architectures like VGG beyond 20-30 layers revealed a persistent degradation problem: as depth increased, training accuracy saturated and then declined, even before occurred, indicating fundamental optimization challenges rather than representational limitations. This issue, observed in empirical experiments on , highlighted the need for mechanisms to preserve information propagation in ultra-deep networks, setting the stage for residual learning paradigms.

The degradation problem and ResNet introduction

In deep neural networks, a key challenge emerged as researchers attempted to scale architectures to greater depths: the degradation problem. This phenomenon, observed in plain feedforward networks without skip connections, manifests as a saturation followed by a decline in accuracy as the number of layers increases beyond a certain threshold, typically around 20-30 layers for datasets like CIFAR and ImageNet. Notably, this degradation affects both training and test errors, indicating it is not primarily due to overfitting but rather optimization difficulties, such as the vanishing gradient problem during backpropagation, which hinders effective learning of identity mappings in deeper architectures. Experiments in the seminal work demonstrated that a 56-layer plain network achieved higher errors than an 18-layer counterpart on CIFAR-10, underscoring the counterintuitive barrier to deeper learning despite the theoretical capacity for more complex representations. To mitigate this issue, the Residual Network (ResNet) framework was introduced in 2015 by Kaiming He and colleagues, revolutionizing the training of very deep networks. Rather than forcing each stack of layers to directly approximate the desired underlying mapping H(\mathbf{x}), ResNet reformulates the learning objective around residual functions: \mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}, such that H(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}. This is implemented via shortcut connections (or skip connections) that bypass one or more layers, adding the input directly to the output of the residual block, enabling the network to default to identity mappings when residuals approach zero—a scenario that is easier to optimize than direct nonlinear mappings. The residual block typically consists of convolutional layers followed by and nonlinearity, with the skip connection ensuring gradient flow during training. This design not only alleviates the degradation problem but also facilitates convergence in networks exceeding 100 layers. The introduction of ResNet marked a pivotal advancement, enabling unprecedented network depths and superior performance on large-scale benchmarks. On the dataset, a 152-layer ResNet achieved a top-5 error rate of 3.57% in the ILSVRC 2015 classification challenge, surpassing previous state-of-the-art models like VGGNet by a significant margin and winning multiple tracks including detection and localization. This success stemmed from the residual learning paradigm's ability to maintain accuracy gains with depth, as evidenced by error reductions continuing up to 1000 layers in controlled experiments on CIFAR, where plain networks plateaued early. ResNet's framework has since become foundational, influencing subsequent architectures by demonstrating that depth, when properly managed, enhances representational power without the pitfalls of degradation.

Subsequent innovations and influence

Following the success of ResNet, researchers developed several architectures that extended the residual learning paradigm to address limitations in feature propagation, parameter efficiency, and representational capacity. One prominent innovation was the Dense Convolutional Network (DenseNet), introduced in 2017, which connects each layer directly to every subsequent layer in a feed-forward manner. This dense connectivity fosters stronger feature reuse, alleviates vanishing gradients more effectively than skip connections alone, and reduces the number of parameters compared to equivalently performing ResNets, achieving superior performance on datasets like and with fewer computations. Building on ResNet's block structure, the ResNeXt architecture, also proposed in 2017, introduced the notion of "" as an additional dimension of scaling alongside depth and width. By employing parallel grouped convolutions within residual blocks—akin to ensembling multiple paths—ResNeXt enhances model capacity without proportionally increasing complexity, outperforming ResNet-50 on classification (e.g., 77.8% top-1 accuracy versus 76.3%) and object detection tasks on COCO. Similarly, Squeeze-and-Excitation Networks () in 2017 augmented residual blocks with lightweight channel attention mechanisms, adaptively recalibrating feature maps to emphasize informative channels, which boosted accuracy by about 2% when integrated into ResNet backbones with negligible added parameters. Subsequent advancements focused on holistic scaling and modernization of convolutional designs. EfficientNet, presented in 2019, unified the scaling of network depth, width, and input resolution via a compound coefficient derived from , yielding models that achieve state-of-the-art accuracy (e.g., 84.4% top-1 for EfficientNet-B7) at up to 10x lower computational cost than prior ResNet variants. More recently, ConvNeXt in 2022 revisited pure convolutional networks by incorporating transformer-inspired modifications—such as larger kernels, layer normalization, and inverted bottlenecks—into a ResNet-like framework, rivaling Vision Transformers in accuracy (e.g., 87.8% top-1 on ImageNet-1K) while maintaining convolutional efficiency and simpler training. The influence of ResNet extends far beyond convolutional networks, establishing residual connections as a foundational motif in architectures. These skip connections, which enable direct flow and mitigate in very models, have been widely adopted as building blocks in subsequent CNNs and have inspired designs in sequential models, including transformers, where they stabilize training of stacked layers and address vanishing gradients in non-convolutional settings. ResNet backbones remain ubiquitous in applications like (e.g., in Faster R-CNN variants) and segmentation (e.g., DeepLab), powering advancements in , autonomous driving, and beyond, with over 300,000 citations to the original paper underscoring its seminal impact.

References

  1. [1]
    [1512.03385] Deep Residual Learning for Image Recognition - arXiv
    Dec 10, 2015 · We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.
  2. [2]
    [PDF] Deep Residual Learning for Image Recognition
    Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper ...
  3. [3]
    8.6. Residual Networks (ResNet) and ResNeXt
    For instance, the original ResNet paper (He et al., 2016) allowed for up to 152 layers. Another benefit of residual networks is that it allows us to add ...
  4. [4]
    Deep Residual Learning for Image Recognition - Semantic Scholar
    This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides ...
  5. [5]
    [1409.4842] Going Deeper with Convolutions - arXiv
    Sep 17, 2014 · One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is ...
  6. [6]
    ILSVRC2014 Results - ImageNet
    No localization. Top5 val score is 6.66% error. 0.06656, 0.606257. VGG, a combination of multiple ConvNets, including a net trained on images of different ...
  7. [7]
    [1505.00387] Highway Networks - arXiv
    May 3, 2015 · In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks.
  8. [8]
    [1603.05027] Identity Mappings in Deep Residual Networks - arXiv
    Mar 16, 2016 · In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly ...
  9. [9]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · The paper introduces the Transformer, a network based solely on attention mechanisms, dispensing with recurrence and convolutions.
  10. [10]
    Residual Convolutional CTC Networks for Automatic Speech ... - arXiv
    Feb 24, 2017 · In this paper, we propose a novel deep and wide CNN architecture denoted as RCNN-CTC, which has residual connections and Connectionist Temporal Classification ...
  11. [11]
    [1905.01072] Deep Residual Reinforcement Learning - arXiv
    May 3, 2019 · We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique.
  12. [12]
    R2N2: Residual Recurrent Neural Networks for Multivariate Time ...
    Sep 10, 2017 · R2N2 is a hybrid model that first models time series with a linear model, then models residual errors using RNNs.
  13. [13]
    ResNetMF: Enhancing Recommendation Systems with Residual ...
    Oct 5, 2023 · By using ResNet, the algorithm can learn more complex and nuanced patterns in the data, leading to more accurate recommendations. Overall, the ...
  14. [14]
    Very Deep Convolutional Networks for Large-Scale Image ... - arXiv
    Sep 4, 2014 · Title:Very Deep Convolutional Networks for Large-Scale Image Recognition. Authors:Karen Simonyan, Andrew Zisserman. View a PDF of the paper ...
  15. [15]
    Batch Normalization: Accelerating Deep Network Training by ... - arXiv
    Feb 11, 2015 · Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases ...
  16. [16]
    [1608.06993] Densely Connected Convolutional Networks - arXiv
    Aug 25, 2016 · In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed- ...
  17. [17]
    Aggregated Residual Transformations for Deep Neural Networks
    Nov 16, 2016 · We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The ...
  18. [18]
    [1709.01507] Squeeze-and-Excitation Networks - arXiv
    Sep 5, 2017 · We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
  19. [19]
    EfficientNet: Rethinking Model Scaling for Convolutional Neural ...
    May 28, 2019 · We propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient.
  20. [20]
    [2201.03545] A ConvNet for the 2020s - arXiv
    Jan 10, 2022 · A family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of ...