Highway network

A highway network is a deep neural network architecture that enables the training of very deep feedforward networks with hundreds of layers by incorporating skip connections and gating mechanisms to regulate the flow of information and address the vanishing gradient problem.^[1] Inspired by long short-term memory (LSTM) units, highway networks use transform and carry gates to allow unimpeded information flow across multiple layers, often referred to as "information highways."^[2] Introduced in 2015 by R. K. Srivastava, Klaus Greff, and Jürgen Schmidhuber, the architecture was the first to successfully optimize networks with up to 900 layers, surpassing previous limits of around 20-30 layers due to optimization difficulties.^[1] The core layer computation is given by: y = H(x, W_H) \cdot T(x, W_T) + x \cdot C(x, W_C) where H is a non-linear transformation, T is the transform gate, and C is the carry gate (often C = 1 - T), both typically sigmoid-activated.^[1] Highway networks paved the way for subsequent deep learning advancements, such as residual networks (ResNets), and have been applied in tasks including image classification, speech recognition, and sequence labeling.^[3]

Background

Vanishing Gradient Problem

The vanishing gradient problem arises during the backpropagation process in deep feedforward neural networks, where gradients with respect to the weights in early layers diminish exponentially as they are propagated backward through successive layers, leading to inefficient or stalled parameter updates and poor convergence.^[4] This degradation occurs because the gradient signal weakens with depth, making it difficult for the network to learn meaningful representations from the input data in deeper architectures.^[4] Mathematically, in a standard feedforward network, the gradient of the loss L with respect to a weight in layer l is proportional to the product of the derivatives of the activation functions from layers l+1 to the output layer L, multiplied by the upstream gradient.^[4] For activations like the sigmoid function, whose derivative is bounded between 0 and 0.25 and often much smaller away from the transition region, this repeated multiplication results in exponential decay of the gradient magnitude.^[4] Similarly, for the hyperbolic tangent (tanh) activation, the derivative satisfies \left| \frac{d}{dx} \tanh(x) \right| \leq 1, with the maximum value of 1 achieved only at x = 0, so gradients typically shrink over multiple layers unless inputs remain precisely centered.^[4] Historically, this problem limited the effective depth of feedforward networks to around 5-10 layers before 2015, as deeper configurations suffered from rapid performance degradation during training, even with careful design.^[4] For instance, experiments with sigmoid-activated networks showed that beyond a few hidden layers, the top layers saturated, halting learning across the entire model, a issue persisting despite advances in other areas.^[4] While initialization methods, such as those scaling weights to preserve variance in activations and gradients, mitigated some signal decay, they proved insufficient for reliably training networks substantially deeper than 10 layers prior to architectural innovations in the mid-2010s.^[4] The problem is particularly acute with saturating activations like the sigmoid, which approach 0 or 1 in their output range, yielding derivatives near zero and effectively blocking gradient flow to preceding layers, thereby preventing weight updates in early network components.^[4] This saturation not only slows optimization but also amplifies the exponential decay, rendering deep networks prone to underfitting or trivial solutions where early layers learn near-identity mappings.^[4]

Motivation for Gated Architectures

Deeper neural networks have demonstrated superior representational power for complex tasks such as image classification, where increased depth correlates with significant performance gains.^[1] However, traditional feedforward networks struggle to scale beyond shallow depths due to optimization challenges, including the vanishing gradient problem, which impedes effective training as layers multiply.^[1] To address these limitations in feedforward settings, researchers drew inspiration from recurrent neural networks (RNNs), particularly the Long Short-Term Memory (LSTM) architecture, which employs gating mechanisms—such as forget, input, and output gates—to selectively control information flow and mitigate vanishing gradients over long sequences.^[5]^[1] These gates enable the network to decide dynamically whether to retain or update information, fostering pathways for gradients to propagate without substantial attenuation, a concept adapted to non-recurrent, feedforward layers to create analogous "highways" for direct information and gradient flow.^[1] The Highway network, introduced in a 2015 paper by Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber, marked the first successful application of such coupled gating in feedforward networks, enabling the training of over 100-layer architectures using standard stochastic gradient descent.^[1] Unlike plain networks, where each layer fully transforms the input, Highway networks incorporate transform and carry gates that allow layer-wise choices between nonlinear transformation and direct copying of the input, thereby preserving gradient magnitudes across depths and avoiding the need for residual connections.^[1] This gating approach not only circumvents optimization barriers but also promotes depth-independent training dynamics, positioning Highway networks as a pivotal innovation in scaling feedforward architectures.^[1]

Architecture

Highway Layer Design

The highway layer serves as the core building block of highway networks, enabling the construction of very deep architectures by facilitating unimpeded information propagation across layers through parallel transformation and carry pathways. Introduced in the seminal work on highway networks, each layer processes an input vector x to yield an output y, balancing the introduction of new representations with the preservation of original features. This design draws inspiration from gated recurrent units like LSTMs but applies it to feedforward structures, allowing gradients to flow effectively without saturation.^[1] At its heart, the layer comprises two primary components: a non-linear transformation path H(x, W_H), which computes a weighted and activated version of the input to generate novel information, and a carry path that directly forwards the input x to retain existing representations. The transformation H typically involves a linear projection W_H x + b_H followed by a non-linear activation function, such as ReLU, to introduce non-linearity and decide on the addition of new features. Meanwhile, the carry operation ensures unchanged passage of input elements, promoting layer-skipping behavior akin to shortcut connections. These paths are combined via element-wise operations, weighted by dedicated gating mechanisms to dynamically allocate influence.^[1]^[6] The gating system employs a transform gate T(x, W_T) = \sigma(W_T x + b_T), where \sigma denotes the sigmoid activation, to modulate the contribution of the transformed path and determine the proportion of new information to incorporate. Complementing this, a carry gate C(x) controls the direct passthrough, regulating what remains unaltered from the input. In the original formulation, these gates are coupled such that C(x) = 1 - T(x), reducing the parameter count by relying on a single sigmoid output to inversely weight the paths—this simplification enables efficient dynamic routing while ensuring the gates sum to unity for each input dimension.^[1]^[6] The layer's output is computed as
y = H(x, W_H) \odot T(x, W_T) + x \odot (1 - T(x, W_T)),
where \odot represents element-wise multiplication, effectively creating a dimension-wise convex combination of the transformed and carried signals. This equation allows the network to adaptively route information: when T(x) approaches 1, the layer emphasizes transformation; when it nears 0, it prioritizes carrying, thus mitigating gradient vanishing by providing a constant-identity shortcut.^[1] Conceptually, the structure resembles a block diagram where the input x splits into parallel branches—the transformation arm applies H and gates it with T, while the carry arm bypasses processing and gates with $1 - T—before the modulated results are element-wise added to produce y for the subsequent layer.

Gating Mechanism

The gating mechanism in Highway layers consists of sigmoid-activated linear transformations that regulate the flow of information. Specifically, the transform gate is defined as T(\mathbf{x}) = \sigma(\mathbf{W}_T \mathbf{x} + \mathbf{b}_T), where \sigma denotes the sigmoid function, producing values in the range [0, 1]. This gate weights the contribution of the non-linear transformation H(\mathbf{x}) against the carry component, which is simply the input \mathbf{x} scaled by $1 - T(\mathbf{x}).^[1] These gates operate as soft switches, dynamically controlling whether the layer emphasizes transformation or direct propagation. When T(\mathbf{x}) \approx 1, the layer applies a strong non-linear modification to the input, enabling feature extraction; when T(\mathbf{x}) \approx 0, the input is predominantly copied forward, forming an "information highway" that bypasses the transformation block. This adaptive weighting allows the network to balance depth with effective learning across many layers.^[1] In terms of gradient preservation, the gating mechanism facilitates unimpeded backpropagation by leveraging the carry path as an identity-like shortcut. During training, gradients through this path approximate 1, providing a direct route from output to input layers and preventing dilution from successive multiplications by small activation derivatives, such as those from sigmoids. This design directly addresses gradient vanishing, enabling stable training of networks with hundreds of layers using standard stochastic gradient descent.^[1] Empirical analysis of trained Highway networks shows that gates adapt layer-specifically: transformation dominates in early layers, where most output changes occur (e.g., within the first ≈10 layers for MNIST and ≈30 for CIFAR-100), while later layers increasingly rely on the highway to propagate information with minimal alteration, as evidenced by sparse gate activations and "stripe-like" patterns in block outputs.^[1]

Mathematical Formulation

Forward Propagation

In Highway networks, the forward propagation through a single layer computes the output y as a weighted combination of a transformed input and the original input, allowing information to flow directly across layers. The core equation is given by
y = H(x, \mathbf{W}_H) \odot T(x, \mathbf{W}_T) + x \odot (1 - T(x, \mathbf{W}_T)),
where x is the input vector, \odot denotes element-wise multiplication, H(x, \mathbf{W}_H) represents the non-linear transformation (typically an affine projection followed by a non-linearity such as ReLU, i.e., H(x, \mathbf{W}_H) = \sigma(\mathbf{W}_H x + \mathbf{b}_H) with \sigma as ReLU), and T(x, \mathbf{W}_T) = \sigma(\mathbf{W}_T x + \mathbf{b}_T) is the transform gate using the sigmoid activation \sigma(z) = \frac{1}{1 + e^{-z}}.^[1] This formulation enables the network to learn whether to transform the input (when the gate approaches 1) or carry it unchanged (when the gate approaches 0).^[1] The gating mechanism in the forward pass regulates the flow of information, with the transform gate T controlling the contribution of the transformed path and the implicit carry gate $1 - T handling the skip connection. In general, the carry gate can be parameterized separately as C(x, \mathbf{W}_C) = \sigma(\mathbf{W}_C x + \mathbf{b}_C), yielding y = H(x, \mathbf{W}_H) \odot T(x, \mathbf{W}_T) + x \odot C(x, \mathbf{W}_C), though the original design ties C = 1 - T to reduce parameters and encourage balanced flow.^[1] For multi-layer Highway networks with L layers, the forward propagation applies this operation iteratively: let y_0 = x be the initial input, then for each layer l = 1 to L,
y_l = H_l(y_{l-1}, \mathbf{W}_{H_l}) \odot T_l(y_{l-1}, \mathbf{W}_{T_l}) + y_{l-1} \odot (1 - T_l(y_{l-1}, \mathbf{W}_{T_l})),
with the final output y_L serving as the network's result.^[1] This stacked structure assumes matching input and output dimensions across layers (e.g., via appropriate weight matrix sizes or projection layers) to facilitate the element-wise addition and direct skip connections.^[1] In the common special case where the carry gate is $1 - T, the equation simplifies to emphasize the gating's role in blending paths, often with the gate input z(x) = \mathbf{W}_T x + \mathbf{b}_T as a linear projection before the sigmoid. This design promotes stable gradient flow during training by allowing identity mappings when gates are low.^[1]

Parameterization and Initialization

In Highway networks, each layer's parameters are defined to support the gating mechanism that enables information flow across depths. For a layer operating on d-dimensional inputs and outputs, the core parameters include the weight matrix \mathbf{W}_H (size d \times d) and bias \mathbf{b}_H (size d) for the transformation H, and the weight matrix \mathbf{W}_T (size d \times d) and bias \mathbf{b}_T (size d) for the transform gate T. If the carry operation is uncoupled from the transform gate, an additional weight matrix \mathbf{W}_C (size d \times d) and bias \mathbf{b}_C (size d) are introduced, increasing the gating overhead. These parameters facilitate the forward propagation where the output combines transformed and carried inputs, as described in the layer design.^[1] Initialization strategies are crucial for stable training, particularly to mitigate variance issues in deep stacks. The weight matrices \mathbf{W}_H and \mathbf{W}_T are initialized using the scheme proposed by He et al. (2015), which draws values from a distribution scaled to preserve variance through the network, preventing signal attenuation or amplification early in training. The transform gate bias \mathbf{b}_T is set to a negative value, such as -3 (or sometimes between -1 and -10), to initially suppress the transform gate (favoring carry ≈1) and promote gradual activation of transformations as optimization proceeds, drawing from practices in gated recurrent units like LSTMs.^[1] For networks exceeding 100 layers, initialization may incorporate depth-aware adjustments to maintain gradient flow, such as integrating layer normalization alongside standard weight initialization to counteract potential explosions or vanishing in very deep configurations. This ensures the gating mechanisms can effectively route information without requiring extensive hyperparameter retuning. Computationally, each Highway layer incurs O(d²) operations, akin to a dense layer, but with roughly twice the parameter count due to the dual linear transformations for transformation and gating.^[1]

Training and Implementation

Optimization Techniques

Training Highway networks leverages standard backpropagation, where gradients propagate through the gating mechanisms via the chain rule. In a Highway layer, the output y is computed as y = H(x) \odot T(x) + x \odot C(x), with C(x) = 1 - T(x), and the Jacobian \frac{\partial y}{\partial x} includes an identity term from the carry path I \odot C(x) (where I is the identity matrix), plus contributions from the nonlinear transformation H'(x) \odot T(x) and the derivatives of the gates themselves. This structure helps preserve gradient magnitude across many layers by allowing unimpeded flow along the carry path when gates are closed (T(x) \approx 0), mitigating the vanishing gradient problem inherent in plain deep networks.^[1] Common optimizers for Highway networks include stochastic gradient descent (SGD) with momentum, as employed in the original implementation, which enables effective training of networks with up to 900 layers without specialized initialization schemes required for ungated deep nets. The gating mechanism reduces sensitivity to initialization compared to plain networks, allowing standard schemes like those from He et al. (2015) for weights and negative biases for transform gates to suffice.^[1] To facilitate convergence, hyperparameters such as learning rate decay schedules are optimized via random search. In implementations using the 2018 Highway Network Block variant, batch sizes of 128 and training for 200 epochs on CIFAR-10 (up to 32 layers) have been used, ensuring robust optimization across varying depths. In original experiments, networks were trained for up to 80 epochs on CIFAR-100 achieving 900 layers.^[1]^[7]

Practical Considerations

Highway networks can be readily implemented in deep learning frameworks by defining custom layers that encapsulate the gating mechanism, allowing seamless integration into larger architectures. A basic highway block, for instance, computes the output as follows:

y = H(x, parameters_H) * T(x, parameters_T) + x * (1 - T(x, parameters_T))
y = H(x, parameters_H) * T(x, parameters_T) + x * (1 - T(x, parameters_T))

where H applies a non-linear transformation (e.g., ReLU), T is the sigmoid-activated transform gate, and the carry behavior is implicitly defined by $1 - T.^[1] The inclusion of gating units adds parameters for the transform (and optionally carry) functions, roughly doubling the parameter count per layer compared to standard feedforward layers and thereby increasing memory usage during training and inference.^[1] Experiments demonstrate successful optimization of highway networks up to 900 layers deep using stochastic gradient descent with momentum, without encountering vanishing gradient issues that plague plain networks. In practice, depths exceeding 100 layers remain feasible but require careful weight and bias initialization to maintain stable training dynamics.^[6]^[1] To debug training issues, monitor transform gate activations across layers, which should display selective sparsity and tend toward saturation near 0 or 1 to enable effective information routing; suboptimal behavior often stems from poor initialization, addressable by setting negative biases (e.g., -1 to -3) on gate linear layers to encourage initial carry dominance.^[1]

Evaluation

Original Experiments

The original experiments on highway networks utilized the MNIST dataset for handwritten digit recognition and the CIFAR-10 and CIFAR-100 datasets for small-scale image classification tasks.^[6] These datasets were chosen to evaluate the architecture's ability to handle increasing network depth without optimization difficulties.^[6] The experimental setups employed convolutional front-end layers to extract initial features from the input images, followed by stacks of highway blocks for deeper processing.^[6] Depth scaling experiments demonstrated the viability of training very deep networks, with architectures ranging from 10 to 100 layers.^[6] On CIFAR-10, test error rates decreased with added depth before stabilizing, indicating effective gradient flow; for instance, an 11-layer network achieved approximately 10.82% error, a 19-layer version 7.54%, and a 32-layer version 8.80%.^[6] Similar trends were observed on MNIST, where deeper highway networks (up to 100 layers) outperformed shallower plain networks, avoiding the error saturation seen in non-gated architectures.^[6] For CIFAR-100, experiments extended to even greater depths, such as 900 layers, though full convergence results were reported as ongoing.^[6] Ablation studies focused on the gating mechanism, comparing coupled gates (where the carry gate is defined as 1 minus the transform gate) against uncoupled variants with independent parameters.^[6] Coupled gates yielded superior performance, with lower test errors across depths, as they enforced a complementary relationship that stabilized training.^[6] Analysis of learned gate activations revealed meaningful patterns: the carry gate dominated in early layers (approximately the first 10 on MNIST and 30 on CIFAR-100), enabling direct propagation of input features, while transform gates became more active deeper in the network to selectively modify representations.^[6] This behavior highlighted the gates' role in dynamically routing information based on layer position and task demands.^[6] All models were trained using stochastic gradient descent on NVIDIA Tesla K40 GPUs, with the deepest configurations (e.g., 100 layers) requiring approximately 1-2 days to converge under standard hyperparameters.^[6]

Performance Comparisons

Highway networks demonstrated superior performance over plain feedforward networks, particularly as depth increased beyond 20-30 layers, where plain networks suffered from optimization difficulties and degradation in accuracy. On the MNIST dataset, plain networks achieved test error rates around 0.5% for shallow configurations (e.g., 10 layers) but rose to about 1.2% for 100-layer depths due to vanishing gradients, while highway networks maintained errors around 0.4% even at 100 layers, representing a significant improvement in deeper regimes. This allowed highway networks to scale effectively to hundreds of layers without performance collapse, as evidenced by successful training of 900-layer models on CIFAR-100.^[6] In terms of convergence speed, highway networks consistently reached lower training losses faster than plain counterparts across depths from 10 to 100 layers on MNIST, attributed to the gating mechanism facilitating direct information flow and better gradient propagation. For instance, 50-layer highway networks converged to near-optimal errors in fewer epochs compared to shallower plain networks, enabling practical training of very deep architectures. On CIFAR-10, highway convolutional networks with 19 layers achieved 92.46% accuracy (7.54% error), surpassing the 90.68% accuracy reported for DropConnect-regularized networks of similar complexity, which generalized dropout to weights but struggled with deeper scaling.^[6] Compared to early residual networks (ResNets) introduced in 2015, highway networks showed comparable performance on standard benchmarks but offered a simpler feedforward design without requiring explicit residual formulations. On CIFAR-10, highway networks attained errors around 7-9% for depths up to 32 layers, closely matching early ResNet results for similar depths (e.g., 7.51% error for a 32-layer ResNet).^[6]^[8] The original highway networks were not evaluated on ImageNet, but later variants achieved top-1 errors around 20-25%, comparable to early ResNets like the 50-layer version at 24.7% top-1 error. ResNets later dominated post-2015 benchmarks due to their scalability and integration with batch normalization, often outperforming highways by 1-2% on full ImageNet top-1 accuracy in wider architectures.^[8] By 2020, residual variants had become the standard for large-scale vision tasks, but highway networks retained utility in resource-constrained settings where simpler gating reduced computational overhead. A study introducing constrained gates in highway blocks reported 2-3% top-1 accuracy gains over baseline highways on ImageNet (e.g., achieving 79.22% on the full ImageNet-2012 validation set with a 50-layer model), demonstrating improved generalization in deeper, parameter-efficient models suitable for edge deployment.^[9] These metrics underscore highways' role in facilitating faster initial convergence and lower errors in depth-limited environments compared to ungated plain nets.

Extensions and Variants

Recurrent Highway Networks

Recurrent Highway Networks (RHNs) represent a 2017 extension of the Highway Network architecture to recurrent neural networks, enabling deeper recurrent transitions for sequence modeling tasks. Introduced by Zilly et al., RHNs incorporate stacked Highway blocks within the recurrent step, allowing the hidden state at time t to be computed as \mathbf{h}_t = \text{Highway}(\mathbf{h}_{t-1}, \mathbf{x}_t), where the Highway function applies multiple layers of nonlinear transformations with gating mechanisms to control information flow from the previous state and input. More precisely, for a stack of L layers at each time step, the state update is given by

\mathbf{s}_l = \mathbf{h}_l \odot \mathbf{t}_l + \mathbf{s}_{l-1} \odot \mathbf{c}_l,

where \mathbf{h}_l = \tanh(W_h \mathbf{s}_{l-1} + \mathbf{b}_h), \mathbf{t}_l = \sigma(W_t \mathbf{s}_{l-1} + \mathbf{b}_t), and \mathbf{c}_l = 1 - \mathbf{t}_l for coupled gates, with \odot denoting element-wise multiplication and \sigma the sigmoid function; this structure mitigates vanishing gradients in deep recurrent computations.^[10] By stacking up to 10 or more Highway layers per recurrent transition, RHNs facilitate effective recurrence depths exceeding 1000 layers over long sequences, far surpassing the single-layer transitions typical of standard LSTMs and enabling better capture of complex temporal dynamics without excessive computational overhead per step. Training employs truncated backpropagation through time (BPTT) with sequence lengths up to 100, combined with techniques like variational dropout and stochastic gradient descent; the adaptive gating in Highway blocks enhances gradient flow and supports learning long-range dependencies by selectively propagating relevant information across many effective layers. RHNs have been applied to language modeling and speech recognition, demonstrating superior performance over LSTM baselines. In word-level language modeling on the Penn Treebank dataset, an RHN with depth 10 and 32 million parameters achieves a test perplexity of 65.4, outperforming a variational LSTM with weight tying (73.2 perplexity) while using fewer parameters.^[10] For character-level modeling on the text8 dataset, the same depth-10 RHN attains 1.27 bits per character (BPC), improving upon the 1.32 BPC of a layered normalized Highway LSTM.^[10] In speech tasks, RHN variants extend Highway LSTMs, showing gains in acoustic modeling on large-scale datasets such as English Voice Search, though primary empirical validation remains in language modeling.^[11]

Improved Gating Constraints

Following the introduction of highway networks in 2015, researchers proposed modifications to the gating mechanisms to address optimization challenges and enhance generalization in very deep architectures. These improvements focused on imposing stricter constraints on the gate parameters to ensure stable training dynamics and better feature propagation. A key contribution came from Oyedotun et al. in 2018, who reformulated the highway block by bounding gate outputs strictly within [0, 1] using a log-sigmoid activation function, where 0 represents a fully closed gate and 1 a fully open one. To further control gate behavior, they applied a constraint limiting gate weights to near-zero norms, focusing the gates on simple gating rather than learning complex features early in training, and introduced remapping parameters (m and n, such as m=1 and n=0.1) to delay gate closure, prioritizing untransformed feature routing initially. This approach enables early optimization by favoring skip connections at the start of training, while allowing gradual learning of transformations as optimization progresses.^[7] These gating constraints offer significant benefits for very deep networks, particularly in reducing overfitting through fewer parameters— for instance, using only one gate for 15- or 20-layer models and three for 32-layer models—while improving generalization. On the CIFAR-10 dataset with data augmentation, a 32-layer constrained highway network achieved a test error of 5.44%, outperforming the unconstrained baseline's 7.72% error, corresponding to approximately a 2% improvement in accuracy.^[7] Similarly, on CIFAR-100, the 32-layer model reduced test error to 25.26% from the baseline's 32.39%.^[7] Such enhancements make the constrained gates particularly effective for hierarchical feature learning in deep convolutional settings. Later variants built on these ideas by integrating constrained highway blocks into specialized architectures for targeted applications. For example, the Highway Deep Pyramid Convolutional Neural Network (HDP-CNN), proposed by Zheng et al. in 2021, incorporates highway connections within a pyramid structure to fuse word-level and character-level representations in convolutional layers, improving detection performance in vision-inspired tasks like phishing website classification. This design leverages the gates to manage information flow across pyramid levels, enhancing multi-scale feature extraction.^[12] In more recent developments, as of 2024, highway networks have been extended for tasks like surface reconstruction from point clouds, where gated layers improve accuracy and efficiency in geometric modeling.^[13] Despite these advances, improved gating constraints in highway networks have seen limited adoption compared to residual networks, which offer simpler skip connections without learnable gates. They remain valuable, however, in hybrid models where gated propagation aids in combining diverse architectural elements for specific optimization needs.^[3]

Residual Networks

Residual Networks (ResNets), introduced by He et al. in 2015, represent a foundational advancement in deep learning by enabling the training of extremely deep architectures through the use of skip connections that directly add the input to the output of a residual block, expressed as y = F(x) + x, where F(x) denotes the residual function comprising convolutional layers or other transformations.^[8] Unlike gated mechanisms, ResNets employ a fixed identity shortcut without explicit parameters for modulation, which simplifies the architecture and facilitates the propagation of gradients, thereby alleviating the vanishing gradient problem and allowing networks exceeding 1000 layers to converge effectively during training.^[8] Highway networks and ResNets share core similarities in addressing the challenges of deep network training, particularly by incorporating shortcut connections that bypass layers to maintain information flow and combat gradient degradation.^[14] Both architectures interpret deep processing as iterative refinements of representations, where skip paths preserve earlier features, enabling successful optimization of networks with dozens to hundreds of layers using standard gradient descent.^[14] However, Highway networks extend this by introducing learnable gating functions that dynamically weigh the transformation and carry paths, providing finer control over feature propagation compared to the unconditional addition in ResNets.^[1]^[14] Key differences lie in their design philosophies and empirical strengths: ResNets' gate-free structure offers greater simplicity and parameter efficiency, contributing to superior scalability in computer vision tasks, as evidenced by an ensemble including the 152-layer variant achieving a top-5 error rate of 3.57% on the ImageNet test set and winning the ILSVRC 2015 classification challenge.^[8] In contrast, Highway networks' gating enables more adaptive information flow, potentially suiting scenarios requiring selective feature updates, though direct comparisons on ImageNet show 50-layer ResNets slightly outperforming equivalent Highway variants (7.17% vs. 7.53% top-5 error).^[8]^[14] The development of ResNets built upon the skip-connection paradigm established in earlier works like Highway networks, which they cite as a precursor with gating for layer fusion, leading to simplifications that prioritized identity mappings for broader applicability.^[8] Subsequent research has explored hybrids that integrate gating into residual blocks to leverage the strengths of both, such as gated residual units that modulate skip paths for enhanced expressivity in sequence modeling and beyond.^[14]

Gated Recurrent Architectures

The Long Short-Term Memory (LSTM) architecture, introduced in 1997, revolutionized recurrent neural networks by incorporating gating mechanisms to mitigate vanishing gradients and enable learning over extended sequences. LSTM units include three key gates—the forget gate, which discards irrelevant information from the cell state; the input gate, which updates the cell state with new data; and the output gate, which controls the readout from the cell state—collectively managing memory flow through multiplicative interactions.^[5] This demonstrated the efficacy of gating for preserving gradients during backpropagation, a principle that predated and inspired non-recurrent extensions.^[6] Building on LSTM's success, the Gated Recurrent Unit (GRU), proposed in 2014, streamlined the design by reducing the number of gates to two: an update gate that balances retention of prior hidden states with new inputs, and a reset gate that determines how much past information to forget when computing candidate activations. This simplification halved the parameters compared to LSTM while achieving comparable performance on sequence tasks, highlighting the flexibility of coupled gating for efficient information routing. Highway Networks adapt a similar coupled-gate structure to feedforward layers, where the transform gate applies nonlinear processing and the carry gate enables direct propagation, serving as a non-recurrent counterpart to GRU's mechanisms.^[15]^[1] These gating innovations in recurrent models have permeated non-recurrent paradigms, with attention mechanisms in Transformers—emerged in 2017—functioning as an implicit form of gating that weights and selects relevant inputs across sequences, akin to the selective propagation in Highways and LSTMs. Subsequent enhancements, such as self-gating units in attention blocks, further draw on highway principles to refine semantic flow in Transformer variants.^[16]