Fact-checked by Grok 2 weeks ago

Backpropagation through time

Backpropagation through time (BPTT) is a gradient-based algorithm for training recurrent neural networks (RNNs) that extends the standard method to handle sequential data by unfolding the recurrent structure across time steps, thereby computing gradients of the loss function with respect to weights through temporal dependencies. Developed in the late and early as part of broader efforts to apply dynamic programming and to neural architectures for time-varying systems, BPTT was formally detailed by Paul J. Werbos in 1990, building on his earlier foundational work on for multilayer networks dating back to the . This technique addressed the need to train networks capable of processing and learning from temporal sequences, such as data, , and tasks, where traditional falls short due to the absence of recurrent connections. In practice, BPTT operates by creating an unrolled version of the RNN, replicating the network's layers for each time step in the input sequence to form a deep feedforward equivalent; errors are then propagated backward from the output through this temporal chain, allowing weight updates that account for influences across multiple time lags. This unfolding enables the use of efficient tools to compute partial derivatives, making it computationally feasible for moderate sequence lengths, though full unrolling can become resource-intensive for long sequences. Despite its effectiveness, BPTT is prone to the , where gradients diminish exponentially over long time dependencies, hindering learning of prolonged patterns, as analyzed in early theoretical work on RNN training dynamics. To mitigate this, innovations like (LSTM) units, introduced in 1997, incorporate gating mechanisms to regulate information flow and preserve gradients over extended periods, significantly enhancing BPTT's applicability in modern architectures. Today, BPTT remains a for training variants of RNNs in sequential modeling, underpinning advancements in fields from to generative sequence modeling.

Background

Recurrent Neural Networks

Recurrent neural networks (RNNs) are a class of artificial neural networks designed to process sequential by incorporating loops that enable the persistence of information across multiple time steps. Unlike feedforward neural networks, which handle inputs in a strictly unidirectional manner without of prior , RNNs use recurrent connections to maintain a hidden state that captures dependencies in sequences of varying lengths. This architecture allows RNNs to model temporal dynamics, making them suitable for tasks where context from previous elements influences current and future outputs. The core of an RNN lies in its hidden state, which is updated at each time step t according to the equation: \mathbf{h}_t = f(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h) where \mathbf{h}_t is the hidden state at time t, \mathbf{x}_t is the input at time t, \mathbf{W}_{hh} and \mathbf{W}_{xh} are weight matrices for hidden-to-hidden and input-to-hidden connections, respectively, \mathbf{b}_h is a bias vector, and f is a nonlinear activation function such as the hyperbolic tangent (tanh). This formulation, introduced in early recurrent models, enables the network to integrate information from past inputs into the current computation. RNNs find applications in sequence modeling tasks, including time series prediction, where they forecast future values based on historical patterns, and , such as predicting the next word in a by leveraging syntactic and semantic . The initial hidden state \mathbf{h}_0 is typically initialized to a zero vector to start the sequence without prior bias. Training RNNs requires adaptations of to account for their temporal structure.

Backpropagation

Backpropagation is a algorithm that efficiently computes the gradients of the with respect to the weights in a multilayer by applying the chain rule of calculus. This method enables the training of deep networks by propagating errors backward through the layers, allowing for the adjustment of parameters to minimize prediction errors. In the forward pass, input data is processed layer by layer to compute activations. Starting from the input layer, each subsequent layer applies a linear transformation followed by a nonlinear , such as the or ReLU, to produce the output of . The final layer's output is then compared to the target using a , typically the for tasks, defined as E = \frac{1}{2} \sum (y - \hat{y})^2, where y is the target and \hat{y} is the prediction. The backward pass computes the error gradients starting from the output layer and propagating them to earlier layers. The error term \delta^l for layer l is calculated as \delta^l = (W^{l+1})^T \delta^{l+1} \odot f'(z^l), where W^{l+1} is the weight matrix to the next layer, \delta^{l+1} is the from the subsequent layer, \odot denotes the Hadamard (element-wise) product, and f'(z^l) is the derivative of the evaluated at the pre-activation z^l. These gradients are then used to update the network parameters via the rule \theta \leftarrow \theta - \alpha \frac{\partial E}{\partial \theta}, where \theta represents the weights and biases, \alpha is the , and \frac{\partial E}{\partial \theta} is the computed gradient. The algorithm was popularized in its modern form by David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams in their 1986 paper, which demonstrated its effectiveness for training multilayer networks and sparked widespread adoption in research. This foundational technique for networks serves as the basis for extensions like through time in recurrent neural networks, where gradients are computed across sequential time steps.

The BPTT Algorithm

Network Unfolding

Network unfolding, also known as unrolling, is a key conceptual step in (BPTT) that transforms a (RNN) into an equivalent network for gradient computation. This process involves replicating the RNN's recurrent layer across each time step of the input sequence, creating a deep, static structure where weights are shared among all replicas. Introduced by Werbos as an extension of to dynamic systems, unfolding allows the temporal dependencies in RNNs to be represented as layered connections in a feedforward architecture, enabling the application of standard algorithms. Consider an input sequence \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T, where T is the sequence length. The unfolding replicates the RNN computation for each time step t = 1 to T, producing hidden states \mathbf{h}_1, \mathbf{h}_2, \dots, \mathbf{h}_T and outputs \mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_T. Specifically, the hidden state at time t is computed as \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1}; \theta), where f is the RNN transition function and \theta represents the shared parameters; this forms a chain of dependencies, with \mathbf{h}_t feeding into the next layer's input, mimicking the recurrent loop as forward connections between replicated layers. This unrolled structure treats the entire sequence as a single, deep feedforward pass, preserving the temporal flow while allowing efficient derivative computation. The primary purpose of unfolding is to enable the treatment of the RNN as a static during training, facilitating the of errors across all time steps. The for the sequence is typically defined as the average over time steps, E = \frac{1}{T} \sum_{t=1}^T E_t, where E_t is the loss at time t (e.g., between \mathbf{y}_t and the target). This formulation aggregates per-step errors, allowing gradients with respect to \theta to be computed by propagating derivatives backward through the unrolled chain, accounting for the shared weights' influence across time.

Forward Propagation

In the forward propagation phase of backpropagation through time, the processes an input sequence step by step, computing hidden states and outputs sequentially across time steps in the unrolled network structure. This computation relies on the shared parameters that define the RNN's recurrent dynamics, allowing the model to capture temporal dependencies efficiently. The process starts with an initial hidden state, typically set to a zero vector \mathbf{h}_0 = \mathbf{0}, and proceeds for each time step t = 1 to T, where T is the sequence length. At each time step t, the hidden state \mathbf{h}_t is updated using the current input \mathbf{x}_t and the previous hidden state \mathbf{h}_{t-1}: \mathbf{h}_t = f(\mathbf{W}_{xh} \mathbf{x}_t + \mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h) Here, f is a nonlinear activation function, such as the hyperbolic tangent \tanh, \mathbf{W}_{xh} is the input-to-hidden weight matrix, \mathbf{W}_{hh} is the hidden-to-hidden weight matrix, and \mathbf{b}_h is the hidden bias vector. The output \mathbf{y}_t is then computed from the current hidden state: \mathbf{y}_t = g(\mathbf{W}_{hy} \mathbf{h}_t + \mathbf{b}_y) where g is an output activation function, often the softmax for classification tasks to produce probability distributions, \mathbf{W}_{hy} is the hidden-to-output weight matrix, and \mathbf{b}_y is the output bias vector. This step-by-step recurrence propagates information forward through time, generating predictions or representations for each position in the sequence. Once outputs are computed for all time steps, the losses are aggregated to form the total objective. At each t, a per-step \mathcal{L}_t is calculated between \mathbf{y}_t and the target \hat{\mathbf{y}}_t, commonly using for probabilistic outputs or for . The overall is then the sum (or average) over all steps: \mathcal{L} = \sum_{t=1}^T \mathcal{L}_t This aggregation ensures the model accounts for errors across the entire sequence, balancing contributions from early and late time steps. The use of shared weights—\mathbf{W}_{xh}, \mathbf{W}_{hh}, and \mathbf{W}_{hy} remain identical for every t—promotes parameter efficiency, as the network maintains a constant number of learnable parameters irrespective of the unfolding depth T, it to generalize to sequences of varying lengths without proportional increases in model size. A representative example is a simple RNN trained on character-level sequence data for next-character prediction, such as modeling text generation. For an input sequence like "he", encoded as one-hot vectors \mathbf{x}_1 for 'h' and \mathbf{x}_2 for 'e', the forward pass computes \mathbf{h}_1 from \mathbf{x}_1 and \mathbf{h}_0, yielding \mathbf{y}_1 (probabilities over the alphabet, with loss against the target 'e'), then \mathbf{h}_2 from \mathbf{x}_2 and \mathbf{h}_1, yielding \mathbf{y}_2 (loss against the next character, say 'l' in "hel"). The total loss sums these per-step errors, guiding the model's learning of sequential patterns like letter transitions.

Backward Propagation

The backward propagation phase of BPTT computes the gradients of the with respect to the shared by propagating errors backward through the unrolled , starting from the last time step and recursing to the first. This uses the chain rule to account for dependencies across time steps, accumulating contributions from all relevant outputs to each . Unlike standard in feedforward networks, the temporal unfolding introduces long-range dependencies in the paths, where the for a parameter at early time steps includes products of derivatives over multiple steps. The computation begins at t = T. First, the gradient of the loss with respect to the output at each time step t is calculated as \frac{\partial \mathcal{L}}{\partial \mathbf{y}_t}, which depends on the specific loss function (e.g., the derivative of cross-entropy with respect to softmax outputs). The gradient with respect to the hidden state at the final step is then: \frac{\partial \mathcal{L}}{\partial \mathbf{h}_T} = \mathbf{W}_{hy}^\top \frac{\partial \mathcal{L}}{\partial \mathbf{y}_T} \odot f'(\mathbf{z}_T) where \mathbf{z}_T = \mathbf{W}_{xh} \mathbf{x}_T + \mathbf{W}_{hh} \mathbf{h}_{T-1} + \mathbf{b}_h is the pre-activation, f' is the derivative of the activation function (e.g., $1 - \tanh^2(\mathbf{z}_T) for tanh), and \odot denotes element-wise multiplication. For earlier time steps t = T-1 down to 1, the hidden state gradient includes contributions from the immediate output and from future time steps via the recurrent connection: \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} = \mathbf{W}_{hy}^\top \frac{\partial \mathcal{L}}{\partial \mathbf{y}_t} \odot f'(\mathbf{z}_t) + \mathbf{W}_{hh}^\top \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{t+1}} \odot f'(\mathbf{z}_t) This recursion propagates the influence of later errors back through the sequence, enabling the model to learn long-term dependencies. Once the hidden state gradients are obtained for all t, the parameter gradients are aggregated by summing over time steps, leveraging the shared weights. For the hidden-to-output weights: \frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hy}} = \sum_{t=1}^T \frac{\partial \mathcal{L}}{\partial \mathbf{y}_t} \mathbf{h}_t^\top For the input-to-hidden weights: \frac{\partial \mathcal{L}}{\partial \mathbf{W}_{xh}} = \sum_{t=1}^T \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} \mathbf{x}_t^\top And for the hidden-to-hidden weights, which capture the recurrent dynamics: \frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^T \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} \mathbf{h}_{t-1}^\top Biases follow similar summations. These gradients are then used to update the parameters via or variants like . The full backward pass thus efficiently computes how changes in weights affect the loss across the entire sequence, though it can suffer from vanishing or exploding gradients for long T due to repeated multiplications in the .

Implementation Details

Pseudocode

The pseudocode below outlines a basic implementation of truncated through time (BPTT) for training a simple (RNN) on a single sequence of length T. The network uses input-to- weights W_{xh}, -to- weights W_{hh}, and -to-output weights W_{hy}, with tanh for states and softmax for outputs. Gradients are accumulated over the sequence during the forward pass and computed via rule in the backward pass, truncated to the last k steps to reduce memory and computation. Initialization sets the initial state h_0 = 0 and uses a \alpha. This approach follows the foundational algorithm described by Werbos.
pseudocode
# Initialize parameters
W_xh ← random_matrix(input_dim, hidden_dim)  # Input-to-hidden weights
W_hh ← random_matrix(hidden_dim, hidden_dim)  # Hidden-to-hidden weights
W_hy ← random_matrix(hidden_dim, output_dim)  # Hidden-to-output weights
α ← learning_rate  # e.g., 0.01

for epoch = 1 to num_epochs:
    for each sequence (x_1, ..., x_T; targets_1, ..., targets_T):
        # Forward pass: store states for all T steps
        h ← zeros(hidden_dim)  # h_0 = 0
        h_list ← empty list
        y_list ← empty list
        total_loss ← 0
        
        for t = 1 to T:
            h ← tanh(W_xh * x_t + W_hh * h)
            y ← softmax(W_hy * h)
            loss ← cross_entropy(y, targets_t)
            total_loss += loss
            append h to h_list
            append y to y_list
        
        # Backward pass: truncated to last k steps (k ≤ T)
        ∇W_xh ← zeros(input_dim, hidden_dim)
        ∇W_hh ← zeros(hidden_dim, hidden_dim)
        ∇W_hy ← zeros(hidden_dim, output_dim)
        
        δ ← zeros(hidden_dim)  # [Gradient](/page/Gradient) w.r.t. output at time T
        
        for t = T downto max(1, T - k + 1):
            # [Gradient](/page/Gradient) from output layer
            dL_dy ← (y_list[t] - targets_t) / batch_size  # Assuming [mean squared error](/page/Mean_squared_error) or cross-entropy derivative
            δ_h ← (dL_dy^T * W_hy) ⊙ (1 - h_list[t] ⊙ h_list[t])  # [Chain rule](/page/Chain_rule): δ_h = δ_y * W_hy * (1 - h^2)
            δ += δ_h  # Accumulate for recurrent contribution
            
            # Gradients for output weights
            ∇W_hy += outer(h_list[t], dL_dy)
            
            # Gradients for recurrent weights (using prev h)
            h_prev ← h_list[t-1] if t > 1 else zeros(hidden_dim)
            ∇W_hh += outer(δ, h_prev)
            
            # Gradients for input weights
            ∇W_xh += outer(δ, x_t)
            
            # Propagate delta backward in time (for next iteration)
            if t > 1:
                δ ← (W_hh^T * δ) ⊙ (1 - h_list[t-1] ⊙ h_list[t-1])
        
        # Update weights (simple SGD)
        W_xh -= α * ∇W_xh / T
        W_hh -= α * ∇W_hh / T
        W_hy -= α * ∇W_hy / T
This pseudocode assumes a single sequence without batching for clarity; in practice, gradients can be averaged over multiple sequences in a batch. The truncation at k steps approximates full BPTT by ignoring dependencies beyond k time steps, which is common for long sequences to address vanishing gradients and memory constraints.

Truncation Methods

Truncated backpropagation through time (TBPTT), also denoted as BPTT-k, addresses the scalability challenges of full BPTT by restricting the unrolling and gradient computation to only the most recent k time steps of the sequence. In this method, the network is unfolded backward from the current time step for a fixed depth k, approximating the true gradient by neglecting contributions from time steps earlier than t - k. This truncation introduces a bias in the gradient estimate but significantly reduces resource demands, making it suitable for training recurrent neural networks (RNNs) on extended sequences. The approach was formalized as an efficient heuristic for continually operating networks, where exact gradients over long histories are often impractical. The primary rationale for TBPTT stems from the prohibitive scaling of full BPTT, which requires unrolling the entire sequence of length T and leads to O(T n^2) time complexity and O(T n) memory usage (where n is the hidden state dimension) due to the expanded computational graph. For long sequences common in applications like or , this results in excessive for intermediate states and linear growth in backward pass operations. By limiting the unrolling to k << T, TBPTT reduces memory usage to O(k \cdot n) (where n is the hidden state dimension) and time complexity to O(k \cdot n^2) per update, allowing practical without full sequence . This truncation is particularly effective in scenarios where gradients from distant past steps decay rapidly, such as in networks approaching fixed points. As an alternative to truncation, real-time recurrent learning (RTRL) enables online gradient computation without any unrolling, updating weights incrementally at each time step using the full recurrent structure. Introduced for fully recurrent networks, RTRL maintains derivatives of the hidden states with respect to all parameters, but this incurs a higher per-step cost of O(n^3) due to the need to track interactions across all units, making it less scalable than TBPTT for moderate to large n. Despite its exactness for infinite horizons, RTRL is rarely used in practice for long sequences owing to its computational expense. TBPTT's impact lies in its ability to balance gradient approximation quality with , facilitating effective learning on tasks requiring temporal dependencies over hundreds of steps. Empirical evaluations on problems, such as learning to simulate a , demonstrate that BPTT with k=9 achieves over 80% success rates while running 28 times faster than RTRL, and variants like BPTT(16;8) offer further speedups exceeding 50 times with comparable accuracy. This trade-off has made TBPTT a for training RNNs in resource-constrained settings, though the choice of k must be tuned to avoid excessive bias in estimates for very long dependencies.

Advantages and Limitations

Advantages

BPTT extends standard backpropagation to recurrent neural networks by unrolling the network across time steps, leveraging shared weights to efficiently update parameters throughout the sequence while reusing stored intermediate activations to avoid redundant computations. This design makes BPTT significantly faster than general-purpose optimization techniques like , which require multiple forward passes for gradient approximation, or evolutionary methods such as genetic algorithms, which lack the structured efficiency of . The algorithm's gradient-based framework scales effectively to complex sequence modeling tasks, where it outperforms non-gradient alternatives by enabling precise optimization over extended temporal dependencies without the typical of search-based approaches. For instance, BPTT facilitates training on long sequences through techniques like truncation, balancing computational cost with the ability to capture intricate patterns in data like or . By applying the chain rule iteratively through the unrolled network, BPTT computes exact gradients for weight updates, offering superior precision compared to the approximations in genetic algorithms or finite-difference methods. This exactness ensures reliable in nonlinear dynamic systems, where approximate techniques often lead to suboptimal solutions due to noise or inefficiency. Empirically, BPTT underpinned early successes of RNNs in , enabling models to learn temporal transitions for continuous utterance processing.

Disadvantages

One of the primary challenges in training recurrent neural networks (RNNs) using through time (BPTT) is the vanishing and exploding gradients problem, which arises from the repeated multiplication of matrices across multiple time steps. In BPTT, gradients are computed by chaining partial derivatives through the unrolled , leading to products of the recurrent weight W_{hh} raised to the power of the number of time steps; if the spectral radius of W_{hh} is less than 1, these products cause gradients to decay exponentially to near zero, hindering learning of long-term dependencies, while a spectral radius greater than 1 can cause gradients to explode, leading to numerical instability. BPTT is also computationally expensive and memory-intensive for long sequences, with time and space complexities both linear in T in standard implementations due to the need to store all intermediate states and backpropagate through , exacerbating resource demands in applications involving extended temporal horizons. The error surfaces of RNNs trained via BPTT often exhibit chaotic behavior and numerous local optima, characterized by high-curvature walls and spurious valleys that trap optimization in suboptimal solutions, complicating to effective weights. Furthermore, BPTT exhibits high sensitivity to weight initialization and hyperparameters such as , where poor choices can amplify issues or lead to divergent training dynamics, necessitating careful tuning for reliable performance. Truncation methods partially mitigate these computational burdens but introduce bias.

Historical Context

Origins

Backpropagation through time (BPTT) originated as an extension of the standard algorithm to handle dynamic systems and recurrent neural networks (RNNs), enabling efficient computation across temporal sequences. The concept was first introduced by Paul Werbos in his 1974 PhD thesis, where he proposed applying dynamic programming and gradient-based optimization to behavioral and economic models, laying the groundwork for propagating errors through time in nonlinear dynamic systems. Werbos's early work emphasized the need for exact derivative calculations in feedback systems, building directly on the reverse-mode principles that underpin . The popularization of backpropagation came in 1986 with the work of David Rumelhart, , and Ronald Williams, whose efforts helped apply the method to neural networks for sequential tasks, such as the Jordan network proposed in Michael Jordan's 1986 thesis, which incorporated recurrent connections from outputs to hidden layers to model serial order in behavior. The explicit description of applying to RNNs by unfolding the network into a time-expanded structure, allowing error signals to propagate backward through time steps, was developed in subsequent works addressing key challenges in training early RNNs, including the propagation of gradients over multiple time steps to learn temporal dependencies without manual intervention. BPTT gained formal structure and widespread recognition through Werbos's 1990 publication, "Backpropagation Through Time: What It Does and How to Do It," which provided a comprehensive review, algorithmic details, and applications to dynamic . This paper clarified the method's mechanics for optimizing RNNs like the Elman simple recurrent , introduced that same year, by treating time as an additional dimension for . Werbos's contribution synthesized prior hints into a robust framework, emphasizing its utility for fault diagnosis, control systems, and in time-varying data.

Key Developments

In the , significant extensions to through time (BPTT) addressed longstanding challenges in recurrent neural networks (RNNs), particularly the that hindered learning long-term dependencies. This issue was formally identified by in his 1991 diploma thesis, analyzing why gradients diminish during of recurrent nets. A pivotal advancement was the introduction of (LSTM) units by and in 1997, which incorporated gating mechanisms to regulate information flow and mitigate gradient vanishing during BPTT. These gates—input, forget, and output—enable selective retention of relevant signals over extended sequences, allowing LSTMs to outperform vanilla RNNs on tasks requiring memory of prior timesteps when trained via BPTT. BPTT's principles also informed strategies for managing exploding gradients, another issue in RNN training. In 2013, Razvan Pascanu, Tomas Mikolov, and analyzed gradient dynamics in RNNs and proposed gradient clipping as a simple yet effective remedy, capping gradient norms during BPTT to prevent instability without altering the underlying optimization. This technique has become a standard practice, providing a foundational approach to gradient control that extends beyond RNNs to broader methods. While BPTT itself is not directly applied in non-recurrent architectures like transformers, its emphasis on efficient gradient computation through sequential unrolling influenced stability techniques in parallelizable models. Computational advancements further enhanced BPTT's practicality. The adoption of GPU acceleration in the early optimized matrix operations central to RNN unrolling and gradient propagation, with libraries like NVIDIA's cuDNN enabling faster training of BPTT-based models on large-scale sequences. Complementing this, the release of in 2015 by introduced automatic differentiation capabilities that streamlined BPTT implementation for RNNs and variants, reducing manual derivation efforts and facilitating widespread adoption in research and industry. Today, while BPTT remains theoretically central to understanding gradient flow in sequential models, it has been largely superseded in practice by advanced architectures such as transformers, which avoid recurrence for better . Nonetheless, BPTT continues to underpin training in specialized RNN applications, including time-series forecasting and , where long-term dependencies persist.

References

  1. [1]
    Backpropagation through time: what it does and how to do it
    Oct 31, 1990 · Backpropagation through time: what it does and how to do it. Abstract: Basic backpropagation, which is a simple method now being widely used in ...
  2. [2]
    Who Invented Backpropagation? - IDSIA
    The first NN-specific application of efficient BP as above was apparently described by Werbos in 1982 [BP2] (but not yet in his 1974 thesis, as is sometimes ...
  3. [3]
    Learning long-term dependencies with gradient descent is difficult
    Learning long-term dependencies with gradient descent is difficult. Abstract: Recurrent neural networks can be used to map input sequences to output sequences, ...Missing: PDF | Show results with:PDF
  4. [4]
    [PDF] LONG SHORT-TERM MEMORY 1 INTRODUCTION
    Hochreiter, S. and Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In. Advances in Neural Information Processing Systems 9. MIT ...
  5. [5]
    [PDF] A PARALLEL DISTRmUTED PROCESSING APPROACH - UCSD CSE
    The approach would therefore seem to have some potential for reconciling problems of serial order with problems relating to the continuous nature of behavior.
  6. [6]
    Finding Structure in Time - Elman - 1990 - Cognitive Science
    The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks ...
  7. [7]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · Cite this article. Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
  8. [8]
    None
    ### Summary of Forward Propagation in Recurrent Neural Networks (RNNs) from *Deep Learning* (Goodfellow et al.)
  9. [9]
    Backpropagation through time: what it does and how to do it
    **Summary of Backpropagation Through Time (BPTT) from https://ieeexplore.ieee.org/document/58337:**
  10. [10]
    9.7. Backpropagation Through Time - Dive into Deep Learning
    This procedure requires us to expand (or unroll) the computational graph of an RNN one time step at a time. The unrolled RNN is essentially a feedforward neural ...Missing: unfolding seminal
  11. [11]
    None
    Summary of each segment:
  12. [12]
    [PDF] Memory-Efficient Backpropagation Through Time - arXiv
    Jun 10, 2016 · Backpropagation Through Time algorithm (BPTT) (Rumelhart et al. ... Below is a pseudocode of an algorithm which evaluates an optimal policy for ...
  13. [13]
    [PDF] A Learning Algorithm for Continually Running Fully Recurrent ...
    These algorithms are shown to allow networks having recurrent connections to learn complex tasks requiring the retention of information over time periods having ...
  14. [14]
    [PDF] Recurrent deep neural networks for robust speech recognition
    A new backpropagation through time (BPTT) algorithm for updating the parameters of the recurrent layer is introduced to make the minibatch stochastic gradient ...
  15. [15]
    [PDF] On the difficulty of training recurrent neural networks
    Among the main reasons why this model is so unwieldy are the vanishing gradient and exploding gradient problems described in Bengio et al. (1994). Proceedings ...
  16. [16]
    Long Short-Term Memory | Neural Computation - MIT Press Direct
    Nov 15, 1997 · This content is only available as a PDF. Open the PDF for in another window. © 1997 Massachusetts Institute of Technology.
  17. [17]
    [1211.5063] On the difficulty of training Recurrent Neural Networks
    Nov 21, 2012 · There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems.
  18. [18]
    Recurrent Neural Network - NVIDIA Developer
    cuDNN is a GPU-accelerated library of primitives for deep neural networks that optimizes RNN performance on NVIDIA GPUs.
  19. [19]
    TensorFlow - Google's latest machine learning system, open ...
    TensorFlow - Google's latest machine learning system, open sourced for everyone. Tuesday, November 10, 2015. Cross posted from the Google ...Missing: backpropagation | Show results with:backpropagation
  20. [20]
    Top 10 Deep Learning Algorithms You Should Know in 2025
    Aug 26, 2025 · The network is trained using backpropagation through time (BPTT) to minimize prediction error.