Fact-checked by Grok 2 weeks ago

Learning rule

A learning rule in artificial neural is an adaptive algorithm or mathematical procedure that modifies the weights of connections between based on input patterns, error signals, or activity correlations to enable the to learn and perform tasks such as or . These rules form the core mechanism for training by iteratively minimizing differences between desired and actual outputs or by extracting inherent data structures. Learning rules originated from biological inspirations, particularly synaptic plasticity observed in the brain, and have evolved into diverse categories including supervised, , and reinforcement-based approaches. The foundational Hebbian learning rule, introduced by Donald Hebb in his 1949 book The Organization of Behavior, proposes that synaptic efficacy strengthens when pre- and post-synaptic neurons activate simultaneously—a principle often summarized as "neurons that fire together, wire together." This unsupervised rule supports associative memory and feature extraction but lacks error feedback, limiting its use in complex tasks. In , rules like the (also called the Widrow-Hoff rule) adjust weights proportionally to the error between target and network outputs, enabling gradient-based optimization and forming the basis for in multilayer networks. Other variants, such as the perceptron learning rule, handle by updating weights only for misclassified inputs, though they converge only for linearly separable data. Unsupervised rules, like competitive learning, promote winner-take-all dynamics for clustering, while rules incorporate reward signals for sequential . Modern advancements integrate these with architectures, enhancing efficiency in areas like and .

Fundamentals

Definition and Purpose

A learning rule is an algorithm or procedure that specifies how to update the parameters of a model, such as the weights in a , based on input data, observed errors, or detected patterns to enhance performance on a specific task. These rules form the core mechanism by which artificial s and other systems adapt during training, drawing inspiration from biological processes like . The primary purpose of a learning rule is to enable models to improve through experience, typically by minimizing a that quantifies discrepancies between predictions and desired outcomes, adapting to underlying data distributions, or uncovering hidden structures in the input. In , rules adjust parameters using labeled examples to align outputs with targets; in , they identify patterns without explicit guidance; and in , they refine behaviors based on reward signals from the environment. This adaptability underpins applications ranging from to systems. Key characteristics of learning rules include their iterative nature, where parameters are updated incrementally across multiple passes over the data, and their reliance on a learning rate parameter η, which scales the magnitude of each update to balance speed and precision. Effective rules exhibit properties, gradually reducing errors until a stable minimum is reached, while adhering to conditions that prevent oscillations or , such as bounding weight changes. For instance, in a basic scenario, a connecting two neurons might be strengthened if the input through that connection contributes to a correct , intuitively mimicking how associations form in learning.

Mathematical Foundations

Learning rules in machine learning typically involve iterative updates to model parameters, often denoted as weights \mathbf{w}, using a general form \Delta \mathbf{w} = \eta \cdot f(\text{error}, \text{input}), where \eta > 0 is the learning rate controlling the step size, and f is a rule-specific function that determines the direction and magnitude of the update based on the observed error and input data. This framework allows for both gradient-based and non-gradient adjustments, with \mathbf{w} represented as a vector in multi-dimensional spaces to accommodate complex models. A foundational optimization method underlying many learning rules is , which minimizes a L(\mathbf{w}) through iterative updates given by \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla L(\mathbf{w}_t), where \nabla L(\mathbf{w}_t) is the of the loss with respect to \mathbf{w} at iteration t. In the variant, prevalent in large-scale , the is approximated using individual data points or mini-batches, enabling efficient computation at the expense of added variance. Common error metrics quantify the discrepancy between predicted outputs \hat{y} and true targets y. For regression tasks, the mean squared error (MSE) is widely used: L = \frac{1}{2} \sum (y - \hat{y})^2, which penalizes larger errors quadratically and derives from the method originally developed for parameter estimation in astronomical observations. In , the cross-entropy loss measures divergence between true probability distributions and model predictions, often expressed for cases as L = - [y \log \hat{y} + (1-y) \log (1-\hat{y})], stemming from in logistic models. Convergence and stability of these updates depend on several factors, including the landscape of L(\mathbf{w}), which may contain local minima trapping the optimization in non-convex settings. The learning rate \eta must be selected judiciously—typically decreasing over iterations—to balance rapid progress and avoidance of oscillations or divergence, with theoretical guarantees often requiring \sum \eta_t = \infty and \sum \eta_t^2 < \infty. Batch updates compute the exact gradient over the full dataset for smoother trajectories but higher computational cost, whereas stochastic updates leverage noisy estimates for faster practical convergence in high-dimensional problems. These principles form the basis for applying learning rules in supervised contexts like the perceptron and unsupervised alternatives like Hebbian learning, where f may deviate from pure gradients.

Historical Development

Early Concepts

The concept of learning rules in neural systems traces its roots to early 20th-century investigations into biological mechanisms of adaptation and association. Ivan Pavlov's work on classical conditioning, detailed in his 1927 book Conditioned Reflexes, demonstrated how animals form associations between neutral stimuli and innate responses, such as dogs salivating to a bell after repeated pairing with food presentation. This associative learning analogy laid foundational groundwork for understanding how environmental signals could modify behavioral outputs, influencing later theories of synaptic change without computational implementation. A pivotal biological insight emerged in 1949 with Donald Hebb's postulate in The Organization of Behavior, proposing that "when an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased." Often paraphrased as "cells that fire together wire together," this idea established synaptic plasticity as a core mechanism for learning, emphasizing coincident neural activity as the driver of strengthened connections in the brain. Hebb's framework provided a theoretical basis for how neural assemblies could adapt through experience, directly inspiring the as an outgrowth of these biological principles. Early theoretical models further bridged biology and computation by abstracting neural function. In 1943, Warren McCulloch and Walter Pitts introduced a simplified neuron model in their paper "A Logical Calculus of the Ideas Immanent in Nervous Activity," portraying neurons as binary threshold devices that compute logical functions based on excitatory and inhibitory inputs. Although this model lacked adaptive learning capabilities, it set the stage for adaptive systems by demonstrating how networks of such units could perform complex computations, akin to propositional logic. The emerging field of cybernetics also contributed conceptual foundations for learning through feedback mechanisms. Norbert Wiener's 1948 book Cybernetics: Or Control and Communication in the Animal and the Machine explored feedback loops as essential for self-regulating systems in both machines and organisms, linking control theory to adaptive processes. Wiener's ideas highlighted how negative feedback could stabilize and adjust system behavior in response to perturbations, providing an analogy for learning rules that would later enable computational realizations like the perceptron.

Mid-20th Century Advances

The in 1956 marked the formal inception of AI as a field, bringing together pioneers like , , , and to explore machine learning and neural networks, which laid the groundwork for subsequent computational learning rules. This conference's emphasis on automata that could learn from data directly influenced early neural models. In 1958, introduced the , a single-layer neural network capable of binary classification through supervised learning, and notably implemented it as hardware called the at , demonstrating practical pattern recognition for tasks like image classification. Building on this, Bernard Widrow and Marcian Hoff developed the in 1960, also known as the least mean squares (LMS) algorithm, which extended learning to continuous-valued outputs in linear units, enabling applications in adaptive filtering and signal processing akin to linear regression. This rule adjusted weights proportionally to the error between predicted and actual outputs, providing a more general framework for training multi-input adaptive linear neurons (). These advances in the late 1950s and early 1960s fueled optimism in neural computing, with hardware prototypes showing real-world viability in areas like speech recognition and control systems. However, in 1969, Marvin Minsky and Seymour Papert published Perceptrons, a rigorous mathematical analysis revealing fundamental limitations of single-layer networks, such as their inability to solve non-linearly separable problems like the XOR function, without multilayer architectures. Their critique, which emphasized the absence of effective learning procedures for hidden layers, contributed significantly to disillusionment in the field and the onset of the first AI winter, curtailing funding and research into neural networks for over a decade. The resurgence began with Paul Werbos's 1974 PhD thesis, which first formalized the for training multilayer networks by propagating errors backward through layers using dynamic programming techniques. This was popularized in 1986 by David Rumelhart, Geoffrey Hinton, and Ronald Williams, whose Nature paper demonstrated backpropagation's efficacy in learning internal representations for complex tasks, such as encoding and decoding patterns in hidden layers, reigniting interest in connectionist models. During this era, competitive learning rules also emerged, allowing networks to self-organize through winner-take-all mechanisms for clustering. These mid-20th century developments established the core principles of supervised neural learning, forming the bedrock for modern deep learning architectures.

Supervised Learning Rules

Perceptron Learning Rule

The perceptron learning rule is a foundational supervised learning algorithm for binary classification in single-layer neural networks, designed to iteratively adjust synaptic weights based on classification errors to separate linearly separable data using a threshold activation function. Introduced by , it processes input vectors through weighted summation followed by a step function that outputs 1 if the sum exceeds a threshold (typically 0) and 0 otherwise, enabling the network to learn decision boundaries for two-class problems. The rule operates in an error-correction manner, updating weights only when a misclassification occurs, making it suitable for pattern recognition tasks where training examples consist of input features paired with binary targets. The core update mechanism is given by the formula: \Delta w_i = \eta (t - y) x_i where w_i is the weight for input x_i, \eta is the learning rate (often set to 1 for simplicity), t is the target output (0 or 1), and y is the perceptron's binary output. A bias term can be incorporated as w_0 with a constant input x_0 = 1, allowing the threshold to be adjusted dynamically. This update increases weights for inputs that would correct a false negative or decrease them for a false positive, effectively shifting the decision hyperplane toward correctly classifying the erroneous example. Rosenblatt proved the perceptron convergence theorem, establishing that if the data is linearly separable—meaning a hyperplane exists that perfectly divides the classes—the algorithm will converge to a solution in a finite number of steps, bounded by factors such as the number of examples and the margin of separability. Specifically, for normalized inputs and a separating hyperplane with margin \gamma, convergence occurs in at most \frac{R^2}{\gamma^2} updates, where R is the maximum input norm, guaranteeing error-free classification after repeated presentations of the training set. Geometrically, each update rotates and translates the decision boundary to reduce misclassifications, enlarging the margin until no errors remain. Despite its guarantees, the perceptron learning rule has key limitations: it converges only for linearly separable data and fails to learn functions requiring non-linear boundaries, such as the , where inputs (0,0) and (1,1) map to one class while (0,1) and (1,0) map to another, as no single hyperplane can separate these points. This vulnerability was rigorously analyzed by , highlighting that single-layer perceptrons lack the representational power for certain logical operations. The algorithm can be implemented via the following pseudocode:
Initialize weights w_i to small random values (or zero) for i = 0 to n (including bias w_0 = 0)
Set learning rate η = 1
While there are misclassifications:
    For each training example (x, t):  # x is input vector, t is target (0 or 1)
        Compute net = sum(w_i * x_i for i=1 to n) + w_0
        y = 1 if net >= 0 else 0  # Threshold activation
        If y != t:
            For i = 1 to n:
                w_i += η * (t - y) * x_i
            w_0 += η * (t - y)  # Bias update
Return weights w
This procedure ensures iterative refinement until the perceptron correctly classifies all training instances, forming the basis for later extensions like the delta rule for continuous outputs.

Delta Learning Rule

The delta learning rule, also known as the Widrow-Hoff rule, is a supervised learning algorithm designed for training single-layer neural networks with linear activation functions to perform regression tasks by minimizing the mean squared error (MSE) between predicted outputs and real-valued targets. Introduced in the context of adaptive linear elements (ADALINEs), it employs a gradient-based update mechanism to iteratively adjust synaptic weights based on the error signal. At its core, the rule implements stochastic gradient descent on the instantaneous MSE loss for linear models, where the output y is computed as the dot product y = \mathbf{w} \cdot \mathbf{x}, with \mathbf{w} as the weight vector and \mathbf{x} as the input vector. The weight update is given by: \Delta \mathbf{w} = \eta (t - y) \mathbf{x} where \eta > 0 is the learning rate, t is the target value, and (t - y) serves as the error term driving the adjustment. This formulation arises from the negative gradient of the half-squared error E = \frac{1}{2}(t - y)^2, ensuring convergence toward weights that minimize the expected MSE under mild conditions on the input data. In applications, the underpins the least mean squares (LMS) algorithm for adaptive filtering in , enabling real-time adjustment of filter coefficients to track changing environments, such as in noise cancellation and echo suppression systems developed since the . Compared to the learning rule, it offers the key advantage of accommodating continuous, non-binary outputs through direct least mean squares minimization, rather than relying on threshold-based . As a variant, it supports by processing examples sequentially, making it computationally efficient for streaming data with low memory requirements. The extends the learning rule to handle real-valued targets while maintaining a similar update structure for linear models. It also serves as the foundational update mechanism generalized in the algorithm for multi-layer networks.

Backpropagation Algorithm

The algorithm is a cornerstone of for multi-layer neural networks, enabling efficient computation of gradients to minimize a by propagating errors backward through the network layers. It extends the , which applies to output layers, to handle hidden layers in non-linear networks via the chain rule of calculus. Formally introduced by Rumelhart, Hinton, and Williams in 1986, the algorithm performs iterative weight adjustments based on the error between predicted and target outputs. The process begins with the forward pass, where inputs are propagated layer by layer to generate predictions. For a j in a layer, the net input z_j is calculated as z_j = \sum_i w_{ji} x_i + b_j, where w_{ji} are weights, x_i are inputs from the previous layer, and b_j is the ; the is then y_j = f(z_j), with f typically a non-linear function like whose is bounded. In the backward pass, errors are computed starting from the output layer and propagated reversely. For the output layer, the error term () is \delta_o = (t - y_o) \odot f'(z_o), where t is the target vector, y_o the output, z_o the pre-activation, f' the of the , and \odot denotes element-wise multiplication. For a hidden layer, the is \delta_h = (W_{\text{next}}^T \delta_{\text{next}}) \odot f'(z_h), where W_{\text{next}} is the weight matrix to the next layer. Weights are updated as \Delta W = \eta \delta_h x_h^T, with \eta as the and x_h the input to the layer; biases follow similarly using \delta. The derivation of these updates relies on applying the chain rule to the loss E, typically E = \frac{1}{2} \sum (t_k - y_k)^2, to find partial derivatives with respect to weights. Specifically, \frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial z_j} \frac{\partial z_j}{\partial w_{ji}} = \delta_j x_i, where \delta_j = \frac{\partial E}{\partial z_j} captures the propagated error sensitivity, computed recursively from output to input via the chain rule across layers. This yields the update \Delta w_{ji} = -\eta \frac{\partial E}{\partial w_{ji}}. Variants of the algorithm include , where gradients are accumulated and averaged over the entire training dataset before a single weight update per , as originally described, and mini-batch processing, which updates weights after subsets of 32–256 examples to balance computational efficiency and in gradients. , also introduced in the seminal work, augments updates to accelerate : \Delta w(t) = -\eta \frac{\partial E}{\partial w(t)} + \alpha \Delta w(t-1), with typical \alpha = 0.9 damping oscillations along shallow directions in the loss landscape. Despite its effectiveness, faces challenges such as the vanishing gradients problem, where repeated multiplication by derivatives less than 1 (e.g., sigmoid's f' < 0.25) causes gradients to diminish exponentially in deep networks, impeding learning in early layers; this was first analyzed in recurrent contexts but extends to feedforward architectures. Additionally, the computational complexity is O(n^2) per layer for fully connected networks with n neurons, arising from the quadratic cost of matrix-vector multiplications in both passes, scaling poorly for large models without optimizations like sparsity.

Unsupervised Learning Rules

Hebbian Learning Rule

The Hebbian learning rule, a foundational principle in unsupervised learning, posits that the strength of synaptic connections between neurons increases when the pre-synaptic and post-synaptic neurons are active simultaneously, thereby strengthening associations based on correlated activity. This mechanism, often summarized as "neurons that fire together, wire together," serves as a biological model for associative learning and has been formalized in computational neuroscience to enable networks to adapt without external supervision. Mathematically, the basic Hebbian update rule for the synaptic weight w_{ij} between pre-synaptic i and post-synaptic j is given by \Delta w_{ij} = \eta \, x_i \, y_j, where \eta > 0 is the learning rate, x_i is the activity of the pre-synaptic , and y_j is the activity of the post-synaptic , typically computed as y_j = f\left( \sum_k w_{jk} x_k \right) with f as an . This rule promotes the reinforcement of pathways that contribute to joint , facilitating the storage of input patterns as stable attractors in recurrent networks. A key challenge with the basic rule is its tendency toward unbounded weight growth, as repeated co-activations continuously amplify weights without stabilization, potentially leading to instability in network dynamics. Solutions such as weight decay, which subtracts a term proportional to the current weight (e.g., \Delta w_{ij} \leftarrow \Delta w_{ij} - \gamma w_{ij} with decay rate \gamma > 0), help mitigate this by introducing gradual and bounding growth over time. In applications, the Hebbian rule underpins pattern storage in Hopfield networks, where weights are set as outer products of stored patterns to retrieve complete representations from partial cues via associative recall. Biologically, it aligns with (LTP), a persistent synaptic strengthening observed in hippocampal slices following high-frequency stimulation, supporting its role in memory formation. Variants of the rule address limitations like growth instability; for instance, Oja's normalized Hebbian rule modifies the update to \Delta w_{ij} = \eta y_j (x_i - y_j w_{ij}), which constrains the vector to unit length, promoting principal component while maintaining without altering the core principle.

Competitive Learning

Competitive learning is an unsupervised learning paradigm in neural networks where a set of output neurons, or , compete to respond to input patterns, enabling the discovery of features or clusters in data without labeled supervision. In this process, neurons adjust their weights to become specialized representatives of input subspaces, promoting efficient data representation through rivalry and selective activation. This approach contrasts with non-competitive methods like Hebbian learning by incorporating inhibition among neurons to enforce sparsity and distinctiveness in representations. The core mechanism involves selecting the best-matching unit (BMU), or winner, for each input vector \mathbf{x} by computing the Euclidean distance to each neuron's weight vector \mathbf{w}_i and choosing the unit k that minimizes ||\mathbf{x} - \mathbf{w}_k||. The winner's weights are then updated to move closer to the input, while other neurons remain unchanged or experience inhibitory effects to prevent overgeneralization. The weight update for the winning neuron k and input component j is given by \Delta w_{kj} = \eta (x_j - w_{kj}), where \eta is the learning rate, pulling the weight toward the input and effectively performing a form of vector quantization. Lateral inhibition, often implemented via negative connections between neurons, further enhances competition by suppressing non-winning units, fostering sparse activations that mimic biological selectivity. A prominent topological variant is the (SOM), introduced by Teuvo Kohonen, which preserves spatial relationships among neurons on a low-dimensional . In SOMs, not only the BMU but also its neighbors update their weights, modulated by a neighborhood function that decays over time. The update becomes \Delta \mathbf{w}_i = \eta(t) h_{ki}(t) (\mathbf{x} - \mathbf{w}_i), where h_{ki}(t) = \exp\left( -\frac{||\mathbf{r}_i - \mathbf{r}_k||^2}{2\sigma(t)^2} \right) is the Gaussian neighborhood kernel, with \mathbf{r}_i and \mathbf{r}_k denoting positions, and \sigma(t) the shrinking radius. This enables of high-dimensional on maps where similar inputs activate nearby neurons, facilitating exploratory . Competitive learning finds applications in for data compression, where codebook vectors approximate input distributions, and in clustering tasks to group similar patterns without predefined categories. The sparsity induced by also supports feature extraction in , such as isolating dominant components in sensory data. Convergence in competitive learning relies on decaying the \eta(t) and, in topological models like SOMs, shrinking the neighborhood \sigma(t) to transition from global organization to , ensuring stable formation and preventing oscillations. This schedule, often linear or exponential, balances exploration and exploitation, leading to topologically ordered representations in SOMs after sufficient iterations.

Extensions and Modern Variants

Reinforcement Learning Rules

Reinforcement learning rules facilitate the optimization of policies or value functions in environments where agents interact sequentially to maximize cumulative rewards, rather than relying on explicit supervisory signals. At their core, these rules employ temporal difference (TD) errors to drive updates, where the TD error quantifies the difference between a current prediction of future reward and an updated estimate based on new observations, enabling from experience without complete knowledge of the environment's dynamics. This approach, formalized in the late , contrasts sharply with by accommodating sparse and delayed rewards—where feedback arrives infrequently or after multiple steps—and necessitating a between of novel actions and exploitation of known rewarding ones to avoid suboptimal policies. A foundational update rule in this paradigm is the REINFORCE algorithm, which adjusts parameters to ascend the gradient of expected reinforcement directly from sampled trajectories. The update takes the form \Delta w = \eta \rho \nabla_w \log \pi(a|s; w), where w are the parameters, \eta is the learning rate, \rho represents the received reward signal (often a return from the episode), \pi(a|s; w) is the parameterized probability of action a in state s, and the gradient points toward policies yielding higher rewards. This Monte Carlo-style method estimates gradients unbiasedly but with high variance, making it suitable for direct search in stochastic settings. Complementing this, value-based rules like Q-learning update action-value estimates to approximate optimal behavior off-, using the TD error for bootstrapping: \Delta Q(s,a) = \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right], where \alpha is the learning rate, r the immediate reward, \gamma the discount factor, s' the next state, and the max operator selects the best subsequent action to propagate value estimates backward. These updates converge to optimal values under tabular representations in finite Markov decision processes, providing a model-free way to learn control policies. Actor-critic architectures build on these rules by separating policy parameterization (the actor) from value estimation (the critic), where the critic's TD-based value approximations inform low-variance policy gradients for the actor, enhancing stability in high-dimensional or continuous spaces. For credit assignment in delayed reward scenarios, eligibility traces augment TD updates by maintaining decaying records of recent state- visits, allowing errors to propagate temporally and blend one-step and multi-step learning for faster convergence, as in TD(\lambda) methods. Such traces, implemented as accumulators updated via e_t = \gamma \lambda e_{t-1} + \nabla v(s_t), past proportionally to their recency and relevance, significantly improving performance in tasks like random walks or with long horizons. The development of these rules traces to pioneering efforts in the 1980s, with Richard S. Sutton's introduction of TD learning in 1988 marking a key advancement in and from interaction, followed by extensions to approximations in the 1990s that enabled scalable applications in complex domains. Modern variants, such as (PPO) introduced in 2017, further improve stability and sample efficiency in tasks.

Adaptive and Online Rules

Adaptive learning rates address the limitations of fixed-rate rules in unsupervised learning by dynamically adjusting synaptic weights to maintain stability and prevent issues like weight explosion or vanishing gradients. A seminal example is Oja's rule, introduced in 1982, which modifies the classical Hebbian update to normalize weights and extract the principal component of input data. The rule is defined as: \Delta w_{ij} = \eta (x_i y_j - y_j^2 w_{ij}) where \eta is the learning rate, x_i is the input, y_j is the output, and the subtractive term ensures that the norm of the weight vector remains bounded at 1, promoting convergence to the dominant eigenvector of the input covariance matrix. Online variants extend these ideas to environments, using approximations to update weights incrementally without requiring full storage. Covariance-based rules, such as the Bienenstock-Cooper-Munro (BCM) theory from 1982, incorporate a sliding to balance potentiation and depression, enabling stable feature selectivity in models. The BCM update is given by: \Delta w = \eta y (x - \theta y) where \theta is an activity-dependent that shifts based on recent presynaptic activity, allowing synapses to strengthen for moderate inputs while weakening for high or low ones, thus resolving issues like over-saturation in competitive learning. Hybrid approaches combine these unsupervised mechanisms with external signals for enhanced adaptability, such as reward-modulated Hebbian learning, which scales weight changes by a global reward signal to mimic 's role in . In signaling models, this modulation enables networks to reorganize connections in tasks requiring sustained activity, as demonstrated in simulations where reward prediction errors drive selective synaptic potentiation without full supervised training. Brief integration with has also enabled online training of deep networks by adapting rates per layer. In modern applications, these adaptive and rules are crucial for continual learning paradigms, where models must incorporate new without overwriting knowledge, directly tackling the stability-plasticity dilemma first formalized by Grossberg in . By dynamically tuning thresholds and rates, rules like BCM variants mitigate catastrophic in benchmarks on permuted MNIST datasets. Second-order methods further refine this by approximating the to capture , enabling faster convergence in non-stationary environments; for instance, Hessian-free optimization uses conjugate gradients to avoid explicit Hessian computation in deep training.

References

  1. [1]
    Fundamentals of Artificial Neural Networks
    Learning in a neural network is normally accomplished through an adaptive procedure, known as a learning rule or algorithm whereby the weights of the ...
  2. [2]
    A Basic Introduction To Neural Networks - cs.wisc.edu
    Most ANNs contain some form of 'learning rule' which modifies the weights of the connections according to the input patterns that it is presented with. In a ...
  3. [3]
    [PDF] Chapter 7 The Local Learning Principle - IGB at UC Irvine
    In the case of neural networks, this is what can be called “learning in the machine”, as opposed to machine learning, i.e. trying to imagine how things work in ...<|control11|><|separator|>
  4. [4]
    Half a century of Hebb | Nature Neuroscience
    In 1949, Donald Hebb predicted a form of synaptic plasticity driven by temporal contiguity of pre- and postsynaptic activity.
  5. [5]
    [PDF] Hebbian Learning Hebb's Postulate Linear Associator Hebb Rule
    1. If two neurons on either side of a connection are activated synchronously, then the weight of that connection is increased.
  6. [6]
    [PDF] The delta rule
    “delta”: difference between desired and actual output. • Also called “perceptron learning rule”. Page 8. Two types of mistakes.
  7. [7]
    Neural Network Basics
    By the early 1960's, the Delta Rule [also known as the Widrow and Hoff learning rule or the least mean square (LMS) rule] was invented (Widrow and Hoff, 1960).
  8. [8]
    [PDF] Neural Networks Learning - Auburn University
    Sep 3, 2010 · This learning rule can be used for both soft- and hard-activation functions. Since desired responses of neurons are not used in the learning ...
  9. [9]
    [PDF] Identifying Learning Rules From Neural Network Observables
    We show that different classes of learning rules can be separated solely on the basis of aggregate statistics of the weights, activations, or instantaneous ...
  10. [10]
    [PDF] Pattern Recognition and Machine Learning - Microsoft
    A companion volume (Bishop and Nabney,. 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab ...
  11. [11]
    A Stochastic Approximation Method - Project Euclid
    September, 1951 A Stochastic Approximation Method. Herbert Robbins, Sutton Monro · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Math. Statist. 22(3): 400-407 ...
  12. [12]
    [PDF] A Stochastic Approximation Method - Columbia University
    Author(s): Herbert Robbins and Sutton Monro. Source: The Annals of Mathematical Statistics , Sep., 1951, Vol. 22, No. 3 (Sep., 1951), pp. 400-407. Published ...
  13. [13]
    The Regression Analysis of Binary Sequences - Cox - 1958
    Dec 5, 2018 · A sequence of 0's and 1's is observed and it is suspected that the chance that a particular trial is a 1 depends on the value of one or more independent ...
  14. [14]
    [PDF] Large-Scale Machine Learning with Stochastic Gradient Descent
    Applying the stochastic gradient rule to these variables and enforcing their positivity leads to sparser solutions. Page 4. 4. Léon Bottou. Table 1. Stochastic ...
  15. [15]
    Classical Conditioning - StatPearls - NCBI Bookshelf
    In Pavlov's experiment, the ringing of the bell became the conditioned stimulus, and salivation was the conditioned response. Essentially, the neutral stimulus ...
  16. [16]
    Donald Hebb Formulates the "Hebb Synapse" in ...
    This work contained the first explicit statement of the physiological learning rule for synaptic modification that became known as the "Hebb synapse."
  17. [17]
    A logical calculus of the ideas immanent in nervous activity
    The paper uses propositional logic to describe neural events and their relations, showing that a net's behavior can be described by logical expressions.
  18. [18]
    [PDF] Cybernetics: - or Control and Communication In the Animal - Uberty
    NORBERT WIENER second edition. THE M.I.T. PRESS. Cambridge, Massachusetts. Page 3. Copyright © 1948 and 1961 by The Massachusetts Institute of Technology. All ...
  19. [19]
    Norbert Wiener - Linda Hall Library
    Nov 26, 2024 · To solve this, Wiener introduced the concept of a feedback loop (Image 3) where real-time information about the plane's position would ...<|separator|>
  20. [20]
    [PDF] A Proposal for the Dartmouth Summer Research Project on Artificial ...
    We propose that a 2 month, 10 man study of arti cial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire.
  21. [21]
    The Perceptron: A Probabilistic Model for Information Storage and ...
    No information is available for this page. · Learn why
  22. [22]
    [PDF] ADAPTIVE SWITCHING CIRCUITS - Bernard Widrow
    An example illustrating the use of switching theory is that of the design of an interlock system for the control of traffic in a railroad switch yard. The first ...Missing: Delta | Show results with:Delta
  23. [23]
    [PDF] Minsky-and-Papert-Perceptrons.pdf - The semantics of electronics
    Copyright 1969 Massachusetts Institute of Technology. Handwritten alterations were made by the authors for the second printing (1972). Preface and epilogue ...
  24. [24]
    Perceptrons - MIT Press
    Perceptrons. An Introduction to Computational Geometry. by Marvin Minsky and Seymour A. Papert. Paperback. Out of print.
  25. [25]
    [PDF] Backwards Differentiation in AD and Neural Nets - the Werbos World
    Within the ANN field proper, it is generally well-known that backpropagation was first spelled out explicitly (and implemented) in my 1974 Harvard PhD thesis[1] ...Missing: original | Show results with:original
  26. [26]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · Cite this article. Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
  27. [27]
    [PDF] Principles of Neurodynamics - Gwern.net
    ... PRINCIPLES OF. NEURODYNAMICS. PERCEPTRONS AND THE THEORY. OF. BRAIN MECHANISMS. By FRANK ROSENBLATT. SPARTAN BOOKS. 64)1 CHIUUM PLACE. N.W.. •. WASHINGTON. DC ...
  28. [28]
    [PDF] Thinking about thinking: the discovery of the lms algorithm
    Widrow's work has focused on numerous aspects of adaptive digital signal processing: noise canceling, antennas, inverse control, and non- linear filtering. He ...
  29. [29]
    Improving the Backpropagation Algorithm with Consequentialism ...
    Mar 11, 2020 · Abstract page for arXiv paper 2003.05164: Improving the Backpropagation Algorithm with Consequentialism Weight Updates over Mini-Batches.
  30. [30]
    [PDF] the vanishing gradient problem during learning recurrent neural nets ...
    Updating a single unit by adding the old activation and the scaled current net input avoids the vanishing gradient.28 But the stored value is sensible to pertur ...
  31. [31]
    What Is the Time Complexity for Training a Neural Network Using ...
    Oct 20, 2024 · In this tutorial, we'll explore the time complexity of training a neural network using backpropagation. We'll discuss the mathematical foundations.
  32. [32]
    [PDF] The Hebbian-LMS Learning Algorithm - Semantic Scholar
    This paper presents the study and analysis of a learning algorithm called as Hebbian-LMS learning rule by the means of an artificial neural network (ANN).
  33. [33]
    Feature discovery by competitive learning - ScienceDirect
    This paper reports the results of our studies with an unsupervised learning paradigm which we have called “Competitive Learning.
  34. [34]
    Competitive learning: From interactive activation to adaptive ...
    All the models which Rumelhart and Zipser (1985) have described were shown in Grossberg (1976b) to exhibit a type of learning which is temporally unstable.
  35. [35]
    Self-organized formation of topologically correct feature maps
    Kohonen, T.: Automatic formation of topological maps of patterns in a self-organizing system. In: Proc. 2nd Scand. Conf. on Image Analysis, pp. 214–220, Oja ...Missing: paper | Show results with:paper
  36. [36]
    [PDF] Learning to predict by the methods of temporal differences
    Feb 4, 2025 · TD methods have also been proposed as models of classical conditioning (Sutton & Barto,. 1981a, 1987: Gelperin, Hopfield & Tank, 1985; Moore et ...
  37. [37]
    [PDF] Reinforcement Learning: An Introduction - Stanford University
    We first came to focus on what is now known as reinforcement learning in late. 1979. We were both at the University of Massachusetts, working on one of.
  38. [38]
    Simple statistical gradient-following algorithms for connectionist ...
    REINFORCE algorithms adjust weights along the gradient of expected reinforcement without computing gradient estimates, for connectionist networks.
  39. [39]
    Q-learning | Machine Learning
    This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to ...
  40. [40]
    [PDF] Actor-Critic Algorithms
    We propose and analyze a class of actor-critic algorithms for simulation-based optimization of a Markov decision process over.Missing: seminal | Show results with:seminal
  41. [41]
    [PDF] Simplified neuron model as a principal component analyzer
    Abstract. A simple linear neuron model with constrained Hebbian-type synaptic modification is analyzed and a new class of unconstrained learning rules is.Missing: original | Show results with:original
  42. [42]