Fact-checked by Grok 2 weeks ago

Neural network

A neural network, also known as an artificial (ANN), is a inspired by the structure and function of biological neural networks in the , consisting of interconnected nodes called artificial neurons that through weighted connections and activation functions to learn patterns from . These networks are organized into layers—typically an input layer receiving raw , one or more hidden layers performing intermediate computations, and an output layer producing results—and operate by propagating signals forward while adjusting synaptic weights during training to minimize prediction errors, often via gradient-based methods like . This architecture enables neural networks to approximate complex functions and handle tasks such as , , and sequence modeling with high accuracy after exposure to large datasets. The origins of neural networks trace back to 1943, when Warren S. McCulloch and introduced the first mathematical model of an as a binary threshold device, demonstrating that networks of such units could perform any logical computation and simulate finite-state machines. Building on this, in 1958 developed the , a single-layer network capable of linear through rules that adjust weights based on input-output discrepancies. However, early limitations, such as the inability of single-layer networks to solve nonlinear problems like the XOR function, led to a period of reduced interest known as the in the late and . The field revived in the with the popularization of multi-layer perceptrons (MLPs) and the algorithm, which enabled efficient training of deep networks by propagating errors backward through layers to update weights using . Key advancements included the 1979 invention of the by , a hierarchical convolutional network precursor for visual with shared weights and to reduce computational demands. In the , recurrent neural networks (RNNs) emerged to handle sequential data, though challenges like vanishing gradients hindered deep training until solutions like (LSTM) units in 1997 allowed effective learning over long time dependencies. The modern era of , characterized by neural networks with many layers (deep credit assignment paths), began around 2006 with unsupervised pre-training techniques like deep belief networks and autoencoders, which initialized weights to facilitate subsequent supervised . Breakthroughs accelerated in the through computational advances like graphics processing units (GPUs), enabling convolutional neural networks (CNNs) to dominate image recognition tasks, as evidenced by AlexNet's 2012 ImageNet victory with an 84.7% top-5 accuracy. Today, neural networks underpin diverse applications, including with transformers, autonomous driving, medical diagnosis, and generative modeling, continually evolving through innovations in architectures, optimization, and scalability.

Biological Foundations

Neuron Structure and Function

A biological , the fundamental unit of the , consists of several key anatomical components that enable it to receive, process, and transmit electrical signals. The , or body, serves as the central hub containing the , organelles, and metabolic machinery necessary for protein and cellular . Extending from the are dendrites, branched structures that receive incoming signals from other neurons, increasing the neuron's surface area for synaptic inputs. The is a long, slender projection that conducts electrical impulses away from the toward other cells, often branching at its end into terminal boutons. Surrounding many axons is the myelin sheath, a lipid-rich insulating layer formed by glial cells ( in the and Schwann cells in the peripheral nervous system), which accelerates signal propagation by enabling . Neurons generate and propagate signals through specialized ion channels embedded in their plasma membrane, which regulate the flow of ions such as sodium (Na⁺), potassium (K⁺), calcium (Ca²⁺), and chloride (Cl⁻). Voltage-gated ion channels open or close in response to changes in membrane potential, allowing selective ion movement that underlies electrical signaling. The action potential, a rapid and transient reversal of the membrane potential, is the primary mechanism for signal generation and long-distance propagation along the axon. It begins when a stimulus depolarizes the membrane to a threshold (typically around -55 mV), triggering the opening of voltage-gated Na⁺ channels, which causes a rapid influx of Na⁺ ions and further depolarization to approximately +40 mV. This is followed by Na⁺ channel inactivation and opening of voltage-gated K⁺ channels, leading to K⁺ efflux, repolarization, and a brief hyperpolarization before returning to rest; the entire process lasts about 1-2 milliseconds and propagates without decrement due to the regenerative nature of channel activation. In myelinated axons, action potentials "jump" between nodes of Ranvier (gaps in the myelin), enhancing speed up to 150 m/s. At rest, the neuron's membrane maintains a of approximately -70 mV, primarily due to the unequal distribution of ions across the membrane and the selective permeability dominated by K⁺ leak channels. The sodium-potassium pump (Na⁺/K⁺-ATPase) actively transports three Na⁺ ions out and two K⁺ ions in per cycle, countering passive leaks to sustain ion gradients. occurs when excitatory inputs increase Na⁺ permeability, shifting the membrane potential toward the Na⁺ equilibrium potential (around +60 mV). The equilibrium potential for each ion, representing the voltage at which its electrochemical gradient is zero, is described by the : E_X = \frac{RT}{zF} \ln \left( \frac{[X]_o}{[X]_i} \right) where E_X is the equilibrium potential, R is the gas constant, T is temperature in Kelvin, z is the ion's valence, F is Faraday's constant, and [X]_o and [X]_i are the extracellular and intracellular concentrations, respectively. For K⁺, with higher intracellular concentration (~140 mM vs. ~5 mM extracellular), E_K is about -90 mV, contributing to the resting potential; deviations from these equilibria drive the action potential dynamics. Neurotransmitters play a crucial role in basic neuronal signaling by mediating communication between neurons at synapses, where they are released from the presynaptic axon terminal into the synaptic cleft upon Ca²⁺ influx triggered by an action potential. These chemical messengers, such as glutamate (excitatory) or GABA (inhibitory), bind to receptors on the postsynaptic membrane, often opening ligand-gated ion channels that alter the membrane potential—depolarizing for excitation or hyperpolarizing for inhibition—thus integrating signals across the neuron. This electrochemical signaling in biological neurons has inspired the design of artificial neurons in computational models, which mimic signal integration and transmission.

Synaptic Transmission and Plasticity

Synapses serve as the junctions between , enabling communication and adaptation in the . There are two primary types: chemical and electrical. Chemical synapses, which predominate in the mammalian , involve the release of neurotransmitters from the presynaptic to influence the postsynaptic across a narrow synaptic cleft of 20-50 nm. In contrast, electrical synapses use gap junctions to allow direct bidirectional flow of ions and small molecules, facilitating rapid synchronization but occurring less frequently, often in specialized tissues like the heart or certain . Chemical synapses are unidirectional and support complex integration, making them central to higher functions such as learning and . The process of synaptic transmission at chemical synapses begins when an arrives at the presynaptic terminal, depolarizing the membrane and opening voltage-dependent calcium channels. This influx of calcium s triggers the fusion of synaptic vesicles with the presynaptic membrane via SNARE proteins, releasing neurotransmitters—such as or glutamate—into the synaptic cleft through . The neurotransmitters then diffuse rapidly across the cleft in microseconds and bind to specific receptors on the postsynaptic membrane, which can be ligand-gated channels for fast responses or G-protein-coupled receptors for slower . This binding induces a postsynaptic response, such as (excitatory) or hyperpolarization (inhibitory), potentially leading to an if the integrated signals reach threshold; the entire process incurs a synaptic delay of approximately 0.5-1.0 ms. Synaptic plasticity refers to the ability of these connections to strengthen or weaken over time, underpinning . A foundational principle is the Hebbian learning rule, proposed by Donald Hebb in 1949, which posits that when the presynaptic neuron repeatedly excites the postsynaptic neuron—"neurons that fire together wire together"—the synaptic efficacy increases. This is mechanistically realized through (LTP), an enduring strengthening of synapses often induced by high-frequency stimulation, involving activation and subsequent insertion of receptors to enhance postsynaptic sensitivity; LTP was first demonstrated in hippocampal slices in 1973. Conversely, long-term depression () weakens synapses through low-frequency stimulation, reducing presence and promoting forgetting or refinement of connections. Neuroplasticity, driven by these synaptic changes, plays a critical role in memory formation and . For instance, LTP in the contributes to consolidation, as seen in studies of associative learning where repeated co-activation strengthens engrams—persistent neural traces of experiences. In , LTD facilitates recovery from injury by pruning inefficient connections, such as in stroke rehabilitation where synaptic remodeling supports functional reorganization. This biological plasticity provides a conceptual foundation for weight adjustments in artificial neural networks during training.

Fundamentals of Artificial Neural Networks

Basic Components and Architecture

Artificial neural networks are composed of interconnected processing units known as , inspired by the structure of biological but simplified for computational purposes. The foundational model of an was introduced by McCulloch and Pitts in 1943, where a receives inputs from other neurons through excitatory or inhibitory connections, sums the excitatory inputs, and fires an all-or-none output if the sum exceeds a fixed , assuming no inhibition is active. This model treated neural activity as propositional logic, with inputs as logical propositions and the output as a logical function of those inputs. Building on this, Frank Rosenblatt's in 1958 extended the concept to handle continuous inputs and modifiable connections, defining an that receives multiple input signals x_i, each weighted by a connection strength w_i, adds a term b, and computes a linear z = \sum w_i x_i + b before applying a to produce a output. In perceptron-based models, the weights w_i represent the strength and sign of synaptic-like connections, allowing the neuron to emphasize or suppress specific inputs, while the bias b shifts the activation threshold independently of the inputs. This summation mechanism enables the neuron to perform a weighted linear combination, mimicking how biological neurons integrate signals from dendrites before propagating via the . artificial neurons retain this : multiple inputs, adjustable weights, an optional bias, and a summation step, though the output processing is handled separately to introduce nonlinearity. These units form the basic building block for more complex architectures, where networks of such neurons can approximate arbitrary functions given sufficient . Neural networks organize artificial neurons into layers to process information hierarchically: an input layer receives raw data features, one or more hidden layers perform intermediate computations, and an output layer produces the final predictions or . The input layer typically has as many neurons as the dimensionality of the input data, directly passing values to the subsequent layer without processing. Hidden layers, first systematically explored in multilayer perceptrons, transform representations through interconnected , enabling the network to learn hierarchical features. The output layer's neuron count matches the number of desired outputs, such as classes in classification tasks. Within layers, can be fully connected, where every in one layer links to every in the next, maximizing but increasing computational cost, or sparse, where only a of exist to reduce parameters and mimic biological . Fully connected layers, as in early perceptrons, ensure dense interactions but scale poorly with layer size, leading to O(n^2) parameters for n per layer. Sparse , common in convolutional or recurrent , limit links to relevant subsets, lowering memory use while preserving representational power, as demonstrated in analyses of recurrent architectures where sparsity maintains performance with fewer parameters. Conceptually, a neural network can be represented as a , with nodes corresponding to neurons and directed edges to weighted connections carrying signals unidirectionally from inputs to outputs. This , often acyclic in networks, defines the flow of : inputs enter nodes, propagate along weighted edges through nodes, and exit via nodes. Such a representation highlights the , where edge weights encode learned parameters, and node degrees reflect —full in dense graphs or partial in sparse ones. This view facilitates analysis of network properties like depth (number of layers) and width (neurons per layer). A simple architecture with two layers illustrates these components: an input layer with three s (for features x_1, x_2, x_3) connects fully to a hidden layer of two s, each computing a weighted sum plus ; these hidden s then connect to a single output for . In terms, this forms a bipartite with six edges from input to hidden and two from hidden to output, all weights adjustable during learning. This minimal setup can solve linearly separable problems, scaling to deeper networks for complex .

Activation Functions and Nonlinearity

In artificial neural networks, activation functions determine the output of a neuron given an input, introducing nonlinearity that is essential for the network's expressive power. Without nonlinearity, a multi-layer network would collapse to a linear transformation, limiting its ability to model complex, non-linear relationships in data. The universal approximation theorem establishes that networks with nonlinear activations, such as sigmoidal functions, can approximate any on a compact subset of \mathbb{R}^n to arbitrary accuracy, provided the network has sufficient width or depth. Historically, early neural models employed step functions, which output a binary value (0 or 1) based on a threshold, mimicking idealized neuron firing but lacking differentiability, which hindered gradient-based optimization. This shifted in the 1980s to smooth, differentiable activations like the sigmoid to enable backpropagation, allowing networks to learn via gradient descent. The sigmoid function, \sigma(x) = \frac{1}{1 + e^{-x}}, maps inputs to (0, 1) and is infinitely differentiable, facilitating training but prone to saturation where gradients approach zero for large positive or negative inputs, leading to the vanishing gradient problem in deep networks. To address the sigmoid's bias toward positive outputs and slower convergence, the hyperbolic tangent (tanh) function emerged as an alternative, defined as \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}, outputting values in (-1, 1) and centering data around zero for better flow. Like the sigmoid, tanh is infinitely differentiable but still suffers from vanishing gradients, though less severely due to its symmetric range, making it suitable for hidden layers in earlier applications. The rectified linear unit (ReLU), \text{ReLU}(x) = \max(0, x), marked a significant advancement in the , offering computational efficiency as a simple thresholding operation and avoiding vanishing gradients by providing a constant gradient of 1 for positive inputs, which accelerates convergence in deep networks. However, ReLU is not differentiable at x=0 (though subgradients are used in practice) and can cause "dying" neurons where negative inputs yield zero output and gradients, stalling learning. To mitigate this, variants like Leaky ReLU were introduced, defined as \text{Leaky ReLU}(x) = \max(0, x) + \alpha \min(0, x) with a small \alpha > 0 (often 0.01), allowing a gentle slope for negative inputs to maintain neuron activity.

Mathematical and Computational Principles

Forward Propagation

Forward propagation, also known as the forward pass, is the computational process in an artificial neural network where input data flows through the network layers to produce an output prediction, simulating the unidirectional signal transmission in biological neurons. This mechanism forms the core of in neural networks, enabling the model to map inputs to outputs without involving weight updates. At the level of a single , or unit, forward propagation begins with the computation of a weighted of inputs. Given input features \mathbf{x} = [x_1, x_2, \dots, x_n] and corresponding weights \mathbf{w} = [w_1, w_2, \dots, w_n], along with a term b, the pre-activation value z is calculated as: z = \sum_{i=1}^{n} w_i x_i + b This z represents the net input to the . The then applies an f to introduce nonlinearity, yielding the output a = f(z). Common activations include step functions in early models or and ReLU in modern ones, ensuring the network can model complex patterns. For efficiency in multi-layer networks with multiple neurons per layer, forward propagation is vectorized using matrix operations. Consider a layer with m neurons receiving input from n previous units, represented by input \mathbf{x} \in \mathbb{R}^n, matrix \mathbf{W} \in \mathbb{R}^{m \times n}, and \mathbf{b} \in \mathbb{R}^m. The pre-activation matrix \mathbf{Z} for the layer is: \mathbf{Z} = \mathbf{W} \mathbf{x} + \mathbf{b} The activations \mathbf{A} are then \mathbf{A} = f(\mathbf{Z}), applied element-wise. This process repeats layer by layer, with the output of one layer serving as input to the next, culminating in the network's final output. Such matrix formulations allow parallel computation on hardware like GPUs, scaling to large networks. To illustrate, consider a simple single-layer network with two inputs \mathbf{x} = [0.5, 0.3]^T, weight matrix \mathbf{W} = \begin{bmatrix} 0.1 & 0.2 \\ 0.4 & 0.5 \end{bmatrix}, and bias \mathbf{b} = [0.1, 0.2]^T, using a ReLU activation f(z) = \max(0, z). The pre-activation is \mathbf{Z} = \mathbf{W} \mathbf{x} + \mathbf{b} = \begin{bmatrix} 0.1 \cdot 0.5 + 0.2 \cdot 0.3 + 0.1 \\ 0.4 \cdot 0.5 + 0.5 \cdot 0.3 + 0.2 \end{bmatrix} = \begin{bmatrix} 0.21 \\ 0.55 \end{bmatrix}. Applying ReLU gives \mathbf{A} = \begin{bmatrix} 0.21 \\ 0.55 \end{bmatrix}. This output could feed into subsequent layers for deeper processing. In predictive tasks, forward propagation transforms raw input data into actionable outputs, such as class probabilities in . For instance, the final layer's pre-activations are often passed through a to produce a : p_k = \frac{e^{z_k}}{\sum_j e^{z_j}} for k. This enables the network to generate interpretable predictions, like identifying an image as a "cat" with 85% confidence, based on the propagated features.

Backpropagation and Optimization

Backpropagation is the cornerstone algorithm for computing gradients in neural networks, enabling efficient training by propagating errors backward through the network layers using the chain rule of . This process begins after the forward propagation computes predictions from inputs, allowing the calculation of partial of the loss with respect to each weight and . Introduced in its modern form for multilayer networks, backpropagation revolutionized training by making it feasible to optimize deep architectures. The choice of loss function is crucial, as it quantifies the discrepancy between predicted outputs \hat{y} and true targets y. For regression problems, the mean squared error (MSE) is widely used, formulated as L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2, where n is the number of samples; this measures the average squared difference, emphasizing larger errors quadratically. In classification tasks, the cross-entropy loss is preferred, particularly with softmax outputs, as it penalizes confident wrong predictions more severely and aligns with probabilistic interpretations of model outputs. The backpropagation algorithm derives the error term \delta^l for layer l recursively as \delta^l = \left( (W^{l+1})^T \delta^{l+1} \right) \odot f'(z^l), where W^{l+1} are the weights to the next layer, \delta^{l+1} is the error from the subsequent layer, f' is the derivative of the , and \odot denotes element-wise multiplication; this applies the chain rule layer by layer from output to input. The gradients for weights are then \frac{\partial L}{\partial W^l} = \delta^l (a^{l-1})^T, where a^{l-1} is the from the previous layer, allowing precise updates proportional to the contribution of each parameter. Optimization proceeds by updating parameters via on the loss: W \leftarrow W - \eta \frac{\partial L}{\partial W}, where \eta is the . Batch gradient descent uses the full dataset for each update, providing stable but computationally expensive gradients. (SGD) approximates with single examples or mini-batches, introducing noise that helps escape poor local solutions but can oscillate. Advanced variants like combine (to accelerate in relevant directions) with adaptive per-parameter learning rates, typically initializing \eta = 0.001, and have become a default for many applications due to faster convergence. Despite these advances, challenges persist, including the risk of settling in local minima where gradients vanish, though suggests such poor minima are rare in overparameterized networks, with most local optima yielding similar performance to global ones. To mitigate issues like slow progress in flat regions or divergence from high curvatures, scheduling reduces \eta over time—e.g., exponentially decaying it every few epochs—balancing with later .

Types of Neural Networks

Feedforward and Multilayer Perceptrons

neural networks, also known as feedforward perceptrons, represent the foundational architecture in artificial neural networks where information flows unidirectionally from input to output layers without cycles or loops. The simplest form is the single-layer , introduced by in 1958 as a binary classifier capable of learning linear decision boundaries through supervised training on . In this model, each input feature connects to a single output neuron via weighted connections, with the output computed as the sign of the weighted sum of inputs plus a term: y = \operatorname{sign}\left( \sum_{i=1}^{n} w_i x_i + b \right), where w_i are weights, x_i are inputs, and b is the ; training adjusts weights using the learning rule to minimize errors for linearly separable patterns. However, the single-layer is limited to problems where classes can be separated by a , failing on nonlinearly separable tasks such as the XOR , which requires distinguishing patterns like (0,0) → 0, (0,1) → 1, (1,0) → 1, and (1,1) → 0. This limitation, rigorously analyzed by and in their 1969 book Perceptrons, demonstrated that single-layer networks cannot compute certain simple functions without additional structure, contributing to early skepticism about neural network scalability. To address these shortcomings, multilayer perceptrons (MLPs) extend the architecture by incorporating one or more hidden layers between the input and output layers, enabling the modeling of nonlinear relationships through layered compositions of linear transformations and nonlinear activation functions. Hidden layers transform inputs into higher-dimensional representations, allowing MLPs to solve problems like XOR by creating nonlinear decision boundaries; for instance, a single-hidden-layer MLP can approximate the XOR function by mapping inputs to intermediate features that separate the classes. This capability arises from the network's depth, where each layer applies a weighted sum followed by a nonlinearity, such as the \sigma(z) = \frac{1}{1 + e^{-z}}, propagating information forward to produce outputs suitable for tasks beyond , including and multiclass problems. A key theoretical justification for MLPs is the universal approximation theorem, which states that a network with a single hidden layer containing a finite number of neurons can approximate any on a compact subset of \mathbb{R}^n to arbitrary accuracy, provided the activation function is nonlinear (e.g., sigmoid). First proved by George Cybenko in 1989 for sigmoidal activations, the theorem was generalized by Kurt Hornik in 1991 to show that standard multilayer networks with almost any continuous squashing activation are universal approximators, establishing their expressive power for representing complex mappings without requiring infinite parameters. This result underscores why MLPs serve as a baseline for many machine learning applications, though practical approximation depends on sufficient hidden units and appropriate training. Prior to the development of backpropagation, training MLPs posed significant challenges, particularly the credit assignment problem of determining how errors at the output should adjust weights in earlier layers without direct . Minsky and Papert highlighted in 1969 that while single-layer perceptrons converge for linearly separable data, multilayer versions lacked an efficient algorithm to propagate blame through depths, leading to slow or infeasible optimization via methods like random search or gradient-free techniques. This difficulty stalled progress on deep networks until provided a scalable solution for error propagation.

Recurrent and Convolutional Networks

Recurrent neural networks (RNNs) extend architectures to handle sequential data by incorporating loops that allow information to persist across time steps. In an RNN, the hidden state at time t, denoted \mathbf{h}_t, is computed as \mathbf{h}_t = f(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t), where f is a nonlinear , \mathbf{W}_{hh} and \mathbf{W}_{xh} are weight matrices, and \mathbf{x}_t is the input at time t. This formulation enables the network to maintain a of previous inputs, making it suitable for tasks involving temporal dependencies, such as predicting the next word in a . However, standard RNNs suffer from vanishing or exploding gradients during through time, which hinders learning over long sequences. To address these limitations, (LSTM) units were introduced as a variant of RNNs. LSTMs incorporate a cell state and three —forget, input, and output—to regulate the flow of and mitigate issues. The forget determines what to discard from the cell state, the input decides what new to store, and the output controls what parts of the cell state to expose as the hidden state. This gating mechanism allows LSTMs to learn long-range dependencies more effectively than vanilla RNNs, as demonstrated in tasks requiring retention of over extended time lags. Convolutional neural networks (CNNs) are designed primarily for grid-like data, such as images, by applying shared weights through operations to detect local patterns. A key component is the , a small that slides over the input to produce feature maps capturing edges, textures, or other motifs. Pooling layers, often max or average pooling, follow convolutions to reduce spatial dimensions while preserving important features, with the stride parameter controlling the step size of the or pooling window to manage output size and computational efficiency. These elements promote translation invariance, enabling the network to recognize patterns regardless of their position in the input. RNNs and LSTMs find applications in , such as language modeling and , where sequential context is essential. CNNs excel in tasks like and image classification, leveraging their ability to extract hierarchical features from visual data.

Historical Development

Early Inspirations and Milestones (1940s–1980s)

The concept of neural networks drew early inspiration from biological neurons, which process signals through interconnected networks in the , laying the groundwork for computational models that mimic these structures. In 1943, Warren McCulloch and introduced the first mathematical model of a , known as the McCulloch-Pitts neuron, which represented neural activity as a logical unit capable of performing operations like AND, OR, and NOT through weighted sums exceeding a . This model demonstrated that networks of such units could compute any logical function, establishing a foundation for viewing the brain as a computational device equivalent to a . Although simplistic and assuming synchronous firing without learning, it shifted focus toward abstracting neural computation into discrete logic, influencing subsequent research. Building on this, Frank Rosenblatt developed the perceptron in 1958 as a single-layer neural network for pattern recognition, implemented initially in hardware to simulate adaptive learning. The perceptron adjusted weights via a supervised learning rule, updating each weight w_i as w_i += \eta (y - \hat{y}) x_i, where \eta is the learning rate, y is the target output, \hat{y} is the predicted output, and x_i are inputs, enabling it to classify linearly separable patterns like distinguishing shapes. Early hardware versions, such as the Mark I Perceptron, successfully learned to recognize simple visual patterns from sensor inputs, sparking optimism about machine learning and leading to funding for larger systems. However, the model was limited to linear decision boundaries, restricting its ability to handle complex, non-separable problems like the XOR function. The perceptron's limitations were rigorously exposed in 1969 by and in their book Perceptrons, which proved mathematically that single-layer networks could not solve non-linearly separable tasks, such as problems, due to their inability to approximate functions requiring layers. This critique highlighted the computational constraints of perceptrons, including sensitivity to input scaling and poor generalization beyond training data, dampening enthusiasm and contributing to the first "" by redirecting research away from connectionist approaches toward symbolic AI. Despite later revisions acknowledging multi-layer potential, the initial analysis effectively stalled neural network progress for over a decade. A notable development in the late 1970s was Kunihiko Fukushima's invention of the in 1979, a hierarchical, multi-layer artificial neural network designed for visual . Inspired by the , it featured alternating layers of S-cells (simple) for feature detection and C-cells (complex) for positional invariance, using shared weights and subsampling—concepts that foreshadowed modern convolutional neural networks (CNNs). Although trained manually without , the Neocognitron demonstrated robustness to shifts and distortions in inputs, influencing subsequent work in . The field began to revive in the 1980s with advancements like John Hopfield's 1982 model of a for , treating networks as dynamical systems that store and retrieve patterns through energy minimization. In the , states s_i = \pm [1](/page/\u22121) evolve according to local rules, converging to stable attractors representing stored memories, with the system's energy defined as E = -\frac{1}{2} \sum_{i,j} s_i s_j w_{ij}, where w_{ij} are symmetric weights derived from Hebbian learning on pattern pairs. This framework allowed error-tolerant recall of incomplete inputs, modeling phenomena like , and bridged neural computation with statistical physics, inspiring further work in optimization and spin-glass analogies. Further revival came with the popularization of the algorithm in 1986, detailed in a seminal by David Rumelhart, , and Ronald Williams. This method enabled efficient training of multi-layer perceptrons (MLPs) by propagating errors backward through layers using , overcoming the limitations of single-layer networks and sparking renewed interest in deep architectures during the late 1980s. Although funding cuts led to a second in the early 1990s, backpropagation laid the foundation for subsequent advances in neural network training.

Revivals and Modern Advances (1990s–Present)

The 1990s saw the emergence of recurrent neural networks (RNNs) for handling sequential data, such as and . However, standard RNNs suffered from vanishing gradients during through time, limiting their ability to learn long-term dependencies. This challenge was addressed in 1997 with the introduction of (LSTM) units by and , which incorporate gating mechanisms—input, forget, and output gates—to selectively remember or forget information over extended sequences. LSTMs enabled effective training of deep recurrent architectures and became foundational for applications like and . A major breakthrough came in 2006 with Geoffrey Hinton's introduction of deep belief networks (DBNs), which combined restricted Boltzmann machines—undirected graphical models trained layer by layer—to form a that could initialize deep feedforward networks for supervised tasks. DBNs addressed the challenge of training deep architectures by using unsupervised pre-training to learn hierarchical feature representations, followed by fine-tuning via , achieving state-of-the-art results on tasks like digit recognition with significantly reduced error rates compared to prior shallow models. This work, alongside advances in autoencoders—neural networks that learn compressed representations by reconstructing inputs through bottleneck layers—revitalized interest in and paved the way for scalable . The era exploded in 2012 with , a (CNN) developed by , , and , which dramatically outperformed competitors in the Large Scale Visual Recognition Challenge by reducing the top-5 error rate from 26.2% to 15.3% using eight layers, ReLU activations, and GPU acceleration for training on over a million images. 's success highlighted the power of depth in feature extraction for , sparking widespread adoption of CNNs and the broader boom, with subsequent models like VGG and ResNet building on its principles to push performance further. In 2017, Ashish Vaswani and colleagues introduced the Transformer architecture in their paper "Attention Is All You Need," replacing recurrent layers with self-attention mechanisms to process sequences in parallel, achieving superior performance on machine translation tasks like English-to-German with a BLEU score of 28.4, surpassing previous convolutional and recurrent models. The core innovation was the scaled dot-product attention, computed as: \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V where Q, K, and V are query, , and matrices derived from input embeddings, and d_k is the of the keys to prevent vanishing gradients from softmax . Transformers revolutionized (NLP) and extended to vision and beyond, forming the backbone of large language models (LLMs) such as OpenAI's GPT series; for instance, (2020) scaled to 175 billion parameters for , while (2023) and GPT-5 (2025) integrated multimodal capabilities and advanced reasoning, enabling applications from to scientific discovery with unprecedented coherence. The 2020s saw further advances in generative modeling with , which iteratively add and remove noise to learn data distributions, culminating in (2022), a that generates high-resolution images from text prompts by operating in a compressed , achieving FID scores competitive with GANs while offering greater stability and editability. This integration of diffusion processes with Transformer-based encoders has fueled creative AI tools, and their synergy with LLMs—such as in multimodal systems combining text and image generation—continues to expand neural networks' scope into real-world deployment by 2025.

Applications and Advancements

Supervised and Unsupervised Learning

Supervised learning in neural networks involves training models on datasets where each input is paired with a corresponding output label, enabling the network to learn mappings from inputs to desired outputs. This paradigm is foundational for tasks requiring predictive accuracy, such as regression and classification. In regression, neural networks approximate continuous functions; for instance, multilayer perceptrons have been applied to predict house prices based on features like location and size, achieving mean absolute errors as low as 10-15% of the median price in benchmark datasets. In classification, convolutional neural networks (CNNs) excel at image recognition, as demonstrated by LeNet-5 on the MNIST dataset of handwritten digits, where it attained an error rate of 0.95% through end-to-end training on labeled examples. Neural networks in supervised learning are typically trained using backpropagation to minimize a loss function based on labeled data. Unsupervised learning, by contrast, operates on unlabeled data to uncover inherent structures without explicit guidance, making it suitable for exploratory analysis in neural networks. Autoencoders, a key architecture for this , consist of an encoder that compresses inputs into a lower-dimensional latent and a decoder that reconstructs the original input, facilitating tasks like clustering by grouping similar latent vectors. For example, deep autoencoders have been used to cluster high-dimensional data such as profiles, revealing biologically meaningful subgroups with silhouette scores exceeding 0.6. via neural networks mimics (PCA) but captures nonlinear manifolds; Hinton and Salakhutdinov's deep autoencoder approach reduced 784-dimensional MNIST images to 30 dimensions, achieving a test reconstruction error of 0.0075 compared to PCA's 0.0108, outperforming linear PCA in reconstruction quality. Semi-supervised learning bridges these paradigms by leveraging a small set of alongside abundant unlabeled data, often through hybrid techniques like self-training, where a model initially trained on labels generates pseudo-labels for unlabeled samples, iteratively refining predictions. This method has been shown to improve accuracy in semi-supervised settings with limited , as seen in applications to where limited annotations are iteratively expanded. In self-training for neural networks, confidence thresholds ensure reliable pseudo-labels, mitigating error propagation. Evaluation metrics for these paradigms differ to reflect their objectives. In , accuracy measures the proportion of correct predictions, while the F1-score, the of , balances false positives and negatives, particularly in imbalanced datasets; for MNIST classification, top CNNs achieve F1-scores near 0.99. For , reconstruction error—typically between input and output—quantifies how well autoencoders capture data fidelity, with lower values (e.g., below 0.05 for normalized MNIST) indicating effective learning of representations.

Generative Models and Emerging Uses

Generative models based on neural networks enable the creation of new data samples that resemble training distributions, distinct from predictive tasks by focusing on synthesis rather than or . These models leverage architectures like autoencoders and adversarial training to learn latent representations and generate outputs such as images, text, or molecules. Key approaches include generative adversarial networks (GANs), variational autoencoders (VAEs), and diffusion models, each addressing challenges in sampling quality, stability, and scalability. Generative adversarial networks, introduced in , consist of two competing neural networks: a that produces from random and a discriminator that distinguishes real data from generated samples. The training involves a min-max game where the generator minimizes the discriminator's ability to detect fakes, formalized by the value function V(G,D) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] with the discriminator maximizing V(G,D) and the generator minimizing it. This adversarial setup has enabled high-fidelity image synthesis, though it often suffers from mode collapse and training instability. Variational autoencoders extend architectures by incorporating probabilistic latent spaces, where an encoder maps inputs to approximate posterior s and a reconstructs from latent samples. Training optimizes the (ELBO), balancing reconstruction loss with a Kullback-Leibler () term that regularizes the latent toward a , typically a standard Gaussian: \mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z)). This framework facilitates controllable generation and interpolation in latent spaces, applied in tasks like and . Diffusion models, gaining prominence since 2020, model generation as a reverse process of gradually adding (forward ) and then denoising to recover structured samples. The forward process transforms data x_0 over T steps to via q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I), while a neural network learns the reverse p_\theta(x_{t-1} | x_t) to iteratively denoise from pure . This approach excels in image synthesis, producing diverse, high-resolution outputs with stable training compared to GANs. Emerging applications of neural networks extend generative capabilities into real-world domains. In autonomous driving, systems like Tesla's Full Self-Driving (FSD) employ end-to-end neural networks trained on vast driving datasets to predict trajectories and control vehicles, processing camera inputs for and without explicit rule-based modules. In drug discovery, AlphaFold's deep learning models predict protein structures with atomic accuracy, integrating with generative networks to design novel ligands and accelerate therapeutic development by simulating molecular interactions. Multimodal AI, exemplified by , uses transformer-based neural networks to generate images from textual descriptions, bridging language and vision for creative applications like art and design. As of 2025, diffusion models have advanced to video generation, as exemplified by OpenAI's Sora, which creates realistic videos from text prompts.

Challenges and Ethical Considerations

Limitations in Training and Interpretability

Neural networks, particularly deep architectures, are prone to , where models memorize rather than generalizing to unseen examples, leading to poor performance on new . This issue arises due to the high capacity of networks with many parameters, which can capture noise in finite datasets. To mitigate , regularization techniques are employed, such as regularization, which adds a penalty term \lambda \| \mathbf{w} \|^2 to the loss function, where \mathbf{w} represents the weights and \lambda controls the strength of the penalty; this encourages smaller weights and smoother functions. Another prominent method is dropout, introduced as a regularization approach that randomly deactivates a fraction of neurons during , preventing co-adaptation of features and approximating an of thinner networks. Training large neural networks faces significant challenges, primarily stemming from their immense computational demands and voracious appetite for . Modern models, such as those used in , require specialized hardware like graphics processing units (GPUs) or tensor processing units (TPUs) to handle the matrix operations involved in forward and backward passes efficiently; for instance, the breakthrough model relied on GPUs to make feasible within days rather than years. TPUs, designed specifically for tensor computations, further accelerate by optimizing for the parallelism in neural network operations, reducing time and energy costs for large-scale models. Additionally, performance improvements follow empirical scaling laws, where decreases as a with increasing dataset size, model parameters, and compute, implying that state-of-the-art results demand exponentially more —often billions of examples—to achieve meaningful gains. A core limitation of neural networks is their black-box nature, where the internal representations and processes remain opaque, hindering trust and in critical applications. This interpretability gap stems from the distributed, non-linear computations across millions of parameters, making it difficult to trace how inputs lead to outputs. To address this, post-hoc techniques like saliency maps have been developed, which compute gradients of the output with respect to input features to highlight regions most influential to predictions, providing visual insights into model focus for tasks like image classification. Similarly, Local Interpretable Model-agnostic Explanations () approximates the model's behavior locally around a specific instance by fitting a simple, interpretable surrogate model to perturbed samples, offering feature-level explanations that are faithful to the original prediction without requiring model modifications. Neural networks exhibit striking vulnerabilities to adversarial examples, where imperceptibly small perturbations to inputs can cause misclassifications with high confidence, undermining reliability in safety-sensitive domains. These attacks exploit the linear nature of deep classifiers in high-dimensional spaces, allowing crafted noise to shift decisions across boundaries. A seminal method, the Fast Gradient Sign Method (FGSM), generates such perturbations efficiently by taking the sign of the with respect to , scaled by a small , to maximize for the . Despite defenses like adversarial training, which incorporates perturbed examples into the training set, these vulnerabilities persist across architectures, highlighting an ongoing challenge in robustifying models against such exploits.

Bias, Fairness, and Societal Impacts

Neural networks, like other systems, can amplify societal biases present in training data, leading to discriminatory outcomes in applications such as hiring, lending, and . arises from multiple sources, including skewed datasets that underrepresent certain demographic groups, algorithmic designs that inadvertently favor majority classes, and deployment contexts where fairness metrics conflict. For instance, a comprehensive survey identifies historical, , and biases as key contributors, noting that real-world applications often exhibit disparate error rates across protected attributes like and . These issues persist because neural networks learn patterns from data without inherent ethical constraints, potentially perpetuating systemic inequalities if not addressed through techniques like adversarial debiasing or fairness-aware training. A prominent example is in facial recognition systems, where convolutional neural networks trained on imbalanced datasets show higher error rates for darker-skinned and female faces. In the seminal study, researchers audited three commercial systems and found error rates up to 34.7% for darker-skinned females, compared to 0.8% for lighter-skinned males, highlighting intersectional disparities. Similarly, Amazon's experimental recruiting tool, powered by neural networks analyzing resumes, downgraded candidates with words associated with women (e.g., "women's chess club") because it was trained predominantly on male-dominated historical data from 2014–2015, leading to its abandonment in 2017. Such cases underscore the need for diverse datasets and auditing, as surveys emphasize that without intervention, neural networks can exacerbate gender and racial inequities in high-stakes decisions. Beyond bias, neural networks pose broader societal impacts, including job displacement through of routine tasks in sectors like and . According to the World Economic Forum's Future of Jobs Report 2025, AI and automation are projected to displace 92 million globally by 2030, while creating 170 million new ones, resulting in a net increase of 78 million . Privacy concerns also intensify, as neural networks facilitate ; for example, models in video process vast amounts of , raising risks of unauthorized tracking and data breaches without robust regulations. Additionally, the environmental footprint of training large neural networks contributes to climate challenges. Training a single model like can emit approximately 626,000 pounds of CO₂, equivalent to a for one person, due to the energy-intensive computations on GPUs. This carbon cost has prompted calls for energy-efficient architectures and policy measures, such as carbon-aware scheduling, to mitigate the growing ecological burden of scaling neural networks. In response to these challenges, regulations such as the European Union's AI Act, which entered into force in August 2024 and applies prohibitions and obligations to high-risk systems from February 2025 and August 2026 respectively, impose requirements for , , and human oversight on neural network-based to mitigate biases, privacy risks, and other societal harms. Overall, these impacts highlight the urgency of integrating ethical frameworks into neural network development to balance innovation with societal well-being.

References

  1. [1]
    Artificial neural networks: a tutorial | IEEE Journals & Magazine
    The article discusses the motivations behind the development of ANNs and describes the basic biological neuron and the artificial computational model.
  2. [2]
    [1404.7828] Deep Learning in Neural Networks: An Overview - arXiv
    Apr 30, 2014 · Deep learning uses deep neural networks, distinguished by the depth of their credit assignment paths, and includes supervised, unsupervised, ...
  3. [3]
    Neuroanatomy, Neurons - StatPearls - NCBI Bookshelf
    Two connected neurons. Neurons have a soma that contains a nucleus, an axon, and a dendritic tree. A single synapse (red circle) is formed at the point where ...Missing: sheath | Show results with:sheath
  4. [4]
    Nerve Tissue - SEER Training Modules - National Cancer Institute
    Each neuron has three basic parts: cell body (soma), one or more dendrites, and a single axon. Cell Body. In many ways, the cell body is similar to other types ...
  5. [5]
    Organization of Cell Types (Section 1, Chapter 8) Neuroscience ...
    Each neuron has only one axon and it is usually straighter and smoother than the dendritic profiles. Axons also contain bundles of microtubules and ...
  6. [6]
    Ion Channels and the Electrical Properties of Membranes - NCBI - NIH
    A simple but very important formula, the Nernst equation, expresses the equilibrium condition quantitatively and, as explained in Panel 11-2, makes it possible ...
  7. [7]
    Chapter 2. Ionic Mechanisms of Action Potentials
    Some initial depolarization (e.g., a synaptic potential) will begin to open the Na+ channels. The increase in the Na+ influx leads to a further depolarization. ...
  8. [8]
    Physiology, Resting Potential - StatPearls - NCBI Bookshelf - NIH
    The resting membrane potential is the result of the movement of several different ion species through various ion channels and transporters (uniporters, ...
  9. [9]
    Physiology, Neurotransmitters - StatPearls - NCBI Bookshelf
    Neurotransmitters are chemicals that allow neurons to communicate, enabling brain functions through chemical synaptic transmission.
  10. [10]
    Chemical Synapse - an overview | ScienceDirect Topics
    Chemical synapses are common and are restricted to the nervous system. Electrical synapses are relatively rare and are found in neuronal and nonneuronal cells.
  11. [11]
    Synaptic Transmission - Basic Neurochemistry - NCBI Bookshelf - NIH
    Synaptic transmission is chemical communication between nerve cells involving steps like neurotransmitter synthesis, storage, release, and receptor binding.
  12. [12]
    Physiology, Synapse - StatPearls - NCBI Bookshelf
    Mar 27, 2023 · Receptor activation: The neurotransmitter binds to post-synaptic receptors and produces a response in the post-synaptic neuron.<|separator|>
  13. [13]
    Synaptic Plasticity: The Role of Learning and Unlearning in ...
    These changes in neuronal connections are the primary mechanism for learning and memory and are known as “synaptic plasticity.” The idea of synaptic plasticity ...
  14. [14]
    A logical calculus of the ideas immanent in nervous activity
    Because of the “all-or-none” character of nervous activity, neural events and the relations among them can be treated by means of propositional logic.
  15. [15]
    The perceptron: A probabilistic model for information storage and ...
    Rosenblatt, F. (1958). The perceptron: A theory of statistical separability in cognitive systems. Buffalo: Cornell Aeronautical Laboratory, Inc. Rep. No. VG- ...
  16. [16]
    Universal structural patterns in sparse recurrent neural networks
    Sep 8, 2023 · Sparse neural networks can achieve performance comparable to fully connected networks but need less energy and memory, showing great promise ...<|control11|><|separator|>
  17. [17]
    [PDF] Graph Structure of Neural Networks
    Abstract. Neural networks are often represented as graphs of connections between neurons. However, de- spite their wide use, there is currently little un-.
  18. [18]
    Approximation by superpositions of a sigmoidal function
    Feb 17, 1989 · Approximation by superpositions of a sigmoidal function ... Article PDF. Download to read the full article text. Similar content being viewed by ...
  19. [19]
    [2101.09957] Activation Functions in Artificial Neural Networks - arXiv
    Jan 25, 2021 · This paper provides an analytic yet up-to-date overview of popular activation functions and their properties, which makes it a timely resource for anyone who ...
  20. [20]
    [PDF] Understanding the difficulty of training deep feedforward neural ...
    Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better ...
  21. [21]
    [PDF] Rectified Linear Units Improve Restricted Boltzmann Machines
    Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data including labeled or unlabeled images. (Hinton et al., ...
  22. [22]
    The Perceptron: A Probabilistic Model for Information Storage and ...
    No information is available for this page. · Learn why
  23. [23]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in ...
  24. [24]
    Deep Learning Book - Optimization
    Chapter 8. Optimization for Training Deep. Models. Deep learning algorithms involve optimization in many contexts. For example,. performing inference in models ...
  25. [25]
    [1412.6980] Adam: A Method for Stochastic Optimization - arXiv
    Dec 22, 2014 · We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order ...
  26. [26]
    Approximation capabilities of multilayer feedforward networks
    1991, Pages 251-257. Neural Networks ... Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks.
  27. [27]
    [PDF] LONG SHORT-TERM MEMORY 1 INTRODUCTION
    Hochreiter, S. and Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In. Advances in Neural Information Processing Systems 9. MIT ...
  28. [28]
    [PDF] Backpropagation Applied to Handwritten Zip Code Recognition
    Previous work performed on recognizing simple digit images (LeCun. 1989) showed that good generalization on complex tasks can be obtained by designing a network ...
  29. [29]
    Neural networks and physical systems with emergent collective ...
    Apr 15, 1982 · Neural networks and physical systems with emergent collective computational abilities. J J Hopfield ... ArticleApril 15, 1982. Sequence-specific ...
  30. [30]
    [PDF] Minsky-and-Papert-Perceptrons.pdf - The semantics of electronics
    This book is about perceptrons-the simplest learning machines. However, our deeper purpose is to gain more general insights into the interconnected subjects of ...Missing: separability | Show results with:separability
  31. [31]
    Neural networks and physical systems with emergent collective
    The broken lines show approximations used in modeling. Biophysics: Hopfield. Page 3. Proc. NatL Acad. Sci. USA 79 (1982) ... Biophysics: Hopfield.Missing: original | Show results with:original
  32. [32]
    [PDF] Neural Networks and Physical Systems with Emergent Collective ...
    Jul 5, 2004 · Contributed by John J Hopfield, January 15, 1982. ABSTRACT Computational properties of use to biological or- ganisms or to the construction ...
  33. [33]
    [PDF] A Fast Learning Algorithm for Deep Belief Nets - Computer Science
    We show how to use “complementary priors” to eliminate the explaining- away effects that make inference difficult in densely connected belief nets.
  34. [34]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  35. [35]
    Models - OpenAI API
    GPT-5. The best model for coding and agentic tasks across domains ; GPT-5 mini. A faster, cost-efficient version of GPT-5 for well-defined tasks ; GPT-5 nano.GPT-4o · Gpt-4.1 · Gpt-4 · GPT-4o mini
  36. [36]
    High-Resolution Image Synthesis with Latent Diffusion Models - arXiv
    Dec 20, 2021 · Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks.
  37. [37]
    [PDF] Reducing the Dimensionality of Data with Neural Networks
    May 25, 2006 · We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much.
  38. [38]
    [PDF] Self-Training: A Survey arXiv:2202.12040v6 [cs.LG] 14 Feb 2025
    Feb 14, 2025 · This paper presents self- training methods for binary and multi-class classification, along with vari- ants and related approaches such as ...
  39. [39]
    [1406.2661] Generative Adversarial Networks - arXiv
    Jun 10, 2014 · Access Paper: View a PDF of the paper titled Generative Adversarial Networks, by Ian J. Goodfellow and 7 other authors. View PDF · TeX Source.
  40. [40]
    [1312.6114] Auto-Encoding Variational Bayes - arXiv
    Dec 20, 2013 · Authors:Diederik P Kingma, Max Welling. View a PDF of the paper titled Auto-Encoding Variational Bayes, by Diederik P Kingma and 1 other authors.
  41. [41]
    [2006.11239] Denoising Diffusion Probabilistic Models - arXiv
    Jun 19, 2020 · We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from ...
  42. [42]
    AI & Robotics - Tesla
    A full build of Autopilot neural networks involves 48 networks that take 70,000 GPU hours to train . Together, they output 1,000 distinct tensors (predictions) ...
  43. [43]
    AlphaFold - Google DeepMind
    AlphaFold has revealed millions of intricate 3D protein structures, and is helping scientists understand how all of life's molecules interact.<|control11|><|separator|>
  44. [44]
    [2102.12092] Zero-Shot Text-to-Image Generation - arXiv
    Feb 24, 2021 · This paper describes a simple approach for zero-shot text-to-image generation using a transformer that autoregressively models text and image ...
  45. [45]
  46. [46]
    [PDF] Dropout: A Simple Way to Prevent Neural Networks from Overfitting
    In this paper, we described dropout as a method where we retain units with probability p at training time and scale down the weights by multiplying them by ...
  47. [47]
    Improving neural networks by preventing co-adaptation of feature ...
    Jul 3, 2012 · Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition. Subjects: Neural and ...
  48. [48]
    [2001.08361] Scaling Laws for Neural Language Models - arXiv
    Jan 23, 2020 · We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the ...
  49. [49]
    Visualising Image Classification Models and Saliency Maps - arXiv
    Dec 20, 2013 · This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets).
  50. [50]
    "Why Should I Trust You?": Explaining the Predictions of Any Classifier
    Feb 16, 2016 · In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner.
  51. [51]
    [1412.6572] Explaining and Harnessing Adversarial Examples - arXiv
    Dec 20, 2014 · Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial ...
  52. [52]
    A Survey on Bias and Fairness in Machine Learning - arXiv
    Aug 23, 2019 · In this survey we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can ...Missing: neural networks
  53. [53]
    Ethics and discrimination in artificial intelligence-enabled ... - Nature
    Sep 13, 2023 · In 2014, Amazon developed an ML-based hiring tool, but it exhibited gender bias. The system did not classify candidates neutrally for gender ( ...
  54. [54]
    Social and juristic challenges of artificial intelligence - Nature
    Jun 25, 2019 · It also has a significant impact on many aspects of society and industry, ranging from scientific discovery, healthcare and medical diagnostics ...
  55. [55]
    Energy and Policy Considerations for Deep Learning in NLP - arXiv
    Jun 5, 2019 · In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of ...