Fact-checked by Grok 2 weeks ago

Activation function

In artificial neural networks, an activation function is a mathematical operation applied to the weighted sum of inputs at a , transforming it into an output that introduces non-linearity, thereby enabling the network to model complex, non-linear relationships in data. These functions are essential components of neural architectures, as without them, multi-layer networks would reduce to simple linear models incapable of capturing intricate patterns. The concept of activation functions traces its origins to early models of biological neurons, notably the 1943 McCulloch-Pitts neuron, which employed a binary as its activation to simulate logical operations like AND and OR gates. This threshold-based approach laid the foundation for and inspired the 1958 by , which also utilized a but faced limitations in handling non-linearly separable problems, as highlighted in Minsky and Papert's 1969 critique. The resurgence of neural networks in the 1980s, driven by the algorithm introduced by Rumelhart, Hinton, and Williams in 1986, popularized smooth, differentiable activation functions such as the , which maps inputs to a range between 0 and 1 and facilitates gradient-based learning. Common activation functions include the sigmoid function, defined as \sigma(x) = \frac{1}{1 + e^{-x}}, valued for its probabilistic interpretation in binary classification but prone to vanishing gradients during training; the hyperbolic tangent (tanh), \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}, which centers outputs around zero for better convergence compared to sigmoid; and the rectified linear unit (ReLU), f(x) = \max(0, x), introduced prominently in 2010 by Nair and Hinton, which accelerates training by avoiding vanishing gradients and promoting sparsity, though it can suffer from "dying ReLU" issues where neurons output zero indefinitely. Variants like Leaky ReLU and parametric ReLU address these limitations by allowing small negative slopes. In modern , activation functions are pivotal for performance, with choices influencing training stability, generalization, and computational efficiency; for instance, ReLU and its derivatives dominate convolutional neural networks due to their simplicity and empirical success in large-scale image recognition tasks. Ongoing continues to explore novel functions, such as swish (f(x) = x \cdot \sigma(\beta x)) and , to further mitigate issues like gradient saturation and enhance expressivity in architectures like transformers.

Fundamentals

Definition and Purpose

In neural networks, an activation function is defined as a non-linear mathematical mapping applied element-wise to the output of a linear transformation within each layer, transforming input values into output values that introduce non-linearity into the model. This mapping, commonly denoted in its general form as f(\mathbf{x}), where \mathbf{x} represents the input vector or scalar, produces a corresponding output that can be scalar or vector-valued, enabling the network to process and propagate information non-linearly. During forward propagation, the activation function follows the computation of a in each , where the input is first transformed via a weighted plus —typically z = \mathbf{w}^T \mathbf{x} + b—and then passed through the activation to yield the neuron's final output a = f(z). This sequential application across layers allows the network to build hierarchical representations from raw inputs. The primary purpose of activation functions is to enable neural networks to approximate arbitrary non-linear functions and model complex relationships in data that exceed linear separability, as without non-linearity, multi-layer networks would collapse to a single linear transformation. For instance, they permit the solution of problems like the , which a single-layer cannot handle due to its inherent linearity. By introducing these non-linearities, activation functions underpin the universal approximation capabilities of neural networks, allowing them to capture intricate patterns in diverse applications.

Historical Development

The origins of activation functions trace back to the foundational work on computational models of in the early . In 1943, Warren McCulloch and introduced a simplified model of a that employed a threshold-based to mimic binary firing behavior, enabling the representation of logical operations through networks of such units. This model laid the groundwork for artificial neural networks by demonstrating how non-linear thresholds could simulate complex propositional logic in nervous activity, though it lacked learning mechanisms. The mid-20th century saw further advancements with the development of learning-capable systems that incorporated threshold-based activation functions. Frank Rosenblatt's , described in , utilized a with a fixed for tasks, allowing the network to adapt weights based on input-output patterns in . Building on this, Bernard Widrow and Ted Hoff introduced the ADALINE in the early 1960s, which employed a linear activation followed by a but emphasized adaptive linear combinations for minimization in adaptive filtering systems. These innovations marked a shift toward trainable models, yet they were limited to single-layer architectures and struggled with non-linearly separable problems, contributing to early enthusiasm followed by setbacks. The 1970s and 1980s brought periods known as AI winters, during which reduced funding and skepticism—exacerbated by critiques like and Seymour Papert's 1969 analysis of limitations—stifled research, including explorations of activation functions. A revival occurred in 1986 with David Rumelhart, , and Ronald Williams' popularization of , which required differentiable activation functions such as the logistic sigmoid to propagate errors through multi-layer networks, enabling the training of deeper architectures. This breakthrough addressed prior limitations but highlighted issues like vanishing gradients in deep setups. The deep learning boom accelerated after 2006, driven by Hinton's introduction of deep belief networks, which revived interest in scalable training and prompted innovations in activation functions to mitigate gradient problems. A pivotal milestone came in 2010 when Vinod Nair and proposed the rectified linear unit (ReLU), a simple that accelerated convergence and alleviated vanishing gradients in deep networks by allowing sparse activation and better gradient flow. This shift, amid surging computational power and data availability, transformed activation functions from niche tools into core components of modern neural architectures.

Common Activation Functions

Binary Step Function

The binary step function, also known as the or , is the most basic activation function in artificial neural networks, producing a output of for inputs below a specified —typically —and for inputs at or above it. This design directly emulates the all-or-none response of biological s, where a either fires (outputs ) or remains inactive (outputs ) based on whether the summed excitatory and inhibitory inputs exceed a firing . Mathematically, the function is defined as: f(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases} In the McCulloch-Pitts neuron model, introduced in 1943, this binary activation enabled networks of such units to simulate logical operations like AND, OR, and NOT gates, laying the groundwork for computational models of the brain by treating neural activity as propositional logic. The function was central to Frank Rosenblatt's in 1958, a single-layer for that adjusted weights via a perceptron learning rule to classify inputs into two categories, such as separating linearly separable patterns in two dimensions. Its primary advantages lie in computational simplicity, requiring only a single comparison with no complex , which made it feasible for early implementations, and in its interpretability as a clear for decisions. However, the function's discontinuity renders it non-differentiable everywhere except at the , preventing the use of for training in multilayer networks and limiting its applicability to simple, linearly separable problems. Later developments, such as the , addressed this by offering a continuous, differentiable to the .

Sigmoid Function

The , also known as the logistic sigmoid, is a smooth, S-shaped activation function that maps any real-valued input to the open interval (0,1), making it suitable for representing probabilities or normalized outputs in neural networks. It is mathematically defined by the equation \sigma(x) = \frac{1}{1 + e^{-x}}, where e is the base of the natural logarithm, ensuring the output approaches 1 as x becomes large and positive, and 0 as x becomes large and negative. This function derives from the originally developed in statistics to model growth processes and binary outcomes, such as in where it serves as the inverse of the transformation to bound predictions between 0 and 1. In the context of neural networks, it was adapted as an activation for artificial neurons to introduce non-linearity while remaining differentiable, facilitating gradient-based learning algorithms like , as introduced in seminal work on multi-layer networks. Key properties of the sigmoid include its around the point (0, 0.5), where \sigma(0) = 0.5, and saturation at the extremes: the \sigma'(x) = \sigma(x)(1 - \sigma(x)) peaks at 0.25 when x = 0 but approaches zero for large |x|, leading to regions where gradients vanish during training. These characteristics make it continuous and infinitely differentiable everywhere, though the saturation can hinder learning in deep networks by causing vanishing gradients. Historically, the was widely used in the output layers of shallow neural networks for tasks, where its probabilistic output directly corresponds to class probabilities without needing additional transformations. Prior to the widespread adoption of rectified linear units in the , it also served as a common activation in hidden layers of early multi-layer perceptrons, enabling the modeling of complex decision boundaries through composition of non-linear transformations.

Hyperbolic Tangent

The hyperbolic tangent activation function, commonly denoted as \tanh, serves as a , S-shaped nonlinearity in neural networks, transforming input values x into outputs bounded within the open (-1, 1). This bounded range ensures that activations remain controlled, preventing explosive growth during forward propagation. Unlike unbounded functions, \tanh introduces non-linearity while maintaining differentiability everywhere, making it suitable for gradient-based optimization. The function is defined mathematically as \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} or, equivalently, using hyperbolic functions, \tanh(x) = \frac{\sinh(x)}{\cosh(x)}, where \sinh(x) = \frac{e^x - e^{-x}}{2} and \cosh(x) = \frac{e^x + e^{-x}}{2}. Its derivative is \tanh'(x) = 1 - \tanh(x)^2, which facilitates efficient backpropagation. \tanh resembles a scaled and shifted version of the logistic sigmoid function \sigma(x) = \frac{1}{1 + e^{-x}}, related by the identity \tanh(x) = 2\sigma(2x) - 1. This connection highlights how \tanh can be derived from sigmoid, inheriting similar saturation behavior near the asymptotes but with symmetric output around zero. A key advantage of \tanh over lies in its zero-centered output, which has an near zero for symmetric inputs, thereby reducing the shift in weights of downstream layers and promoting more stable flow during . This zero-centering often leads to fewer epochs compared to 's positive , enhancing in multi-layer networks. However, like , \tanh suffers from vanishing for large |x|, where the approaches zero, potentially slowing learning in deep architectures. In historical context, \tanh gained prominence in recurrent neural networks for handling sequential data with bounded states. It was notably adopted in the (LSTM) units proposed by Hochreiter and Schmidhuber in 1997, where \tanh activates the candidate cell state to squash values into (-1, 1), aiding in the preservation of long-term dependencies without unbounded growth. This choice complemented gates in LSTMs, enabling effective training on tasks requiring memory over extended time lags. LSTMs with \tanh have since become a cornerstone in sequence modeling, influencing architectures like gated recurrent units.

Rectified Linear Unit

The rectified linear unit (ReLU) is a linear activation function defined as f(x) = \max(0, x), which outputs the input directly if it is positive and zero otherwise, thereby introducing sparsity in activations by nullifying negative values. This function was introduced in 2010 by Vinod Nair and to improve the training of restricted Boltzmann machines, where it demonstrated faster convergence compared to activations by preserving relative intensities across layers. ReLU gained widespread adoption following its use in the architecture, which achieved breakthrough performance on the Large Scale Visual Recognition Challenge in , marking a pivotal advancement in deep convolutional s. One key advantage of ReLU is its ability to mitigate the , as the is either 1 or 0 for positive inputs, enabling effective through deep networks without the saturation issues common in or functions. Additionally, ReLU is computationally efficient, requiring only a simple thresholding operation without expensive exponentials or divisions, which accelerates training in large-scale models. The sparsity induced by zeroing negative inputs further reduces parameter redundancy and can enhance generalization in sparse representations. To address the minor drawback of "dying ReLU," where neurons can become inactive for all inputs during , have been developed to allow small gradients for negative values. Leaky ReLU modifies the to f(x) = \max(\alpha x, x), where \alpha is a small positive constant (typically 0.01), permitting a leaky flow for negative inputs to prevent neuron death while retaining ReLU's efficiency; it was proposed in 2013 for improving acoustic models in neural . Another , the exponential linear (ELU), is defined as f(x) = x if x > 0 and f(x) = \alpha (e^x - 1) otherwise, with \alpha = 1, which centers the mean near for faster learning and reduced shift; ELU was introduced in 2015 to accelerate convergence in .

Properties and Characteristics

Differentiability and Continuity

Activation functions in neural networks must generally be differentiable to facilitate training via gradient-based optimization methods such as , which relies on computing gradients to update parameters. This differentiability allows the application of the chain rule during , enabling efficient propagation of error signals through the network layers. For functions like the rectified linear unit (ReLU), which is non-differentiable at the (where f(x) = \max(0, x)), subgradients are employed; the subgradient at x = 0 is conventionally set to 0 or any value in [0, 1] to handle this point during optimization. Most common activation functions are continuous everywhere, ensuring smooth mappings from inputs to outputs, with the notable exception of the binary step function, which introduces a discontinuity at the threshold (typically 0). Continuity is a prerequisite for differentiability, as non-continuous functions cannot have derivatives at points of discontinuity, limiting their utility in gradient-based learning. Certain activations, such as the sigmoid and hyperbolic tangent, can lead to vanishing gradients due to saturation regions where the absolute value of the input exceeds 1, causing derivatives to approach zero and impeding learning in deep networks. For the sigmoid function \sigma(x) = \frac{1}{1 + e^{-x}}, the derivative is \sigma'(x) = \sigma(x)(1 - \sigma(x)), which has a maximum value of 0.25 and diminishes rapidly for large |x|. Similarly, the hyperbolic tangent \tanh(x) has a derivative \sech^2(x), which is less than 1 for |x| > 0 and approaches 0 as |x| increases, exacerbating gradient flow issues in deeper architectures. For ReLU, the derivative is piecewise defined as f'(x) = 1 if x > 0 and 0 otherwise, avoiding saturation but introducing the non-differentiability at zero. These properties directly influence optimization dynamics: smooth, differentiable activations support stable gradient propagation via the chain rule, while issues like vanishing gradients necessitate alternatives like to maintain effective training in deep models.

Non-linearity Requirements

Activation functions in neural networks must introduce non-linearity to prevent the collapse of multi-layer architectures into equivalent single-layer linear models. If the activation function f is linear, such as f(z) = kz + b, then composing multiple layers results in a single linear transformation: for inputs x, the output of two layers becomes f(W_2 f(W_1 x)) = W_2 (k W_1 x + b) + b' = k' W_2 W_1 x + b'', where k' and constants absorb the biases, rendering deeper networks no more expressive than a shallow linear regressor. This limitation confines the network to modeling only linear relationships, severely restricting its ability to capture complex data patterns. Non-linear activation functions overcome this by enabling the network to approximate arbitrary s on compact subsets of \mathbb{R}^n, as established by the universal approximation theorem. Originally proven for sigmoidal activations, this theorem demonstrates that a single hidden layer with sufficiently many neurons can approximate any on a compact subset of \mathbb{R}^n to arbitrary accuracy, with extensions applying to other non-linear functions like ReLU under certain conditions. Without non-linearity, networks fail to solve problems requiring non-linear decision boundaries, such as the , which single-layer s cannot classify due to its non-linear separability, as shown in early analyses of perceptron limitations. The key criterion for non-linearity is that the activation must not preserve when composed with affine transformations; specifically, f(Wz + b) should not reduce to an affine function for all W, b, z. Piecewise linear or smoothly curved forms, like those in ReLU or , satisfy this by introducing bends or saturations that allow layered compositions to generate non-linear manifolds. Biologically, this mirrors the -based firing of neurons, where inputs are integrated until exceeding a firing potential, as modeled in the foundational McCulloch-Pitts neuron, which uses a to simulate all-or-nothing spikes only above a .

Specialized Variants

Radial Basis Functions

Radial basis functions (RBFs) serve as activation functions in neural networks, characterized by their dependence solely on the radial from a specified center point. Formally, an RBF is defined as f(\mathbf{x}) = \phi(\|\mathbf{x} - \mathbf{c}\|), where \mathbf{x} is the input , \mathbf{c} is the center , \|\cdot\| denotes the norm, and \phi is a univariate that operates on the r = \|\mathbf{x} - \mathbf{c}\|. This structure ensures radial symmetry, making the activation invariant to rotations around the center. The is the most prevalent form of RBF, expressed as \phi(r) = \exp\left( -\frac{r^2}{2\sigma^2} \right), where \sigma > 0 is a that determines the function's width and thus the extent of its localized response. This produces a smooth, bell-shaped curve peaking at r = 0 with value 1 and approaching 0 as r increases. RBFs exhibit infinite support, being non-zero for all finite r, yet their rapid decay beyond a few multiples of \sigma results in effectively localized peaks, ideal for capturing regional features in data. Additionally, their form ensures translation invariance: shifting \mathbf{c} merely relocates the peak without altering its shape or height. In networks, these activations form the hidden layer, where the output is a weighted of multiple RBFs centered at selected points, enabling approximation of continuous functions on compact sets. This architecture, introduced for multivariable , excels in tasks requiring precise fitting to scattered points, such as in high dimensions. The localized nature of RBFs facilitates efficient learning via methods like orthogonal , avoiding the vanishing gradient issues that can saturate sigmoidal activations during . Furthermore, the Gaussian RBF extends to methods, notably as the radial basis in support vector machines, where it implicitly maps inputs to a high-dimensional feature space for non-linear separation.

Swish and Parametric Functions

Swish is a self-gated activation function defined as f(x) = x \cdot \sigma(\beta x), where \sigma is the and \beta is a learnable that allows the function to adapt during training. Introduced by Ramachandran et al. in 2017, Swish generalizes the ReLU by incorporating a smooth gating mechanism, enabling non-monotonic behavior that can enhance performance in deep neural networks. Other parametric activation functions include the Parametric Rectified Linear Unit (PReLU), proposed by He et al. in 2015, which extends ReLU with a learnable slope parameter for negative inputs, formulated as f(x) = \max(0, x) + a \min(0, x) where a is trainable. Similarly, the Gaussian Error Linear Unit (GELU), developed by Hendrycks and Gimpel in 2016, is defined as f(x) = x \Phi(x), where \Phi(x) is the cumulative distribution function of the standard Gaussian, providing a probabilistic interpretation that smooths transitions near zero. These learnable activations offer advantages over fixed functions by avoiding abrupt zeros in the negative regime, which can mitigate dying issues, and by permitting the network to optimize the activation's shape for specific tasks. They have found widespread use in advanced architectures, such as Swish and PReLU in convolutional neural networks for image recognition, and GELU in models like for .

Comparison and Applications

Performance Evaluation

Performance evaluation of activation functions relies on key metrics that quantify their influence on training efficiency, stability, and model performance. speed is a primary metric, often assessed by the number of epochs needed to achieve a specified accuracy on datasets; faster indicates more effective learning . variance measures the fluctuation in backpropagated gradients across layers, where excessive variance can lead to unstable optimization, and normalized variance provides a more reliable indicator of than raw variance. Sparsity evaluates the percentage of zero-valued activations, promoting computational efficiency and potentially enhancing by inducing feature sparsity in the network. Empirical benchmarks highlight these metrics in practice. On the MNIST dataset, rectified linear unit (ReLU) activations generally enable faster and higher accuracy compared to , often reaching over 98% accuracy more quickly due to reduced issues. Similar trends appear on , where ReLU-based convolutional neural networks (CNNs) demonstrate superior training speed and accuracy compared to -based models, in image classification tasks. For more advanced functions, Swish has shown marginal improvements over ReLU on the dataset, boosting top-1 classification accuracy by 0.9% in Mobile NASNet-A architectures while maintaining comparable rates. Computational factors further inform evaluation. ReLU incurs minimal computational overhead with simple thresholding, whereas and tanh demand more operations involving exponentials, leading to higher overall and costs. Memory usage is also lower for sparse activations like those from ReLU, as zero values reduce storage needs during forward passes. Frameworks such as and facilitate ablation studies by allowing seamless substitution of activation functions within identical architectures, enabling direct measurement of metrics like epochs-to-accuracy and statistics. Current trends underscore ReLU variants' dominance in CNNs for vision tasks owing to their speed and sparsity benefits, while hyperbolic tangent (tanh) remains prevalent in recurrent neural networks (RNNs) for sequential modeling, where its zero-centered output aids gradient propagation over time steps.

Selection Criteria

The selection of an activation function in neural networks depends primarily on the nature of the task, as different functions are suited to specific output requirements. For tasks, the is commonly applied in the output layer to produce probabilities between 0 and 1, enabling direct interpretation as class likelihoods. In multi-class classification, softmax (a of the ) is preferred for output layers to generate normalized probabilities across classes. For hidden layers in feedforward networks, the rectified linear unit (ReLU) is a standard choice due to its ability to introduce non-linearity without saturating gradients during , facilitating efficient training in deep architectures. Architectural considerations further guide the choice, particularly in recurrent neural networks (RNNs) where bounded activations like tanh are favored to mitigate exploding gradients by constraining signal propagation over time steps. In contrast, unbounded functions such as ReLU are well-suited for convolutional neural networks (CNNs), supporting deeper layers without vanishing signals and promoting sparsity in feature representations. Practical constraints, including computational resources and training stability, also influence decisions. ReLU's simple thresholding operation—outputting the input if positive and zero otherwise—ensures low computational overhead, making it ideal for deployment on resource-limited edge devices where efficiency is paramount. To enhance stability, saturating functions like should be avoided in layers, as they can lead to vanishing gradients that hinder learning in deep networks. Heuristics provide practical starting points for practitioners: begin with ReLU for most hidden layers due to its robustness and speed, then experiment with alternatives like Swish if occurs, as its smooth, non-monotonic shape can improve generalization in complex models. Additionally, consider the data distribution; zero-centered activations such as tanh are beneficial when inputs are symmetrically distributed around zero, as they prevent bias shifts in subsequent layers and accelerate . Emerging trends point toward automated methods for selection, with AutoML techniques enabling the search for task-specific through or evolutionary algorithms, potentially yielding optimized variants beyond manual choices.

Advanced Topics

Quantum Activation Functions

Quantum activation functions refer to non-linear mappings implemented within quantum circuits to enable expressive power in quantum neural networks (QNNs), typically through measurement-based protocols or variational quantum circuits that introduce non-linearity without violating quantum constraints. Unlike classical activations, these functions operate on quantum states, leveraging superposition and entanglement to process information in a , where the output is often obtained via partial measurements or post-selection to approximate non-linear behaviors. This approach addresses the inherent linearity of unitary quantum operations by incorporating probabilistic elements, such as projective measurements, to mimic classical non-linearities while preserving quantum where possible. Prominent examples include the quantum , realized through encoding of classical inputs into quantum states followed by variational circuits that approximate the curve via trainable parameters, and extensions of to quantum kernels, which allow non-linear feature mappings in quantum support vector machines by data into high-dimensional quantum Hilbert spaces. Another set of examples comprises QReLU and m-QReLU, quantum analogs of the rectified linear unit designed for tasks; QReLU applies a quantum gate conditioned on the input to enforce , while m-QReLU incorporates measurement outcomes to adaptively threshold s in multi-qubit settings. Additionally, Quantum Splines (QSplines) and Generalized Hybrid Quantum Splines (GHQSplines) use variational quantum circuits to piecewise approximate arbitrary non-linear functions, enabling trainable quantum gates to serve as layers in QNNs. A primary challenge in implementing these functions stems from the , which prohibits duplicating unknown quantum states for parallel classical-like non-linear processing, necessitating partial measurements that introduce noise and decoherence risks. To mitigate this, techniques like ancillary qubits and controlled measurements are employed, though they can limit scalability on noisy intermediate-scale quantum (NISQ) devices. In applications, quantum functions enhance (QML) models for tasks such as optimization and , potentially offering exponential speedups in high-dimensional data processing compared to classical counterparts. For example, quantum-inspired activations like QReLU have been applied in classical convolutional neural networks for medical diagnostics, such as detecting from images and Parkinson from spiral drawings, where they improved accuracy, , , and F1-scores compared to traditional ReLU variants. Recent since 2018 has focused on hybrid quantum-classical frameworks, with seminal works exploring trainable quantum gates for end-to-end QNN training and kernel-based methods that leverage quantum activations for provable advantages in specific problems. More recent developments as of 2025 include quantum variational activation functions (QVAFs), which leverage data re-uploading in variational circuits for improved approximation in quantum neural architectures, and optimized quantum circuits for activation functions targeting fault-tolerant quantum devices.

Periodic Activation Functions

Periodic activation functions are a class of activation mechanisms in neural networks designed to process periodic or cyclical data by applying trigonometric or periodic mappings that generate repeating outputs, enabling the network to capture inherent periodicities without artificial discontinuities. A representative example is f(x) = \sqrt{2} \sin(x), which generates a smooth, oscillating output, facilitating the representation of repeating patterns in data. This approach contrasts with traditional activations like ReLU by inherently embedding periodicity, which aids in modeling domains where inputs wrap around, such as angular measurements or seasonal cycles. Key properties of periodic activation functions include their across periodic boundaries and ability to preserve in cyclical representations, making them suitable for exhibiting , such as calendar-based timestamps or directional . Unlike linear or activations, they avoid abrupt jumps at cycle edges—e.g., treating 0° and 360° as equivalent—thus reducing issues in optimization for circular . These functions also promote better in tasks with inherent , as the periodic supports translation-invariant and higher for out-of-distribution in Bayesian neural networks, enhancing the network's toward periodicity. In practice, periodic activations find application in recurrent neural networks (RNNs) for temporal , where they help model time-series with seasonal components, such as daily or yearly cycles in financial or environmental datasets, by avoiding discontinuities that plague standard activations in circular domains. For instance, periodic activations have been integrated into models for time-series , where variants like periodic ReLU—featuring periodicity in their form—improve handling of oscillating signals while maintaining computational . Triangular wave functions, another periodic example, approximate linear rises and falls within each period, offering differentiable alternatives for tasks requiring precise periodicity capture in neural architectures. The development of periodic activation functions gained prominence in the , driven by the need for specialized tasks involving implicit representations of signals and shapes with periodic structures. Seminal work in introduced periodic mechanisms to enable neural networks to learn high-frequency details in cyclical data and induce global stationarity in Bayesian neural networks, outperforming conventional activations in fitting periodic targets and improving robustness. Subsequent advancements extended this to time-series domains, emphasizing their role in stabilizing for models processing seasonal or angular inputs.

References

  1. [1]
    Activation Functions in Deep Learning: A Comprehensive Survey ...
    Sep 29, 2021 · In this paper, a comprehensive overview and survey is presented for AFs in neural networks for deep learning.
  2. [2]
    Activation Function in Neural Networks and Their Types - Coursera
    Mar 15, 2025 · Activation functions are necessary for any neural network to generate high-dimensional nonlinear patterns. Without them, the neural network ...
  3. [3]
    [PDF] Lecture 27: Neural Networks and Deep Learning
    Apr 6, 2020 · In 1943, McCulloch & Pitts proposed that biological neurons have a nonlinear activation function (a step function) whose input is a weighted ...
  4. [4]
    History of the Perceptron - CSULB
    The activation function used by McCulloch and Pitts was the threshold step function. However, other functions that can be used are the Sigmoid, Piecewise ...
  5. [5]
    Neural Network Basics
    The above rule, which governs the manner in which an output node maps input values to output values, is known as an activation function (meaning that this ...
  6. [6]
    [PDF] Rectified Linear Units Improve Restricted Boltzmann Machines
    Restricted Boltzmann machines were devel- oped using binary stochastic hidden units. These can be generalized by replacing each.
  7. [7]
    Activation functions and their characteristics in deep neural networks
    More specifically, the definitions, the impacts on the neural networks, and the advantages and disadvantages of quite a few activation functions will be ...<|control11|><|separator|>
  8. [8]
    [PDF] A Survey on Activation Functions and their relation with Xavier and ...
    Mar 18, 2020 · 2 Activation function. The activation function, also known as the transfer function, is the nonlinear function applied on the inner product x>w ...
  9. [9]
    [PDF] Activation Functions in Artificial Neural Networks - arXiv
    Jan 25, 2021 · Abstract. Activation functions shape the outputs of artificial neurons and, therefore, are integral parts of neural networks in general and ...
  10. [10]
    [PDF] Activation Functions in Deep Learning - arXiv
    The activation functions (AFs) play a very crucial role in neu- ral networks [16] by learning the abstract features through non- linear transformations.
  11. [11]
    [PDF] Approximation by superpositions of a sigmoidal function - NJIT
    Feb 17, 1989 · G. Cybenkot. Abstr,,ct. In this paper we demonstrate that finite linear combinations of com- positions of a fixed, univariate function and a ...
  12. [12]
    [PDF] arXiv:2304.03189v1 [cs.LG] 6 Apr 2023
    Apr 6, 2023 · Patterns that are not linearly separable, such as the Boolean exclusive OR (XOR) function, cannot be learned by a single perceptron.
  13. [13]
    A logical calculus of the ideas immanent in nervous activity
    McCulloch, W.S., Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133 (1943). https://doi ...
  14. [14]
    The Perceptron: A Probabilistic Model for Information Storage and ...
    No information is available for this page. · Learn why
  15. [15]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · Cite this article. Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
  16. [16]
    [PDF] Review and Comparison of Commonly Used Activation Functions for ...
    A neural cell output in the neural network is calculated by the activation unit. The derivative of the activation function is later used by the backpropagation ...
  17. [17]
    Fundamentals of Artificial Neural Networks and Deep Learning - NCBI
    Jan 14, 2022 · Activation functions allow nonlinearities to be introduced into the network's modeling capabilities (Wiley 2016). The activation function of a ...
  18. [18]
    [PDF] Rectifier Nonlinearities Improve Neural Network Acoustic Models
    Rectifier nonlinearities, like ReL, improve DNN acoustic models, reducing word error rates by 2% compared to sigmoidal nonlinearities.
  19. [19]
    Fast and Accurate Deep Network Learning by Exponential Linear ...
    Nov 23, 2015 · We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies.
  20. [20]
    [PDF] Deep Learning for Medical Image Segmentation
    Apr 29, 2015 · This is the basis of back-propagation, and this is why we require the activation function to be differentiable. A detailed derivation is ...
  21. [21]
    [PDF] Gradient flow dynamics of shallow ReLU networks for square loss ...
    Since the ReLU is not differentiable at 0, the dynamics should be defined as a subgradient inclusion flow [Bolte et al., 2010]. However, we show in Appendix D ...
  22. [22]
    [PDF] Regularization and Reparameterization Avoid Vanishing Gradients ...
    Jun 4, 2021 · But the most popular approach is to replace sigmoid-activation functions such as tanh, logsigmoid, and arctan by piecewise-linear activation ...
  23. [23]
    None
    Nothing is retrieved...<|control11|><|separator|>
  24. [24]
    Approximation by superpositions of a sigmoidal function
    Feb 17, 1989 · In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly ...
  25. [25]
    [PDF] Multivariable Functional Interpolation and Adaptive Networks
    In this sense, the radial basis function networks are more closely related to the early linear perceptrons. However, in contrast to these early networks, the.Missing: seminal | Show results with:seminal
  26. [26]
    Radial basis function - Scholarpedia
    May 26, 2010 · It also opens the door to existence and uniqueness results for interpolating scattered data by radial basis functions in very general settings ( ...<|separator|>
  27. [27]
    [1710.05941] Searching for Activation Functions - arXiv
    Oct 16, 2017 · View a PDF of the paper titled Searching for Activation Functions, by Prajit Ramachandran ... Updated version of "Swish: a Self-Gated Activation ...
  28. [28]
    Delving Deep into Rectifiers: Surpassing Human-Level Performance ...
    Feb 6, 2015 · In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that ...
  29. [29]
    [1606.08415] Gaussian Error Linear Units (GELUs) - arXiv
    Jun 27, 2016 · We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is x\Phi(x).
  30. [30]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  31. [31]
    How to Choose an Activation Function for Deep Learning
    Jan 22, 2021 · You must choose the activation function for your output layer based on the type of prediction problem that you are solving. Specifically, the ...
  32. [32]
    Introduction to Activation Functions in Neural Networks - DataCamp
    The sigmoid activation function, often represented as σ(x), is a smooth, continuously differentiable function that is historically important in the development ...
  33. [33]
    [PDF] Rectified Linear Units Improve Restricted Boltzmann Machines
    Nair, V. and Hinton, G. E. Implicit mixtures of restricted boltzmann machines. In Neural information processing systems, 2008. Salakhutdinov, R. and ...
  34. [34]
    Quantum activation functions for quantum neural networks - arXiv
    Jan 10, 2022 · Title:Quantum activation functions for quantum neural networks. Authors:Marco Maronese, Claudio Destri, Enrico Prati · Download PDF. Abstract ...
  35. [35]
    Two novel quantum activation functions to aid medical diagnostics
    Oct 15, 2020 · QReLU and m-QReLU: Two novel quantum activation functions to aid medical diagnostics. Authors:L. Parisi, D. Neagu, R. Ma, F. Campean.
  36. [36]
    [PDF] Periodic Activation Functions Induce Stationarity - NIPS papers
    The contributions of this paper are: (i) We show that periodic activation functions establish a direct correspondence between the prior on the network weights ...