Fact-checked by Grok 2 weeks ago

Symbolic regression

Symbolic regression is a type of regression analysis that searches the space of mathematical expressions to identify the model that best fits a given dataset, discovering interpretable formulas that describe relationships between variables without assuming a predefined functional form.^[1]^[2] Unlike traditional regression methods, which optimize parameters within a fixed equation structure, symbolic regression simultaneously evolves both the structure and parameters of the expression, often producing concise, symbolic representations like polynomials or nonlinear functions.^[3] This approach emphasizes interpretability and generalizability, making it particularly valuable for scientific discovery where understanding underlying mechanisms is crucial.^[2] The technique originated in the early 1990s through the work of John Koza, who integrated it with genetic programming—a subset of evolutionary algorithms inspired by natural selection—to automatically generate and refine computer programs representing mathematical functions.^[4] In genetic programming for symbolic regression, candidate expressions are represented as tree structures, with operations like crossover (combining subtrees) and mutation (altering nodes) driving the evolutionary search toward expressions that minimize error on training data.^[1] Early applications focused on benchmark problems, such as fitting synthetic functions, but the method has since expanded due to its ability to uncover novel relationships in real-world data.^[3] Symbolic regression has found broad applications across scientific domains, including physics, where it has rediscovered laws like Kepler's third law from planetary data, and in materials science for deriving constitutive equations.^[5] In astrophysics, it has been used to identify scaling relations in galaxy properties and models for exoplanet transit spectroscopy.^[1] More recent advances incorporate deep learning techniques, such as transformer-based models and reinforcement learning, to improve efficiency and handle larger datasets, addressing the inherent computational challenges of the NP-hard search space.^[2] Despite these developments, symbolic regression remains computationally intensive, often requiring specialized hardware or approximations to scale effectively.^[3]

Introduction

Definition and Objectives

Symbolic regression (SR) is an automated machine learning technique that searches for both the structure and parameters of mathematical models to fit given input-output data pairs, unlike traditional parametric regression methods that presuppose a fixed functional form such as linear or polynomial equations.^[4]^[6] In SR, the goal is to discover symbolic expressions—combinations of mathematical operators and variables—that describe underlying relationships in the data without prior assumptions about the model's form.^[7] This process typically employs evolutionary algorithms, such as genetic programming, to explore vast spaces of possible expressions.^[4] The primary objectives of symbolic regression are to identify interpretable and parsimonious expressions that generalize well to unseen data, minimize prediction error, and facilitate scientific discovery beyond mere forecasting.^[6] Interpretability arises from producing human-readable formulas, such as y = x^2 + \sin(x), which reveal causal or physical relationships, while parsimony favors simpler models to avoid overfitting and enhance generalization.^[8] Error minimization is commonly achieved through fitness functions like the mean squared error (MSE), defined as

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2,

where y_i are observed targets, \hat{y}_i are predicted values, and n is the number of data points; this measures the average squared deviation to quantify model accuracy.^[4] Overall, SR promotes insight into data-generating processes, particularly in fields like physics and engineering where exact equations are sought.^[6] At its core, symbolic regression involves input data consisting of independent variables X and dependent targets Y, a search space of operators (e.g., addition +, subtraction -, multiplication \times, division /, sine \sin, exponential \exp) combined with terminals (variables and constants), and fitness functions to evaluate expression quality.^[4] The search space defines the building blocks for constructing expressions, often represented as tree structures in genetic programming implementations.^[6] Originating from evolutionary computation paradigms in the late 1980s, symbolic regression has evolved as a key tool for model discovery.^[4]

Historical Background

The 1980s marked the emergence of symbolic regression through evolutionary computation, rooted in John Holland's genetic algorithms described in his 1975 book Adaptation in Natural and Artificial Systems, which introduced adaptation mechanisms for optimizing complex structures. John Koza extended this framework in the late 1980s, developing genetic programming (GP) to evolve executable representations of programs, with symbolic regression as a primary application for fitting mathematical models to data.^[4] Koza's seminal 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection popularized GP for symbolic problems, demonstrating its ability to discover nonlinear expressions like Boolean functions and time-series models. His contributions included influential patents, such as U.S. Patent 4,935,877 (1990) for nonlinear genetic algorithms in problem-solving. In the 2000s, symbolic regression integrated with broader machine learning paradigms, enabling applications to empirical sciences through tools like the Eureqa software (2009), which applied GP to distill laws from noisy experimental data. Koza's annual Humies awards, launched in 2004, highlighted high-impact GP results, including symbolic regression achievements competitive with human designs.^[9] The 2010s brought a shift toward scalability for big data, with methods addressing computational challenges in high-dimensional spaces; post-2015 innovations in deep symbolic regression combined neural guidance with evolutionary search to recover expressions from complex datasets.^[10]^[11] In the 2020s, further advances as of 2025 have incorporated transformer-based models and large language models to enhance efficiency and interpretability, alongside new benchmarks for evaluating SR methods.^[2]^[1]

Comparison with Traditional Regression

Key Differences

Symbolic regression fundamentally differs from classical regression techniques in its approach to model formulation, where it simultaneously evolves both the structure and parameters of mathematical expressions, often represented as tree-based structures such as x^2 + \sin(y), without presupposing a specific functional form. In contrast, classical methods like ordinary least squares (OLS) regression assume a fixed model structure, such as the linear form y = \beta_0 + \beta_1 x, and optimize only the numerical parameters within that predefined framework.^[6] This structural flexibility in symbolic regression stems from its roots in evolutionary computation, pioneered by Koza's genetic programming paradigm. The optimization process in symbolic regression involves a global search across an expansive space of possible expressions, typically employing heuristic exploration to avoid local minima and discover novel functional relationships. Classical regression, however, relies on local optimization techniques, such as least squares minimization in OLS, which efficiently fit parameters but can fail to identify the underlying true model if the assumed form is incorrect.^[6] This exploratory nature allows symbolic regression to uncover complex, nonlinear dependencies that classical approaches might overlook without extensive model specification. A key output distinction is that symbolic regression yields explicit, human-readable symbolic equations that provide mechanistic insights into the data-generating process, whereas classical regression produces opaque numerical coefficients or black-box models that prioritize predictive accuracy over interpretability. For instance, symbolic regression has successfully rediscovered nonlinear physical laws, such as Kepler's third law relating planetary orbital periods to semi-major axes, from raw astronomical data where linear regression would fail due to the absence of a prior nonlinear form assumption.^[12]

Advantages and Limitations

Symbolic regression offers significant advantages in scientific discovery due to its ability to produce highly interpretable mathematical expressions that reveal underlying mechanisms in data. Unlike black-box models such as neural networks, it generates explicit formulas, such as the Coulomb force law F = k \frac{q_1 q_2}{r^2}, which can be directly understood and verified by domain experts.^[13] This interpretability facilitates hypothesis generation and validation in fields like physics, where symbolic regression has successfully rediscovered conservation laws, including those governing simple harmonic oscillators and chaotic double pendula, from raw experimental data without prior assumptions about the system's form.^[14] Another key benefit is automatic feature engineering, as the method evolves both the structure and coefficients of expressions, potentially uncovering nonlinear interactions that manual feature selection might overlook. For instance, in dynamical systems like the nonlinear pendulum, techniques integrated with symbolic regression can derive reduced representations such as \ddot{z} = -0.99 \sin z.^[13] Furthermore, symbolic regression demonstrates robustness to unknown functional forms relative to traditional parametric methods, which rely on predefined equations; it explores a broad space of compositions from a function library, enabling discovery even when the true model deviates from standard assumptions. Despite these strengths, symbolic regression faces notable limitations, primarily its computational expense stemming from an NP-hard search space that grows exponentially with expression complexity.^[15] This makes it resource-intensive for large datasets or intricate models, often requiring significant time and hardware compared to faster parametric alternatives. Additionally, without proper controls, it risks overfitting by favoring overly complex expressions that memorize noise rather than generalize, a challenge addressed through mechanisms like Pareto fronts to trade off accuracy and simplicity. Symbolic regression is also sensitive to data noise, as perturbations can lead to spurious terms in evolved expressions, particularly in low-signal environments. To mitigate these drawbacks, practitioners employ trade-offs via multi-objective optimization, simultaneously minimizing fitting error and expression length or complexity—analogous to penalties in criteria like the Akaike Information Criterion (AIC). Methods such as those using Pareto fronts rank solutions to select parsimonious models that balance fidelity and interpretability, enhancing generalization in noisy or sparse data scenarios.^[13]

Core Methods

Genetic Programming Approaches

Genetic programming (GP) represents a foundational approach to symbolic regression, evolving populations of computer programs represented as tree structures to discover mathematical expressions that best fit given data. In this paradigm, each individual in the population is an expression tree where internal nodes denote functions or operators, and leaf nodes represent variables or constants. The evolutionary process begins with the random initialization of a population of such trees, typically using methods like the ramped half-and-half technique to ensure diversity in tree sizes and shapes. Selection, crossover, and mutation operators drive the evolution across generations. Tournament selection is commonly employed to choose parents based on fitness, favoring individuals that minimize the error between the evolved expression and target data points, often measured by mean squared error or a similar regression metric adapted for symbolic forms. Crossover swaps subtrees between two parent trees to generate offspring, while mutation replaces a randomly selected subtree with a new one, introducing variation while preserving viable structures. These operations are tailored for symbolic regression by defining a function set including arithmetic operators (e.g., +, -, *, /) and sometimes transcendental functions (e.g., sin, exp) to allow discovery of nonlinear and complex expressions. A key adaptation in GP for symbolic regression is the use of fitness functions that penalize both approximation error and expression complexity to combat code bloat, where populations tend to grow unnecessarily large over generations. Parsimony pressure, a common technique, incorporates a penalty proportional to tree size into the fitness evaluation, such as adding a small constant times the number of nodes to the error term, thereby favoring simpler models without sacrificing accuracy. This helps maintain computational efficiency and interpretability in the evolved expressions.^[16] John Koza's standard GP algorithm, introduced in 1992, established the canonical framework for these methods, demonstrating their efficacy on symbolic regression tasks like fitting quartic polynomials or trigonometric functions through iterative evolution over hundreds of generations. Variants like strongly typed GP extend this by enforcing type constraints during tree construction and genetic operations, ensuring semantically valid expressions—for instance, restricting addition to numeric types only—which reduces invalid offspring and accelerates convergence on well-typed regression problems.^[17] The tree representation can be formalized as a directed acyclic graph, but in practice, it is a binary or n-ary tree where the root evaluates to the full expression. For example, the expression x + \sin(y) is depicted as:

\begin{array}{c} + \\ / \quad \backslash \\ x \quad \sin(y) \end{array}

Mutation might replace the \sin(y) subtree with, say, e^z, yielding x + e^z, altering the functional form while adhering to the type system if strongly typed. Early implementations facilitated practical adoption of GP for symbolic regression. Lil-GP, developed in the mid-1990s, provided an efficient C-based library for tree-based GP, supporting features like ephemeral random constants for numerical terminals and applications to benchmark regression problems such as the quadratic formula discovery.^[18]

Other Evolutionary Techniques

Beyond standard genetic programming, which typically employs tree-based representations for evolving mathematical expressions, other evolutionary techniques adapt alternative encodings and search mechanisms to address challenges in symbolic regression, such as computational efficiency and the generation of valid expressions. Linear genetic programming (LGP) represents programs as sequences of instructions or machine-code-like operations, contrasting with tree structures by enabling direct compilation and faster execution during fitness evaluation.^[19] This linear encoding facilitates efficient handling of large datasets, where LGP has demonstrated superior performance and simpler solutions compared to traditional genetic programming in symbolic regression tasks.^[19] For instance, LGP's instruction-based approach reduces overhead in interpreting complex trees, leading to efficiency gains in evolving regression models on extensive data.^[20] Grammatical evolution (GE) employs a linear genome that maps to valid expressions via a Backus-Naur Form (BNF) grammar, constraining the search space to produce only syntactically correct mathematical formulas and thereby minimizing invalid outputs.^[21] Introduced by Ryan, Collins, and O'Neill in 1998, GE decouples the genotype from the phenotype, allowing flexible grammar definitions to enforce domain-specific constraints in symbolic regression.^[21] This method enhances the reliability of evolved expressions by avoiding the need for repair mechanisms common in unconstrained evolutionary searches.^[22] Estimation of distribution algorithms (EDAs) replace traditional crossover and mutation with probabilistic models that estimate the distribution of promising solutions and sample new individuals accordingly, guiding the search more explicitly in symbolic regression.^[23] In this paradigm, EDAs build statistical models—such as multivariate normal distributions—from selected individuals to generate offspring, which can improve convergence on complex expression spaces by capturing dependencies among variables.^[23] Applications in symbolic regression, as explored in works combining EDAs with grammar-guided evolution, demonstrate their utility in reducing parameter tuning while effectively sampling valid mathematical structures.^[24] Particle swarm optimization (PSO) adaptations in symbolic regression often focus on continuous optimization of numerical coefficients within evolved expression structures, treating parameters as particle positions in a search space updated via velocity and social influences.^[25] This swarm-based approach complements structure discovery by iteratively refining constants post-evolution, leveraging global and local best-known solutions to minimize fitting errors efficiently.^[25] Hybrid integrations, such as PSO-tuned genetic programming models, have been applied to enhance prediction accuracy in regression scenarios by optimizing coefficients after initial expression generation.^[26]

Advanced Methods

Machine Learning Integrations

Symbolic regression has increasingly integrated machine learning techniques to enhance the efficiency and accuracy of expression discovery, addressing the combinatorial explosion inherent in traditional search methods. Neural networks, in particular, guide the exploration of expression spaces by proposing candidate structures or priors, reducing reliance on purely random or evolutionary sampling.^[11] One prominent approach is neural-guided symbolic regression, where recurrent neural networks (RNNs), such as gated recurrent units (GRUs), generate sequences of mathematical operations to form expressions. In the Deep Symbolic Regression (DSR) framework, an RNN emits a distribution over tractable mathematical expressions, trained via reinforcement learning with risk-seeking policy gradients to prioritize high-reward, low-complexity solutions. This method samples expressions autoregressively, with the RNN conditioned on input features and previously generated tokens, enabling recovery of ground-truth equations from noisy data in benchmarks like the Feynman equations. A softmax layer is applied to the RNN's final hidden state to obtain probabilities over production rules, such as operator selection:

p(o_t | o_{<t}, x) = \text{softmax}(W_o \cdot h_t + b_o)

where o_t is the token at step t, h_t is the hidden state, and W_o, b_o are learned parameters, allowing prioritized selection of operators like addition or multiplication based on learned priors.^[27] Complementing this, the AI Feynman method employs neural networks for initial fitting of complex functions, followed by recursive symbolic simplification informed by physics-inspired priors like dimensional analysis. A feedforward neural network approximates the target function from data, after which symbolic regression refines subscales by substituting learned constants and reducing dimensionality, successfully solving all 100 of the 100 Feynman physics problems that stumped earlier genetic programming approaches.^[28] Hybrid models combine symbolic regression with gradient boosting machines to leverage the latter's strength in capturing non-linear patterns for initial approximations, followed by symbolic refinement for interpretability. For instance, interpretable variants of eXtreme Gradient Boosting (XGBoost) preprocess data to identify key interactions, guiding the symbolic search toward parsimonious expressions in tasks like materials modeling. Tools evolving from Eureqa, such as PySR, incorporate gradient boosting-inspired library tuning and regularization to accelerate convergence on real-world datasets.^[29]^[30] Reinforcement learning variants further advance operator sequence exploration, treating expression building as a sequential decision process. RL agents, optimized via policy gradients, learn to construct expressions by rewarding fits that balance accuracy and complexity, as in extensions of DSR where deep policies navigate vertical (multi-output) regression spaces. These approaches, prominent in 2020s research, use actor-critic methods to sample operator sequences, outperforming vanilla genetic programming on large-scale symbolic benchmarks by focusing search on promising substructures.^[31]^[32]

Recent Innovations

Recent innovations in symbolic regression have increasingly incorporated human expertise and domain-specific knowledge to enhance interpretability and accuracy, particularly through interactive and physics-informed approaches. A notable advancement is interactive symbolic regression, which enables human-in-the-loop co-design where users provide feedback to refine mathematical expressions iteratively. In a 2025 study published in Nature Communications, researchers introduced the Symbolic Q-network, an offline reinforcement learning framework that allows human experts to collaborate with the algorithm in real-time discovery tasks, modifying expressions based on discrepancies between data and predictions to achieve more tailored models for complex systems. This mechanism has demonstrated improved convergence in scenarios requiring rapid adaptation, such as engineering design optimization.^[1] Physics-informed symbolic regression has emerged as a post-2023 paradigm that embeds domain constraints, like conservation laws, directly into the fitness function to guide expression evolution toward physically plausible solutions. For instance, a 2023 framework, Φ-SO, leverages deep learning to recover analytical expressions from physics data while enforcing unit constraints, outperforming traditional methods on noisy datasets from astrophysical simulations. Building on this, a 2024 integration of symbolic regression with physics-informed neural networks (PINNs) has been applied to nonlinear dynamics, such as fluid flow equations, where the hybrid approach extracts interpretable terms while satisfying governing partial differential equations. More recent work in 2025, including StruSR, uses structure-aware PINNs to incorporate prior knowledge of equation forms, enhancing scalability for high-dimensional problems like soil constitutive modeling.^[33]^[34]^[35] These advances prioritize dimensional consistency and constraint satisfaction, reducing the search space and improving generalization in scientific applications. Scalable deep symbolic regression has gained traction with transformer-based models designed for large-scale expression generation, addressing the limitations of exhaustive search in high-dimensional spaces. The 2024 SymFormer architecture, an end-to-end transformer, simultaneously predicts symbols and constants from data, enabling efficient handling of diverse datasets without predefined grammars and showing competitive performance on benchmarks like Nguyen and Keane functions. A comprehensive 2025 ACM Computing Surveys review highlights how these models, trained via supervised learning on synthetic expressions, facilitate zero-shot generalization to unseen problems, with applications in dynamical systems modeling. Such methods extend beyond classical evolutionary techniques by leveraging attention mechanisms for semantic similarity, though empirical comparisons in a 2024 arXiv preprint indicate that traditional genetic programming implementations, like Operon, often outperform newer transformer-based approaches in terms of exact recovery rates on standard tasks.^[36]^[2]^[37] Benchmarking efforts have also evolved, with the SRBench framework receiving significant updates in 2024 and 2025 to incorporate emerging algorithms and datasets. These revisions expanded the evaluated methods to over 25 systems, including recent deep learning hybrids, and introduced new metrics for assessing generalization across noise levels and extrapolation, fostering standardized comparisons in the community. Explorations into quantum-inspired techniques remain nascent but show promise in preliminary 2024 studies for accelerating searches in combinatorial spaces.

Evaluation and Benchmarking

Standard Benchmarks

SRBench represents the most comprehensive and widely adopted benchmark suite for symbolic regression, introduced in 2021 by La Cava et al. as an open-source framework to evaluate modern methods against state-of-the-art machine learning approaches.^[38] It encompasses over 252 regression problems drawn from the Penn Machine Learning Benchmarks (PMLB), blending synthetic datasets—such as noisy polynomials and differential equation systems—and real-world data without known ground-truth expressions.^[38] Synthetic examples include the Feynman equations from physics lectures, which test the recovery of known physical laws, and Strogatz ordinary differential equations modeling chaotic systems like the van der Pol oscillator.^[38] The suite emphasizes reproducibility, with controlled experimental setups that compare expression accuracy and interpretability across diverse problem scales and noise conditions.^[38] To address varying difficulty levels, SRBench divides synthetic problems into easy and hard tracks based on noise: the easy track uses noise-free data, while the hard track incorporates Gaussian white noise at levels of 0.001, 0.01, and 0.1 relative to the signal's root mean square.^[38] Evaluation protocols adopt a multi-objective approach, balancing predictive accuracy against expression simplicity to mimic real-world trade-offs in model selection.^[38] Key metrics include the exact match rate, which measures the percentage of cases where the discovered expression symbolically matches the ground truth after simplification; normalized error, often computed as the mean squared error scaled by the variance of the target values to enable cross-dataset comparisons; and expression complexity, quantified by the number of nodes (operators, variables, and constants) in the simplified parse tree.^[38] These metrics facilitate Pareto-optimal analysis, where methods are scored on fronts plotting low error against low complexity.^[38] Earlier benchmarks laid foundational groundwork for low-dimensional problems, such as those introduced by Schmidt and Lipson in 2009, which focused on distilling free-form equations from experimental data in physics and chemistry using small-scale, interpretable datasets. For large-scale evaluation, the Feynman Symbolic Regression Database (FSRD), developed by Udrescu and Tegmark in 2019, provides 100 high-dimensional physics-inspired equations derived from Feynman Lectures, challenging methods on sparse, noisy data with up to thousands of terms.^[39] It was expanded in AI Feynman 2.0 around 2020 to include multi-level fitting capabilities.^[39] Both suites employ similar core metrics—exact match rate for ground-truth recovery, normalized error for fit quality, and node-based complexity for parsimony—to ensure consistent assessment.^[39] Recent updates to SRBench, including extensions in 2024 via SRBench++ with domain-specific datasets and further refinements in 2025 toward a next-generation benchmark that doubles the number of evaluated methods, incorporates improved metrics, visualizations, and analysis of trade-offs including energy consumption, while maintaining the core dataset of 252 problems with curated selections for specific tracks.^[40]^[41] Additionally, LLM-SRBench, introduced in 2025, provides 239 challenging problems across physics, chemistry, biology, and engineering domains to assess large language model-based scientific equation discovery.^[42] These evolutions maintain multi-objective protocols and prioritize verifiable ground truths for rigorous testing. Such benchmarks are occasionally adapted for competitions, like the GECCO Symbolic Regression Challenge, to standardize comparisons across evolving algorithms.^[38]

Competitions and Challenges

The SRBench Competition of 2022, held at the Genetic and Evolutionary Computation Conference (GECCO) in Boston, Massachusetts, represented a major effort to evaluate and advance symbolic regression algorithms through standardized benchmarking.^[43] It featured two primary tracks: a synthetic track assessing rediscovery of known equations, feature selection, resistance to local optima, extrapolation, and noise sensitivity on clean and noisy data; and a real-world track focused on 14-day forecasts of COVID-19 cases, hospitalizations, and deaths in New York State, emphasizing accuracy, model simplicity, and trustworthiness.^[43] Thirteen algorithms were submitted, with nine qualifying after validation; QLattice, developed by Abzu AI, won the synthetic track with a score of 6.23, while Unified Deep Symbolic Regression (uDSR), from a team at Lawrence Livermore National Laboratory led by Brenden Petersen, claimed victory in the real-world track with a score of 5.75.^[43]^[44] The Operon framework, a genetic programming-based method with local search optimization, also demonstrated strong performance in the synthetic track, balancing accuracy and interpretability.^[45]^[46] Beyond SRBench, symbolic regression has been featured in ongoing workshops and challenges that foster innovation. The Genetic Programming Theory and Practice (GPTP) workshops, held annually since 2003 initially at the University of Michigan and later at Michigan State University, provide a venue for discussing theoretical and practical advances in genetic programming, including symbolic regression applications, through presentations and collaborations among researchers.^[47]^[48] The AI Feynman dataset from the 2019 paper by Udrescu and Tegmark, consisting of 100 equations from Feynman Lectures on Physics, has inspired subsequent benchmarks and methods for recovering exact equations from noisy data, with expansions like the 2022 SRSd benchmark using 120 recreated datasets.^[28]^[49] This initiative highlighted the potential of hybrid neural-symbolic approaches for scientific equation discovery.^[28] These events address core challenges in symbolic regression, such as scalability to large datasets and ensuring interpretability of discovered expressions. In the 2022 SRBench synthetic track, top performers achieved low-error recovery of ground-truth equations, underscoring progress in handling noisy data and extrapolation.^[43] The real-world track emphasized domain adaptation, with winners producing simple, trustworthy models for epidemiological forecasting.^[44] Follow-up competitions, including the 2023 SRBench edition at GECCO with dedicated performance and interpretability tracks, and the 2024 Symbolic Regression Workshop (SymReg) at GECCO in Melbourne, Australia, incorporated recent innovations like deep learning integrations to tackle evolving benchmarks.^[50]^[51] The SymReg workshop continued at GECCO 2025, featuring updates to SRBench benchmarking.^[52]

Applications

Scientific Discovery

Symbolic regression has emerged as a powerful tool for scientific discovery by automatically inferring interpretable mathematical models from observational or simulated data, enabling the rediscovery of known physical laws and the formulation of novel hypotheses without relying on preconceived functional forms. This data-to-equation pipeline typically involves fitting candidate expressions to data while prioritizing parsimony—favoring simpler models that generalize well—to generate testable scientific hypotheses. By balancing complexity and fidelity, symbolic regression facilitates hypothesis generation in domains where traditional modeling is hindered by incomplete knowledge or high-dimensional data.^[53] In physics, symbolic regression has successfully rediscovered fundamental equations from simulated datasets, such as Newton's law of universal gravitation, F = -G \frac{M_1 M_2}{r^2}, derived from solar system ephemeris data spanning decades.^[54] The AI Feynman project, a physics-inspired symbolic regression method, achieved a 100% success rate in solving 100 equations from the Feynman Lectures on Physics, including gravitational laws and fluid dynamics principles, by combining neural network approximations with techniques like dimensional analysis and recursion to simplify expressions. This approach has also approximated solutions to complex partial differential equations, such as those in the Navier-Stokes framework for fluid flow, from simulation data, demonstrating its utility in validating and extending theoretical models.^[28]^[39] In biology and chemistry, symbolic regression infers models of complex dynamics, such as reaction kinetics in bioprocesses and gene regulatory networks. For instance, it has automated the discovery of analytical kinetic rate models for metabolic pathways without predefined structures, revealing stoichiometric dependencies and rate laws from experimental time-series data. In gene regulation, methods like LogicGep use symbolic regression to infer Boolean functions describing network interactions, accurately reconstructing regulatory relations in synthetic and biological datasets. A notable application in materials science involved deriving explicit equations for alloy properties, such as hardness and fatigue strength, from compositional and processing data, aiding the design of high-entropy alloys with targeted performance.^[55]^[56]^[57] Recent advances highlight symbolic regression's role in interpretable discovery within power systems, where 2024 reviews emphasize its application to deriving parsimonious models for dynamics like inverter stability and grid synchronization from measurement data. These works underscore the method's ability to produce transparent equations—such as sparse representations of nonlinear oscillations—with high accuracy (e.g., R^2 > 0.999) and low computational overhead, fostering hypothesis-driven insights into system behavior under uncertainty. Overall, such developments reinforce symbolic regression as a cornerstone for automated scientific inference across disciplines. In 2025, extensions of AI Feynman have been applied to rediscover astronomical relations, such as the Lunar Equation of the Centre from ephemeris data.^[10]^[58]

Engineering and Real-World Uses

Symbolic regression finds practical applications in engineering domains where interpretable models are essential for optimization and deployment in real-world systems. In control systems, particularly within aerospace, it has been used to derive empirical transfer functions for nonlinear plants, enabling the design of controllers that outperform traditional methods in stability and performance. For example, a 2021 study introduced an epigenetic linear genetic programming approach to symbolic regression for controller design, applied to systems like inverted pendulums—relevant to aerospace attitude control—achieving robust closed-loop performance without extensive simulations.^[59] NASA has leveraged symbolic regression for prognostics in aerospace components, such as predicting remaining useful life in turbofan engines from sensor data, using genetic programming to evolve predictive equations that improved fault detection accuracy over baseline models.^[60] Additionally, symbolic regression models aerodynamic deviations between ground wind tunnel tests and flight data for aerospace vehicles, yielding concise equations that enhance prediction accuracy with over 80% improvement compared to baseline models in some cases.^[61] In finance and the energy sector, symbolic regression aids in modeling complex dynamics for forecasting and optimization. For stock dynamics, it recovers nonlinear equations from financial datasets, such as asset pricing models for bonds and equities.^[62]^[63] In renewable energy, a 2025 review highlights its integration with deep learning for solar power output forecasting, where symbolic regression derived parsimonious models from photovoltaic data, outperforming standalone neural networks in accuracy for grid integration and economic dispatch.^[64] For wind power, symbolic regression has modeled farm behaviors under low-voltage conditions, correcting simulation errors and supporting stability analysis in power grids with high renewable penetration.^[64] In manufacturing, physics-informed symbolic regression has been applied to tool wear prediction and remaining useful life estimation in machining processes, integrating recursive modeling to handle dynamic sensor inputs and achieve lower prediction errors than black-box alternatives.^[65] For instance, symbolic regression develops soft sensors that predict product quality metrics with mean absolute percentage errors around 8-13%, identifying key sensor relationships for maintenance scheduling.^[66] Notable case studies from the 2010s demonstrate symbolic regression's impact on engineering optimization. Eureqa, a genetic programming-based tool, was used to derive high-correlation models for small-scale Magnus wind turbine power coefficients, processing blade element momentum theory data to optimize performance equations with improved accuracy over empirical fits.^[67] However, deploying symbolic regression in industrial big data environments reveals scalability challenges, as traditional methods struggle with high-dimensional datasets due to computational complexity, prompting recent advances in unbiased search algorithms to handle larger problem instances efficiently.

Software and Implementation

Open-Source Tools

One prominent open-source library for symbolic regression is gplearn, a Python package that implements genetic programming tailored for discovering mathematical expressions from data. It provides a scikit-learn-compatible interface, allowing seamless integration into machine learning pipelines via methods like fit and predict, and supports the definition of custom functions and operators to extend the search space beyond standard arithmetic operations.^[68] PySR is another widely used open-source framework, available in both Python and Julia, designed for high-performance symbolic regression with an emphasis on speed and interpretability in scientific applications. Introduced in 2020, it employs regularized evolution and simulated annealing to efficiently explore expression spaces, and features differentiable outputs compatible with machine learning libraries such as JAX and PyTorch for tasks like symbolic distillation of neural networks; the library remains actively updated, with enhancements through 2025.^[69]^[70] DEAP (Distributed Evolutionary Algorithms in Python) serves as a versatile open-source evolutionary computation framework that can be adapted for symbolic regression through its genetic programming primitives. It includes built-in examples for symbolic regression problems, enabling users to evolve expression trees using arbitrary data distributions and fitness functions, though it requires custom setup for full symbolic regression workflows.^[71]^[72] Operon is an efficient C++ framework focused on large-scale genetic programming for symbolic regression, utilizing a linear tree encoding to generate interpretable mathematical models from regression targets. It achieved strong performance in the 2022 GECCO Symbolic Regression Competition, ranking highly across synthetic and real-world benchmarks, and features Python bindings available for broader accessibility.^[73]^[74]^[75] In contrast to these developer-oriented open-source options, commercial software like Eureqa provides proprietary graphical interfaces for similar tasks.

Commercial Software

Commercial software for symbolic regression provides user-friendly interfaces and integrated platforms tailored for industry professionals, often emphasizing ease of use without requiring extensive programming knowledge. These tools typically incorporate genetic programming or similar evolutionary algorithms to automate the discovery and ranking of mathematical expressions that fit data, balancing predictive accuracy with model simplicity. Unlike open-source alternatives, commercial offerings focus on proprietary enhancements, support services, and seamless integration into enterprise workflows.^[2] Eureqa, originally developed by Nutonian, Inc., is a prominent GUI-based tool for symbolic regression that employs genetic programming to evolve mathematical models from datasets, automatically ranking candidates based on fitness criteria such as error minimization and parsimony. Launched in 2009, it gained recognition for its ability to generate interpretable equations in fields like engineering and scientific modeling. In 2017, Nutonian was acquired by DataRobot, integrating Eureqa's capabilities into the broader automated machine learning platform, where it supports blueprint-based model building and inspection of evolved expressions for complexity versus performance trade-offs.^[76]^[77] TuringBot is a commercial desktop application specializing in symbolic regression for engineering and scientific applications, utilizing simulated annealing to derive explicit mathematical formulas from data with an emphasis on interpretability and robustness. Publicly launched in 2020, it offers a straightforward interface for uploading datasets and exploring evolved models, including visualization tools for predictions and residuals. The software supports custom function libraries and constraint settings to guide the search process, making it suitable for real-world problem-solving in domains requiring transparent analytical expressions.^[78]^[79]^[2]

References

[1]
Interactive symbolic regression with co-design mechanism through ...
Apr 26, 2025 · Introduction. Symbolic regression is a powerful form of regression analysis that searches the space of mathematical expressions to find the ...
[2]
Recent Advances in Symbolic Regression | ACM Computing Surveys
Jun 11, 2025 · Symbolic regression (SR) is an optimization problem that identifies the most suitable mathematical expression or model to fit the observed ...
[3]
Artificial Intelligence in Physical Sciences: Symbolic Regression ...
Apr 19, 2023 · Symbolic Regression is a type of regression analysis in which a mathematical function that describes a given dataset is derived. While ...
[4]
[PDF] John R. Koza - Genetic Programming
In this paper, we illustrate the process of formulating and solving problems of modeling (i.e. symbolic regression, symbolic function identification) with this ...
[5]
[PDF] Selected topics 5: Symbolic Regression - Heidelberg University
□ GP: developed by John Koza as a specific implementation of genetic algorithms (GAs) ... Symbolic Regression in Heavy-. Ion Physics, Bachelorarbeit, 2019.
[6]
Contemporary Symbolic Regression Methods and their Relative ...
Symbolic regression (SR) is an approach to machine learning (ML) in which both the parameters and structure of an analytical model are optimized. SR can be ...
[7]
[PDF] Symbolic Regression via Neural-Guided Genetic Programming ...
Symbolic regression is the process of identifying mathematical expressions that fit observed output from a black-box process. It is a discrete optimization ...
[8]
A review on symbolic regression in power systems
This review focuses on symbolic regression, a powerful methodology for deriving parsimonious and interpretable mathematical models directly from data.
[9]
[PDF] AM: An Artificial Intelligence Approach to Discovery in Mathematics ...
AM is a program that models developing new math concepts using heuristic rules, viewing math as intelligent behavior, not a finished product.
[10]
Human Competitive
Annual "Humies" Awards For Human-Competitive Results · Produced By Genetic And Evolutionary Computation · Genetic Programming Videos.
[11]
Interpretable scientific discovery with symbolic regression: a review
Jan 2, 2024 · An RL setting requires four components: state space, action space, state transition probabilities, and reward. The agent selects an action ...Missing: core | Show results with:core
[12]
Deep symbolic regression: Recovering mathematical expressions ...
Dec 10, 2019 · We propose a framework that leverages deep learning for symbolic regression via a simple idea: use a large model to search the space of small models.Missing: 2015 | Show results with:2015
[13]
Combining data and theory for derivable scientific discovery with AI ...
Apr 12, 2023 · Fig. 4: Depiction of symbolic models for Kepler's third law of planetary motion giving the orbital period of a planet in the solar system.
[14]
Interpretable Scientific Discovery with Symbolic Regression: A Review
Nov 20, 2022 · This survey presents a structured and comprehensive overview of symbolic regression methods and discusses their strengths and limitations.
[15]
Distilling Free-Form Natural Laws from Experimental Data | Science
Distilling Free-Form Natural Laws from Experimental Data. Michael Schmidt and Hod LipsonAuthors Info & Affiliations. Science. 3 Apr 2009. Vol 324, Issue ...
[16]
[2207.01018] Symbolic Regression is NP-hard - arXiv
Jul 3, 2022 · Symbolic regression (SR) is learning a model of data as a mathematical expression. This paper shows that SR is NP-hard, meaning it is ...
[17]
Parsimony pressure made easy | Proceedings of the 10th annual ...
The parsimony pressure method is perhaps the simplest and most frequently used method to control bloat in genetic programming.
[18]
Strongly Typed Genetic Programming | Evolutionary Computation
Jun 1, 1995 · Genetic programming is a powerful method for automatically generating computer programs via the process of natural selection (Koza, 1992).
[19]
Using LIL-GP for Genetic Programming
In fact, two of the sample applications provided with LIL-GP are symbolic regression of a simple function, and symbolic regression of a boolean-11 multiplexer.
[20]
Studying bloat control and maintenance of effective code in linear ...
Mar 5, 2016 · LGP has been shown to outperform GP in several problems, including Symbolic Regression (SReg), and to produce simpler solutions.
[21]
https://link.springer.com/chapter/10.1007/BFb0055930
[22]
Grammatical evolution: Evolving programs for an arbitrary language
May 26, 2006 · We describe a Genetic Algorithm that can evolve complete programs. Using a variable length linear genome to govern how a Backus Naur Form grammar definition is ...Missing: original | Show results with:original
[23]
Grammatical Evolution | Request PDF - ResearchGate
Aug 9, 2025 · This chapter describes Grammatical Evolution (GE) in detail (Ryan et al., 1998; O'Neill and Ryan, 2001; O'Neill, 2001).
[24]
[PDF] Symbolic Regression Using α, β Operators and Estimation ... - CMAP
The solution of this problem is made using an evolutionary algo- rithm, specifically an estimation of distribution algorithms. In this work, Evonorm is used to ...
[25]
Symbolic Regression by Means of Grammatical Evolution with ...
Jan 11, 2018 · In this paper is proposed an Estimation Distribution Algorithm (EDA) as search engine using the Symbolic Regression as a benchmark, due to the ...
[26]
A Hybrid Genetic Programming with Particle Swarm Optimization
A Hybrid Genetic Programming with Particle Swarm Optimization. Conference paper. pp 11–18; Cite this conference paper. Download book PDF.
[27]
(PDF) Evolutionary Design of a PSO-Tuned Multigene Symbolic ...
Aug 7, 2025 · This paper presents an innovative application of symbolic regression based on particle swarm optimization (PSO) for predicting the ...Missing: key | Show results with:key
[28]
Deep symbolic regression | OpenReview
We propose a framework that combines deep learning with symbolic regression via a simple idea: use a large model to search the space of small models. More ...Missing: DeepSR | Show results with:DeepSR
[29]
AI Feynman: a Physics-Inspired Method for Symbolic Regression
May 27, 2019 · We develop a recursive multidimensional symbolic regression algorithm that combines neural network fitting with a suite of physics-inspired techniques.
[30]
Symbolic regression guided by interpretable machine learning for ...
Sep 3, 2025 · Recently, the interpretable ML techniques have been incorporated into ML models, such as eXtreme Gradient Boosting ... hybrid strategy.
[31]
[PDF] InceptionSR: Recursive Symbolic Regression for Equation Synthesis
In this paper, we propose InceptionSR, a symbolic re- gression algorithm that builds upon PySR by incorporating ideas from gradient boosting and library ...
[32]
[2402.00254] Vertical Symbolic Regression via Deep Policy Gradient
Feb 1, 2024 · We propose Vertical Symbolic Regression using Deep Policy Gradient (VSR-DPG) and demonstrate that VSR-DPG can recover ground-truth equations ...
[33]
[PDF] Discovering symbolic policies with deep reinforcement learning
The reward from the environment (inner RL loop) is then used as a learning signal to train the Policy. Generator via policy gradients (outer RL loop). To scale ...
[34]
Deep Symbolic Regression for Physics Guided by Units Constraints
Dec 11, 2023 · Symbolic regression (SR) is the study of algorithms that automate the search for analytic expressions that fit data.
[35]
Integrating symbolic regression with physics-informed neural ...
Dec 20, 2024 · This study explores a hybrid framework integrating machine learning techniques and symbolic regression via genetic programing for analyzing the nonlinear ...
[36]
[PDF] SymFormer: End-to-End Symbolic Regression Using Transformer ...
Recently, neural networks have been applied to symbolic regression, among which the transformer-based methods seem to be most promising. After training a ...
[37]
A Comparison of Recent Algorithms for Symbolic Regression ... - arXiv
Jun 5, 2024 · Our results show that traditional GP methods as implemented e.g., by Operon still remain superior to two recently published symbolic regression ...
[38]
Contemporary Symbolic Regression Methods and their Relative Performance
### Summary of SRBench Benchmark
[39]
AI Feynman: A physics-inspired method for symbolic regression
We develop a recursive multidimensional symbolic regression algorithm that combines neural network fitting with a suite of physics-inspired techniques.
[40]
https://ieeexplore.ieee.org/document/10586218
[41]
SRBench Competition 2022 - Cava Lab
SRBench Competition 2022: Interpretable Symbolic Regression for Data Science. SRBench hosted its first competition at the GECCO 2022 conference in Boston, MA.
[42]
LLNL team claims top AI award at international symbolic regression ...
Aug 16, 2022 · The LLNL team won for their "Unified Deep Symbolic Regression" algorithm, which beat 12 teams in a real-world track using COVID-19 data.
[43]
GECCO'2022 Symbolic Regression Competition - ACM Digital Library
Jul 19, 2023 · Operon is a C++ framework for symbolic regression with the ability to perform local search by optimizing model coefficients using the.
[44]
GECCO'2022 Symbolic Regression Competition: Post-Analysis of ...
Oct 7, 2025 · I showcase the resulting publicly available Exhaustive Symbolic Regression algorithm on three open problems in astrophysics: the expansion ...
[45]
Annual Genetic Programming Theory and Practice (GPTP) Workshop
This workshop focuses on how theory can inform practice and what practice reveals about theory. The goal is to evaluate the state-of-the-art in genetic ...Missing: symbolic regression competitions
[46]
[PDF] arXiv:2402.00425v1 [cs.NE] 1 Feb 2024
Feb 1, 2024 · The Genetic Programming Theory and Practice (GPTP) workshop series was launched in 2003 as an annual event intended to promote the exchange ...<|separator|>
[47]
Welcome to the Feynman Symbolic Regression Database!
This database contains 120 symbolic regression mysteries as described in the paper AI Feynman: a Physics-Inspired Method for Symbolic Regression.Missing: 2019-2021 | Show results with:2019-2021
[48]
Symbolic Regression GECCO Competition - 2023 - SRBench
The 2023 edition of the Symbolic Regression (SR) competition will be composed of two tracks: performance track and interpretability track.
[49]
GECCO 2024 Symbolic Regression Workshop (SymReg)
GECCO 2024 Symbolic Regression Workshop (SymReg). Part of GECCO 2024. July 14th - 18th, 2024. Melbourne, Australia (hybrid event).
[50]
A computational framework for physics-informed symbolic ... - Nature
Jan 23, 2023 · Sparse regression systems can substantially reduce the search space of all possible functions by identifying parsimonious models using sparsity- ...
[51]
Machine learning uncovers analytical kinetic models of bioprocesses
Dec 5, 2024 · In this work, we apply an alternative approach based on symbolic regression to identify bioprocess models without assuming a pre-defined model structure.
[52]
LogicGep: Boolean networks inference using symbolic regression ...
Jun 17, 2024 · LogicGep outperformed four existing Boolean network inference methods in accurately identifying gene regulatory relations and gene interaction ...
[53]
Symbolic regression in materials science via dimension ...
We propose enhancements to genetic programming with dimensional consistency and artificial constraints to improve the search efficiency of symbolic regression.
[54]
https://iopscience.iop.org/article/10.1088/2632-2153/acfa63/pdf
[55]
[PDF] Data Prognostics Using Symbolic Regression - V. Hunter Adams
The NASA. Prognostics Data Repository provides a training set in which 100 simulated engines are run to failure, and a test set in which a separate set of 100 ...
[56]
https://academic.oup.com/bib/article/25/4/bbae286/7694187
[57]
[PDF] SYMBOLIC REGRESSION IN FINANCIAL ECONOMICS
We apply symbolic regression, the machine learning approach of recovering mod- els from data, in financial economics. Specifically, we present a data set ...
[58]
(PDF) Evolving Stock Market Prediction Models Using Multi-gene ...
The main objective of this paper is to build a suitable prediction model for the Standard & Poor 500 return index (S&P500) with potential influence feature
[59]
https://doi.org/10.1016/j.ymssp.2020.107348
[60]
https://vanhunteradams.com/Papers/Data_Prognostics.pdf
[61]
Welcome to gplearn's documentation!
gplearn supports regression through the SymbolicRegressor , binary classification with the SymbolicClassifier , as well as transformation for automated feature ...Missing: open source libraries PySR DEAP Operon
[62]
MilesCranmer/PySR: High-Performance Symbolic ... - GitHub
PySR is an open-source tool for Symbolic Regression: a machine learning task where the goal is to find an interpretable symbolic expression that optimizes some ...Discussions · Issues 89 · Pull requests 24 · Linux<|control11|><|separator|>
[63]
https://www.researchgate.net/publication/280609703_Evolving_Stock_Market_Prediction_Models_Using_Multi-gene_Symbolic_Regression_Genetic_Programming
[64]
Symbolic Regression Problem: Introduction to GP
Symbolic regression is one of the best known problems in GP (see Reference). It is commonly used as a tuning problem for new algorithms.<|control11|><|separator|>
[65]
DEAP documentation — DEAP 1.4.3 documentation
- **DEAP Overview**: DEAP is an evolutionary computation framework designed for rapid prototyping and testing of ideas, emphasizing explicit algorithms and transparent data structures.
[66]
heal-research/operon: C++ Large Scale Genetic Programming
Operon is a modern C++ framework for symbolic regression that uses genetic programming to explore a hypothesis space of possible mathematical expressions.
[67]
Operon C++ | Proceedings of the 2020 Genetic and Evolutionary ...
Jul 8, 2020 · In this paper we introduce Operon, a C++ GP framework focused on performance, modularity and usability, featuring an efficient linear tree encoding and a ...
[68]
GECCO'2022 Symbolic Regression Competition - ACM Digital Library
Jul 24, 2023 · Publication History. Published: 24 July 2023. Check for updates. Author Tags. symbolic regression · overfitting · interpretability · model ...
[69]
Eureqa - Wikipedia
... symbolic regression. Eureqa. Owner, Datarobot, Inc. Created by ... In 2017 Nutonian was acquired by DataRobot and Eureqa merged into their payware portfolio.
[70]
DataRobot Acquires Nutonian
DataRobot, the leader in automated machine learning, today announced it has acquired Nutonian, Inc., a data science software company specializing in time ...
[71]
Eureqa Models - DataRobot docs
The Eureqa Models tab lets you inspect and compare the best models generated from a Eureqa blueprint, to balance predictive accuracy against complexity.Missing: assisted symbolic
[72]
TuringBot – AI Formula Discovery Software
Eureqa was acquired by a consulting company called DataRobot and is no ... wind turbine performance prediction. Syed Ahmed Kabir, I. F., Gajendran ...
[73]
About Us - TuringBot: Symbolic Regression software
At TuringBot Software, we specialize in advanced machine learning technologies. Founded in 2019 and publicly launched in 2020.Missing: 2018 | Show results with:2018