Symbolic regression
Symbolic regression is a type of regression analysis that searches the space of mathematical expressions to identify the model that best fits a given dataset, discovering interpretable formulas that describe relationships between variables without assuming a predefined functional form.[1][2] Unlike traditional regression methods, which optimize parameters within a fixed equation structure, symbolic regression simultaneously evolves both the structure and parameters of the expression, often producing concise, symbolic representations like polynomials or nonlinear functions.[3] This approach emphasizes interpretability and generalizability, making it particularly valuable for scientific discovery where understanding underlying mechanisms is crucial.[2] The technique originated in the early 1990s through the work of John Koza, who integrated it with genetic programming—a subset of evolutionary algorithms inspired by natural selection—to automatically generate and refine computer programs representing mathematical functions.[4] In genetic programming for symbolic regression, candidate expressions are represented as tree structures, with operations like crossover (combining subtrees) and mutation (altering nodes) driving the evolutionary search toward expressions that minimize error on training data.[1] Early applications focused on benchmark problems, such as fitting synthetic functions, but the method has since expanded due to its ability to uncover novel relationships in real-world data.[3] Symbolic regression has found broad applications across scientific domains, including physics, where it has rediscovered laws like Kepler's third law from planetary data, and in materials science for deriving constitutive equations.[5] In astrophysics, it has been used to identify scaling relations in galaxy properties and models for exoplanet transit spectroscopy.[1] More recent advances incorporate deep learning techniques, such as transformer-based models and reinforcement learning, to improve efficiency and handle larger datasets, addressing the inherent computational challenges of the NP-hard search space.[2] Despite these developments, symbolic regression remains computationally intensive, often requiring specialized hardware or approximations to scale effectively.[3]Introduction
Definition and Objectives
Symbolic regression (SR) is an automated machine learning technique that searches for both the structure and parameters of mathematical models to fit given input-output data pairs, unlike traditional parametric regression methods that presuppose a fixed functional form such as linear or polynomial equations.[4][6] In SR, the goal is to discover symbolic expressions—combinations of mathematical operators and variables—that describe underlying relationships in the data without prior assumptions about the model's form.[7] This process typically employs evolutionary algorithms, such as genetic programming, to explore vast spaces of possible expressions.[4] The primary objectives of symbolic regression are to identify interpretable and parsimonious expressions that generalize well to unseen data, minimize prediction error, and facilitate scientific discovery beyond mere forecasting.[6] Interpretability arises from producing human-readable formulas, such as y = x^2 + \sin(x), which reveal causal or physical relationships, while parsimony favors simpler models to avoid overfitting and enhance generalization.[8] Error minimization is commonly achieved through fitness functions like the mean squared error (MSE), defined as \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where y_i are observed targets, \hat{y}_i are predicted values, and n is the number of data points; this measures the average squared deviation to quantify model accuracy.[4] Overall, SR promotes insight into data-generating processes, particularly in fields like physics and engineering where exact equations are sought.[6] At its core, symbolic regression involves input data consisting of independent variables X and dependent targets Y, a search space of operators (e.g., addition +, subtraction -, multiplication \times, division /, sine \sin, exponential \exp) combined with terminals (variables and constants), and fitness functions to evaluate expression quality.[4] The search space defines the building blocks for constructing expressions, often represented as tree structures in genetic programming implementations.[6] Originating from evolutionary computation paradigms in the late 1980s, symbolic regression has evolved as a key tool for model discovery.[4]Historical Background
The 1980s marked the emergence of symbolic regression through evolutionary computation, rooted in John Holland's genetic algorithms described in his 1975 book Adaptation in Natural and Artificial Systems, which introduced adaptation mechanisms for optimizing complex structures. John Koza extended this framework in the late 1980s, developing genetic programming (GP) to evolve executable representations of programs, with symbolic regression as a primary application for fitting mathematical models to data.[4] Koza's seminal 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection popularized GP for symbolic problems, demonstrating its ability to discover nonlinear expressions like Boolean functions and time-series models. His contributions included influential patents, such as U.S. Patent 4,935,877 (1990) for nonlinear genetic algorithms in problem-solving. In the 2000s, symbolic regression integrated with broader machine learning paradigms, enabling applications to empirical sciences through tools like the Eureqa software (2009), which applied GP to distill laws from noisy experimental data. Koza's annual Humies awards, launched in 2004, highlighted high-impact GP results, including symbolic regression achievements competitive with human designs.[9] The 2010s brought a shift toward scalability for big data, with methods addressing computational challenges in high-dimensional spaces; post-2015 innovations in deep symbolic regression combined neural guidance with evolutionary search to recover expressions from complex datasets.[10][11] In the 2020s, further advances as of 2025 have incorporated transformer-based models and large language models to enhance efficiency and interpretability, alongside new benchmarks for evaluating SR methods.[2][1]Comparison with Traditional Regression
Key Differences
Symbolic regression fundamentally differs from classical regression techniques in its approach to model formulation, where it simultaneously evolves both the structure and parameters of mathematical expressions, often represented as tree-based structures such as x^2 + \sin(y), without presupposing a specific functional form. In contrast, classical methods like ordinary least squares (OLS) regression assume a fixed model structure, such as the linear form y = \beta_0 + \beta_1 x, and optimize only the numerical parameters within that predefined framework.[6] This structural flexibility in symbolic regression stems from its roots in evolutionary computation, pioneered by Koza's genetic programming paradigm. The optimization process in symbolic regression involves a global search across an expansive space of possible expressions, typically employing heuristic exploration to avoid local minima and discover novel functional relationships. Classical regression, however, relies on local optimization techniques, such as least squares minimization in OLS, which efficiently fit parameters but can fail to identify the underlying true model if the assumed form is incorrect.[6] This exploratory nature allows symbolic regression to uncover complex, nonlinear dependencies that classical approaches might overlook without extensive model specification. A key output distinction is that symbolic regression yields explicit, human-readable symbolic equations that provide mechanistic insights into the data-generating process, whereas classical regression produces opaque numerical coefficients or black-box models that prioritize predictive accuracy over interpretability. For instance, symbolic regression has successfully rediscovered nonlinear physical laws, such as Kepler's third law relating planetary orbital periods to semi-major axes, from raw astronomical data where linear regression would fail due to the absence of a prior nonlinear form assumption.[12]Advantages and Limitations
Symbolic regression offers significant advantages in scientific discovery due to its ability to produce highly interpretable mathematical expressions that reveal underlying mechanisms in data. Unlike black-box models such as neural networks, it generates explicit formulas, such as the Coulomb force law F = k \frac{q_1 q_2}{r^2}, which can be directly understood and verified by domain experts.[13] This interpretability facilitates hypothesis generation and validation in fields like physics, where symbolic regression has successfully rediscovered conservation laws, including those governing simple harmonic oscillators and chaotic double pendula, from raw experimental data without prior assumptions about the system's form.[14] Another key benefit is automatic feature engineering, as the method evolves both the structure and coefficients of expressions, potentially uncovering nonlinear interactions that manual feature selection might overlook. For instance, in dynamical systems like the nonlinear pendulum, techniques integrated with symbolic regression can derive reduced representations such as \ddot{z} = -0.99 \sin z.[13] Furthermore, symbolic regression demonstrates robustness to unknown functional forms relative to traditional parametric methods, which rely on predefined equations; it explores a broad space of compositions from a function library, enabling discovery even when the true model deviates from standard assumptions. Despite these strengths, symbolic regression faces notable limitations, primarily its computational expense stemming from an NP-hard search space that grows exponentially with expression complexity.[15] This makes it resource-intensive for large datasets or intricate models, often requiring significant time and hardware compared to faster parametric alternatives. Additionally, without proper controls, it risks overfitting by favoring overly complex expressions that memorize noise rather than generalize, a challenge addressed through mechanisms like Pareto fronts to trade off accuracy and simplicity. Symbolic regression is also sensitive to data noise, as perturbations can lead to spurious terms in evolved expressions, particularly in low-signal environments. To mitigate these drawbacks, practitioners employ trade-offs via multi-objective optimization, simultaneously minimizing fitting error and expression length or complexity—analogous to penalties in criteria like the Akaike Information Criterion (AIC). Methods such as those using Pareto fronts rank solutions to select parsimonious models that balance fidelity and interpretability, enhancing generalization in noisy or sparse data scenarios.[13]Core Methods
Genetic Programming Approaches
Genetic programming (GP) represents a foundational approach to symbolic regression, evolving populations of computer programs represented as tree structures to discover mathematical expressions that best fit given data. In this paradigm, each individual in the population is an expression tree where internal nodes denote functions or operators, and leaf nodes represent variables or constants. The evolutionary process begins with the random initialization of a population of such trees, typically using methods like the ramped half-and-half technique to ensure diversity in tree sizes and shapes. Selection, crossover, and mutation operators drive the evolution across generations. Tournament selection is commonly employed to choose parents based on fitness, favoring individuals that minimize the error between the evolved expression and target data points, often measured by mean squared error or a similar regression metric adapted for symbolic forms. Crossover swaps subtrees between two parent trees to generate offspring, while mutation replaces a randomly selected subtree with a new one, introducing variation while preserving viable structures. These operations are tailored for symbolic regression by defining a function set including arithmetic operators (e.g., +, -, *, /) and sometimes transcendental functions (e.g., sin, exp) to allow discovery of nonlinear and complex expressions. A key adaptation in GP for symbolic regression is the use of fitness functions that penalize both approximation error and expression complexity to combat code bloat, where populations tend to grow unnecessarily large over generations. Parsimony pressure, a common technique, incorporates a penalty proportional to tree size into the fitness evaluation, such as adding a small constant times the number of nodes to the error term, thereby favoring simpler models without sacrificing accuracy. This helps maintain computational efficiency and interpretability in the evolved expressions.[16] John Koza's standard GP algorithm, introduced in 1992, established the canonical framework for these methods, demonstrating their efficacy on symbolic regression tasks like fitting quartic polynomials or trigonometric functions through iterative evolution over hundreds of generations. Variants like strongly typed GP extend this by enforcing type constraints during tree construction and genetic operations, ensuring semantically valid expressions—for instance, restricting addition to numeric types only—which reduces invalid offspring and accelerates convergence on well-typed regression problems.[17] The tree representation can be formalized as a directed acyclic graph, but in practice, it is a binary or n-ary tree where the root evaluates to the full expression. For example, the expression x + \sin(y) is depicted as: \begin{array}{c} + \\ / \quad \backslash \\ x \quad \sin(y) \end{array} Mutation might replace the \sin(y) subtree with, say, e^z, yielding x + e^z, altering the functional form while adhering to the type system if strongly typed. Early implementations facilitated practical adoption of GP for symbolic regression. Lil-GP, developed in the mid-1990s, provided an efficient C-based library for tree-based GP, supporting features like ephemeral random constants for numerical terminals and applications to benchmark regression problems such as the quadratic formula discovery.[18]Other Evolutionary Techniques
Beyond standard genetic programming, which typically employs tree-based representations for evolving mathematical expressions, other evolutionary techniques adapt alternative encodings and search mechanisms to address challenges in symbolic regression, such as computational efficiency and the generation of valid expressions. Linear genetic programming (LGP) represents programs as sequences of instructions or machine-code-like operations, contrasting with tree structures by enabling direct compilation and faster execution during fitness evaluation.[19] This linear encoding facilitates efficient handling of large datasets, where LGP has demonstrated superior performance and simpler solutions compared to traditional genetic programming in symbolic regression tasks.[19] For instance, LGP's instruction-based approach reduces overhead in interpreting complex trees, leading to efficiency gains in evolving regression models on extensive data.[20] Grammatical evolution (GE) employs a linear genome that maps to valid expressions via a Backus-Naur Form (BNF) grammar, constraining the search space to produce only syntactically correct mathematical formulas and thereby minimizing invalid outputs.[21] Introduced by Ryan, Collins, and O'Neill in 1998, GE decouples the genotype from the phenotype, allowing flexible grammar definitions to enforce domain-specific constraints in symbolic regression.[21] This method enhances the reliability of evolved expressions by avoiding the need for repair mechanisms common in unconstrained evolutionary searches.[22] Estimation of distribution algorithms (EDAs) replace traditional crossover and mutation with probabilistic models that estimate the distribution of promising solutions and sample new individuals accordingly, guiding the search more explicitly in symbolic regression.[23] In this paradigm, EDAs build statistical models—such as multivariate normal distributions—from selected individuals to generate offspring, which can improve convergence on complex expression spaces by capturing dependencies among variables.[23] Applications in symbolic regression, as explored in works combining EDAs with grammar-guided evolution, demonstrate their utility in reducing parameter tuning while effectively sampling valid mathematical structures.[24] Particle swarm optimization (PSO) adaptations in symbolic regression often focus on continuous optimization of numerical coefficients within evolved expression structures, treating parameters as particle positions in a search space updated via velocity and social influences.[25] This swarm-based approach complements structure discovery by iteratively refining constants post-evolution, leveraging global and local best-known solutions to minimize fitting errors efficiently.[25] Hybrid integrations, such as PSO-tuned genetic programming models, have been applied to enhance prediction accuracy in regression scenarios by optimizing coefficients after initial expression generation.[26]Advanced Methods
Machine Learning Integrations
Symbolic regression has increasingly integrated machine learning techniques to enhance the efficiency and accuracy of expression discovery, addressing the combinatorial explosion inherent in traditional search methods. Neural networks, in particular, guide the exploration of expression spaces by proposing candidate structures or priors, reducing reliance on purely random or evolutionary sampling.[11] One prominent approach is neural-guided symbolic regression, where recurrent neural networks (RNNs), such as gated recurrent units (GRUs), generate sequences of mathematical operations to form expressions. In the Deep Symbolic Regression (DSR) framework, an RNN emits a distribution over tractable mathematical expressions, trained via reinforcement learning with risk-seeking policy gradients to prioritize high-reward, low-complexity solutions. This method samples expressions autoregressively, with the RNN conditioned on input features and previously generated tokens, enabling recovery of ground-truth equations from noisy data in benchmarks like the Feynman equations. A softmax layer is applied to the RNN's final hidden state to obtain probabilities over production rules, such as operator selection: p(o_t | o_{<t}, x) = \text{softmax}(W_o \cdot h_t + b_o) where o_t is the token at step t, h_t is the hidden state, and W_o, b_o are learned parameters, allowing prioritized selection of operators like addition or multiplication based on learned priors.[27] Complementing this, the AI Feynman method employs neural networks for initial fitting of complex functions, followed by recursive symbolic simplification informed by physics-inspired priors like dimensional analysis. A feedforward neural network approximates the target function from data, after which symbolic regression refines subscales by substituting learned constants and reducing dimensionality, successfully solving all 100 of the 100 Feynman physics problems that stumped earlier genetic programming approaches.[28] Hybrid models combine symbolic regression with gradient boosting machines to leverage the latter's strength in capturing non-linear patterns for initial approximations, followed by symbolic refinement for interpretability. For instance, interpretable variants of eXtreme Gradient Boosting (XGBoost) preprocess data to identify key interactions, guiding the symbolic search toward parsimonious expressions in tasks like materials modeling. Tools evolving from Eureqa, such as PySR, incorporate gradient boosting-inspired library tuning and regularization to accelerate convergence on real-world datasets.[29][30] Reinforcement learning variants further advance operator sequence exploration, treating expression building as a sequential decision process. RL agents, optimized via policy gradients, learn to construct expressions by rewarding fits that balance accuracy and complexity, as in extensions of DSR where deep policies navigate vertical (multi-output) regression spaces. These approaches, prominent in 2020s research, use actor-critic methods to sample operator sequences, outperforming vanilla genetic programming on large-scale symbolic benchmarks by focusing search on promising substructures.[31][32]Recent Innovations
Recent innovations in symbolic regression have increasingly incorporated human expertise and domain-specific knowledge to enhance interpretability and accuracy, particularly through interactive and physics-informed approaches. A notable advancement is interactive symbolic regression, which enables human-in-the-loop co-design where users provide feedback to refine mathematical expressions iteratively. In a 2025 study published in Nature Communications, researchers introduced the Symbolic Q-network, an offline reinforcement learning framework that allows human experts to collaborate with the algorithm in real-time discovery tasks, modifying expressions based on discrepancies between data and predictions to achieve more tailored models for complex systems. This mechanism has demonstrated improved convergence in scenarios requiring rapid adaptation, such as engineering design optimization.[1] Physics-informed symbolic regression has emerged as a post-2023 paradigm that embeds domain constraints, like conservation laws, directly into the fitness function to guide expression evolution toward physically plausible solutions. For instance, a 2023 framework, Φ-SO, leverages deep learning to recover analytical expressions from physics data while enforcing unit constraints, outperforming traditional methods on noisy datasets from astrophysical simulations. Building on this, a 2024 integration of symbolic regression with physics-informed neural networks (PINNs) has been applied to nonlinear dynamics, such as fluid flow equations, where the hybrid approach extracts interpretable terms while satisfying governing partial differential equations. More recent work in 2025, including StruSR, uses structure-aware PINNs to incorporate prior knowledge of equation forms, enhancing scalability for high-dimensional problems like soil constitutive modeling.[33][34][35] These advances prioritize dimensional consistency and constraint satisfaction, reducing the search space and improving generalization in scientific applications. Scalable deep symbolic regression has gained traction with transformer-based models designed for large-scale expression generation, addressing the limitations of exhaustive search in high-dimensional spaces. The 2024 SymFormer architecture, an end-to-end transformer, simultaneously predicts symbols and constants from data, enabling efficient handling of diverse datasets without predefined grammars and showing competitive performance on benchmarks like Nguyen and Keane functions. A comprehensive 2025 ACM Computing Surveys review highlights how these models, trained via supervised learning on synthetic expressions, facilitate zero-shot generalization to unseen problems, with applications in dynamical systems modeling. Such methods extend beyond classical evolutionary techniques by leveraging attention mechanisms for semantic similarity, though empirical comparisons in a 2024 arXiv preprint indicate that traditional genetic programming implementations, like Operon, often outperform newer transformer-based approaches in terms of exact recovery rates on standard tasks.[36][2][37] Benchmarking efforts have also evolved, with the SRBench framework receiving significant updates in 2024 and 2025 to incorporate emerging algorithms and datasets. These revisions expanded the evaluated methods to over 25 systems, including recent deep learning hybrids, and introduced new metrics for assessing generalization across noise levels and extrapolation, fostering standardized comparisons in the community. Explorations into quantum-inspired techniques remain nascent but show promise in preliminary 2024 studies for accelerating searches in combinatorial spaces.Evaluation and Benchmarking
Standard Benchmarks
SRBench represents the most comprehensive and widely adopted benchmark suite for symbolic regression, introduced in 2021 by La Cava et al. as an open-source framework to evaluate modern methods against state-of-the-art machine learning approaches.[38] It encompasses over 252 regression problems drawn from the Penn Machine Learning Benchmarks (PMLB), blending synthetic datasets—such as noisy polynomials and differential equation systems—and real-world data without known ground-truth expressions.[38] Synthetic examples include the Feynman equations from physics lectures, which test the recovery of known physical laws, and Strogatz ordinary differential equations modeling chaotic systems like the van der Pol oscillator.[38] The suite emphasizes reproducibility, with controlled experimental setups that compare expression accuracy and interpretability across diverse problem scales and noise conditions.[38] To address varying difficulty levels, SRBench divides synthetic problems into easy and hard tracks based on noise: the easy track uses noise-free data, while the hard track incorporates Gaussian white noise at levels of 0.001, 0.01, and 0.1 relative to the signal's root mean square.[38] Evaluation protocols adopt a multi-objective approach, balancing predictive accuracy against expression simplicity to mimic real-world trade-offs in model selection.[38] Key metrics include the exact match rate, which measures the percentage of cases where the discovered expression symbolically matches the ground truth after simplification; normalized error, often computed as the mean squared error scaled by the variance of the target values to enable cross-dataset comparisons; and expression complexity, quantified by the number of nodes (operators, variables, and constants) in the simplified parse tree.[38] These metrics facilitate Pareto-optimal analysis, where methods are scored on fronts plotting low error against low complexity.[38] Earlier benchmarks laid foundational groundwork for low-dimensional problems, such as those introduced by Schmidt and Lipson in 2009, which focused on distilling free-form equations from experimental data in physics and chemistry using small-scale, interpretable datasets. For large-scale evaluation, the Feynman Symbolic Regression Database (FSRD), developed by Udrescu and Tegmark in 2019, provides 100 high-dimensional physics-inspired equations derived from Feynman Lectures, challenging methods on sparse, noisy data with up to thousands of terms.[39] It was expanded in AI Feynman 2.0 around 2020 to include multi-level fitting capabilities.[39] Both suites employ similar core metrics—exact match rate for ground-truth recovery, normalized error for fit quality, and node-based complexity for parsimony—to ensure consistent assessment.[39] Recent updates to SRBench, including extensions in 2024 via SRBench++ with domain-specific datasets and further refinements in 2025 toward a next-generation benchmark that doubles the number of evaluated methods, incorporates improved metrics, visualizations, and analysis of trade-offs including energy consumption, while maintaining the core dataset of 252 problems with curated selections for specific tracks.[40][41] Additionally, LLM-SRBench, introduced in 2025, provides 239 challenging problems across physics, chemistry, biology, and engineering domains to assess large language model-based scientific equation discovery.[42] These evolutions maintain multi-objective protocols and prioritize verifiable ground truths for rigorous testing. Such benchmarks are occasionally adapted for competitions, like the GECCO Symbolic Regression Challenge, to standardize comparisons across evolving algorithms.[38]Competitions and Challenges
The SRBench Competition of 2022, held at the Genetic and Evolutionary Computation Conference (GECCO) in Boston, Massachusetts, represented a major effort to evaluate and advance symbolic regression algorithms through standardized benchmarking.[43] It featured two primary tracks: a synthetic track assessing rediscovery of known equations, feature selection, resistance to local optima, extrapolation, and noise sensitivity on clean and noisy data; and a real-world track focused on 14-day forecasts of COVID-19 cases, hospitalizations, and deaths in New York State, emphasizing accuracy, model simplicity, and trustworthiness.[43] Thirteen algorithms were submitted, with nine qualifying after validation; QLattice, developed by Abzu AI, won the synthetic track with a score of 6.23, while Unified Deep Symbolic Regression (uDSR), from a team at Lawrence Livermore National Laboratory led by Brenden Petersen, claimed victory in the real-world track with a score of 5.75.[43][44] The Operon framework, a genetic programming-based method with local search optimization, also demonstrated strong performance in the synthetic track, balancing accuracy and interpretability.[45][46] Beyond SRBench, symbolic regression has been featured in ongoing workshops and challenges that foster innovation. The Genetic Programming Theory and Practice (GPTP) workshops, held annually since 2003 initially at the University of Michigan and later at Michigan State University, provide a venue for discussing theoretical and practical advances in genetic programming, including symbolic regression applications, through presentations and collaborations among researchers.[47][48] The AI Feynman dataset from the 2019 paper by Udrescu and Tegmark, consisting of 100 equations from Feynman Lectures on Physics, has inspired subsequent benchmarks and methods for recovering exact equations from noisy data, with expansions like the 2022 SRSd benchmark using 120 recreated datasets.[28][49] This initiative highlighted the potential of hybrid neural-symbolic approaches for scientific equation discovery.[28] These events address core challenges in symbolic regression, such as scalability to large datasets and ensuring interpretability of discovered expressions. In the 2022 SRBench synthetic track, top performers achieved low-error recovery of ground-truth equations, underscoring progress in handling noisy data and extrapolation.[43] The real-world track emphasized domain adaptation, with winners producing simple, trustworthy models for epidemiological forecasting.[44] Follow-up competitions, including the 2023 SRBench edition at GECCO with dedicated performance and interpretability tracks, and the 2024 Symbolic Regression Workshop (SymReg) at GECCO in Melbourne, Australia, incorporated recent innovations like deep learning integrations to tackle evolving benchmarks.[50][51] The SymReg workshop continued at GECCO 2025, featuring updates to SRBench benchmarking.[52]Applications
Scientific Discovery
Symbolic regression has emerged as a powerful tool for scientific discovery by automatically inferring interpretable mathematical models from observational or simulated data, enabling the rediscovery of known physical laws and the formulation of novel hypotheses without relying on preconceived functional forms. This data-to-equation pipeline typically involves fitting candidate expressions to data while prioritizing parsimony—favoring simpler models that generalize well—to generate testable scientific hypotheses. By balancing complexity and fidelity, symbolic regression facilitates hypothesis generation in domains where traditional modeling is hindered by incomplete knowledge or high-dimensional data.[53] In physics, symbolic regression has successfully rediscovered fundamental equations from simulated datasets, such as Newton's law of universal gravitation, F = -G \frac{M_1 M_2}{r^2}, derived from solar system ephemeris data spanning decades.[54] The AI Feynman project, a physics-inspired symbolic regression method, achieved a 100% success rate in solving 100 equations from the Feynman Lectures on Physics, including gravitational laws and fluid dynamics principles, by combining neural network approximations with techniques like dimensional analysis and recursion to simplify expressions. This approach has also approximated solutions to complex partial differential equations, such as those in the Navier-Stokes framework for fluid flow, from simulation data, demonstrating its utility in validating and extending theoretical models.[28][39] In biology and chemistry, symbolic regression infers models of complex dynamics, such as reaction kinetics in bioprocesses and gene regulatory networks. For instance, it has automated the discovery of analytical kinetic rate models for metabolic pathways without predefined structures, revealing stoichiometric dependencies and rate laws from experimental time-series data. In gene regulation, methods like LogicGep use symbolic regression to infer Boolean functions describing network interactions, accurately reconstructing regulatory relations in synthetic and biological datasets. A notable application in materials science involved deriving explicit equations for alloy properties, such as hardness and fatigue strength, from compositional and processing data, aiding the design of high-entropy alloys with targeted performance.[55][56][57] Recent advances highlight symbolic regression's role in interpretable discovery within power systems, where 2024 reviews emphasize its application to deriving parsimonious models for dynamics like inverter stability and grid synchronization from measurement data. These works underscore the method's ability to produce transparent equations—such as sparse representations of nonlinear oscillations—with high accuracy (e.g., R^2 > 0.999) and low computational overhead, fostering hypothesis-driven insights into system behavior under uncertainty. Overall, such developments reinforce symbolic regression as a cornerstone for automated scientific inference across disciplines. In 2025, extensions of AI Feynman have been applied to rediscover astronomical relations, such as the Lunar Equation of the Centre from ephemeris data.[10][58]Engineering and Real-World Uses
Symbolic regression finds practical applications in engineering domains where interpretable models are essential for optimization and deployment in real-world systems. In control systems, particularly within aerospace, it has been used to derive empirical transfer functions for nonlinear plants, enabling the design of controllers that outperform traditional methods in stability and performance. For example, a 2021 study introduced an epigenetic linear genetic programming approach to symbolic regression for controller design, applied to systems like inverted pendulums—relevant to aerospace attitude control—achieving robust closed-loop performance without extensive simulations.[59] NASA has leveraged symbolic regression for prognostics in aerospace components, such as predicting remaining useful life in turbofan engines from sensor data, using genetic programming to evolve predictive equations that improved fault detection accuracy over baseline models.[60] Additionally, symbolic regression models aerodynamic deviations between ground wind tunnel tests and flight data for aerospace vehicles, yielding concise equations that enhance prediction accuracy with over 80% improvement compared to baseline models in some cases.[61] In finance and the energy sector, symbolic regression aids in modeling complex dynamics for forecasting and optimization. For stock dynamics, it recovers nonlinear equations from financial datasets, such as asset pricing models for bonds and equities.[62][63] In renewable energy, a 2025 review highlights its integration with deep learning for solar power output forecasting, where symbolic regression derived parsimonious models from photovoltaic data, outperforming standalone neural networks in accuracy for grid integration and economic dispatch.[64] For wind power, symbolic regression has modeled farm behaviors under low-voltage conditions, correcting simulation errors and supporting stability analysis in power grids with high renewable penetration.[64] In manufacturing, physics-informed symbolic regression has been applied to tool wear prediction and remaining useful life estimation in machining processes, integrating recursive modeling to handle dynamic sensor inputs and achieve lower prediction errors than black-box alternatives.[65] For instance, symbolic regression develops soft sensors that predict product quality metrics with mean absolute percentage errors around 8-13%, identifying key sensor relationships for maintenance scheduling.[66] Notable case studies from the 2010s demonstrate symbolic regression's impact on engineering optimization. Eureqa, a genetic programming-based tool, was used to derive high-correlation models for small-scale Magnus wind turbine power coefficients, processing blade element momentum theory data to optimize performance equations with improved accuracy over empirical fits.[67] However, deploying symbolic regression in industrial big data environments reveals scalability challenges, as traditional methods struggle with high-dimensional datasets due to computational complexity, prompting recent advances in unbiased search algorithms to handle larger problem instances efficiently.Software and Implementation
Open-Source Tools
One prominent open-source library for symbolic regression is gplearn, a Python package that implements genetic programming tailored for discovering mathematical expressions from data. It provides a scikit-learn-compatible interface, allowing seamless integration into machine learning pipelines via methods likefit and predict, and supports the definition of custom functions and operators to extend the search space beyond standard arithmetic operations.[68]
PySR is another widely used open-source framework, available in both Python and Julia, designed for high-performance symbolic regression with an emphasis on speed and interpretability in scientific applications. Introduced in 2020, it employs regularized evolution and simulated annealing to efficiently explore expression spaces, and features differentiable outputs compatible with machine learning libraries such as JAX and PyTorch for tasks like symbolic distillation of neural networks; the library remains actively updated, with enhancements through 2025.[69][70]
DEAP (Distributed Evolutionary Algorithms in Python) serves as a versatile open-source evolutionary computation framework that can be adapted for symbolic regression through its genetic programming primitives. It includes built-in examples for symbolic regression problems, enabling users to evolve expression trees using arbitrary data distributions and fitness functions, though it requires custom setup for full symbolic regression workflows.[71][72]
Operon is an efficient C++ framework focused on large-scale genetic programming for symbolic regression, utilizing a linear tree encoding to generate interpretable mathematical models from regression targets. It achieved strong performance in the 2022 GECCO Symbolic Regression Competition, ranking highly across synthetic and real-world benchmarks, and features Python bindings available for broader accessibility.[73][74][75]
In contrast to these developer-oriented open-source options, commercial software like Eureqa provides proprietary graphical interfaces for similar tasks.