Stata
Stata is a general-purpose statistical software package developed by StataCorp LLC, providing integrated tools for data management, statistical analysis, visualization, and automated reporting across various platforms including Windows, macOS, and Unix.[1] First released in January 1985 as version 1.0 by founders William Gould and Sean Becketti, it originated as a regression-focused tool in California before relocating to Texas in 1993 and evolving into a comprehensive data science platform.[2][3]
Over its four decades of development, Stata has emphasized reproducibility, speed, and ease of use, supporting both command-line and graphical user interfaces to accommodate users ranging from beginners to advanced researchers.[4] The software's latest major release, version 19 in April 2025, introduced enhancements in machine learning, Bayesian analysis, and multilingual support, building on continuous updates that ensure compatibility with modern computing needs.[2] StataCorp maintains extensive documentation, validation against benchmarks like those in NIST tests, and a publishing division for user-contributed resources, fostering a robust ecosystem for statistical computing.[3]
Widely adopted in academia and research, Stata is particularly prominent in fields such as economics, biomedicine, epidemiology, sociology, and political science, where it facilitates complex data manipulation, regression modeling, survival analysis, and publication-quality graphics.[5][6] Its syntax-driven approach allows for programmable workflows, while menu-based options enable intuitive exploration, making it suitable for teaching, policy analysis, and empirical studies across disciplines.[4] Unlike some open-source alternatives, Stata's proprietary nature ensures proprietary optimizations for large datasets and long-panel data common in longitudinal research.[7]
Overview
Description and Purpose
Stata is a proprietary statistical software package developed by StataCorp LLC for general-purpose statistical analysis, simulation, regression, and graphics.[3] It serves as an integrated tool for data science tasks, encompassing data manipulation, visualization, statistical modeling, and automated reporting in a single environment.[4]
The primary purposes of Stata include facilitating data manipulation, econometric modeling, survey data analysis, and general statistical computing, with a particular emphasis on reproducibility and user-friendliness for fields such as social sciences, economics, and biostatistics.[8][9] Its design philosophy centers on a command-driven interface that supports scripting through do-files and log files, enabling precise replication of analyses across sessions and platforms.[10] This approach prioritizes efficiency in handling large datasets by storing data in memory as a structured "data rectangle," allowing rapid processing while maintaining consistency in syntax and operations.[10]
Stata's name derives from "statistics" and "data," reflecting its core focus on statistical computing with datasets, as a syllabic abbreviation coined by its creator to evoke an Italian sound.[11]
Key Features
Stata provides an integrated environment that supports the full data analysis workflow, including data import and export in various formats such as CSV, Excel, and SQL databases, data manipulation through commands for merging datasets, reshaping from wide to long formats, and generating derived variables.[4] This seamless integration extends to statistical testing, encompassing procedures like t-tests for means comparison, ANOVA for group differences, and confidence interval estimation, all accessible via intuitive syntax.[12] Additionally, Stata excels in producing publication-quality graphics, such as scatterplots, histograms, and advanced plots like ROC curves, which can be customized and exported in formats including PDF and SVG for direct use in academic papers or reports.[4]
A distinctive aspect of Stata is its emphasis on reproducibility and extensibility through do-file scripting, which allows users to record sequences of commands in text files that can be executed repeatedly to ensure consistent results across sessions or collaborators.[13] Complementing this, ado-file extensions enable the creation and distribution of custom commands as reusable programs, fostering a vast ecosystem of user-contributed tools available via the software's package manager.[14] Stata also offers robust built-in support for advanced econometric techniques, including panel data models via the xt suite of commands for fixed and random effects estimation, and instrumental variables regression through xtivreg for addressing endogeneity in longitudinal settings.[15]
Recent enhancements in Stata 19, released in April 2025, have expanded its machine learning capabilities through integration with H2O for random forests and other ensemble methods for classification and regression tasks, alongside tools for causal average treatment effects (CATE) and high-dimensional fixed effects (HDFE).[16] These updates also include advanced Bayesian analysis tools, such as Bayesian quantile regression and variable selection for linear models, enabling probabilistic inference in complex scenarios.[16] For efficiency, Stata employs in-memory processing with built-in data compression, allowing higher editions like Stata/MP to handle datasets comprising up to 20 billion observations on modern hardware, making it suitable for large-scale empirical research.[17]
Stata further supports usability through a graphical user interface option for point-and-click operations alongside its command-line interface.[4]
History
Origins and Early Development
Stata originated in the early 1980s amid the rise of personal computing, when William Gould and economist Finis Welch co-founded Computing Resource Center (CRC) in 1982 in Santa Monica, California. Initially focused on providing computing resources, CRC shifted toward software development as desktop computers became more accessible. Development of Stata began in January 1984, driven by the need for an affordable, user-friendly statistical package tailored to econometricians and social scientists frustrated with the high costs and complexity of mainframe-based tools like SAS and SPSS. Gould, leveraging the newly available Lattice C compiler for PCs, designed Stata to emphasize simplicity, extensibility, and a centralized command grammar inspired by systems like Wylbur, Unix, and CMS, allowing users to work with summary datasets efficiently.[18]
The first version of Stata was crafted primarily by Gould, with assistance from Sean Becketti in refining the design. It was announced at the American Economic Association meeting in Dallas in late 1984 and officially released in January 1985 for MS-DOS, featuring around 44 commands centered on basic regression analysis, summary statistics, and data management. This initial release targeted academic and professional users seeking a lightweight alternative to proprietary software, running on early IBM PCs and emphasizing ease of use for non-programmers while supporting custom extensions. The name "Stata" was coined by Gould as a blend of "stat" from statistics and "data," intended to evoke a fresh, non-acronymic identity that rhymed with "data" for memorability; early users sometimes mispronounced it as "STAT-A" due to associations with other tools like STAT/X.[2][11]
By the early 1990s, as Stata gained traction among researchers, CRC transitioned to a dedicated software focus. In 1993, the company was incorporated as Stata Corporation (later StataCorp LP) and relocated its headquarters to College Station, Texas, near Texas A&M University, where Finis Welch held a professorship and many early contributors were affiliated. This move marked Stata's evolution from a niche academic tool to a commercial enterprise, enabling expanded development while maintaining its roots in accessible statistical computing.[18][19][20]
Major Releases and Evolution
Stata has maintained a consistent schedule of biennial major releases since its inception in January 1985 with version 1.0, which was designed for DOS-based IBM PCs and focused on basic data management, descriptive statistics, and regression analysis using 44 core commands.[2][10] Subsequent versions evolved incrementally, with major updates approximately every two years, supplemented by free point releases (e.g., 16.1) that added features without requiring a full upgrade.[2] This rhythm accelerated in the 1980s with frequent minor updates but stabilized post-2000, allowing Stata to incorporate user feedback and technological advancements systematically.[10]
Early releases emphasized core statistical capabilities on limited hardware; for instance, Stata 2.0 (June 1988) introduced graphics, string variables, and Kaplan-Meier survival analysis, while Stata 3.0 (March 1992) expanded to logit, probit, heteroskedasticity-robust standard errors, and epidemiological tools like epitab.[10] By the mid-1990s, Stata shifted to cross-platform support, with version 4.0 (January 1995) marking the first Windows edition, followed by Unix and Macintosh compatibility, enabling broader accessibility beyond DOS.[10] Version 5.0 (September 1996) enhanced modeling commands, and Stata 6.0 (January 1999) added web-aware features for data import and updates.[10] Stata 7.0 (December 2000) advanced panel data and time-series tools, including the introduction of SMCL (Stata Markup and Control Language) for formatted output display.[21][10]
The 2000s brought significant interface and performance innovations: Stata 8.0 (January 2003) overhauled the user interface with a graphical dialog system and a new graphics engine supporting advanced plotting and time-series tools like VAR and SVAR.[10] Stata 9.0 (April 2005) introduced the Mata matrix programming language and xtmixed for multilevel mixed-effects models, enabling analysis of clustered data such as longitudinal studies.[10] Version 10.0 (June 2007) launched Stata/MP, leveraging multiprocessing on multicore systems for faster computations, alongside Graph Editor for interactive plotting, xtmelogit for binary multilevel outcomes, and millisecond-precision time-series support.[10] Stata 11.0 (July 2009) added factor variables for flexible model specification, multiple imputation for missing data, generalized method of moments (gmm), and unit-root tests for panels.[10] Stata 12.0 (July 2011) integrated structural equation modeling (SEM) with a dedicated suite, plus multilevel generalized linear models and advanced time-series like MGARCH.[10]
Later versions addressed modern data challenges: Stata 13.0 (June 2013) supported long strings (up to 2 billion characters), treatment-effects estimation (teffects), and unified multilevel commands under the me prefix.[10] Stata 14.0 (April 2015) introduced Bayesian analysis via bayesmh for Markov chain Monte Carlo estimation.[10] Version 15.0 (June 2017) extended regression models (e.g., for choice-based samples), latent class analysis, and automated reporting to Word and PDF with embedded results.[22] Stata 16.0 (June 2019) enabled multiple datasets in memory simultaneously, lasso and elastic net for model selection and prediction, meta-analysis tools, and initial Python integration via PyStata for bidirectional interoperability, with R connectivity expanded in subsequent updates.[23][24][25]
Stata 17.0 (April 2021) revamped table creation for flexible summaries, enhanced Bayesian panel models, and improved PyStata with Jupyter Notebook support.[26] Version 18.0 (April 2023) added heterogeneous distribution effects in regressions, local average treatment effects, and faster panel-data estimation like xtgls.[27] The most recent major release, Stata 19.0 (April 2025), incorporates AI and machine learning enhancements such as Bayesian variable selection for linear models, Bayesian quantile regression, and predictive modeling tools including cross-validation and coefficient paths, alongside StataNow for cloud-based access and big data handling.[2][16] These updates reflect Stata's adaptation to computational trends, prioritizing speed via multiprocessing, interoperability with languages like Python and R since version 16, and scalability for large datasets through features like multiple frames and cloud compatibility.[25][16]
Company and Organizational Growth
StataCorp LLC, established in 1985 as the developer of the Stata statistical software package, has operated as a privately held company since its relocation and renaming in 1993. Headquartered in College Station, Texas, the organization emphasizes long-term stability through its focus on high-quality statistical tools for researchers, maintaining a lean structure that supports consistent innovation without public market pressures.[3]
Over the decades, StataCorp has experienced steady organizational expansion, growing from a small team in its early years to approximately 100-130 employees by the 2010s and beyond. This scaling reflects the company's increasing prominence in the statistical software sector, where it sustains operations through a dedicated workforce focused on software development, support, and user resources. Annual revenue estimates for StataCorp place it in the $10-100 million range as of 2025, underscoring its established market position without aggressive commercialization.[28][29][30]
To serve its global user base, StataCorp relies on a network of authorized international resellers and distributors rather than establishing its own overseas offices. For instance, Timberlake Consultants Ltd handles distribution in the UK, Ireland, France, Spain, Portugal, the Middle East, North Africa, Brazil, and Poland, enabling localized sales, training, and support. This distributor model has facilitated broader accessibility while keeping the core organization centralized in the United States.[31][32]
In the competitive statistical software industry, StataCorp positions Stata as a reliable alternative to open-source tools like R and Python, as well as established proprietary platforms such as SAS and SPSS. The company particularly emphasizes markets in academia, government, and non-profits, where it offers tailored licensing to promote adoption— including student discounts, Prof+ plans for qualified professionals, volume purchase reductions, and specialized options for educational institutions and public sector entities. These strategies have helped StataCorp cultivate loyalty among research-oriented users, differentiating it through ease of use and integrated functionality.[33][34][35]
As of 2025, StataCorp continues to invest in its core offerings, exemplified by the April release of Stata 19, which builds on four decades of refinements to meet evolving analytical demands in data science and econometrics.[3]
Technical Architecture
User Interface Options
Stata provides multiple user interface options to accommodate different workflows, from interactive command execution to visual data management. The primary interface is the command-line, accessed via the dot prompt (.), where users enter commands directly for immediate execution, such as . summarize mpg weight to generate descriptive statistics.[36] This mode supports interactive analysis and is essential for scripting through do-files, which are plain text files containing sequences of Stata commands (e.g., do myanalysis.do) for batch processing, automation, and reproducible research.[36] Do-files can be nested up to 64 levels and are recommended to begin with a version command to ensure compatibility across Stata releases.[36]
For users preferring point-and-click interactions, Stata offers a graphical user interface (GUI), introduced in Stata 8 in 2003, which includes intuitive menus, dialogs, and toolbars for accessing data management, statistical analysis, and graphing features without writing code.[37] The GUI organizes functions into top-level menus like Data, Graphics, and Statistics, with associated dialog boxes that generate underlying commands for transparency and customization.[38] Key components include the Variables Manager, which allows editing of variable names, labels, and properties through a tabular view, and supports operations like renaming or recoding via dropdown menus.[36]
Stata's interface variants enhance workflow organization and data handling. The Project Manager, integrated into the GUI, enables users to bundle related do-files, datasets, logs, and other resources into a single project file (.gpr) for easy navigation and sharing, ideal for complex analyses involving multiple files.[39] The Data Editor provides a spreadsheet-like environment for viewing, entering, and editing data in memory, with modes for browsing (read-only) or editing, and features like cell tooltips for truncated text and pinnable rows/columns for focused inspection.[40] Accessed via Data > Data Editor, it updates in real-time as commands execute, facilitating interactive data exploration.[38]
Accessibility and usability are supported across interfaces through keyboard shortcuts (e.g., F1 for help, Page Up/Down for command history recall, Tab for auto-completion of variable names), customizable function keys, and resizable windows.[36] Users can tailor toolbars and layouts via preferences, and the Do-file Editor includes syntax highlighting and error checking for efficient scripting.[38] Stata maintains cross-platform consistency on Windows, macOS, and Unix/Linux, with uniform command syntax and file handling (e.g., forward slashes for paths on non-Windows systems), ensuring seamless transitions between operating systems.[36]
Data Structure and Management
Stata organizes data in memory as a rectangular table consisting of observations (rows) and variables (columns), where each cell contains a numeric or string value. This flat-file structure forms the core of Stata's dataset, which is loaded into memory upon import and serves as the primary workspace for analysis. Observations represent individual units, such as respondents or time periods, while variables denote attributes like age or income. Prior to Stata 16, only a single dataset could be active in memory at a time, requiring users to load and unload files sequentially.[41][42]
Introduced in Stata 16, the frames feature enables multiple datasets to reside simultaneously in memory, each stored within its own frame for independent manipulation. This allows users to reference and operate across frames using commands like frame to switch contexts or frame post to transfer data between them, facilitating complex workflows such as merging subsets without overwriting the primary dataset. Frames maintain the same observation-variable structure but enhance flexibility for handling related data collections, such as linking survey waves or auxiliary files.[43]
To optimize memory usage, Stata employs efficient storage types, including byte, int, long, float, and double for numerics, with packed formats for strings that store repeated substrings compactly. The compress command automatically converts variables to the smallest possible format without loss of precision—for instance, recoding integers within -127 to 100 as byte (1 byte per value) or trimming long strings to shorter str# types if patterns allow—potentially reducing dataset size by factors of 2 to 10 depending on data characteristics. This is particularly useful for large datasets, as it minimizes RAM requirements and speeds up operations.[44]
Data management in Stata relies on a suite of commands for creating, modifying, combining, and restructuring datasets. The generate command creates new variables based on expressions, such as deriving income categories from raw earnings; replace updates existing values conditionally, enabling data cleaning like handling outliers. For integration, merge combines datasets on common keys (e.g., ID variables) in one-to-one, one-to-many, or many-to-one modes, while reshape transforms data between wide (multiple variables per time point) and long (one row per observation-time pair) formats to suit analysis needs. Support for longitudinal and panel data is provided by xtset, which declares panel structure by specifying panel and time variables, enabling commands like xtreg to account for clustering without manual restructuring.[45][46]
Stata's strengths include scalability for large flat-file datasets, with the Basic Edition (Stata/BE) and Standard Edition (Stata/SE) supporting up to approximately 2.1 billion observations, limited primarily by available memory rather than software constraints. However, it lacks native relational database functionality, such as built-in querying or joins across normalized tables; instead, users import data from relational sources like SQL databases via ODBC or JDBC interfaces for processing within Stata's flat structure.[47][48]
Stata's native file format is the binary .dta file, which stores datasets along with associated metadata such as variable labels, value labels, and notes. This format has evolved across versions, with compatibility spanning from version 4 to the current version 19, though older versions may impose limits on features like extended label lengths when reading newer files.[49] The save command outputs data in .dta format by default, ensuring preservation of these elements for seamless reloading via the use command.[50]
Stata provides robust support for importing and exporting common data formats to facilitate interoperability with other software. Comma-separated values (CSV) files and other delimited text files can be handled using import delimited and export delimited, which support automatic delimiter detection and selective row or column specification.[51] Microsoft Excel files in .xls and .xlsx formats are supported through import excel and export excel, allowing direct reading and writing of worksheets while handling multiple sheets if needed.[52] For legacy statistical software, Stata 16 and later versions include import sas for SAS .sas7bdat files and import spss for IBM SPSS .sav files, preserving variable attributes where possible.[53] Fixed- or free-format text files can also be imported using import delimited, superseding older commands like infile and insheet.[54]
Specialized compatibility extends to database connectivity and scripting integrations. Stata supports Open Database Connectivity (ODBC) via the odbc command, enabling import, export, and SQL queries from sources like Microsoft SQL Server, Oracle, MySQL, and others, provided the appropriate drivers are installed.[55] Similarly, Java Database Connectivity (JDBC) is available through the jdbc command for cross-platform access to databases including Oracle, SQL Server, Amazon Redshift, and Snowflake.[56] For integration with other languages, Stata offers official Python support starting in version 16 via the python command, allowing embedded Python code execution and data exchange within do-files.[57] User-contributed tools like rsource enable similar R integration by executing R scripts from within Stata, though this requires R installation.[58]
In Stata 19, released in April 2025, data management enhancements include frame handling, label operations, and support for importing Parquet files using the import parquet command.[16] Existing XML support via xmluse and xmlsave remains available for importing and exporting datasets in extensible markup language format. JSON handling is not natively supported for direct import or export, relying instead on user-contributed packages like jsonio, and there is no built-in compatibility for NoSQL databases.[59] Post-import, data can be manipulated using Stata's internal structures, as detailed in the data management section.[54]
Core Functionality
Stata provides a suite of built-in procedures for descriptive statistics, enabling users to compute measures such as means, standard deviations, variances, skewness, kurtosis, medians, percentiles, and interquartile ranges via the summarize command.[12] The tabulate command generates one- or two-way frequency tables, including row and column percentages, and supports options for summary statistics like means and standard deviations across categories.[60] For hypothesis testing, Stata includes commands like ttest for comparing means, where the t-statistic is calculated as t = \frac{\bar{x}_1 - \bar{x}_2}{SE}, with SE denoting the standard error of the difference, and supporting one-sample, two-sample, and paired tests under assumptions of normality or via robust variants.[61] Additionally, tabulate with the chi2 option performs Pearson's chi-squared test for independence in two-way tables, assessing whether observed frequencies differ significantly from expected values under the null hypothesis of no association.[62]
In econometrics, Stata's core regression tools begin with ordinary least squares (OLS) estimation using regress, which fits the linear model Y = X\beta + \epsilon, where Y is the response vector, X the design matrix, \beta the parameter vector, and \epsilon the error term, providing coefficient estimates, standard errors, t-statistics, and R-squared values.[63] For binary outcomes, logit and probit implement logistic and probit regression, respectively, modeling the probability of success via the cumulative distribution function of the logistic or normal distribution, with maximum likelihood estimation for parameters.[63] Instrumental variables and generalized method of moments (GMM) are handled by ivregress, supporting two-stage least squares (2SLS), limited-information maximum likelihood (LIML), and GMM estimators to address endogeneity, where instruments are specified to identify causal effects.[64] Time-series analysis includes ARIMA modeling via arima, which estimates autoregressive integrated moving average processes, allowing for differencing to achieve stationarity and forecasting with dynamic predictions.[65]
Advanced statistical capabilities encompass survival analysis with stcox, fitting Cox proportional hazards models to estimate hazard ratios under the assumption of proportional hazards, using partial likelihood maximization for time-to-event data with censoring.[66] Multilevel modeling is supported by mixed, which estimates linear mixed-effects models incorporating fixed and random effects for hierarchical or clustered data, such as y_{ij} = X_{ij}\beta + Z_{ij}b_i + \epsilon_{ij}, where b_i are random effects at level i.[67] Machine learning tools include the lasso command for penalized regression with L1 regularization to promote sparsity, and in Stata 19, integration with H2O for ensemble methods like random forests and gradient boosting machines, featuring cross-validation for hyperparameter tuning and prediction.[68][69]
A distinctive feature of Stata's estimation procedures is the extensive post-estimation toolkit, allowing users to compute marginal effects and predicted values with margins, which evaluates responses at specified covariate levels, such as average marginal effects (AMEs), and supports contrasts via ANOVA-style tests.[70] The test command performs Wald tests for linear hypotheses on coefficients, including joint significance and equality constraints. Robust standard errors, adjustable via the vce(robust) option in commands like regress, account for heteroskedasticity by using sandwich estimators, enhancing inference validity without assuming homoscedasticity.[63] These tools facilitate seamless extension of model diagnostics and interpretation, with results amenable to visualization as covered in graphics capabilities.[8]
Graphics and Output Capabilities
Stata provides a wide array of graph types for visualizing data, including histograms for displaying distributions, scatterplots for exploring relationships between variables, box plots for summarizing data variability, ROC curves for evaluating diagnostic test performance, and heatmaps for representing matrix data through color gradients.[71]
Customization options enable users to tailor visualizations extensively, such as using the twoway command for overlaying multiple series like lines and scatters on a single plot, or graph combine to arrange multiple graphs into panels for comparative analysis. Additional refinements include specifying colors via palette options, adjusting axis labels and titles for clarity, and configuring legends to identify plot elements effectively.[72]
Output capabilities support flexible handling of results and visualizations, with SMCL (Stata Markup and Control Language) used for logging sessions and formatting command outputs in log files via commands like log using. Graphs can be exported to various formats including PDF, EPS, and PNG using graph export, preserving publication quality. For dynamic documents, the dyndoc command integrates Stata results and graphs into Markdown-based HTML or Word files, facilitating reproducible reports.[73][74][75]
In Stata 19, released in 2025, graphics enhancements include a new twoway heatmap plottype for creating color-coded grids from bivariate data, alongside improved bar plots with built-in confidence intervals and integration with reporting tools for seamless HTML exports.[71][76]
Programming and Extensibility
Stata's programming ecosystem centers on its ado programming language, which enables users to automate tasks and create custom commands. Do-files serve as simple scripts consisting of sequences of Stata commands stored in plain text files, executable via the do command for reproducible workflows in interactive sessions or batch processing. Ado-files build on this foundation by defining reusable commands that integrate seamlessly with Stata's syntax, allowing users to encapsulate complex operations into callable functions. These can be developed locally or shared through the Statistical Software Components (SSC) repository, where installation occurs via ssc install packagename, facilitating easy access to community extensions.[77][78]
For intensive numerical tasks, Stata incorporates Mata, a compiled matrix programming language introduced in version 9 in April 2005, optimized for efficient linear algebra and data manipulation akin to MATLAB. Mata operates interactively, within do-files, or as callable functions from ado-programs, supporting operations like matrix inversion (A = inv(B)) and advanced simulations with just-in-time compilation for speed. Extensibility is further enhanced by thousands of user-written packages available on SSC, C/C++ plugin interfaces for integrating low-level compiled code, and built-in version control via the version prefix to ensure cross-release compatibility in scripts and commands.[2][79][80][81][82]
Mata's advanced capabilities include object-oriented class programming, enabling structured extensions with classes, methods, inheritance, constructors, and destructors for modular code design. Error handling across Stata programming relies on the capture prefix, which suppresses error messages from commands and sets the _rc return code for conditional logic, often paired with local or global macros to store and manipulate dynamic values like variable lists or loop counters. These features collectively allow for sophisticated, maintainable extensions tailored to econometric and statistical applications.[83][84]
As of November 2025, updates to Stata 19 have added features including import for Parquet files, causal mediation analysis with multiple mediators, and a Mata quantile function, further expanding core capabilities.[85]
Products and Licensing
Editions and Versions
Stata offers four primary editions tailored to different user needs and computational scales: Stata/MP, Stata/SE, Stata/BE, and Numerics by Stata. Each edition provides the full suite of Stata's statistical, data management, and graphics capabilities but differs in performance optimization, dataset size limits, and deployment focus.[17][86]
Stata/MP is the multicore-optimized edition designed for high-performance computing on modern hardware, supporting up to 64 processors and handling the largest datasets with up to 120,000 variables and over 1 trillion observations, limited only by available system memory. It excels in parallel processing for commands like regressions and simulations, making it suitable for large-scale analyses in research and industry. Stata/SE serves as the standard edition for single-processor systems, accommodating up to 32,767 variables, 10,998 variables in statistical models, and up to 2.1 billion observations, ideal for most professional workflows involving substantial but not extreme datasets. Stata/BE, the basic edition (formerly Stata/IC), is optimized for smaller-scale work with limits of 2,048 variables, 798 in models, and 2.1 billion observations, commonly used in teaching environments or with modest datasets. Numerics by Stata focuses on scientific computing and embedded applications, integrating Stata's engine into custom software, web apps, or automated systems via APIs like OLE automation, JDBC/ODBC, and Mata matrix programming, without the interactive interface of other editions.[86][87]
| Edition | Max Variables | Max in Models | Max Observations | Processors | Target Use Case |
|---|
| Stata/MP | 120,000 | 65,532 | 1+ trillion* | Up to 64 | Large-scale simulations, big data |
| Stata/SE | 32,767 | 10,998 | 2.1 billion | 1 | Standard professional analysis |
| Stata/BE | 2,048 | 798 | 2.1 billion | 1 | Teaching, small datasets |
| Numerics | Varies by integration | Varies by integration | Varies by integration | Varies | Embedded/scientific apps |
*Memory-dependent; requires substantial RAM (e.g., 1 TB+ for terabyte-scale data).[86][88]
Versioning in Stata follows a major release model, with perpetual licenses providing all updates within a major version (e.g., Stata 19 includes patches and enhancements until the next major release like Stata 20) and cross-platform binaries that run identically on Windows, macOS, and Linux under a single license. Hardware requirements start at a minimum of 1 GB RAM and 4 GB disk space for Stata/BE, scaling to 4 GB RAM minimum for Stata/MP, though practical use with large datasets demands significantly more—up to supercomputing levels with terabytes of RAM for Stata/MP. As of 2025, Stata supports ARM architecture natively, including Apple Silicon Macs since Stata 17, enabling efficient deployment on diverse hardware like M-series processors.[34][89][90]
Users select editions based on workload: Stata/MP for intensive, parallelized tasks like complex simulations on multicore systems; Stata/SE for balanced, single-threaded professional use; Stata/BE for educational or lightweight applications with limited data; and Numerics for programmatic integration in scientific or automated environments. Pricing tiers for these editions are detailed separately, but all share Stata's core reliability and reproducibility.[17][87]
Pricing Models and Availability
Stata offers both perpetual and annual licensing options for single-user installations, with the latter primarily through the StataNow subscription model that includes continuous updates and new releases during the term. Perpetual licenses do not expire but require separate annual or multiyear maintenance purchases to access updates beyond the initial year included with the license.[34] Network and site licenses are available for institutions, allowing concurrent use by multiple users at a single location or organization-wide access, respectively; these can also be annual or perpetual, with site licenses often customized for departments or cloud integration.[34]
Pricing varies by edition (Stata/BE for smaller datasets, Stata/SE for mid-sized, and Stata/MP for multicore processing), user type, and license term, with educational and student rates offering substantial discounts—typically 40-50% off business pricing for qualified academic users affiliated with degree-granting institutions. For business single-user annual licenses (StataNow), prices start at $925 for Stata/SE, $1,085 for Stata/MP (2-core), and $1,195 for Stata/MP (4-core), with higher-core versions available upon request. Educational single-user annual licenses are lower, starting at $360 for Stata/BE, $510 for Stata/SE, $690 for Stata/MP (2-core), and $840 for Stata/MP (4-core). The Prof+ Plan provides even deeper discounts for faculty and staff, with annual rates of $160 for Stata/BE, $250 for Stata/SE, $360 for Stata/MP (2-core), and $510 for Stata/MP (4-core). Perpetual licenses, while still offered, are generally more expensive upfront; for example, an academic Stata/MP (2-core) perpetual license costs $1,554 plus $675 annual maintenance thereafter, making annual subscriptions more cost-effective over multiple years.[91][92][93][94] Student options include short-term licenses, such as a 6-month Stata/BE for around $48, or free 6-month access for class use at accredited institutions.[95][96]
| Edition | Business Annual (USD) | Educational Annual (USD) | Prof+ Plan Annual (USD) |
|---|
| Stata/BE | Not listed (contact for quote) | 360 | 160 |
| Stata/SE | 925 | 510 | 250 |
| Stata/MP (2-core) | 1,085 | 690 | 360 |
| Stata/MP (4-core) | 1,195 | 840 | 510 |
Availability is primarily through direct purchase from the StataCorp website for U.S., Canada, and international customers, with electronic delivery for downloads; authorized resellers and distributors handle sales in other regions and provide local support. There is no open-source version of Stata, as it remains a proprietary commercial product.[97][98]
Licenses are non-transferable to other users and cannot be resold, though single-user licenses may be installed on multiple compatible machines (Windows, macOS, Unix) for the same authorized user. Volume discounts apply for bulk purchases of multiple single-user or network licenses, reducing per-unit costs for enterprises and institutions; quotes for these are available upon request.[99][34][98]
Community and Resources
Stata's user community encompasses hundreds of thousands of individuals worldwide, including students, academics, researchers, analysts, and data scientists who have relied on the software for over four decades.[100] The user base is particularly concentrated in academia and research institutions, where Stata serves as a primary tool for empirical analysis across various disciplines.[101]
The demographics of Stata users skew heavily toward quantitative researchers and policymakers in fields such as economics, social sciences, biostatistics, epidemiology, public health, and sociology.[101] Economics stands out as the dominant domain, with a majority of prominent economists utilizing Stata for statistical analysis and econometric modeling, reflecting its entrenched role in academic economics departments.[102] In recent years, adoption has expanded into data science applications, including machine learning workflows, as users leverage Stata's evolving capabilities for broader analytical tasks.[4]
Community engagement is fostered through longstanding events and forums that promote knowledge sharing and collaboration. The annual Stata Conferences, organized by StataCorp since 2001, bring together users for presentations on advanced techniques, with regional variants like the UK Stata Conference marking its 31st edition in 2025, indicating origins in the mid-1990s.[103][104] Complementing these are user groups worldwide and the Statalist forum, established in 1994 as an independent mailing list and now a vibrant web-based platform hosting extensive discussions on statistical methods and Stata implementation.[105]
User contributions significantly enhance Stata's ecosystem, with the Statistical Software Components (SSC) archive serving as a repository for community-developed extensions. By 2020, the SSC hosted over 2,800 packages, covering specialized tools for econometrics, graphics, and data management, allowing users to extend core functionality without altering official software; the archive has continued to grow since then.[106] Collaborative initiatives, such as Stata's NetCourses—self-paced online training programs spanning topics from introductory analysis to programming—further support skill-building and peer interaction among researchers.[107]
Support, Documentation, and Integrations
Stata provides extensive documentation to support users at all levels, including over 19,000 pages across more than 20 PDF manuals covering topics from base commands to specialized functions like graphics and data management.[108] These manuals, such as the [U] User's Guide, offer detailed explanations of Stata basics, elements of syntax, and practical advice, and are accessible directly from within the software via hyperlinks in help files.[36] Additionally, the built-in help command delivers context-sensitive online assistance for commands, functions, and options, allowing users to quickly reference syntax and examples without leaving the interface.[109] Complementing these resources, official video tutorials on YouTube—over 350 short videos narrated by Stata staff—cover specific topics from installation to advanced analyses, enabling visual learning for diverse workflows.[110]
Official support for Stata is integrated into software licenses for registered users, featuring prompt email-based technical assistance through [email protected], where queries are routed to specialists for accurate resolutions.[111] This service addresses installation, usage, and troubleshooting issues, ensuring users receive courteous and expert guidance.[112] Stata validates its software against benchmarks such as statistical tests from NIST, with public certification results available.[113] For structured training, NetCourses provide online options such as self-paced NetCourseNow sessions with dedicated instructors, starting at $125 as of 2025.[114]
In 2025, Stata emphasizes modern integrations to enhance interoperability, including the PyStata Python package that enables seamless use of Stata within Jupyter notebooks via magic commands and interactive functions.[115] Users can execute Python code directly from Stata using the python prefix, facilitating hybrid workflows for data manipulation and analysis, while community tools like rcall allow similar calls to R for specialized tasks.[25] Cloud deployment is supported on platforms such as AWS and Azure, where users run Stata on virtual machines for scalable computing without local installation.[116] To address emerging needs in machine learning workflows, Stata 19 introduces guides and commands for predictive analytics, including H2O-based ensemble decision trees for gradient boosting and random forests, bridging traditional statistics with AI-driven modeling.[69] Although no official ChatGPT plugin exists, users commonly leverage general AI tools like ChatGPT for generating and debugging Stata code, supplementing official resources.[117]
Usage Examples
Basic Command Syntax
Stata commands follow a consistent syntax structure of the form command [varlist] [if] [in] [, options], where command specifies the action, varlist optionally lists variables, if restricts observations to those meeting a condition, in limits to a range of observations, and options modify behavior.[118] For instance, the describe command lists dataset variables and their properties without arguments, as in describe, while summarize varname computes means and standard deviations for specified variables.[119]
Basic data management begins with loading datasets using use filename, which reads Stata-format .dta files into memory. Variable creation employs generate newvar = expression, such as generate income_squared = income^2 to compute derived values.[120] Simple linear regression is performed with regress y x, estimating coefficients for dependent variable y on predictor x.[119]
Do-files, saved with .do extension, contain sequences of commands for reproducibility and automation, executed via the do command or Do-file Editor.[13] Output logging records sessions using log using filename, capturing results and commands in text or SMCL format for later review.[73]
For assistance, the help command displays documentation, as in help summarize; typing help alone provides general guidance.[121] Output control uses set more off to suppress pauses during lengthy displays, allowing continuous scrolling.[122]
Advanced Application Example
A practical advanced application of Stata involves analyzing panel survey data on labor union membership among U.S. workers, drawn from a CSV file containing repeated observations over years for individuals, with variables such as age, education grade, urban/rural status, southern residence, and year. This workflow integrates data import, missing value handling, machine learning-based variable selection, panel data setup, random-effects logistic regression to model unionization probability, computation of marginal effects, visualization, and export, showcasing Stata's capabilities for comprehensive econometric analysis as of version 19.[123][124][125]
The process begins with importing the CSV data using import delimited, which reads the file into memory while specifying delimiters and variable types for efficiency with large surveys. Missing values coded as 99 (common in survey datasets to flag non-responses) are then recoded to standard missing (.) via mvdecode across relevant variables, ensuring clean data for modeling without biasing estimates. To handle high-dimensional predictors—such as numerous demographic interactions—lasso logit performs lasso-penalized variable selection, shrinking irrelevant coefficients to zero and identifying key predictors like grade and south interactions, which is particularly relevant in 2025 for scalable analysis of big survey data with integrated ML tools.[126]
Next, the dataset is declared as panel using xtset idcode year, balancing the structure for individual fixed effects over time. A random-effects logistic model is fitted with xtlogit on the selected variables, estimating odds ratios for union membership while accounting for unobserved heterogeneity across individuals. Marginal effects are computed post-estimation with margins to interpret average changes in probability, followed by marginsplot for visualization, and the graph exported to PDF via graph export for reporting. This sequence leverages Stata's do-file system for reproducible workflows.
stata
* Full do-file: Advanced [panel](/page/Panel) survey [analysis](/page/Analysis) for [union](/page/Union) membership
clear all
set more off
* Step 1: Import CSV survey data
import delimited "union_survey.csv", clear varnames(1) case(preserve)
* Step 2: [Clean](/page/Clean) [missing](/page/Missing) values (assume 99 codes refusals/non-applicable)
mvdecode _all, mv(99=.)
* Step 3: Lasso for variable selection in [logit](/page/Logit) context
lasso logit [union](/page/Union) c.age i.[grade](/page/Grade) not_smsa [south](/page/South)##c.year i.region ttl_exp wage, ///
indepvars(penalty) controls(none) postselection(controls) ///
selection(cv) rseed(12345)
lassocoef // Display selected variables, e.g., [grade](/page/Grade), [south](/page/South), year interaction retained
* Step 4: Panel setup
xtset idcode year
* Step 5: Random-effects logit on selected variables
xtlogit [union](/page/Union) age [grade](/page/Grade) not_smsa [south](/page/South)##c.year, re
* Step 6: Marginal effects and plot
margins, at((minmax) year) by([south](/page/South))
marginsplot, recast(line) title("Predicted Probability of Union Membership by Year and Region")
* Step 7: Export graph
graph export "union_margins.pdf", replace
* Full do-file: Advanced [panel](/page/Panel) survey [analysis](/page/Analysis) for [union](/page/Union) membership
clear all
set more off
* Step 1: Import CSV survey data
import delimited "union_survey.csv", clear varnames(1) case(preserve)
* Step 2: [Clean](/page/Clean) [missing](/page/Missing) values (assume 99 codes refusals/non-applicable)
mvdecode _all, mv(99=.)
* Step 3: Lasso for variable selection in [logit](/page/Logit) context
lasso logit [union](/page/Union) c.age i.[grade](/page/Grade) not_smsa [south](/page/South)##c.year i.region ttl_exp wage, ///
indepvars(penalty) controls(none) postselection(controls) ///
selection(cv) rseed(12345)
lassocoef // Display selected variables, e.g., [grade](/page/Grade), [south](/page/South), year interaction retained
* Step 4: Panel setup
xtset idcode year
* Step 5: Random-effects logit on selected variables
xtlogit [union](/page/Union) age [grade](/page/Grade) not_smsa [south](/page/South)##c.year, re
* Step 6: Marginal effects and plot
margins, at((minmax) year) by([south](/page/South))
marginsplot, recast(line) title("Predicted Probability of Union Membership by Year and Region")
* Step 7: Export graph
graph export "union_margins.pdf", replace
In the xtlogit output from this workflow (adapted from the canonical union dataset with 26,200 observations across 4,434 individuals), the model shows strong fit (Wald χ²(6) = 227.46, p < 0.001), with education (grade coefficient = 0.087, p < 0.001) increasing union odds by about 9% per grade level, non-metropolitan residence reducing odds (coefficient = -0.251, p = 0.002), and southern location strongly decreasing odds (coefficient = -2.839, p < 0.001), though the negative effect attenuates over time (interaction coefficient = 0.024, p = 0.003). The random-effects parameter ρ = 0.636 (p < 0.001) confirms significant unobserved individual variation, justifying the panel approach; lasso selection pruned redundant regional dummies, yielding parsimonious odds ratios like exp(0.087) ≈ 1.091 for grade. The marginsplot visualizes predicted probabilities rising from ~0.15 in southern areas to ~0.25 in non-south over years, aiding intuitive interpretation of policy impacts in labor economics.[124]