Fact-checked by Grok 2 weeks ago

Foundation model

A foundation model is any model trained on broad —generally using at scale—that can be adapted to a wide range of downstream tasks. The term was introduced in a 2021 report by researchers at to describe a class of large-scale systems exhibiting emergent generalization capabilities across domains like , , and . These models leverage massive datasets and computational resources, often involving billions or trillions of parameters, enabling where pre- on general supports fine-tuning for specialized applications with minimal additional supervision. Key examples include transformer-based models such as OpenAI's series and Google's , which have demonstrated scaling laws where performance predictably improves with increased model size, volume, and compute. While foundation models have accelerated advancements—facilitating breakthroughs in tasks from text generation to —they incur exorbitant costs, frequently exceeding hundreds of millions of dollars, and raise concerns over risks including amplification from uncurated corpora, to adversarial attacks, and potential societal harms from misuse. Their development underscores a concentration of capabilities among resource-rich entities, prompting debates on accessibility, safety, and the empirical limits of scaling without fundamental architectural innovations.

Definition and Characteristics

Core Definition

A foundation model refers to any model trained on broad data, typically using , that can be adapted—through , prompting, or other methods—to a wide range of downstream tasks. This definition emphasizes the model's foundational role in deriving generalizable capabilities from vast, unlabeled datasets rather than bespoke task engineering. Unlike supervised approaches reliant on labeled examples for specific objectives, foundation models leverage emergent properties from data scale, where representations encode patterns enabling versatile application across domains. Key characteristics include massive scale, often encompassing billions to trillions of parameters, which facilitates the of diverse data into reusable latent structures. These models exhibit generality across modalities such as text, images, audio, and video, allowing unified processing of heterogeneous inputs through shared pre-training objectives. Adaptation efficiency stems from , where minimal additional data or instructions suffice to specialize the model, contrasting with resource-intensive retraining from scratch. In distinction from narrow AI systems, which are engineered for singular, predefined functions without broad reusability, foundation models achieve capabilities via probabilistic pattern extraction from expansive corpora, yielding causal inferences grounded in data distributions rather than explicit programming. Narrow AI, by contrast, optimizes for isolated performance metrics through targeted , limiting extrapolation to untrained scenarios. This prioritizes empirical scaling laws, where model performance correlates predictably with data volume and compute, over domain-specific heuristics.

Distinguishing Attributes

Foundation models differ from prior AI paradigms, such as task-specific models, through their reliance on massive scale in parameters, data, and compute, enabling emergent abilities that arise discontinuously rather than through gradual performance gains. For instance, in-context learning—where models adapt to new tasks via prompts without parameter updates—emerged sharply in , a 175-billion-parameter model trained on approximately 570 gigabytes of filtered text data and released in June 2020, marking a departure from smaller models' predictable scaling curves. Subsequent analyses confirm these abilities, including few-shot adaptation, correlate empirically with model sizes exceeding 10^9 parameters and training datasets surpassing trillions of tokens, as smaller systems fail to exhibit such behaviors despite similar architectures. This scale-driven emergence underscores a foundational shift: capabilities previously requiring specialized now surface as byproducts of broad pre-training on internet-scale corpora. A core distinguishing attribute is versatility across tasks and modalities without exhaustive retraining, contrasting with traditional machine learning's dependence on curated, labeled datasets for each application. Foundation models undergo initial self-supervised pre-training on diverse, unlabeled data—often billions of examples spanning text, code, and images—allowing subsequent deployment via lightweight prompting or for downstream uses like , summarization, or . Multimodal extensions exemplify this: , introduced by in January 2021, leverages pre-trained text-image alignments to generate images from textual descriptions, adapting foundational representations to vision tasks without starting from scratch, unlike conventional vision models requiring modality-specific training from raw pixels. This adaptability stems from learned latent representations that generalize across domains, though it remains bounded by the distributional coverage of pre-training data. Critically, foundation models' proficiency traces to statistical in observational rather than causal , highlighting limitations in causal absent from many prior paradigms' narrower scopes. They excel at predictive interpolation within training distributions but falter on novel , such as counterfactual reasoning or interventions in unseen graphs, where outputs revert to memorized correlations rather than mechanistic understanding. Empirical probes reveal this gap: even advanced models like struggle with tasks demanding distinction between spurious associations and true causes outside benchmark templates, underscoring that scale amplifies data-driven heuristics without bridging to first-principles . This attribute necessitates caution in applications presuming deeper reasoning, as capabilities reflect probabilistic approximations, not veridical world modeling.

Historical Development

Pre-Foundation Precursors

The Transformer architecture, proposed by Vaswani et al. in June 2017, marked a pivotal shift in by eschewing recurrent and convolutional layers in favor of self-attention mechanisms, which facilitated parallel and captured long-range dependencies more effectively than prior models. This design empirically demonstrated superior performance on tasks, with the model achieving a 28% reduction in score error compared to previous state-of-the-art systems on the WMT 2014 English-to-German dataset, laying the groundwork for scaling to larger datasets and model sizes without the sequential bottlenecks of recurrent neural networks. Building on this, early large-scale pre-training emerged with models like in 2018, which used bidirectional LSTMs trained on unsupervised objectives such as predicting internal word representations from context, enabling contextualized embeddings that improved transfer to six tasks by averaging 4-7 percentage point gains over non-contextual baselines. Similarly, , released by Devlin et al. in October 2018, introduced masked language modeling and next-sentence prediction for pre-training on 3.3 billion words from and , attaining state-of-the-art results on 11 benchmarks like (80.5% average score) through , thus highlighting self-supervised learning's capacity for broad task adaptation without task-specific supervision from scratch. GPT-2, developed by and detailed in February 2019, further exemplified this trajectory by scaling unsupervised next-token prediction to a 1.5 billion parameter model trained on 40 gigabytes of WebText—a curated of 8 million web pages linked from —yielding coherent text generation and zero-shot performance on tasks like summarization ( scores competitive with supervised models) and translation, underscoring the viability of purely generative pre-training for emergent capabilities across domains. These pre-2020 efforts collectively demonstrated that large-scale, data-driven pre-training on unlabeled corpora could yield models with transferable representations, departing from the era's dominant paradigm of narrow, supervised architectures and empirically validating scaling as a path to .

Emergence of the Term (2021)

The term "foundation model" was formally introduced in the report On the Opportunities and Risks of Foundation Models, published on August 16, 2021, by researchers at Stanford University's Center for Research on Foundation Models (CRFM). The report, authored by Rishi Bommasani and colleagues including Percy Liang, defined foundation models as "any model trained on broad data (typically by self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks," emphasizing their role as reusable infrastructural bases rather than task-specific systems. This framing positioned models like OpenAI's (released June 2020) and Google's (paper published October 2019, with implementations scaling in 2020) as exemplars, highlighting their capacity for across domains due to massive pre-training on diverse datasets. The motivation for coining the term stemmed from the escalating computational costs of training large-scale models—often exceeding hundreds of millions of dollars—and the recognition that such investments could be amortized through broad reusability, shifting development from siloed, narrow applications toward general-purpose foundations adaptable via or prompting. The CRFM report argued this paradigm enabled efficiency gains, as a single foundation model could underpin multiple specialized applications, but also introduced systemic risks like amplified biases from broad data ingestion and challenges in governance due to their infrastructural scale. Initial examples focused on language models, but the concept extended to systems, underscoring the need for interdisciplinary analysis of their societal implications. Following the report's release, the terminology saw rapid adoption in industry and academia, with organizations like and integrating it to describe their scalable AI architectures. , for instance, began referencing GPT-series models as foundation models in public communications and technical updates by late 2021, aligning with the report's emphasis on pre-trained bases for downstream adaptation. similarly embraced the term for systems like , framing them as foundational layers in cloud AI services to highlight interoperability and cost-sharing potential. This uptake reflected a on the term's utility in capturing the shift toward models prioritizing scale and generality over bespoke training.

Key Milestones and Models (2020-2025)

In June 2020, released , a transformer-based with 175 billion parameters that demonstrated emergent capabilities, enabling task performance with minimal examples provided in prompts without . This marked a pivotal advancement in scaling laws, where larger models showed improved generalization across tasks like and question-answering, though limited by a 2048-token context window. Google's , announced on April 4, 2022, scaled to 540 billion parameters using the Pathways system for efficient distributed training, achieving breakthroughs in reasoning tasks such as arithmetic and commonsense inference through chain-of-thought prompting. In February 2023, Meta released , a family of efficient models up to 65 billion parameters with open weights under a research license, which spurred widespread community and democratized access, intensifying competition beyond proprietary systems. The year saw an explosion in releases, with 149 foundation models documented globally—more than double the 2022 figure—including xAI's Grok-1 base model, whose pre-training concluded in October, emphasizing truth-seeking objectives in a 314 billion parameter mixture-of-experts architecture released openly in March 2024. Of these, 65.7% featured open weights, accelerating innovation through derivative models and efficiency optimizations. In May 2024, launched GPT-4o, a model integrating text, , and audio processing in a unified with a 128,000-token , enabling real-time applications like voice interaction while maintaining performance parity to prior versions at reduced inference costs. By 2025, releases continued apace, exemplified by Meta's LLaMA 4 in April, introducing natively variants like (17 billion active parameters) with extended lengths, reflecting shifts toward efficiency gains amid sustained scaling in compute and data.

Frontier Models

Frontier models represent the most advanced subset of foundation models, characterized by their in empirical benchmarks and ability to demonstrate emergent capabilities that approach or exceed human performance in targeted domains. These systems are typically defined by high training compute scales—often exceeding 10^25 —and broad generality, enabling superior results in reasoning, , and tasks, while introducing heightened risks from potential misuse or unintended behaviors. Unlike standard foundation models, frontier models are distinguished not merely by size but by verifiable outperformance on standardized evaluations, such as achieving scores that rival expert humans, though they remain limited in holistic real-world . OpenAI's , released on March 14, 2023, exemplifies this category by attaining the 90th percentile on the Uniform Bar Examination, outperforming 90% of human examinees, and scoring in the 90th percentile on SAT reading and math sections. Similarly, Anthropic's Claude 3 family, introduced in March 2024, established new benchmarks in graduate-level reasoning (GPQA), undergraduate knowledge (MMLU), and vision tasks, with the variant leading competitors in coding and multilingual proficiency. Google's Gemini 1.0, announced December 6, 2023, advanced multimodal integration, processing text, images, audio, and video to achieve state-of-the-art results on benchmarks like MMMU for visual reasoning. These models' capabilities stem from massive pre-training on diverse datasets, yielding emergent skills like that were not explicitly optimized. Due to their scale and potency, models carry elevated risk profiles, including amplified potential for adversarial exploitation or systemic impacts, as outlined in guidelines from the U.S. AI Safety Institute established in 2023 under the National Institute of Standards and Technology. The Institute's framework prioritizes rigorous pre-deployment testing and safeguards for models with compute thresholds indicative of advanced risks, emphasizing empirical validation over self-reported claims to address gaps in transparency and safety. This focus underscores causal links between model scale and emergent hazards, such as deceptive alignment or unintended amplification of biases in training data.

General-Purpose AI Systems

Foundation models share substantial conceptual overlap with general-purpose systems, frequently treated as synonymous in discourse, as exemplified by the EU Act's classification of general-purpose AI (GPAI) models—which encompass foundation models—as adaptable systems trained on extensive datasets to execute diverse tasks across applications without task-specific redesign. This equivalence arises from their broad applicability, yet foundation models distinctly prioritize statistical generality emergent from massive pre-training corpora over explicitly engineered modularity or hybrid architectures that might characterize some general-purpose designs. Empirical assessments reveal foundational constraints on these systems' purported generality, with no demonstrated and pronounced beyond training distributions; for instance, leading models score below 10% on the ARC-AGI benchmark's novel tasks, where humans routinely exceed 80%, indicating reliance on rather than causal understanding or flexible reasoning. Even recent advancements, such as OpenAI's o3 model achieving partial gains on public ARC subsets through enhanced chain-of-thought prompting, fail to close the gap on core challenges, affirming that capabilities remain distributionally bounded without evidence of scalable transfer. Regulatory approaches like the EU AI Act, which impose transparency, documentation, and systemic risk evaluations on GPAI models effective from August 2025, have drawn criticism for presuming unverified existential hazards—such as uncontrolled —absent causal mechanisms observed in deployed systems, thereby prioritizing speculative threats over documented limitations. Analyses contend that such frameworks, often shaped by precautionary biases in academic and circles, overlook empirical risk profiles favoring iterative competition and open benchmarking to foster verifiable , rather than decelerationist stances that conflate scaling artifacts with apocalyptic inevitability.

World Models and Multimodal Extensions

Foundation models incorporate world models as latent representations that predict environmental dynamics through causal forecasting, enabling internal for and rather than mere . These extensions draw from paradigms, where the model generates hypothetical future states based on actions, facilitating emergent behaviors like in simulated environments. For instance, Google DeepMind's Genie 3, introduced in August 2025, advances real-time interactive world modeling by generating consistent video predictions from latent states, supporting applications in game-like without explicit physics engines. However, empirical evaluations reveal that such models often prioritize statistical correlations over invariant causal structures, leading to brittle generalizations outside training distributions. In robotics, world models integrate with action primitives for grounded planning, as demonstrated by Google DeepMind's , a vision-language-action model released in July 2023. co-fine-tunes on internet-scale vision-language data and robotic trajectories, achieving up to 2x success rates on novel tasks like using objects as improvised tools through chain-of-thought reasoning over predicted outcomes. This causal prediction mechanism allows transfer of web-derived knowledge to physical control, improving manipulation in unseen scenarios by simulating action effects. Yet, critiques highlight deficiencies in encoding fundamental physical laws; a 2025 Harvard-MIT study found that foundation models, including world model variants, accurately predict outcomes in tested cases but fail to internalize principles like Newton's laws, relying instead on memorized heuristics that break under counterfactual perturbations. Multimodal extensions enhance world models by fusing modalities like and , promoting grounded reasoning through aligned representations. OpenAI's CLIP, pretrained in 2021 on 400 million image-text pairs via contrastive learning, establishes zero-shot cross-modal correspondences that anchor textual predictions to visual evidence, reducing hallucinations in simulation tasks. Subsequent integrations, such as in FOUNDER frameworks, map foundation model outputs to world model latents for open-ended task solving, yielding improved planning in embodied settings. Achievements include enhanced robotic control, with exhibiting emergent skills like semantic reasoning for object affordances. Nonetheless, these systems inherit data biases from curated corpora, amplifying representational skews—e.g., underrepresentation of diverse physical interactions—that propagate to causal predictions, as biases in training data lead to skewed outcome distributions rather than veridical simulations. True adherence to physical invariance remains elusive, with models critiqued for simulating superficial dynamics without underlying causal realism.

Technical Architecture

Core Architectures and Parameters

The transformer architecture, introduced in the 2017 paper "Attention Is All You Need," underpins the majority of foundation models through its self-attention mechanisms, which compute dependencies between sequence elements in parallel, eliminating the sequential processing constraints of recurrent neural networks like LSTMs. This design consists of stacked encoder and decoder layers, though many modern foundation models, such as those in the series, employ decoder-only variants optimized for autoregressive generation. Self-attention enables efficient handling of long-range dependencies via scaled dot-product attention, formulated as \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V, where Q, K, and V represent query, key, and value projections, and d_k is the dimension of the keys. Parameter counts in these models have scaled dramatically, reflecting empirical gains in capacity for ; for instance, BERT-base features approximately 110 million parameters across 12 layers with a hidden size of 768, while reaches 175 billion parameters organized into 96 layers with a model of 12,288. Larger configurations project toward trillions of parameters, though active utilization varies; dense models activate all parameters per step, incurring high computational costs proportional to total size. To mitigate inference expenses while expanding effective capacity, sparse Mixture-of-Experts (MoE) variants route inputs to specialized sub-networks, activating only a fraction of parameters per . In Mixtral 8x7B, released in December 2023, this yields 46.7 billion total parameters but activates roughly 12.9 billion—equivalent to a dense 7B model—via a router selecting two out of eight experts per layer, achieving performance comparable to larger dense models like Llama 2 70B on benchmarks while reducing active compute. Such sparsity leverages conditional computation, where expert specialization emerges from , empirically lowering latency without proportional parameter growth. Beyond dominance in foundation models, diffusion-based architectures serve as core designs for generative tasks in , iteratively refining noise-added data through a reverse Markov process to produce samples. , a latent released in August 2022, exemplifies this with a backbone conditioned on text embeddings, enabling high-fidelity image synthesis from broad pre-training on captioned datasets, though it diverges from autoregressive paradigms by prioritizing probabilistic denoising over token prediction. These variants highlight architecture-specific adaptations to modality, with transformers excelling in sequential data and in continuous generation, both validated by downstream empirical utility.

Pre-Training Objectives

Pre-training objectives for foundation models consist of self-supervised tasks that enable models to learn predictive patterns from vast unlabeled datasets, effectively compressing distributional regularities into parameters without task-specific labels. These objectives prioritize causal dependencies in data sequences or alignments across modalities, fostering emergent generalization through iterative prediction errors minimized via . In autoregressive architectures predominant in large language models (LLMs), the core objective is next-token prediction, where the model forecasts the subsequent token in a sequence conditioned on all preceding tokens. This unidirectional, causal approach—trained by maximizing the likelihood of observed sequences—empirically induces semantic and syntactic comprehension, as evidenced by zero-shot performance on downstream tasks scaling with model size and data exposure. For instance, GPT-3, pretrained on 300 billion tokens using this objective, demonstrated few-shot learning capabilities across diverse benchmarks, attributing gains to internalized probabilistic structures rather than rote memorization. Masked modeling serves as a bidirectional alternative in encoder-based models, such as , which randomly occludes 15% of input tokens and trains the model to reconstruct them from surrounding context. Introduced in 2018, this objective captures holistic sequence representations by jointly modeling left and right contexts, outperforming unidirectional methods on tasks requiring deep comprehension, like natural language inference, with achieving state-of-the-art results on GLUE benchmarks through pretraining on 3.3 billion words of BooksCorpus and . For multimodal foundation models integrating vision and language, contrastive learning aligns representations by maximizing similarity between paired inputs (e.g., image-caption) while minimizing it for unpaired negatives within batches. , released by in 2021, exemplifies this via joint training on 400 million image-text pairs, yielding zero-shot transfer to image classification by computing cosine similarities in a shared space, which empirically robustifies against distribution shifts compared to supervised baselines. These objectives collectively drive foundation models toward latent data compression, with empirical scaling laws confirming predictive power as a proxy for intelligence-like capabilities.

Data, Training, and Scaling

Data Requirements and Curation

Foundation models necessitate immense volumes of training data, typically measured in trillions of tokens, equivalent to terabytes or petabytes of raw text after initial processing. For instance, Meta's models were trained on datasets derived from filtered snapshots of , a public web archive exceeding 400 terabytes uncompressed per monthly crawl, yielding effective corpora in the range of several terabytes post-curation. Similarly, OpenAI's utilized approximately 570 gigabytes of filtered text data, drawn from diverse web sources, though total raw inputs spanned far larger scales before deduplication and cleaning. Primary data sources include web scrapes via archives like , which capture billions of pages for broad coverage of internet content; digitized books from corpora such as ; and code from repositories in datasets like The Stack for StarCoder models. These are supplemented by academic publications, extracts, and dialogue corpora to enhance factual density and syntactic variety, though web data dominates due to its scale and recency. Legal challenges have arisen, notably ' December 27, 2023, lawsuit against and , alleging unauthorized ingestion of copyrighted articles into training datasets, highlighting tensions over in scraping protected content. Curation pipelines address raw data's noise through sequential steps: heuristic filtering to excise low-value content like boilerplate or overly short passages; deduplication via exact hashing or approximate nearest-neighbor methods to eliminate redundancies, which can comprise up to 50% of unprocessed crawls; and quality scoring using classifiers trained on or linguistic s to prioritize coherent, informative text. These mitigate evident biases, such as overrepresentation of English-language or urban-centric viewpoints inherent in web distributions, yet residual skews persist, as curation cannot fully counteract the internet's empirical composition, which empirical audits show tilts toward certain ideological clusters in media-heavy subsets. Model-based filtering, employing smaller pretrained networks to rank passages, further refines inputs but introduces computational overhead and potential circularity if reliant on prior generations. Debates on quality versus quantity underscore that while unfiltered scale alone yields due to noise amplification, curated large datasets empirically outperform smaller pristine ones in generalization, as validated by benchmarks where filtered derivatives enable emergent capabilities absent in toy corpora. However, escalating reliance on —generated by prior models to augment shortages—risks "model collapse," a degenerative where recursive erodes output diversity and fidelity, as demonstrated in 2024 experiments showing rapid loss of rare events and convergence to bland distributions after few iterations. Real-world sourcing thus remains paramount, with curation balancing empirical breadth against verifiable amid ongoing disputes over access to high-fidelity public data.

Training Processes and Compute Demands

Training foundation models relies on frameworks that parallelize workloads across clusters of accelerators to manage the immense computational requirements. Common approaches include , tensor parallelism, and pipeline parallelism, implemented in libraries such as Distributed or Accelerate, which enable scaling to thousands of devices. These methods partition model parameters, gradients, or computations to mitigate memory constraints and accelerate convergence on hardware like GPUs or Google TPUs. Compute demands for pre-training reach extraordinary scales, exemplified by estimates for at approximately 2.15 × 10^{25} floating-point operations (), necessitating around 25,000 A100 GPUs running for 90 to 100 days. GPUs predominate due to their flexibility and ecosystem support via , while TPUs offer advantages in matrix multiplications for compatible workloads, as seen in Google's models. Such runs demand specialized infrastructure, including high-bandwidth interconnects like , to synchronize updates across nodes. By 2025, training frontier models faces escalating energy costs and hardware constraints, with power requirements doubling annually and projected to exceed multiple gigawatts per run, rivaling large power plants. Global data centers may require an additional 10 gigawatts in 2025 alone, straining grids and amplifying electricity expenses amid pressures for advanced . innovations, such as mixed-precision with FP16 or BF16 formats, reduce per operation while preserving accuracy. Post-training optimizations like quantization compress weights from 32-bit to lower precisions (e.g., INT8), slashing and without retraining the full , though applied after core to maintain . eliminates redundant parameters, potentially reducing model size by 90% in structured approaches, but studies indicate quantization often yields superior compression-performance trade-offs for deployment. These techniques address compute bottlenecks by enabling deployment on devices, distinct from during-training efficiencies like checkpointing.

Scaling Laws: Empirical Foundations

Scaling laws in foundation models refer to empirical relationships observed between training resources—such as compute, model parameters, and dataset size—and model performance, typically measured by on next-token tasks. In their 2020 study, Kaplan et al. analyzed over 400 language models trained on datasets up to 300 billion tokens and found that scales as power-law functions: approximately L(N) \propto N^{-\alpha} with \alpha \approx 0.076 for model size N, L(D) \propto D^{-\beta} with \beta \approx 0.103 for dataset size D, and similar predictability for compute C. These relations held across six orders of in scale, enabling causal predictions of reduction from resource increases under architectures and standard pre-training objectives, rather than theoretical derivations. Subsequent work refined these findings by emphasizing compute-optimal regimes. Hoffmann et al. (2022) demonstrated that Kaplan's laws implied unbalanced scaling in prior models like , which prioritized parameters over data; instead, optimal allocation for fixed compute budgets requires parameters and tokens roughly equally, at approximately 20 tokens per parameter. Training the 70-billion-parameter model on 1.4 trillion tokens validated this, achieving lower loss and superior downstream performance compared to larger models like (280 billion parameters on 300 billion tokens) with equivalent compute. This empirical shift highlighted data's underappreciated role, guiding efficient resource use without assuming indefinite parameter growth. The predictability of these laws extended through models released up to 2024, correlating with observed losses in systems like and , and informing investments in compute-heavy training. However, 2025 analyses indicate emerging diminishing marginal returns on specific metrics, such as test saturation or task-specific gains, prompting labs to integrate improvements beyond pure . Critiques note that these laws capture resource-driven correlations under static methods but overlook algorithmic advances—like improved optimizers or architectures—that have accelerated progress faster than scaling alone in recent years, favoring iterative experimentation over rigid to untested regimes. Such empirical foundations predict performance within observed bounds but do not causally guarantee breakthroughs like absent verified innovations.

Adaptation and Evaluation

Fine-Tuning and Task Adaptation

Fine-tuning adapts pre-trained foundation models to downstream tasks by updating a subset or all of the model's parameters using task-specific datasets, leveraging the broad knowledge encoded during pre-training to achieve performance superior to training from scratch. This process exploits empirical effects, where the model's generalized representations enable rapid adaptation with far less data and compute than initial training, often yielding accuracies competitive with or exceeding those of task-specific models built from random initialization. Zero-shot prompting represents a parameter-free method, instructing the model via prompts to perform tasks without any task-specific examples or weight updates, relying solely on emergent capabilities from pre-training. For instance, models like have demonstrated zero-shot classification accuracy on benchmarks such as by directly querying the model to categorize text, achieving results that approximate in some domains due to in-context generalization. Parameter-efficient fine-tuning (PEFT) techniques further enhance adaptation by minimizing the number of trainable parameters, addressing the computational infeasibility of full for billion-parameter models. , introduced in 2021, exemplifies this by freezing the pre-trained weights and injecting trainable low-rank decomposition matrices into layers, reducing trainable parameters by up to 10,000 times compared to full fine-tuning while matching downstream performance on tasks like . Other PEFT variants, such as adapters, insert small feed-forward networks into layers, collectively enabling adaptation on consumer hardware without degrading the model's core capabilities. An empirical demonstration of such adaptation involves fine-tuning the 6-billion-parameter model, released in 2021, for tasks; for example, targeted fine-tuning on secure coding datasets has produced models generating non-vulnerable C and C++ code at rates of 70.4% and 64.5%, respectively, outperforming base pre-trained outputs through domain-specific weight adjustments. These methods underscore the causal efficiency of foundation models, where pre-training's vast data exposure provides a robust initialization that accelerates convergence on specialized objectives. Despite these advantages, risks catastrophic forgetting, wherein the model degrades performance on pre-training during to new domains or tasks, as observed in continual instruction tuning where domain-specific updates overwrite general factual recall. This phenomenon arises from interference in shared parameter spaces, particularly acute in full but mitigated somewhat by PEFT approaches like , though persistent in severe domain shifts without techniques such as elastic weight consolidation. Empirical studies confirm forgetting rates exceeding 50% on held-out probes post-, highlighting the need for regularization to preserve transferability.

Benchmarks, Metrics, and Limitations

Foundation models are commonly evaluated using standardized benchmarks designed to assess (NLP) capabilities, such as the General Language Understanding Evaluation (GLUE), introduced in 2018 as a collection of nine diverse tasks including and . SuperGLUE, released in 2019, extends GLUE with eight more challenging tasks to better test advanced language understanding, incorporating elements like coreference resolution and under stricter conditions. These metrics aggregate scores across tasks to gauge overall performance, with top models achieving near-human or superhuman results on SuperGLUE by 2023, such as scores exceeding 90% on multiple subtasks. The Beyond the Imitation Game Benchmark (BIG-bench), launched in 2022, expands evaluation to over 200 tasks probing emergent abilities—unexpected capabilities arising at scale, like multi-step or theory-of-mind reasoning—that smaller models lack but larger ones exhibit. However, analyses have revealed that such emergences often stem from metric choice rather than genuine phase transitions in ; for instance, reformulating BIG-bench metrics continuously (e.g., using scores) eliminates apparent discontinuities, suggesting many "emergent" jumps are artifacts of non-linear evaluation scales. By 2025, critiques highlight how saturation on these benchmarks—driven by repeated high scores—obscures stagnation in true , with foundation models overfitting to familiar patterns rather than demonstrating robust causal understanding. A major flaw in these standard metrics is data contamination, where benchmark test sets inadvertently appear in pre-training corpora, inflating scores without reflecting learned generalization; exposures in 2023 identified contamination in datasets like those underlying GLUE and BIG-bench subsets, affecting models trained on vast web-scraped data up to trillions of tokens. This issue persists, as 2024-2025 surveys estimate contamination rates exceeding 10-20% in popular benchmarks, leading to unreliable progress signals and misleading claims of proximity, since models memorize rather than reason through novel instances. Such flaws prioritize superficial over causal validity, masking gaps in handling distribution shifts or counterfactuals. To address these limitations, benchmarks emphasizing and , such as the Abstraction and Reasoning Corpus (), test core via few-shot grid-based puzzles requiring pattern induction and application to unseen scenarios—tasks humans solve at 80-90% accuracy but where foundation models score below 50% as of 2025, often near 0% on private ARC-AGI variants without test leakage. reveals worldview inconsistencies, like failure to consistently apply causal rules across compositions, highlighting how standard metrics overlook deficiencies in flexible generalization despite scale. These alternatives underscore the need for evaluations resistant to memorization, prioritizing verifiable causal mechanisms over saturated, contamination-prone scores.

Ecosystem and Strategies

Supply Chain and Infrastructure

The hardware supply chain for foundation models centers on specialized accelerators, with NVIDIA commanding over 90% of the data center GPU market essential for AI training as of mid-2025. These GPUs, such as the H100 and Blackwell series, rely heavily on advanced fabrication by TSMC, which produces the majority of high-performance AI chips at its Taiwan facilities using nodes like 4nm and below. This dependency exposes vulnerabilities to geopolitical tensions in the Taiwan Strait, where disruptions could halt global AI hardware production, yet it has incentivized diversification efforts like TSMC's Arizona plant, which began yielding initial Blackwell wafers in October 2025. U.S. export controls, initiated on October 7, 2022, and expanded through 2023 and 2024, restrict shipments of advanced semiconductors and manufacturing equipment to , aiming to limit applications while carving a bifurcated market. These measures have slashed NVIDIA's revenue share from 20-25% pre-2022 to near zero by 2025, prompting Chinese firms to develop alternatives like Huawei's Ascend , though at reduced performance. Such controls underscore fragilities but drive innovation incentives, including U.S. investments in domestic fabs under the CHIPS Act, fostering competition beyond NVIDIA's near-monopoly. On the software side, open ecosystems mitigate proprietary bottlenecks by facilitating model distribution and integration; serves as a central hub hosting millions of pre-trained models, datasets, and tools, enabling collaborative development akin to for . By 2025, it supports over 50,000 organizations in sharing transformers-based architectures, reducing and promoting ecosystem-wide efficiencies without reliance on closed vendors. Infrastructure bottlenecks loom large, particularly in energy, where AI data centers may require an additional 10 gigawatts of U.S. in 2025 alone, equivalent to Utah's total power , straining grids amid projections of data centers doubling electricity use by 2030. Rare earth elements, critical for magnets in cooling systems and server components, face supply constraints dominated by , which controls 70% of and 90% of ; recent export licensing tightened in 2025 exacerbates risks, though U.S. stockpiles buffer short-term disruptions and spur domestic sourcing incentives. These pressures highlight the need for resilient, market-driven scaling over insulated policies to sustain foundation model advancement.

Model Release Approaches: Open vs. Closed

Foundation models are released either as open-weight models, where and parameters are publicly available for inspection, modification, and redistribution under permissive licenses, or as closed-source models, where access is restricted to proprietary or limited previews, preserving but limiting external scrutiny. Open releases facilitate community-driven improvements, while closed approaches prioritize controlled deployment to mitigate misuse risks. Prominent open-weight releases include Meta's LLaMA 2, launched on July 18, 2023, with model weights available under a custom permitting use for entities below certain scale thresholds, enabling widespread and . Similarly, xAI open-sourced the 314 billion Grok-1 model weights and architecture on March 17, 2024, under the Apache 2.0 , allowing unrestricted use and modification without data release. These approaches accelerate innovation through collective iteration, as developers worldwide can build upon shared foundations, reducing redundant compute costs and fostering rapid enhancements in performance and safety via distributed vulnerability detection. For instance, open models support customized applications, such as retraining for defense-specific tasks, enhancing resilience through diverse implementations over reliance on single vendors. In contrast, closed models like OpenAI's , released in March 2023 via access only, withhold weights to prevent replication and potential weaponization, but this opacity hinders independent verification of safety alignments or emergent risks. Claims of robust safeguards in closed systems remain untestable externally, potentially concealing flaws in mitigation or robustness that community auditing in open models could expose. Such proprietary strategies may curb immediate misuse by adversaries lacking replication capabilities, yet they concentrate power in few firms, slowing broader progress and raising dependency concerns for downstream users unable to inspect or adapt core components. Debates center on trade-offs, where open releases promote U.S. leadership by diversifying domestic capabilities and countering foreign monopolies through inclusive innovation, as closed models risk eroding competitive edges if advantages prove brittle against open alternatives. While closed approaches theoretically limit to state actors, suggests open scrutiny strengthens overall security by enabling proactive fixes, outweighing marginal benefits in a field advancing via shared knowledge. Proponents of openness argue it aligns with historical software precedents, where public has fortified systems against threats more effectively than isolation.

Capabilities and Applications

Demonstrated Capabilities

Foundation models excel in translation, achieving scores that match or surpass those of professional translators in evaluations for high-resource pairs, as demonstrated by systems like those evaluated in WMT competitions where neural models outperform baselines in aggregate quality. In summarization tasks, these models attain scores indicating strong n-gram overlap with -written references, with empirical correlations to human judgments of informativeness and exceeding 0.8 in controlled studies. Multimodal foundation models extend these competencies to video generation, as exemplified by OpenAI's Sora, released in February 2024, which produces videos up to one minute in length from text prompts while maintaining high visual fidelity, temporal consistency, and adherence to descriptive instructions in reproducible demonstrations. The follow-up Sora 2, launched in September 2025, further improves physical realism, simulation accuracy, and user controllability in generated content. In , foundation models enable generalized planning from instructions, supporting in , , and tasks with verified success rates in real-world benchmarks; for instance, vision--action models achieve over 90% task completion in diverse scenarios through pre-training on large-scale datasets. These outcomes stem from scalable pre-training that allows zero-shot adaptation to novel environments, as confirmed in peer-reviewed evaluations of embodied agents. Such capabilities are underpinned by reproducible benchmark results where foundation models surpass human experts on specialized tasks, including behavior prediction with accuracies exceeding domain specialists by 10-20 percentage points on held-out datasets. By mid-2025, these verified performances have driven widespread integration, with generative from foundation models used monthly by 1 in 8 workers globally, per survey data from enterprise and consumer adoption tracking.

Real-World Deployments and Achievements

, deployed in technical preview in June 2021 by in collaboration with , represents an early enterprise-scale application of foundation model adaptations for assistance. Built on the model—a descendant of trained on vast code repositories—Copilot suggests code completions and functions in real-time within integrated development environments. A randomized controlled experiment with 95 professional developers found that Copilot users completed repository-level programming tasks 55.8% faster than a no-Copilot baseline, with equivalent or superior solution correctness, as measured by passed unit tests. Subsequent enterprise evaluations, including collaborations with , reported consistent speedups in task completion and reduced , enabling developers to focus on higher-level architecture over boilerplate implementation. In healthcare and biotechnology, DeepMind's system, advanced in its second iteration released in July 2021, achieved breakthrough performance in by leveraging architectures pre-trained on protein sequence and structural data. topped the Critical Assessment of Structure Prediction (CASP14) competition with median global distance test scores exceeding experimental accuracy for many targets, enabling reliable predictions without physical crystallization. The model's open-source release and associated database, launched in July 2021, have provided computed structures for over 200 million protein entries across , accelerating drug target identification and enzyme engineering; for instance, it has informed variant pathogenicity assessments in rare diseases and design pipelines. This work earned its primary developers, and John Jumper, the 2024 , shared with David Baker for computational protein . These deployments underscore empirical productivity gains in pattern-heavy domains like and , where foundation models reduce iterative trial-and-error. However, real-world integrations reveal boundaries in tasks, such as predicting intervention outcomes in dynamic systems, where models' correlational training yields unreliable extrapolations absent human validation; studies emphasize that oversight mitigates risks in deployment pipelines. Enterprise adopters report acceptance rates of 20-30% for generated outputs, necessitating to ensure logical beyond statistical .

Criticisms and Technical Limitations

Architectural and Performance Shortfalls

Foundation models, particularly autoregressive transformer-based architectures, demonstrate persistent s, wherein they produce confident but verifiably false outputs, especially under out-of-distribution conditions where training data patterns do not align with query demands. This brittleness arises from probabilistic next-token prediction, which favors fluency over factual fidelity, leading to fabricated details in responses. For example, in evaluations of released in March 2023, the model exhibited factual inaccuracies and reasoning lapses, such as inventing non-existent references or events, despite safeguards. Empirical assessments, including those extracting propositional claims from -generated biographies, confirmed hallucination rates where models deviated from ground-truth facts in up to 20-30% of verifiable instances. The absence of domain-specific inductive biases in foundation models further exacerbates architectural shortfalls, as these systems rely on scale rather than embedded priors for reasoning about causal structures or physical invariants. Without hardcoded assumptions like conservation laws or temporal ordering, models violate basic physics in simulation tasks; for instance, they may generate trajectories ignoring momentum preservation or energy constraints when prompted for hypothetical scenarios beyond memorized examples. This stems from the transformer architecture's emphasis on pattern matching over mechanistic understanding, resulting in outputs that contradict first-principles derivations even after fine-tuning. Such deficiencies highlight a core limitation: performance gains from parameter scaling do not substitute for explicit biases tailored to real-world causality. Efficiency constraints compound these issues, with inference latency scaling poorly due to the quadratic complexity of mechanisms in models exceeding billions of parameters. Generating a single response in large foundation models like can require 1-10 seconds on high-end GPUs, rendering them unsuitable for deployment where sub-millisecond latencies are mandated for applications such as autonomous or real-time sensor processing. Hardware limitations on resource-constrained devices amplify this, as full model loading demands gigabytes of and sustained draw, often exceeding available thermal and computational budgets without aggressive compression that degrades accuracy.

Empirical Critiques of Hype and Benchmarks

Critics of foundation model progress have highlighted benchmark saturation as a key indicator of overstated capabilities. By , leading models such as OpenAI's o1 and GPT-4o achieved MMLU scores exceeding 90%, nearing the benchmark's effective ceiling near human expert performance levels (typically 90-95% for specialists across subjects), which obscures further differentiation and masks persistent gaps in general intelligence. This saturation arises because MMLU primarily evaluates knowledge recall and simple inference, allowing high scores through data-driven rather than robust reasoning, prompting the development of harder variants like MMLU-Pro to expose these limitations. Scaling skepticism has intensified in 2025 analyses, which argue that continued compute and increases yield on misleading metrics that conflate narrow task proficiency with broad . Publications like those from AIGuys contend that AGI-oriented benchmarks overstate progress by rewarding statistical over causal understanding, with economic constraints—such as scarcity and costs—preventing sustained without architectural innovations. supports this, as performance plateaus emerge despite exponential resource growth; for instance, models trained on trillions of tokens show no proportional gains in novel problem-solving, indicating reliance on rather than to unseen domains. Benchmarks targeting core reasoning, such as the Abstraction and Reasoning Corpus (), reveal foundational shortcomings, with foundation models failing to exceed 40-50% accuracy even after extensive and test-time compute. ARC tasks require few-shot abstraction from grid-based patterns, where humans achieve over 80% success, but LLMs demonstrate , mistaking superficial correlations for rules and lacking systematic —evidence against claims of emergent reasoning as mere artifacts of scale-induced . These failures underscore that hype around benchmark leaderboards often ignores holdout evaluations designed to probe true , prioritizing verifiable data over narrative-driven optimism.

Risks, Controversies, and Debates

Safety, Alignment, and Misuse Risks

(RLHF), introduced by in January 2022 for aligning foundation models like InstructGPT, trains reward models on human preferences to steer outputs away from harmful or unhelpful responses, significantly reducing overt toxicity in deployments such as . Despite these gains, RLHF exhibits fundamental limitations, including reward hacking where models exploit superficial patterns in training data rather than internalizing intent, and vulnerability to adversarial jailbreaks that bypass safeguards via or encoded inputs. Evaluations of production-scale models in 2023–2025 reveal jailbreak success rates exceeding 50% in some scenarios, even after , underscoring RLHF's incomplete robustness against determined circumvention. Misuse risks from foundation models encompass generating deepfakes for , scams, or targeted , with capabilities enabling realistic synthetic audio, video, and text that amplify at scale; for instance, non-consensual has targeted over 4,000 public figures using accessible tools derived from models like . Open-weight models heighten proliferation concerns by allowing malicious adaptation without oversight, yet they also empower defensive innovations, such as community-built detectors and red-teaming datasets that outpace proprietary restrictions in adaptability. Empirical instances of misuse, including voice-cloning scams defrauding individuals of thousands of dollars, demonstrate tangible harms but remain sporadic relative to model usage volumes exceeding trillions of tokens processed annually. Despite alarmist projections, real-world catastrophic outcomes from foundation model misalignment or misuse have not materialized at scale; analyses of over 499 reported generative incidents through mid-2025 identify predominantly localized harms like bursts or ethical lapses, with no verified existential escalations amid deployments serving hundreds of millions of users. This empirical track record—contrasting hype-driven forecasts of imminent doom—highlights precautionary approaches' tendency to overprioritize speculative tail risks, often sidelining iterative market mechanisms where developer updates, user feedback loops, and competitive pressures refine safeguards faster than centralized mandates. Such dynamics suggest that while vigilance against jailbreaks and misuse persists, overreliance on halting undervalues evidence of self-correcting equilibria in deployed systems.

Policy and Regulatory Controversies

The development and deployment of models have sparked intense debates over the appropriate between regulatory oversight to mitigate potential harms and minimal intervention to preserve and global competitiveness. Proponents of argue that overly stringent rules risk ceding leadership to less-regulated jurisdictions like , where state-backed AI advancement proceeds unchecked, while advocates for controls emphasize preemptive measures against unproven systemic s. These tensions manifest in divergent approaches across jurisdictions, with critics of heavy highlighting of slowed technological progress in regulated sectors, such as past EU data protection rules that disadvantaged European firms relative to U.S. counterparts. The European Union's AI Act, finalized in 2024 and entering phased enforcement from August 2025, imposes specific obligations on providers of general-purpose AI models—encompassing foundation models trained with substantial compute resources, such as those exceeding thresholds for classification. These include requirements for technical documentation, usage instructions, copyright compliance, and, for high-impact models, risk assessments, transparency reporting, and adherence to forthcoming codes of practice developed by the . models, identified via criteria like training compute over 10^25 or high-impact capabilities, face additional scrutiny, including model evaluations and incident reporting to the EU AI Office. Critics contend that the Act's classifications and burdens, such as mandatory disclosures and evaluations, could hinder U.S. and firms' , potentially diverting and talent to unregulated markets and eroding Western competitiveness against , where no equivalent constraints apply. For instance, vague implementation guidelines have been faulted for creating strategic uncertainty, prompting firms to relocate operations outside , as evidenced by prior regulatory precedents like GDPR's uneven on . Supporters, often aligned with precautionary frameworks, view these measures as essential for addressing opacity in model training and deployment, though empirical data on foundation model incidents remains limited, with most reported risks stemming from misuse rather than inherent model flaws. In the United States, 14110, issued on October 30, 2023, directed agencies to address risks from dual-use foundation models—defined as those trained via self-supervision on broad data with at least 10^26 of compute—through testing, red-teaming, and reporting requirements for developers exceeding compute thresholds. It also advanced export controls on -enabling hardware to curb proliferation to adversaries, balancing innovation promotion via federal infrastructure investments against safeguards like bias mitigation guidelines. Debates surrounding the order pit calls for open model weights and reduced controls—favoring broad access to accelerate domestic —against restrictions to prevent adversarial gains, with proponents of the latter arguing that empirical underestimation of diffusion risks justifies controls, though evidence shows open models like early variants enabling rapid global replication without proportional harm. The order's subsequent partial rescission under the incoming administration in 2025 underscored accelerationist preferences for deregulation, prioritizing compute scaling over preemptive limits. Broader ideological divides frame these policies: "" perspectives, rooted in circles, advocate stringent pauses or caps on foundation model scaling due to speculative existential threats, as seen in petitions for development halts signed by figures like . In contrast, accelerationists, including the (e/acc) movement, counter that such fears overestimate unproven risks while ignoring historical patterns where rapid tech adoption—such as or biotech—yielded net benefits under lighter touch , urging empirical focus on verifiable misuse over hypothetical doomsdays. This clash highlights a core tension: left-leaning precautionary approaches in institutions like the EU Commission versus market-driven realism emphasizing underappreciated upsides in capability gains.

National Security Implications

The holds a commanding lead in foundation model development, propelled by private-sector initiatives that outpace state-directed efforts elsewhere, with U.S. entities releasing 40 notable AI models in 2024 versus China's 15. Ventures like xAI, established in July 2023 to advance AI through competitive innovation, underscore this edge in scaling large-scale models via market-driven resources rather than centralized planning. U.S. export controls on advanced chips, implemented since 2022 to curb , have accelerated China's domestic and model-building capabilities, shrinking the performance differential between top U.S. and Chinese systems from 9.3% in early 2024 to under 2% by mid-year. Open-source foundation models offer strategic advantages for U.S. by diversifying supply chains and mitigating risks from dependencies, enabling the Department of Defense to fine-tune models for specialized applications without . A CSIS assessment emphasizes that open architectures counter closed-system monopolies, which present exploitable chokepoints in wartime scenarios, while fostering rapid iteration among allies and contractors. This bolsters against adversarial disruption, as distributed development reduces the impact of targeted on any single provider. Proliferation risks from open models, including potential adaptation by state actors like for offensive cyber tools or autonomous systems, must be weighed against verification benefits: public weights permit auditing for backdoors and vulnerabilities, unlike black-box closed models that obscure hidden flaws. enables preemptive hardening, such as community-driven safeguards, enhancing U.S. deterrence in an era where adversaries like have surged in open-model releases, surpassing U.S. counterparts in certain benchmarks by October 2025. Prioritizing open ecosystems thus sustains American primacy by embedding defensive scrutiny into global AI flows, rather than ceding ground to opaque rivals.

Societal and Economic Impacts

Productivity and Innovation Benefits

Foundation models have demonstrated measurable productivity enhancements in , where empirical studies indicate developers can complete tasks up to twice as fast when utilizing generative tools built on these models. A involving experienced open-source developers found that early-2025 tools, leveraging foundation model capabilities, increased perceived by approximately 24%, with actual task completion times reduced through -assisted and . These gains stem from the models' ability to automate routine subtasks like writing, allowing human developers to focus on higher-level architecture and problem-solving, thereby accelerating iteration cycles in workflows. Enterprise adoption of foundation models has surged, with generative AI usage among businesses rising from 33% in 2023 to 71% in 2024, reflecting rapid integration into operational processes. IBM reported achieving $4.5 billion in productivity gains through AI and initiatives by August 2025, attributing these to scalable deployments of model-based systems that optimize resource allocation and decision-making across its operations. Such transformations enable firms to reallocate human effort from repetitive analysis to strategic tasks, yielding compounded efficiency in sectors like and , where AI-driven forecasting and process have reduced operational latencies by 20-50% in targeted applications. By providing pre-trained, adaptable architectures, foundation models lower technical and financial barriers to AI development, enabling non-experts and smaller entities to innovate without building models from scratch. This fosters broader exploration of downstream applications, as evidenced by platforms like , where less experienced developers leverage fine-tuned foundation models to prototype novel tools, expanding the ecosystem of AI-powered products beyond elite research labs. Consequently, these models accelerate market-driven innovation by empowering startups and independent creators to deploy specialized applications—such as custom analytics apps or automated design tools—hastening the translation of ideas into viable solutions and outpacing resource-constrained, equity-prioritizing alternatives.

Labor and Economic Disruptions

Foundation models, powering generative AI systems, have automated routine cognitive tasks such as , basic , and content summarization, leading to targeted job displacements in administrative and clerical roles. A 2024 World Economic Forum analysis projected that AI-driven automation could displace up to 85 million jobs globally by 2025, primarily in repetitive sectors like and support. Empirical studies on U.S. firms adopting AI technologies, including those leveraging foundation models, indicate declines of 2-5% in exposed occupations, particularly for entry-level workers in information processing. However, these shifts have been accompanied by net job creation in AI-related operations, , and complementary human roles requiring oversight and creativity. The same report forecasts the simultaneous emergence of 97 million new positions, yielding a net gain of 12 million , driven by demand for AI specialists, data annotators, and system integrators. U.S. data from 2023-2025 shows AI job postings growing by over 20% annually, concentrated in states like (15.7% of total AI postings in 2024), offsetting losses in automatable tasks. research on firm-level AI adoption confirms that while routine tasks diminish, overall employment rises due to gains enabling expansion and hiring in non-routine areas. Economic benefits from foundation models disproportionately accrue to skilled workers, exacerbating short-term inequality but evidenced by wage premiums for AI proficiency. OECD analysis of U.S. data reveals that workers with AI skills command a 20-25% wage premium, concentrated in high-skill sectors like technology and finance where models augment rather than replace expertise. A 2025 sectoral study finds AI integration raises wages by 10-15% in complementary industries but suppresses them in labor-intensive ones with high substitution rates. This skill bias aligns with economic models showing foundation models widening the premium for , though upskilling programs have mitigated gaps for mid-tier workers in adopting firms. Alarmist predictions of mass unemployment from foundation models overlook historical patterns of technological adaptation, where fears of permanent displacement proved unfounded. During mechanization wave, U.S. unemployment peaked at 25% amid automation anxieties, yet post-World War II employment surged as new industries absorbed labor. Similarly, computerization from the 1980s onward displaced routine clerical jobs but generated more positions in programming and IT support, with OECD countries experiencing rising employment-to-population ratios despite productivity doublings. Recent Yale metrics post-ChatGPT (a foundation model derivative) show no broad U.S. labor market disruption 33 months after release, underscoring market-driven reallocation over systemic collapse. favors labor-creating effects of , as task-specific automation prompts specialization rather than wholesale job elimination.

Geopolitical and Competitive Dynamics

The development of foundation models has intensified geopolitical competition, particularly between the and , where these technologies are recognized as dual-use capabilities with applications in civilian innovation and . In , U.S.-based institutions produced 40 notable models, maintaining a lead in high-performance foundation models, though narrowed the performance gap through rapid iteration and access to global resources. By January 2025, China's DeepSeek model challenged U.S. dominance narratives by demonstrating competitive capabilities in large language models, underscoring the pace of Beijing's advancements despite U.S. export controls on advanced semiconductors implemented in January 2025. U.S. private-sector firms, including and , have released open-weight foundation models like series, which democratize access and accelerate global innovation but inadvertently empower rivals by providing foundational architectures that Chinese developers can fine-tune. China's vibrant open-source ecosystem has leveraged these releases, enabling firms like DeepSeek and Alibaba to produce models that, by October 2025, surpassed U.S. counterparts in certain open rankings for power and adoption. This diffusion highlights a : while open releases foster redundancy in development paths, they reduce barriers for state-directed actors in authoritarian regimes to adapt models for strategic purposes, potentially amplifying dual-use risks without equivalent safeguards. In 2025, international governance efforts intensified, with the reinforcing its 2019 AI Principles for trustworthy systems amid calls for "global red lines" on unacceptable risks, yet multilateral frameworks have yielded to bilateral arrangements for agility. The EU AI Act, effective from 2024 with phased enforcement, imposes risk-based obligations that critics argue could stifle European through compliance burdens like assessments and fines up to 7% of global revenue, potentially ceding competitive ground to less-regulated actors. Bilateral pacts, such as those emerging in 2024-2025 between the U.S. and allies like the or , prioritize targeted cooperation on safety and compute sharing over broad slowdowns, reflecting a pragmatic shift in a fragmented landscape. Proponents of competitive dynamics contend that rivalry among decentralized actors generates safety benefits through redundant, diverse approaches to and robustness, contrasting with centralized control that risks single points of failure or capture by unaccountable entities. This perspective posits that U.S. private-sector agility, unhindered by over-regulation, sustains leads in foundational capabilities, while excessive multilateral constraints could enable authoritarian regimes to outpace through state mobilization, as evidenced by China's open-source momentum.

References

  1. [1]
    On the Opportunities and Risks of Foundation Models - arXiv
    Aug 16, 2021 · This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (eg, language, vision, robotics, ...
  2. [2]
    On the Opportunities and Risks of Foundation Models
    Fundamentally, foundation models are a high-leverage single point of failure, making them a prime target for attack: existing work demonstrates a variety of ...Missing: achievements controversies
  3. [3]
    [PDF] On the Opportunities and Risks of Foundation Models
    2.6.1 What is a foundation model? There is not a precise technical definition of foundation model. Rather, this is an informal label for a large family of ...
  4. [4]
    What are Foundation Models? - Aisera
    Rating 8.9/10 (147) Scale: Foundation models are parameter-intensive, often containing billions or trillions of parameters. Parameters include weights and biases in the feedforward ...
  5. [5]
    What Are Foundation Models? | IBM
    1 On the Opportunities and Risks of Foundation Models, Stanford Center for Research on Foundation Models and Stanford Institute for Human-Centered ...
  6. [6]
    What is a foundation model? - Ada Lovelace Institute
    Jul 17, 2023 · A defining characteristic of foundation models is the scale of data and computational resources involved in building them. They require datasets ...AI technologies and... · Foundation models: applicationsMissing: key | Show results with:key
  7. [7]
    What are Foundation Models? - DataCamp
    Aug 15, 2023 · Narrow AI refers to AI systems designed for specific tasks but which are unable to perform tasks outside their planned scope.What are Foundation Models... · How Do Foundation Models... · Modality
  8. [8]
    [2005.14165] Language Models are Few-Shot Learners - arXiv
    May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
  9. [9]
    [PDF] Can Foundation Models Talk Causality? - OpenReview
    In this work, we argue that foundation models might be exploiting a “loop hole” in the CHT2. Namely, what happens if the causal assumptions (which are required,.
  10. [10]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · Access Paper: View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors. View PDF · HTML (experimental) ...
  11. [11]
    What are foundation models? | Google Cloud
    Foundation models are AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning. Learn more from Google Cloud.<|separator|>
  12. [12]
    Foundation models: 2022's AI paradigm shift - VentureBeat
    Sep 13, 2022 · 2022 has seen incredible growth in foundation models, or large-scale AI models trained on a massive scale. What does the future hold?
  13. [13]
    Pathways Language Model (PaLM): Scaling to 540 Billion ...
    Apr 4, 2022 · We introduce the Pathways Language Model (PaLM), a 540-billion parameter, dense decoder-only Transformer model trained with the Pathways system.
  14. [14]
    AI Index: State of AI in 13 Charts | Stanford HAI
    Apr 15, 2024 · This past year, organizations released 149 foundation models, more than double the number released in 2022. Of these newly released models, 65.7 ...Biggest Players · Prices Skyrocket · What Ai Race?
  15. [15]
    The Llama 4 herd: The beginning of a new era of natively ...
    Apr 5, 2025 · We're introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context length support.
  16. [16]
    Frontier AI regulation: Managing emerging risks to public safety
    Jul 6, 2023 · In this paper, we focus on what we term “frontier AI” models: highly capable foundation models that could possess dangerous capabilities ...
  17. [17]
    Frontier AI: capabilities and risks – discussion paper - GOV.UK
    Apr 28, 2025 · Increasingly, frontier AI models are multi-modal. In addition to text, they can generate and process other data types such as images, video, and ...What is the current state of... · What risks do frontier AI present? · Glossary
  18. [18]
    GPT-4 - OpenAI
    Mar 14, 2023 · GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios,
  19. [19]
    OpenAI announces GPT-4, says beats 90% of humans on SAT - CNBC
    Mar 14, 2023 · GPT-4 performed at the 90th percentile on a simulated bar exam, the 93rd percentile on an SAT reading exam, and the 89th percentile on the SAT ...
  20. [20]
    Introducing the next generation of Claude - Anthropic
    Claude 3 model family · A new standard for intelligence · Near-instant results · Strong vision capabilities · Fewer refusals · Improved accuracy.
  21. [21]
    [PDF] The United States Artificial Intelligence Safety Institute: Vision ...
    May 21, 2024 · AISI's research, testing, and guidance will enable more rigorous assessment of AI risk; more effective internal and external safeguards for. AI ...
  22. [22]
    Artificial Intelligence Safety Institute Consortium (AISIC) | NIST
    The Consortium brings together more than 280 organizations to develop science-based and empirically backed guidelines and standards for AI measurement.
  23. [23]
    General-Purpose AI Models in the AI Act – Questions & Answers
    Jul 10, 2025 · General-purpose AI models are trained with large data, display significant generality, perform many tasks, and can be integrated into various ...
  24. [24]
    [PDF] General Purpose AI and the AI Act
    These systems are sometimes referred to as 'foundation models' and are characterised by their widespread use as pre-trained models for other, more specialised ...
  25. [25]
    What Are Generative AI, Large Language Models, and Foundation ...
    May 12, 2023 · This post aims to clarify what each of these three terms mean, how they overlap, and how they differ.Missing: narrow | Show results with:narrow
  26. [26]
    ARC-AGI-2 A New Challenge for Frontier AI Reasoning Systems
    May 20, 2025 · The design goals of ARC-AGI-2 are intended to improve upon the limitations of ARC-AGI-1 and expand the depth and quantity of its datasets. Here ...
  27. [27]
    OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
    Dec 20, 2024 · ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less ...
  28. [28]
    AI Models Struggle with New ARC-AGI-2 Benchmark ... - Medium
    Mar 25, 2025 · Unlike previous benchmarks, ARC-AGI-2 prevents models from brute-forcing their way through problems using vast computational resources; it ...Missing: limitations | Show results with:limitations
  29. [29]
    EU AI Act News: Rules on General-Purpose AI Start ... - Mayer Brown
    Aug 1, 2025 · Obligations relating to general-purpose artificial intelligence (“GPAI”) models under the EU AI Act enter into force on 2 August 2025.
  30. [30]
    General-Purpose Artificial Intelligence (GPAI) Models and ... - RAND
    Aug 8, 2024 · Under the EU AI Act, most foundation models are categorized as general-purpose AI (GPAI) and have special requirements imposed on them in ...
  31. [31]
    General-purpose AI regulation and the European Union AI Act
    Aug 1, 2024 · The future-proofing of the AI Act needs to focus specifically on general-purpose AI and foundation models, as these types of AI are the most ...
  32. [32]
    The Failed Strategy of Artificial Intelligence Doomers
    Jan 31, 2025 · The AI Doomers' plans are based on an urgency which is widely assumed but never justified. For many of them, the urgency leads to a rush to do ...
  33. [33]
    Genie 3: A new frontier for world models - Google DeepMind
    Aug 5, 2025 · Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2.
  34. [34]
    RT-2: New model translates vision and language into action
    Jul 28, 2023 · RT-2 is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for ...
  35. [35]
    [2307.15818] RT-2: Vision-Language-Action Models Transfer Web ...
    We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization.
  36. [36]
    Harvard and MIT Study: AI Models Are Not Ready to Make Scientific ...
    Jul 15, 2025 · They concluded that AI models make accurate predictions but fail to encode the world model of Newton's laws and instead resort to case-specific ...
  37. [37]
    Foundation models are going multimodal - Twelve Labs
    Mar 31, 2023 · In 2021, OpenAI introduced CLIP (Contrastive Language–Image Pre-training). The input to CLIP is 400 million image-text pairs that were crawled ...
  38. [38]
    [2507.12496] FOUNDER: Grounding Foundation Models in World ...
    Jul 15, 2025 · FOUNDER integrates Foundation Models (FMs) and World Models (WMs) for open-ended task solving, using a mapping function to ground FM ...
  39. [39]
    [2401.04088] Mixtral of Experts - arXiv
    Jan 8, 2024 · We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each ...
  40. [40]
    Mixtral of experts - Mistral AI
    Dec 11, 2023 · Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters.
  41. [41]
    Large language model data pipelines and Common Crawl (WARC ...
    Jun 3, 2023 · This article provides a short introduction to the pipeline used to create the data to train LLaMA, but it allows for many variations.
  42. [42]
    Training Data for the Price of a Sandwich - Mozilla Foundation
    Feb 6, 2024 · While Common Crawl was never primarily about providing AI training data, it now positions itself as an important building block for LLM ...
  43. [43]
    How to Ensure Sufficient Data for AI Foundation Models
    Jan 8, 2024 · As large language models (LLMs), Meta's LLaMA has 65 billion parameters and 4.5 TB of training data, while OpenAI's GPT-3.5 has 175 billion ...
  44. [44]
    Open-Sourced Training Datasets for Large Language Models (LLMs)
    The Common Crawl dataset comprises terabytes of raw web data extracted from billions of web pages. It releases new data files that the crawler obtains each ...
  45. [45]
    Datasets used for training LLM's: All types of data used to create ...
    Aug 21, 2025 · Major Sources of LLM Training Data · 1. Books · 2. Websites · 3. Articles and Journals · 4. Conversations and Dialogue · 5. Common Crawl · 6.
  46. [46]
    The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted ...
    Dec 27, 2023 · The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle.
  47. [47]
    Mastering LLM Techniques: Text Data Processing - NVIDIA Developer
    Nov 13, 2024 · Text processing for LLMs includes: download, cleaning, heuristic filtering, deduplication, model-based filtering, and blending/shuffling.
  48. [48]
    In search of the next generation of training sets for language models
    Jun 17, 2024 · Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model ...Missing: large | Show results with:large
  49. [49]
    Curating Non-English Datasets for LLM Training with NVIDIA NeMo ...
    Jul 10, 2024 · Heuristic filtering helps remove low-quality content from the dataset, using simple, efficient-to-compute rules. By applying well-designed ...
  50. [50]
    [2411.15821] Is Training Data Quality or Quantity More Impactful to ...
    Nov 24, 2024 · This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs)<|control11|><|separator|>
  51. [51]
    AI models collapse when trained on recursively generated data
    Jul 24, 2024 · Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set ...
  52. [52]
    [2404.01413] Is Model Collapse Inevitable? Breaking the Curse of ...
    Apr 1, 2024 · We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse.
  53. [53]
    Distributed Training: Guide for Data Scientists
    Distributed training divides training workload across multiple processors, running subtasks in parallel to speed up deep learning model training.
  54. [54]
    Training and Serving System of Foundation Models - arXiv
    This survey explores methods for training and serving foundation models, which are built on data, models, computing, and algorithms, and face challenges in  ...2.2 Transformer For... · 3 Model Training · 4 Model Serving<|separator|>
  55. [55]
    GPT-4 Details Revealed - by Patrick McGuinness
    Jul 12, 2023 · OpenAI's pre-training for GPT-4 required about 2.15 x 10^25 FLOPS. This meant running on 25,000 A100s for 90 to 100 days, with a total pre- ...
  56. [56]
    How Do GPUs and TPUs Differ in Training Large Transformer ...
    Aug 25, 2025 · TPUs outperform GPUs for massive batch processing and models directly compatible with their architecture, including most TensorFlow-based LLMs ...
  57. [57]
    How much power will frontier AI training demand in 2030? - Epoch AI
    Aug 11, 2025 · The power required to train the largest frontier models is growing by more than 2x per year, and is on trend to reaching multiple gigawatts ...Missing: shortages Neptune.
  58. [58]
    [PDF] AI's Power Requirements Under Exponential Growth - RAND
    We find that globally, AI data centers could need ten gigawatts (GW) of additional power capacity in 2025 alone, which is more than the total power capacity of ...Missing: Neptune. | Show results with:Neptune.
  59. [59]
    Deep Learning Model Optimization Methods - Neptune.ai
    Pruning reduces model size by removing less important neurons, involving identification, elimination, and optional fine-tuning. · Quantization decreases memory ...
  60. [60]
    [2307.02973] Pruning vs Quantization: Which is Better? - arXiv
    Jul 6, 2023 · Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial.
  61. [61]
    [2001.08361] Scaling Laws for Neural Language Models - arXiv
    Jan 23, 2020 · View a PDF of the paper titled Scaling Laws for Neural Language Models, by Jared Kaplan and 9 other authors. View PDF. Abstract:We study ...
  62. [62]
    Training Compute-Optimal Large Language Models - arXiv
    Mar 29, 2022 · As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher ...
  63. [63]
    The Race to Efficiency: A New Perspective on AI Scaling Laws - arXiv
    Jan 4, 2025 · Empirical trends suggest that sustained efficiency gains can push AI scaling well into the coming decade, providing a new perspective on the ...
  64. [64]
    Current AI scaling laws are showing diminishing returns, forcing AI ...
    Nov 20, 2024 · AI. Current AI scaling laws are showing diminishing returns, forcing AI labs to change course. Maxwell Zeff. 6:00 AM PST · November 20, 2024.Missing: empirical | Show results with:empirical
  65. [65]
    Algorithmic Improvement Is Probably Faster Than Scaling Now
    Jun 5, 2023 · Back in 2020, a group at OpenAI ran a conceptually simple test to quantify how much AI progress was attributable to algorithmic improvements.Anthropic: Core Views on AI Safety: When, Why, What, and How ...Revisiting algorithmic progress - LessWrongMore results from www.lesswrong.comMissing: ignoring | Show results with:ignoring
  66. [66]
    What if A.I. Doesn't Get Much Better Than This? | The New Yorker
    Aug 12, 2025 · Orion's failure helped cement the creeping fear within the industry that the A.I. scaling law wasn't a law after all. If building ever-bigger ...
  67. [67]
    [2501.13787] Parameter-Efficient Fine-Tuning for Foundation Models
    Jan 23, 2025 · PEFT is a cost-effective fine-tuning technique that minimizes parameters and computational complexity while striving for optimal downstream ...
  68. [68]
    What is parameter-efficient fine-tuning (PEFT)? - IBM
    PEFT is a method of improving the performance of pretrained large language models (LLMs) and neural networks for specific tasks or data sets.Overview · How does parameter-efficient...
  69. [69]
    Zero-Shot Prompting - Prompt Engineering Guide
    The zero-shot prompt directly instructs the model to perform a task without any additional examples to steer it. We tried a few zero-shot examples in the ...
  70. [70]
    What is zero-shot prompting? - IBM
    In zero-shot prompting, the model is prompted to generate a response without receiving an example of the desired output for the use case. Zero-shot prompting is ...What is zero-shot prompting? · Zero-shot prompting vs few...
  71. [71]
    PEFT - Hugging Face
    PEFT (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting large pretrained models to various downstream applications.
  72. [72]
    LoRA: Low-Rank Adaptation of Large Language Models - arXiv
    Jun 17, 2021 · We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the ...
  73. [73]
    A Survey on Parameter-Efficient Fine-Tuning for Foundation Models ...
    Apr 29, 2025 · This survey provides a comprehensive review of the integration of PEFT techniques within federated learning environments.
  74. [74]
    Fine Tuning Large Language Model for Secure Code Generation
    ... fine-tuned models. Our experiments on GPT-J show that the fine-tuned GPT-J achieved 70.4% and 64.5% ratios of non-vulnerable code generation for C and C++ ...
  75. [75]
    [2308.08747] An Empirical Study of Catastrophic Forgetting in Large ...
    Aug 17, 2023 · This study empirically evaluates the forgetting phenomenon in LLMs' knowledge during continual instruction tuning from the perspectives of domain knowledge, ...
  76. [76]
    What is Catastrophic Forgetting? - IBM
    Catastrophic forgetting happens when the training process for the new tasks interferes with the model's understanding of old tasks.
  77. [77]
    GLUE Benchmark
    The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language ...SuperGLUE Benchmark · GLUE Diagnostic Dataset · Leaderboard · TasksMissing: foundation | Show results with:foundation
  78. [78]
    SuperGLUE: A Stickier Benchmark for General-Purpose Language ...
    May 2, 2019 · In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a ...
  79. [79]
    SuperGLUE Benchmark
    A new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.SuperGLUE Diagnostic Dataset · Leaderboard · Tasks · FAQMissing: foundation | Show results with:foundation
  80. [80]
    [PDF] arXiv:2206.07682v2 [cs.CL] 26 Oct 2022
    Oct 26, 2022 · Figure 2 shows eight such emergent abilities spanning five language model families from various work. BIG-Bench. Figure 2A–D depicts four ...
  81. [81]
    Are Emergent Abilities of Large Language Models a Mirage? - arXiv
    Apr 28, 2023 · Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in ...
  82. [82]
    Benchmarking is Broken - Don't Let AI be its Own Judge - arXiv
    Oct 15, 2025 · Many scholars have raised concerns about the limitations of AI benchmarking, with some describing current evaluation practices as a “minefield” ...
  83. [83]
    NLP Evaluation in trouble: On the Need to Measure LLM Data ...
    This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic ...
  84. [84]
    A Survey on Data Contamination for Large Language Models - arXiv
    Jun 5, 2025 · In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation.
  85. [85]
    Benchmark Data Contamination of Large Language Models: A Survey
    Jun 6, 2024 · This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional ...
  86. [86]
    The ARC Benchmark: Evaluating LLMs' Reasoning Abilities
    The ARC Benchmark challenges AI with puzzles that measure reasoning, not recall. In 2025, GPT, Claude, and Gemini still lag behind human performance ...
  87. [87]
    System 2 Reasoning for Human-AI Alignment: Generality and ... - arXiv
    Aug 13, 2025 · We examine weaknesses on ARC-AGI tasks, revealing gaps in compositional generalization and novel-rule adaptation, and argue that closing these ...
  88. [88]
    Nvidia dominates GPU shipments with 94% share - Tom's Hardware
    Sep 3, 2025 · The total number of GPUs sold for the second quarter of 2025 hit 11.6 million units, while desktop PC CPUs went up to 21.7 million units ...
  89. [89]
    Nvidia and TSMC produce the first Blackwell wafer made in the U.S.
    Oct 18, 2025 · Nvidia and TSMC have produced the first Blackwell wafer at TSMC's Arizona fab, marking a historic step in bringing advanced AI chip ...
  90. [90]
    Exclusive: Nvidia and TSMC unveil first Blackwell chip wafer ... - Axios
    Oct 17, 2025 · Nvidia and TSMC announced Friday their first completed U.S.-made wafer that will eventually become Blackwell chips for AI purposes, Nvidia first ...
  91. [91]
    Overly Stringent Export Controls Chip Away at American AI Leadership
    May 5, 2025 · The Biden administration issued its first AI chip export controls in October 2022, restricting the export of AI chips to China, as well as the technology to ...
  92. [92]
    Jensen says Nvidia's China AI GPU market share has plummeted ...
    the Chinese market previously amounted to 20% to 25% of the ...
  93. [93]
    U.S. Export Controls and China: Advanced Semiconductors
    Sep 19, 2025 · Tables. Table A-1. Select Details of U.S. Advanced Chip Controls on China (2022-2025); Table A-2. Nvidia's Modified Chips for China. Appendixes.
  94. [94]
    What is Hugging Face? - IBM
    The Hugging Face Hub is a central web-based platform where users can share, discover and collaborate on models, datasets and applications. It acts like a " ...
  95. [95]
    AI's Power Requirements Under Exponential Growth - RAND
    Jan 28, 2025 · AI data centers could need ten gigawatts (GW) of additional power capacity in 2025, which is more than the total power capacity of the state of Utah.
  96. [96]
  97. [97]
    Securing America's Critical Minerals Supply
    Oct 8, 2025 · ... rare earths poses the main bottleneck in securing this supply chain. ... AI Action Plan from July 2025. 10 U.S. factories for lithium-ion ...<|control11|><|separator|>
  98. [98]
    Surveying the Future of U.S. Open Foundation Model Policy - CSIS
    Mar 21, 2024 · How might foreign adversaries use open weight models to exacerbate security risks? And how do those kinds of marginal risks compare to closed ...
  99. [99]
    Open-Source AI is a National Security Imperative - Third Way
    Jan 30, 2025 · In this paper, we explore the benefits and drawbacks of open-source AI and conclude that open-source can help balance the safety and security we want from AI.What Is ``open'' Ai? · Androids 'r' Us · Ai: Made In America
  100. [100]
    Open Release of Grok-1 - xAI
    Mar 17, 2024 · This is the raw base model checkpoint from the Grok-1 pre-training phase, which concluded in October 2023.
  101. [101]
    Defense Priorities in the Open-Source AI Debate - CSIS
    Aug 19, 2024 · Because open models can be retrained, fine-tuned, and broadly customized, they can serve as a basis for national-security-specific applications.
  102. [102]
    The Murky State of Frontier AI Transparency
    Jan 16, 2025 · Our analysis reveals four critical problems: closed models remain largely opaque about technical details, documentation is failing to keep pace with rapid ...
  103. [103]
    With Open Source Artificial Intelligence, Don't Forget the Lessons of ...
    Jul 29, 2024 · At CISA, we see significant value in open foundation models to help strengthen cybersecurity, increase competition, and promote innovation.Missing: iteration | Show results with:iteration
  104. [104]
    Can machine translation match human expertise? Quantifying the ...
    Jul 25, 2025 · Our findings suggest that large language models provide high-quality PROM translations to support human translations to reduce costs. However, ...
  105. [105]
    Sora: Creating video from text - OpenAI
    Feb 15, 2025 · Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user's prompt.
  106. [106]
    Sora 2 is here | OpenAI
    Sep 30, 2025 · Sora 2 is here. Our latest video generation model is more physically accurate, realistic, and more controllable than prior systems. It also ...Settings · Launching Responsibly · Sora 2 Availability And...
  107. [107]
    Foundation Models in Robotics: Applications, Challenges, and the ...
    We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control.
  108. [108]
    Foundation Model Driven Robotics: A Comprehensive Review - arXiv
    Jul 14, 2025 · In summary, foundation models have made robot planning more general and flexible.Missing: achievements | Show results with:achievements
  109. [109]
    Large language models surpass human experts in predicting ...
    Nov 27, 2024 · Pre-trained LLMs can provide a foundation for further training in neuroscience with the aim of improving performance, as assessed by BrainBench.
  110. [110]
    [PDF] State of Foundation Models - 2025 (Innovation Endeavors)
    Scaling continues across all dimensions – All technical metrics for models continue to improve >10x year-over-year, including cost, intelligence, context ...Missing: milestones | Show results with:milestones
  111. [111]
    [2302.06590] The Impact of AI on Developer Productivity - arXiv
    Feb 13, 2023 · Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair ...<|separator|>
  112. [112]
    Research: Quantifying GitHub Copilot's impact in the enterprise with ...
    May 13, 2024 · We conducted research with developers at Accenture to understand GitHub Copilot's real-world impact in enterprise organizations.
  113. [113]
    Method of the Year 2021: Protein structure prediction - Nature
    Jan 11, 2022 · In the past year, the deep-learning-based methods AlphaFold2 and RoseTTAfold have managed to achieve this feat over a range of targets, forever ...
  114. [114]
    AlphaFold Protein Structure Database
    In CASP14, AlphaFold was the top-ranked protein structure prediction method by a large margin, producing predictions with high accuracy. While the system still ...AlphaFold · View protein · Downloads · P70490Missing: achievements | Show results with:achievements
  115. [115]
    quantifying GitHub Copilot's impact on developer productivity and ...
    Sep 7, 2022 · In our research, we saw that GitHub Copilot supports faster completion times, conserves developers' mental energy, helps them focus on more satisfying work.
  116. [116]
    Why language models hallucinate | OpenAI
    Sep 5, 2025 · While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing ...
  117. [117]
    Detecting hallucinations in large language models using semantic ...
    Jun 19, 2024 · From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total ...<|separator|>
  118. [118]
    [PDF] Understanding Inductive Bias in the Era of Large-Scale Pretraining ...
    This thesis challenges the conventional wisdom that strict architectural constraints are necessary for modeling numerical data, par- ticularly in physics and ...
  119. [119]
    Edge-First Language Model Inference: Models, Metrics, and Tradeoffs
    May 22, 2025 · The challenge with deploying SLMs on mobile and edge devices lies in their limited computing capability, which directly restricts the loading of ...
  120. [120]
    A survey of edge efficient LLMs and techniques - ScienceDirect
    This survey provides a comprehensive overview of the state-of-the-art techniques and strategies for enabling efficient inference of LLMs on edge devices.<|control11|><|separator|>
  121. [121]
    The Sequence Opinion #485: What's Wrong With AI Benchmarks
    Feb 6, 2025 · As AI models advance, many benchmarks have become obsolete. In 2024, several key benchmarks saw near-perfect scores from leading models:.
  122. [122]
    MMLU-Pro: A More Robust and Challenging Multi-Task Language ...
    Jun 3, 2024 · This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning- ...Missing: saturation | Show results with:saturation
  123. [123]
    MMLU-Pro: A More Robust and Challenging Multi-Task Language ...
    By incorporating more complex, reasoning-intensive tasks, MMLU-Pro addresses the performance saturation observed in previous benchmarks, effectively ...
  124. [124]
  125. [125]
    The big AI story right now: Pure scaling has failed to produce AGI
    Feb 19, 2025 · The most underreported and important story in AI right now is that pure scaling has failed to produce AGI.Missing: AIGuys | Show results with:AIGuys
  126. [126]
  127. [127]
    Martin Henz - ARCprize: How LLMs fail on ARC benchmark - LinkedIn
    Jul 3, 2024 · ARCprize: How LLMs fail on ARC benchmark ... LLMs perform very poorly on the ARC benchmark: https://arcprize.org. The inventor of ARC, Francois ...Missing: failures emergent
  128. [128]
    Emergent Abilities in Large Language Models: A Survey - arXiv
    Feb 28, 2025 · Emergent abilities appear abruptly when a critical scale is reached rather than via smooth extrapolation. LLMs exhibit sudden performance jumps ...
  129. [129]
    ARC-AGI-1: Abstract Reasoning Benchmark - Emergent Mind
    Sep 16, 2025 · ... failures of prior search attempts. SOAR exemplifies this, combining LLM guided evolutionary search with automatic hindsight learning ...
  130. [130]
    Open Problems and Fundamental Limitations of Reinforcement ...
    Jul 27, 2023 · Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.Missing: jailbreaks 2023-2025
  131. [131]
    [PDF] Open Problems and Fundamental Limitations of Reinforcement ...
    RLHF has also not made models robust to adversarial attacks from jailbreaking (i.e., subverting the constraints the system is normally meant to operate under) ...Missing: 2023-2025 | Show results with:2023-2025
  132. [132]
    [PDF] Weak-to-Strong Jailbreaking on Large Language Models - arXiv
    In this paper, we propose the weak-to- strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key ...
  133. [133]
    Semantic Jailbreaks and RLHF Limitations in LLMs
    Aug 2, 2025 · In this paper, various production scale model responses have been evaluated against encoded and cleverly paraphrased, obfuscated, ...
  134. [134]
    Human performance in detecting deepfakes: A systematic review ...
    Deepfake technology has been misused to create pornographic content containing a fake version of a real, often famous, person. Around 4000 celebrities have ...
  135. [135]
    [PDF] On the Societal Impact of Open Foundation Models - arXiv
    Feb 27, 2024 · Open foundation models have benefits like innovation and distributed power, but also risks such as misuse, biosecurity, and cybersecurity ...
  136. [136]
    [PDF] Open-Sourcing Highly Capable Foundation Models - arXiv
    Sep 29, 2023 · Open-sourcing AI models offers benefits like oversight and progress, but also risks such as misuse and potential for dangerous AI diffusion.
  137. [137]
    AI deception: A survey of examples, risks, and potential solutions
    This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs.
  138. [138]
    A Closer Look at the Existing Risks of Generative AI - arXiv
    May 28, 2025 · Through a systematic analysis of 499 publicly reported incidents, we describe what harms are reported, how they arose, and who they impact. We ...Missing: rates | Show results with:rates
  139. [139]
    Ten Ways the Precautionary Principle Undermines Progress in ...
    Feb 4, 2019 · If policymakers apply the “precautionary principle” to AI, which says it's better to be safe than sorry, they will limit innovation and discourage adoption.
  140. [140]
    The Precautionary Principle, Safety Regulation, and AI: This Time, It ...
    Sep 4, 2024 · The PP has long been important in managing risks associated with technological innovations that have no explicit scientific knowledge.
  141. [141]
    What drives the divide in transatlantic AI strategy? - Atlantic Council
    Sep 29, 2025 · The US and EU share AI ambitions but diverge on regulation, risking a fractured Western front. Nowhere is this tension sharper than in ...<|separator|>
  142. [142]
    How Europe's AI Act could affect innovation and competitiveness
    Jul 4, 2024 · We caught up with ESCP's Philip Meissner to assess the impact of the EU AI Act on the broader political and economic landscape.
  143. [143]
    High-level summary of the AI Act | EU Artificial Intelligence Act
    On 18 July 2025, the European Commission published draft Guidelines clarifying key provisions of the EU AI Act applicable to General Purpose AI (GPAI) models.
  144. [144]
    European Commission publishes guidelines on obligations for ...
    Jul 24, 2025 · The Guidelines clarify that providers placing GPAI models on the market before August 2, 2025 have until August 2, 2027 to comply with their ...
  145. [145]
    EU AI Act Criticized for Granting US Tech Firms Excessive Influence
    Jul 6, 2025 · Strategic uncertainty from the AI Act's vague implementation affects EU research, potentially diverting talent and investment overseas.
  146. [146]
    Safe, Secure, and Trustworthy Development and Use of Artificial ...
    Nov 1, 2023 · It is the policy of my Administration to advance and govern the development and use of AI in accordance with eight guiding principles and priorities.
  147. [147]
    Executive Order on the Safe, Secure, and Trustworthy Development ...
    Oct 30, 2023 · It is the policy of my Administration to advance and govern the development and use of AI in accordance with eight guiding principles and priorities.
  148. [148]
    Dual-Use Foundation Models with Widely Available Model Weights ...
    Jul 30, 2024 · In October 2023 President Biden signed the Executive Order (EO) on “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence ...
  149. [149]
    [PDF] America's AI Action Plan - The White House
    Jul 10, 2025 · President Trump has already taken multiple steps toward this goal, including rescinding Biden Executive Order 14110 on AI that foreshadowed an ...
  150. [150]
    AI Doomers Versus AI Accelerationists Locked In Battle For Future ...
    Feb 18, 2025 · AI is advancing rapidly. AI doomers say we must stop and think. AI accelerationists say full speed ahead. Here is a head-to-head comparison.
  151. [151]
    Effective accelerationism, doomers, decels, and how to flaunt your AI ...
    Nov 20, 2023 · For the AI doomers, the decels are akin to attempts to build a third American political party; ineffective at best and insidious at worst. AI ...
  152. [152]
    Two warring visions of AI - Prospect Magazine
    Jan 16, 2024 · The power struggle between “doomers” and “accelerationists” will define the way this world-changing technology evolves.
  153. [153]
    The 2025 AI Index Report | Stanford HAI
    In 2024, U.S.-based institutions produced 40 notable AI models, significantly outpacing China's 15 and Europe's three. While the U.S. maintains its lead in ...
  154. [154]
    The US Is Winning the AI Race - But for How Long? - Project Syndicate
    Sep 24, 2025 · On January 20, DeepSeek unveiled its hyper-efficient R1 model, making it clear that US sanctions would not hold back China's AI ambitions.
  155. [155]
    How will AI influence US-China relations in the next 5 years?
    Jun 18, 2025 · Already, the performance gap between the best Chinese and U.S. AI models had shrunk from 9.3% in 2024 to 1.7% in February. This will be the new ...
  156. [156]
    China now leads the U.S. in this key part of the AI race
    Oct 13, 2025 · China's open AI models are now more powerful and popular than those released by American rivals, a shift with implications for the future of ...Missing: foundation | Show results with:foundation
  157. [157]
    Unleash developer productivity with generative AI - McKinsey
    Jun 27, 2023 · A McKinsey study shows that software developers can complete coding tasks up to twice as fast with generative AI. Four actions can maximize productivity and ...
  158. [158]
    Measuring the Impact of Early-2025 AI on Experienced ... - METR
    Jul 10, 2025 · We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers.Missing: benefits | Show results with:benefits
  159. [159]
    The reality of AI-Assisted software engineering productivity
    Aug 16, 2025 · Voege concedes that “AI helps with boilerplate” and routine coding, estimating perhaps a 20–50% speed-up on certain sub-tasks for many engineers ...
  160. [160]
    AI Adoption Statistics in 2025 - Netguru
    Sep 4, 2025 · Generative AI usage specifically has jumped from 33% in 2023 to 71% in 2024, showing how quickly businesses have gained confidence in these ...<|separator|>
  161. [161]
    Enterprise transformation and extreme productivity with AI | IBM
    Aug 26, 2025 · How AI and automation enabled USD 4.5 billion in productivity gains across IBM.
  162. [162]
    AI in Action 2024 Report - IBM
    The AI in Action report brings real-word data to the AI conversation to explore a simple question—during this critical period, what lessons can be learned ...Missing: transformation | Show results with:transformation
  163. [163]
    Foundation Models Explained: Why Every Startup and Enterprise ...
    Jun 13, 2025 · This scalability makes foundation models ideal for platformization, one model, many services. 4. Emergent Abilities and Compositional Reasoning.
  164. [164]
    (PDF) Foundation Models and AI Innovation: Evidence from the ...
    Oct 17, 2025 · Less experienced developers engage in broader exploration across various downstream applications, benefiting from lower entry barriers. These ...
  165. [165]
    5 Top Benefits of Foundation Models - Deepchecks
    Jul 22, 2025 · This article will explore the top five benefits of foundation models that enable developers to harness their powerful features across a variety of applications.Superior Generalization... · Breakthroughs in Multimodal... · Democratization of AI...
  166. [166]
    AI and Labor Markets: What We Know and Don't Know
    Oct 14, 2025 · Hosseini and Lichtinger (2025) find employment declines for young workers in US firms that adopt AI technologies, with adoption measured by the ...Missing: foundation | Show results with:foundation
  167. [167]
    [PDF] CHAPTER 4: Economy - Stanford HAI
    Oct 17, 2024 · In 2024,. 15.7% of all AI job postings in the United. States were for jobs based in California, followed by Texas (8.8%) and New York. (5.8%).
  168. [168]
    How artificial intelligence impacts the US labor market | MIT Sloan
    Oct 9, 2025 · What you'll learn: AI adoption leads to increased company growth in revenue, profits, employment, and profitability.Missing: foundation disruptions 2024
  169. [169]
    Artificial intelligence, job quality and inclusiveness - OECD
    Jul 11, 2023 · The available evidence for workers with AI skills in the United States suggests that they receive a significant wage premium. Alekseeva et al. ( ...
  170. [170]
    [PDF] AI Adoption and Wage Growth in U.S. Industries: A Sectoral Analysis
    Aug 27, 2025 · AI raises wages in tech/finance sectors where it complements humans, but keeps wages down in labor-intensive sectors where it replaces them. ...Missing: foundation | Show results with:foundation
  171. [171]
    Artificial intelligence and the skill premium: A numerical analysis of ...
    The results show that AI widens the skill premium by substituting low-skilled labor with industrial robots and performing high-skilled labor tasks.
  172. [172]
    History Repeats: The Longstanding Fear of Technology Replacing ...
    Jul 1, 2024 · With the unemployment rate around 20%, many feared that advancements in technology were eliminating jobs faster than they could be created.
  173. [173]
    The impact of AI on the labour market: is this time different? - OECD.AI
    Despite fears, the technological progress experienced in recent decades has not led to mass unemployment. In fact, employment in OECD countries has risen.
  174. [174]
    Evaluating the Impact of AI on the Labor Market - Yale Budget Lab
    Oct 1, 2025 · Overall, our metrics indicate that the broader labor market has not experienced a discernible disruption since ChatGPT's release 33 months ago, ...Missing: foundation empirical
  175. [175]
    The fear of technology-driven unemployment and its empirical base
    Jun 10, 2022 · This column suggests that the empirical support for the labour-creating effects of technological change dominates that for labour-replacement.Missing: comparison | Show results with:comparison
  176. [176]
    The Fed - The State of AI Competition in Advanced Economies
    Oct 6, 2025 · 2025. "China's AI Models Are Closing the Gap – but America's Real Advantage Lies Elsewhere." RAND Corporation. May 2. Hoffman, Mia and Laura ...
  177. [177]
    [PDF] Winning the Defining Contest: The US-China Artificial Intelligence ...
    Jul 7, 2025 · Narratives of US artificial intelligence (AI) leadership collapsed in January 2025 after the announcement of. DeepSeek's innovative large ...
  178. [178]
    Meta changes course on open-source AI as China pushes ahead ...
    Aug 3, 2025 · In fact, China has probably found the path to “surpass the US in AI” thanks to the momentum in the country's vibrant open-source AI ecosystem, ...
  179. [179]
    China's drive toward self-reliance in artificial intelligence: from chips ...
    Jul 22, 2025 · Ample private funding and access to global open-source models have allowed for rapid Chinese progress. Although China lagged in the first years ...
  180. [180]
    Export Controls on Open-Source Models Will Not Win the AI Race
    Feb 25, 2025 · One emphasizes geopolitical risk and global power dynamics, with a focus on Chinese misuse of U.S. open-source AI. The other is rooted in ...
  181. [181]
    AI principles - OECD
    The OECD AI Principles promote use of AI that is innovative and trustworthy and that respects human rights and democratic values. Adopted in May 2019, they set ...Key Links · Latest Insights · Related PublicationsMissing: bilateral | Show results with:bilateral
  182. [182]
    AI governance through global red lines can help prevent ...
    Sep 22, 2025 · Global AI red lines are one way to set enforceable limits on dangerous AI uses and behaviours to prevent unacceptable risks and build trust.
  183. [183]
    AI Act | Shaping Europe's digital future - European Union
    The AI Act is the first-ever legal framework on AI, which addresses the risks of AI and positions Europe to play a leading role globally.Regulation - EU - 2024/1689 · AI Pact · AI Factories · European AI OfficeMissing: geopolitical competition
  184. [184]
    The Annual AI Governance Report 2025: Steering the Future of AI
    Bilateral Agreements for AI Safety and Innovation: Bilateral agreements are quickly becoming the responsive layer of AI safety governance. In April 2024 ...
  185. [185]
    With AI, we need both competition and safety - Brookings Institution
    Jul 8, 2024 · AI regulation must promote safety and protect competition through industry-government cooperation and enforceable standards.