Fact-checked by Grok 2 weeks ago

Foundation model

A foundation model is any model trained on broad data—generally using self-supervised learning at scale—that can be adapted to a wide range of downstream tasks.^[1] The term was introduced in a 2021 report by researchers at Stanford University to describe a class of large-scale AI systems exhibiting emergent generalization capabilities across domains like natural language processing, computer vision, and robotics.^[2] These models leverage massive datasets and computational resources, often involving billions or trillions of parameters, enabling transfer learning where pre-training on general data supports fine-tuning for specialized applications with minimal additional supervision.^[1] Key examples include transformer-based language models such as OpenAI's GPT series and Google's PaLM, which have demonstrated scaling laws where performance predictably improves with increased model size, data volume, and compute.^[1] While foundation models have accelerated AI advancements—facilitating breakthroughs in tasks from text generation to image synthesis—they incur exorbitant training costs, frequently exceeding hundreds of millions of dollars, and raise concerns over risks including bias amplification from uncurated training corpora, vulnerability to adversarial attacks, and potential societal harms from misuse.^[3] Their development underscores a concentration of capabilities among resource-rich entities, prompting debates on accessibility, safety, and the empirical limits of scaling without fundamental architectural innovations.^[2]

Definition and Characteristics

Core Definition

A foundation model refers to any model trained on broad data, typically using self-supervised learning, that can be adapted—through fine-tuning, prompting, or other methods—to a wide range of downstream tasks.^[1] This definition emphasizes the model's foundational role in deriving generalizable capabilities from vast, unlabeled datasets rather than bespoke task engineering.^[1] Unlike supervised approaches reliant on labeled examples for specific objectives, foundation models leverage emergent properties from data scale, where representations encode patterns enabling versatile application across domains.^[1] Key characteristics include massive scale, often encompassing billions to trillions of parameters, which facilitates the compression of diverse data into reusable latent structures.^[4] These models exhibit generality across modalities such as text, images, audio, and video, allowing unified processing of heterogeneous inputs through shared pre-training objectives.^[1] Adaptation efficiency stems from transfer learning, where minimal additional data or instructions suffice to specialize the model, contrasting with resource-intensive retraining from scratch.^[5] In distinction from narrow AI systems, which are engineered for singular, predefined functions without broad reusability, foundation models achieve capabilities via probabilistic pattern extraction from expansive corpora, yielding causal inferences grounded in data distributions rather than explicit programming.^[6] Narrow AI, by contrast, optimizes for isolated performance metrics through targeted supervision, limiting extrapolation to untrained scenarios.^[7] This paradigm shift prioritizes empirical scaling laws, where model performance correlates predictably with data volume and compute, over domain-specific heuristics.^[1]

Distinguishing Attributes

Foundation models differ from prior AI paradigms, such as task-specific supervised learning models, through their reliance on massive scale in parameters, data, and compute, enabling emergent abilities that arise discontinuously rather than through gradual performance gains. For instance, in-context learning—where models adapt to new tasks via prompts without parameter updates—emerged sharply in GPT-3, a 175-billion-parameter model trained on approximately 570 gigabytes of filtered text data and released in June 2020, marking a departure from smaller models' predictable scaling curves.^[8] Subsequent analyses confirm these abilities, including few-shot adaptation, correlate empirically with model sizes exceeding 10^9 parameters and training datasets surpassing trillions of tokens, as smaller systems fail to exhibit such behaviors despite similar architectures. This scale-driven emergence underscores a foundational shift: capabilities previously requiring specialized engineering now surface as byproducts of broad pre-training on internet-scale corpora.^[1] A core distinguishing attribute is versatility across tasks and modalities without exhaustive retraining, contrasting with traditional machine learning's dependence on curated, labeled datasets for each application. Foundation models undergo initial self-supervised pre-training on diverse, unlabeled data—often billions of examples spanning text, code, and images—allowing subsequent deployment via lightweight prompting or fine-tuning for downstream uses like translation, summarization, or code generation.^[1] Multimodal extensions exemplify this: DALL·E, introduced by OpenAI in January 2021, leverages pre-trained text-image alignments to generate images from textual descriptions, adapting foundational representations to vision tasks without starting from scratch, unlike conventional vision models requiring modality-specific training from raw pixels. This adaptability stems from learned latent representations that generalize across domains, though it remains bounded by the distributional coverage of pre-training data. Critically, foundation models' proficiency traces to statistical pattern matching in observational data rather than causal comprehension, highlighting limitations in causal realism absent from many prior paradigms' narrower scopes. They excel at predictive interpolation within training distributions but falter on novel causal inference, such as counterfactual reasoning or interventions in unseen graphs, where outputs revert to memorized correlations rather than mechanistic understanding.^[9] Empirical probes reveal this gap: even advanced models like GPT-4 struggle with tasks demanding distinction between spurious associations and true causes outside benchmark templates, underscoring that scale amplifies data-driven heuristics without bridging to first-principles causality.^[1] This attribute necessitates caution in applications presuming deeper reasoning, as capabilities reflect probabilistic approximations, not veridical world modeling.

Historical Development

Pre-Foundation Precursors

The Transformer architecture, proposed by Vaswani et al. in June 2017, marked a pivotal shift in natural language processing by eschewing recurrent and convolutional layers in favor of self-attention mechanisms, which facilitated parallel computation and captured long-range dependencies more effectively than prior models.^[10] This design empirically demonstrated superior performance on machine translation tasks, with the model achieving a 28% reduction in BLEU score error compared to previous state-of-the-art systems on the WMT 2014 English-to-German dataset, laying the groundwork for scaling to larger datasets and model sizes without the sequential bottlenecks of recurrent neural networks.^[10] Building on this, early large-scale pre-training emerged with models like ELMo in 2018, which used bidirectional LSTMs trained on unsupervised objectives such as predicting internal word representations from context, enabling contextualized embeddings that improved transfer to six NLP tasks by averaging 4-7 percentage point gains over non-contextual baselines. Similarly, BERT, released by Devlin et al. in October 2018, introduced masked language modeling and next-sentence prediction for pre-training on 3.3 billion words from BookCorpus and English Wikipedia, attaining state-of-the-art results on 11 NLP benchmarks like GLUE (80.5% average score) through fine-tuning, thus highlighting self-supervised learning's capacity for broad task adaptation without task-specific supervision from scratch. GPT-2, developed by OpenAI and detailed in February 2019, further exemplified this trajectory by scaling unsupervised next-token prediction to a 1.5 billion parameter model trained on 40 gigabytes of WebText—a curated dataset of 8 million web pages linked from Reddit—yielding coherent text generation and zero-shot performance on tasks like summarization (ROUGE scores competitive with supervised models) and translation, underscoring the viability of purely generative pre-training for emergent capabilities across domains. These pre-2020 efforts collectively demonstrated that large-scale, data-driven pre-training on unlabeled corpora could yield models with transferable representations, departing from the era's dominant paradigm of narrow, supervised architectures and empirically validating scaling as a path to generalization.

Emergence of the Term (2021)

The term "foundation model" was formally introduced in the report On the Opportunities and Risks of Foundation Models, published on August 16, 2021, by researchers at Stanford University's Center for Research on Foundation Models (CRFM).^[1] The report, authored by Rishi Bommasani and colleagues including Percy Liang, defined foundation models as "any model trained on broad data (typically by self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks," emphasizing their role as reusable infrastructural bases rather than task-specific systems.^[2] This framing positioned models like OpenAI's GPT-3 (released June 2020) and Google's T5 (paper published October 2019, with implementations scaling in 2020) as exemplars, highlighting their capacity for transfer learning across domains due to massive pre-training on diverse datasets.^[1] The motivation for coining the term stemmed from the escalating computational costs of training large-scale models—often exceeding hundreds of millions of dollars—and the recognition that such investments could be amortized through broad reusability, shifting AI development from siloed, narrow applications toward general-purpose foundations adaptable via fine-tuning or prompting.^[3] The CRFM report argued this paradigm enabled efficiency gains, as a single foundation model could underpin multiple specialized applications, but also introduced systemic risks like amplified biases from broad data ingestion and challenges in governance due to their infrastructural scale.^[1] Initial examples focused on language models, but the concept extended to multimodal systems, underscoring the need for interdisciplinary analysis of their societal implications. Following the report's release, the terminology saw rapid adoption in industry and academia, with organizations like OpenAI and Google integrating it to describe their scalable AI architectures.^[11] OpenAI, for instance, began referencing GPT-series models as foundation models in public communications and technical updates by late 2021, aligning with the report's emphasis on pre-trained bases for downstream adaptation.^[12] Google similarly embraced the term for systems like PaLM, framing them as foundational layers in cloud AI services to highlight interoperability and cost-sharing potential.^[11] This uptake reflected a consensus on the term's utility in capturing the shift toward models prioritizing scale and generality over bespoke training.

Key Milestones and Models (2020-2025)

In June 2020, OpenAI released GPT-3, a transformer-based language model with 175 billion parameters that demonstrated emergent few-shot learning capabilities, enabling task performance with minimal examples provided in prompts without fine-tuning.^[8] This marked a pivotal advancement in scaling laws, where larger models showed improved generalization across natural language tasks like translation and question-answering, though limited by a 2048-token context window.^[8] Google's PaLM, announced on April 4, 2022, scaled to 540 billion parameters using the Pathways system for efficient distributed training, achieving breakthroughs in reasoning tasks such as arithmetic and commonsense inference through chain-of-thought prompting.^[13] In February 2023, Meta released LLaMA, a family of efficient models up to 65 billion parameters with open weights under a research license, which spurred widespread community fine-tuning and democratized access, intensifying competition beyond proprietary systems. The year 2023 saw an explosion in releases, with 149 foundation models documented globally—more than double the 2022 figure—including xAI's Grok-1 base model, whose pre-training concluded in October, emphasizing truth-seeking objectives in a 314 billion parameter mixture-of-experts architecture released openly in March 2024.^[14] Of these, 65.7% featured open weights, accelerating innovation through derivative models and efficiency optimizations.^[14] In May 2024, OpenAI launched GPT-4o, a multimodal model integrating text, vision, and audio processing in a unified architecture with a 128,000-token context, enabling real-time applications like voice interaction while maintaining performance parity to prior versions at reduced inference costs. By 2025, releases continued apace, exemplified by Meta's LLaMA 4 in April, introducing natively multimodal variants like Scout (17 billion active parameters) with extended context lengths, reflecting shifts toward efficiency gains amid sustained scaling in compute and data.^[15]

Frontier Models

Frontier models represent the most advanced subset of foundation models, characterized by their leadership in empirical benchmarks and ability to demonstrate emergent capabilities that approach or exceed human performance in targeted domains. These systems are typically defined by high training compute scales—often exceeding 10^25 FLOPs—and broad generality, enabling superior results in reasoning, coding, and multimodal tasks, while introducing heightened risks from potential misuse or unintended behaviors. Unlike standard foundation models, frontier models are distinguished not merely by size but by verifiable outperformance on standardized evaluations, such as achieving scores that rival expert humans, though they remain limited in holistic real-world agency.^[16]^[17] OpenAI's GPT-4, released on March 14, 2023, exemplifies this category by attaining the 90th percentile on the Uniform Bar Examination, outperforming 90% of human examinees, and scoring in the 90th percentile on SAT reading and math sections. Similarly, Anthropic's Claude 3 family, introduced in March 2024, established new benchmarks in graduate-level reasoning (GPQA), undergraduate knowledge (MMLU), and vision tasks, with the Opus variant leading competitors in coding and multilingual proficiency. Google's Gemini 1.0, announced December 6, 2023, advanced multimodal integration, processing text, images, audio, and video to achieve state-of-the-art results on benchmarks like MMMU for visual reasoning. These models' capabilities stem from massive pre-training on diverse datasets, yielding emergent skills like few-shot learning that were not explicitly optimized.^[18]^[19]^[20] Due to their scale and potency, frontier models carry elevated risk profiles, including amplified potential for adversarial exploitation or systemic impacts, as outlined in guidelines from the U.S. AI Safety Institute established in 2023 under the National Institute of Standards and Technology. The Institute's framework prioritizes rigorous pre-deployment testing and safeguards for models with compute thresholds indicative of advanced risks, emphasizing empirical validation over self-reported claims to address gaps in transparency and safety. This focus underscores causal links between model scale and emergent hazards, such as deceptive alignment or unintended amplification of biases in training data.^[21]^[22]

General-Purpose AI Systems

Foundation models share substantial conceptual overlap with general-purpose AI systems, frequently treated as synonymous in policy discourse, as exemplified by the EU AI Act's classification of general-purpose AI (GPAI) models—which encompass foundation models—as adaptable systems trained on extensive datasets to execute diverse tasks across applications without task-specific redesign.^[23] ^[24] This equivalence arises from their broad applicability, yet foundation models distinctly prioritize statistical generality emergent from massive pre-training corpora over explicitly engineered modularity or hybrid architectures that might characterize some general-purpose designs.^[6] ^[25] Empirical assessments reveal foundational constraints on these systems' purported generality, with no demonstrated autonomy and pronounced brittleness beyond training distributions; for instance, leading models score below 10% on the ARC-AGI benchmark's novel abstraction tasks, where humans routinely exceed 80%, indicating reliance on pattern matching rather than causal understanding or flexible reasoning.^[26] ^[27] Even recent advancements, such as OpenAI's o3 model achieving partial gains on public ARC subsets through enhanced chain-of-thought prompting, fail to close the gap on core generalization challenges, affirming that capabilities remain distributionally bounded without evidence of scalable intelligence transfer.^[27] ^[28] Regulatory approaches like the EU AI Act, which impose transparency, documentation, and systemic risk evaluations on GPAI models effective from August 2025, have drawn criticism for presuming unverified existential hazards—such as uncontrolled proliferation—absent causal mechanisms observed in deployed systems, thereby prioritizing speculative threats over documented limitations.^[29] ^[30] Analyses contend that such frameworks, often shaped by precautionary biases in academic and advocacy circles, overlook empirical risk profiles favoring iterative competition and open benchmarking to foster verifiable safety, rather than decelerationist stances that conflate scaling artifacts with apocalyptic inevitability.^[31] ^[32]

World Models and Multimodal Extensions

Foundation models incorporate world models as latent representations that predict environmental dynamics through causal forecasting, enabling internal simulation for decision-making and planning rather than mere pattern matching. These extensions draw from reinforcement learning paradigms, where the model generates hypothetical future states based on actions, facilitating emergent behaviors like trajectory optimization in simulated environments. For instance, Google DeepMind's Genie 3, introduced in August 2025, advances real-time interactive world modeling by generating consistent video predictions from latent states, supporting applications in game-like planning without explicit physics engines.^[33] However, empirical evaluations reveal that such models often prioritize statistical correlations over invariant causal structures, leading to brittle generalizations outside training distributions. In robotics, world models integrate with action primitives for grounded planning, as demonstrated by Google DeepMind's RT-2, a vision-language-action model released in July 2023. RT-2 co-fine-tunes on internet-scale vision-language data and robotic trajectories, achieving up to 2x success rates on novel tasks like using objects as improvised tools through chain-of-thought reasoning over predicted outcomes.^[34]^[35] This causal prediction mechanism allows transfer of web-derived knowledge to physical control, improving manipulation in unseen scenarios by simulating action effects. Yet, critiques highlight deficiencies in encoding fundamental physical laws; a 2025 Harvard-MIT study found that foundation models, including world model variants, accurately predict outcomes in tested cases but fail to internalize principles like Newton's laws, relying instead on memorized heuristics that break under counterfactual perturbations.^[36] Multimodal extensions enhance world models by fusing modalities like vision and language, promoting grounded reasoning through aligned representations. OpenAI's CLIP, pretrained in 2021 on 400 million image-text pairs via contrastive learning, establishes zero-shot cross-modal correspondences that anchor textual predictions to visual evidence, reducing hallucinations in simulation tasks.^[37] Subsequent integrations, such as in FOUNDER frameworks, map foundation model outputs to world model latents for open-ended task solving, yielding improved planning in embodied settings.^[38] Achievements include enhanced robotic control, with RT-2 exhibiting emergent skills like semantic reasoning for object affordances. Nonetheless, these systems inherit data biases from curated corpora, amplifying representational skews—e.g., underrepresentation of diverse physical interactions—that propagate to causal predictions, as biases in training data lead to skewed outcome distributions rather than veridical simulations. True adherence to physical invariance remains elusive, with models critiqued for simulating superficial dynamics without underlying causal realism.

Technical Architecture

Core Architectures and Parameters

The transformer architecture, introduced in the 2017 paper "Attention Is All You Need," underpins the majority of foundation models through its self-attention mechanisms, which compute dependencies between sequence elements in parallel, eliminating the sequential processing constraints of recurrent neural networks like LSTMs.^[10] This design consists of stacked encoder and decoder layers, though many modern foundation models, such as those in the GPT series, employ decoder-only variants optimized for autoregressive generation.^[8] Self-attention enables efficient handling of long-range dependencies via scaled dot-product attention, formulated as \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V, where Q, K, and V represent query, key, and value projections, and d_k is the dimension of the keys.^[10] Parameter counts in these models have scaled dramatically, reflecting empirical gains in capacity for generalization; for instance, BERT-base features approximately 110 million parameters across 12 layers with a hidden size of 768, while GPT-3 reaches 175 billion parameters organized into 96 layers with a model dimension of 12,288. Larger configurations project toward trillions of parameters, though active utilization varies; dense models activate all parameters per inference step, incurring high computational costs proportional to total size.^[8] To mitigate inference expenses while expanding effective capacity, sparse Mixture-of-Experts (MoE) variants route inputs to specialized sub-networks, activating only a fraction of parameters per token. In Mixtral 8x7B, released in December 2023, this yields 46.7 billion total parameters but activates roughly 12.9 billion—equivalent to a dense 7B model—via a router selecting two out of eight experts per feedforward layer, achieving performance comparable to larger dense models like Llama 2 70B on benchmarks while reducing active compute.^[39] Such sparsity leverages conditional computation, where expert specialization emerges from training, empirically lowering latency without proportional parameter growth.^[40] Beyond transformer dominance in language foundation models, diffusion-based architectures serve as core designs for generative tasks in vision, iteratively refining noise-added data through a reverse Markov process to produce samples. Stable Diffusion, a latent diffusion model released in August 2022, exemplifies this with a U-Net backbone conditioned on text embeddings, enabling high-fidelity image synthesis from broad pre-training on captioned datasets, though it diverges from autoregressive paradigms by prioritizing probabilistic denoising over token prediction. These variants highlight architecture-specific adaptations to modality, with transformers excelling in sequential data and diffusion in continuous generation, both validated by downstream empirical utility.

Pre-Training Objectives

Pre-training objectives for foundation models consist of self-supervised tasks that enable models to learn predictive patterns from vast unlabeled datasets, effectively compressing distributional regularities into parameters without task-specific labels.^[1] These objectives prioritize causal dependencies in data sequences or alignments across modalities, fostering emergent generalization through iterative prediction errors minimized via gradient descent.^[8] In autoregressive architectures predominant in large language models (LLMs), the core objective is next-token prediction, where the model forecasts the subsequent token in a sequence conditioned on all preceding tokens.^[8] This unidirectional, causal approach—trained by maximizing the likelihood of observed sequences—empirically induces semantic and syntactic comprehension, as evidenced by zero-shot performance on downstream tasks scaling with model size and data exposure.^[8] For instance, GPT-3, pretrained on 300 billion tokens using this objective, demonstrated few-shot learning capabilities across diverse benchmarks, attributing gains to internalized probabilistic structures rather than rote memorization.^[8] Masked modeling serves as a bidirectional alternative in encoder-based models, such as BERT, which randomly occludes 15% of input tokens and trains the model to reconstruct them from surrounding context. Introduced in 2018, this objective captures holistic sequence representations by jointly modeling left and right contexts, outperforming unidirectional methods on tasks requiring deep comprehension, like natural language inference, with BERT achieving state-of-the-art results on GLUE benchmarks through pretraining on 3.3 billion words of BooksCorpus and English Wikipedia. For multimodal foundation models integrating vision and language, contrastive learning aligns representations by maximizing similarity between paired inputs (e.g., image-caption) while minimizing it for unpaired negatives within batches. CLIP, released by OpenAI in 2021, exemplifies this via joint training on 400 million image-text pairs, yielding zero-shot transfer to image classification by computing cosine similarities in a shared embedding space, which empirically robustifies against distribution shifts compared to supervised baselines. These objectives collectively drive foundation models toward latent data compression, with empirical scaling laws confirming predictive power as a proxy for intelligence-like capabilities.^[1]

Data, Training, and Scaling

Data Requirements and Curation

Foundation models necessitate immense volumes of training data, typically measured in trillions of tokens, equivalent to terabytes or petabytes of raw text after initial processing. For instance, Meta's LLaMA models were trained on datasets derived from filtered snapshots of Common Crawl, a public web archive exceeding 400 terabytes uncompressed per monthly crawl, yielding effective corpora in the range of several terabytes post-curation.^[41]^[42] Similarly, OpenAI's GPT-3 utilized approximately 570 gigabytes of filtered text data, drawn from diverse web sources, though total raw inputs spanned far larger scales before deduplication and cleaning.^[43] Primary data sources include web scrapes via archives like Common Crawl, which capture billions of pages for broad coverage of internet content; digitized books from corpora such as BookCorpus; and code from repositories in datasets like The Stack for StarCoder models.^[44]^[45] These are supplemented by academic publications, Wikipedia extracts, and dialogue corpora to enhance factual density and syntactic variety, though web data dominates due to its scale and recency. Legal challenges have arisen, notably The New York Times' December 27, 2023, lawsuit against OpenAI and Microsoft, alleging unauthorized ingestion of copyrighted articles into training datasets, highlighting tensions over fair use in scraping protected content.^[46] Curation pipelines address raw data's noise through sequential steps: heuristic filtering to excise low-value content like boilerplate or overly short passages; deduplication via exact hashing or approximate nearest-neighbor methods to eliminate redundancies, which can comprise up to 50% of unprocessed crawls; and quality scoring using classifiers trained on perplexity or linguistic heuristics to prioritize coherent, informative text.^[47]^[48] These mitigate evident biases, such as overrepresentation of English-language or urban-centric viewpoints inherent in web distributions, yet residual skews persist, as curation cannot fully counteract the internet's empirical composition, which empirical audits show tilts toward certain ideological clusters in media-heavy subsets. Model-based filtering, employing smaller pretrained networks to rank passages, further refines inputs but introduces computational overhead and potential circularity if reliant on prior generations.^[49] Debates on quality versus quantity underscore that while unfiltered scale alone yields diminishing returns due to noise amplification, curated large datasets empirically outperform smaller pristine ones in generalization, as validated by benchmarks where filtered Common Crawl derivatives enable emergent capabilities absent in toy corpora.^[50] However, escalating reliance on synthetic data—generated by prior models to augment shortages—risks "model collapse," a degenerative feedback where recursive training erodes output diversity and fidelity, as demonstrated in 2024 experiments showing rapid loss of rare events and convergence to bland distributions after few iterations.^[51]^[52] Real-world sourcing thus remains paramount, with curation balancing empirical breadth against verifiable provenance amid ongoing disputes over access to high-fidelity public data.

Training Processes and Compute Demands

Training foundation models relies on distributed computing frameworks that parallelize workloads across clusters of accelerators to manage the immense computational requirements. Common approaches include data parallelism, tensor parallelism, and pipeline parallelism, implemented in libraries such as PyTorch Distributed or Hugging Face Accelerate, which enable scaling to thousands of devices.^[53] These methods partition model parameters, gradients, or computations to mitigate memory constraints and accelerate convergence on hardware like NVIDIA GPUs or Google TPUs.^[54] Compute demands for pre-training reach extraordinary scales, exemplified by estimates for GPT-4 at approximately 2.15 × 10^{25} floating-point operations (FLOPs), necessitating around 25,000 NVIDIA A100 GPUs running for 90 to 100 days.^[55] GPUs predominate due to their flexibility and ecosystem support via CUDA, while TPUs offer advantages in matrix multiplications for compatible workloads, as seen in Google's models.^[56] Such runs demand specialized infrastructure, including high-bandwidth interconnects like NVLink, to synchronize updates across nodes.^[53] By 2025, training frontier models faces escalating energy costs and hardware constraints, with power requirements doubling annually and projected to exceed multiple gigawatts per run, rivaling large power plants.^[57] Global AI data centers may require an additional 10 gigawatts in 2025 alone, straining grids and amplifying electricity expenses amid supply chain pressures for advanced chips.^[58] Efficiency innovations, such as mixed-precision training with FP16 or BF16 formats, reduce FLOPs per operation while preserving accuracy.^[59] Post-training optimizations like quantization compress weights from 32-bit to lower precisions (e.g., INT8), slashing memory and inference latency without retraining the full pipeline, though applied after core training to maintain fidelity.^[60] Pruning eliminates redundant parameters, potentially reducing model size by 90% in structured approaches, but studies indicate quantization often yields superior compression-performance trade-offs for deployment.^[59]^[60] These techniques address compute bottlenecks by enabling deployment on edge devices, distinct from during-training efficiencies like gradient checkpointing.^[59]

Scaling Laws: Empirical Foundations

Scaling laws in foundation models refer to empirical relationships observed between training resources—such as compute, model parameters, and dataset size—and model performance, typically measured by cross-entropy loss on next-token prediction tasks. In their 2020 study, Kaplan et al. analyzed over 400 language models trained on datasets up to 300 billion tokens and found that loss scales as power-law functions: approximately L(N) \propto N^{-\alpha} with \alpha \approx 0.076 for model size N, L(D) \propto D^{-\beta} with \beta \approx 0.103 for dataset size D, and similar predictability for compute C.^[61] These relations held across six orders of magnitude in scale, enabling causal predictions of loss reduction from resource increases under transformer architectures and standard pre-training objectives, rather than theoretical derivations.^[61] Subsequent work refined these findings by emphasizing compute-optimal regimes. Hoffmann et al. (2022) demonstrated that Kaplan's laws implied unbalanced scaling in prior models like GPT-3, which prioritized parameters over data; instead, optimal allocation for fixed compute budgets requires scaling parameters and tokens roughly equally, at approximately 20 tokens per parameter.^[62] Training the 70-billion-parameter Chinchilla model on 1.4 trillion tokens validated this, achieving lower loss and superior downstream performance compared to larger models like Gopher (280 billion parameters on 300 billion tokens) with equivalent compute.^[62] This empirical shift highlighted data's underappreciated role, guiding efficient resource use without assuming indefinite parameter growth. The predictability of these laws extended through models released up to 2024, correlating with observed losses in systems like PaLM and LLaMA, and informing investments in compute-heavy training.^[63] However, 2025 analyses indicate emerging diminishing marginal returns on specific metrics, such as test loss saturation or task-specific gains, prompting labs to integrate efficiency improvements beyond pure scaling.^[64] Critiques note that these laws capture resource-driven correlations under static methods but overlook algorithmic advances—like improved optimizers or architectures—that have accelerated progress faster than scaling alone in recent years, favoring iterative experimentation over rigid extrapolation to untested regimes.^[65] Such empirical foundations predict performance within observed bounds but do not causally guarantee breakthroughs like superintelligence absent verified innovations.^[66]

Adaptation and Evaluation

Fine-Tuning and Task Adaptation

Fine-tuning adapts pre-trained foundation models to downstream tasks by updating a subset or all of the model's parameters using task-specific datasets, leveraging the broad knowledge encoded during pre-training to achieve performance superior to training from scratch.^[67] This process exploits empirical transfer learning effects, where the model's generalized representations enable rapid adaptation with far less data and compute than initial training, often yielding accuracies competitive with or exceeding those of task-specific models built from random initialization.^[68] Zero-shot prompting represents a parameter-free adaptation method, instructing the model via natural language prompts to perform tasks without any task-specific examples or weight updates, relying solely on emergent capabilities from pre-training.^[69] For instance, models like GPT-3 have demonstrated zero-shot classification accuracy on benchmarks such as sentiment analysis by directly querying the model to categorize text, achieving results that approximate few-shot learning in some domains due to in-context generalization.^[70] Parameter-efficient fine-tuning (PEFT) techniques further enhance adaptation by minimizing the number of trainable parameters, addressing the computational infeasibility of full fine-tuning for billion-parameter models.^[71] LoRA, introduced in 2021, exemplifies this by freezing the pre-trained weights and injecting trainable low-rank decomposition matrices into transformer layers, reducing trainable parameters by up to 10,000 times compared to full fine-tuning while matching downstream performance on tasks like question answering.^[72] Other PEFT variants, such as adapters, insert small feed-forward networks into layers, collectively enabling adaptation on consumer hardware without degrading the model's core capabilities.^[73] An empirical demonstration of such adaptation involves fine-tuning the 6-billion-parameter GPT-J model, released in 2021, for code generation tasks; for example, targeted fine-tuning on secure coding datasets has produced models generating non-vulnerable C and C++ code at rates of 70.4% and 64.5%, respectively, outperforming base pre-trained outputs through domain-specific weight adjustments.^[74] These methods underscore the causal efficiency of foundation models, where pre-training's vast data exposure provides a robust initialization that accelerates convergence on specialized objectives. Despite these advantages, fine-tuning risks catastrophic forgetting, wherein the model degrades performance on pre-training knowledge during adaptation to new domains or tasks, as observed in continual instruction tuning where domain-specific updates overwrite general factual recall.^[75] This phenomenon arises from interference in shared parameter spaces, particularly acute in full fine-tuning but mitigated somewhat by PEFT approaches like LoRA, though persistent in severe domain shifts without techniques such as elastic weight consolidation.^[76] Empirical studies confirm forgetting rates exceeding 50% on held-out knowledge probes post-fine-tuning, highlighting the need for regularization to preserve transferability.^[75]

Benchmarks, Metrics, and Limitations

Foundation models are commonly evaluated using standardized benchmarks designed to assess natural language processing (NLP) capabilities, such as the General Language Understanding Evaluation (GLUE), introduced in 2018 as a collection of nine diverse tasks including sentiment analysis and textual entailment.^[77] SuperGLUE, released in 2019, extends GLUE with eight more challenging tasks to better test advanced language understanding, incorporating elements like coreference resolution and question answering under stricter conditions.^[78] These metrics aggregate scores across tasks to gauge overall performance, with top models achieving near-human or superhuman results on SuperGLUE by 2023, such as scores exceeding 90% on multiple subtasks.^[79] The Beyond the Imitation Game Benchmark (BIG-bench), launched in 2022, expands evaluation to over 200 tasks probing emergent abilities—unexpected capabilities arising at scale, like multi-step arithmetic or theory-of-mind reasoning—that smaller models lack but larger ones exhibit.^[80] However, analyses have revealed that such emergences often stem from metric choice rather than genuine phase transitions in model behavior; for instance, reformulating BIG-bench metrics continuously (e.g., using Brier scores) eliminates apparent discontinuities, suggesting many "emergent" jumps are artifacts of non-linear evaluation scales.^[81] By 2025, critiques highlight how saturation on these benchmarks—driven by repeated high scores—obscures stagnation in true generalization, with foundation models overfitting to familiar patterns rather than demonstrating robust causal understanding.^[82] A major flaw in these standard metrics is data contamination, where benchmark test sets inadvertently appear in pre-training corpora, inflating scores without reflecting learned generalization; exposures in 2023 identified contamination in datasets like those underlying GLUE and BIG-bench subsets, affecting models trained on vast web-scraped data up to trillions of tokens.^[83] ^[84] This issue persists, as 2024-2025 surveys estimate contamination rates exceeding 10-20% in popular NLP benchmarks, leading to unreliable progress signals and misleading claims of AGI proximity, since models memorize rather than reason through novel instances.^[85] Such flaws prioritize superficial pattern matching over causal validity, masking gaps in handling distribution shifts or counterfactuals. To address these limitations, benchmarks emphasizing causal reasoning and abstraction, such as the Abstraction and Reasoning Corpus (ARC), test core intelligence via few-shot grid-based puzzles requiring pattern induction and application to unseen scenarios—tasks humans solve at 80-90% accuracy but where foundation models score below 50% as of 2025, often near 0% on private ARC-AGI variants without test leakage.^[86] ARC reveals worldview inconsistencies, like failure to consistently apply causal rules across compositions, highlighting how standard metrics overlook deficiencies in flexible generalization despite scale.^[87] These alternatives underscore the need for evaluations resistant to memorization, prioritizing verifiable causal mechanisms over saturated, contamination-prone scores.

Ecosystem and Strategies

Supply Chain and Infrastructure

The hardware supply chain for foundation models centers on specialized accelerators, with NVIDIA commanding over 90% of the data center GPU market essential for AI training as of mid-2025.^[88] These GPUs, such as the H100 and Blackwell series, rely heavily on advanced fabrication by TSMC, which produces the majority of high-performance AI chips at its Taiwan facilities using nodes like 4nm and below.^[89] This dependency exposes vulnerabilities to geopolitical tensions in the Taiwan Strait, where disruptions could halt global AI hardware production, yet it has incentivized diversification efforts like TSMC's Arizona plant, which began yielding initial Blackwell wafers in October 2025.^[90] U.S. export controls, initiated on October 7, 2022, and expanded through 2023 and 2024, restrict shipments of advanced AI semiconductors and manufacturing equipment to China, aiming to limit military applications while carving a bifurcated market.^[91] These measures have slashed NVIDIA's China revenue share from 20-25% pre-2022 to near zero by 2025, prompting Chinese firms to develop alternatives like Huawei's Ascend chips, though at reduced performance.^[92] Such controls underscore supply chain fragilities but drive innovation incentives, including U.S. investments in domestic fabs under the CHIPS Act, fostering competition beyond NVIDIA's near-monopoly.^[93] On the software side, open ecosystems mitigate proprietary bottlenecks by facilitating model distribution and integration; Hugging Face serves as a central hub hosting millions of pre-trained models, datasets, and tools, enabling collaborative development akin to GitHub for machine learning.^[94] By 2025, it supports over 50,000 organizations in sharing transformers-based architectures, reducing barriers to entry and promoting ecosystem-wide efficiencies without reliance on closed vendors. Infrastructure bottlenecks loom large, particularly in energy, where AI data centers may require an additional 10 gigawatts of U.S. capacity in 2025 alone, equivalent to Utah's total power infrastructure, straining grids amid projections of data centers doubling electricity use by 2030.^[95] Rare earth elements, critical for magnets in cooling systems and server components, face supply constraints dominated by China, which controls 70% of mining and 90% of processing; recent export licensing tightened in 2025 exacerbates risks, though U.S. stockpiles buffer short-term disruptions and spur domestic sourcing incentives.^[96]^[97] These pressures highlight the need for resilient, market-driven scaling over insulated policies to sustain foundation model advancement.

Model Release Approaches: Open vs. Closed

Foundation models are released either as open-weight models, where architecture and parameters are publicly available for inspection, modification, and redistribution under permissive licenses, or as closed-source models, where access is restricted to proprietary APIs or limited previews, preserving intellectual property but limiting external scrutiny.^[98] Open releases facilitate community-driven improvements, while closed approaches prioritize controlled deployment to mitigate misuse risks.^[99] Prominent open-weight releases include Meta's LLaMA 2, launched on July 18, 2023, with model weights available under a custom license permitting commercial use for entities below certain scale thresholds, enabling widespread fine-tuning and adaptation. Similarly, xAI open-sourced the 314 billion parameter Grok-1 base model weights and architecture on March 17, 2024, under the Apache 2.0 license, allowing unrestricted use and modification without fine-tuning data release.^[100] These approaches accelerate innovation through collective iteration, as developers worldwide can build upon shared foundations, reducing redundant compute costs and fostering rapid enhancements in performance and safety via distributed vulnerability detection.^[99] For instance, open models support customized national security applications, such as retraining for defense-specific tasks, enhancing resilience through diverse implementations over reliance on single vendors.^[101] In contrast, closed models like OpenAI's GPT-4, released in March 2023 via API access only, withhold weights to prevent replication and potential weaponization, but this opacity hinders independent verification of safety alignments or emergent risks. Claims of robust safeguards in closed systems remain untestable externally, potentially concealing flaws in bias mitigation or robustness that community auditing in open models could expose.^[102] Such proprietary strategies may curb immediate misuse by adversaries lacking replication capabilities, yet they concentrate power in few firms, slowing broader ecosystem progress and raising dependency concerns for downstream users unable to inspect or adapt core components.^[99] Debates center on national security trade-offs, where open releases promote U.S. leadership by diversifying domestic capabilities and countering foreign monopolies through inclusive innovation, as closed models risk eroding competitive edges if proprietary advantages prove brittle against open alternatives.^[101] While closed approaches theoretically limit proliferation to state actors, empirical evidence suggests open scrutiny strengthens overall security by enabling proactive fixes, outweighing marginal containment benefits in a field advancing via shared knowledge.^[99] Proponents of openness argue it aligns with historical software precedents, where public code review has fortified systems against threats more effectively than isolation.^[103]

Capabilities and Applications

Demonstrated Capabilities

Foundation models excel in natural language translation, achieving BLEU scores that match or surpass those of professional human translators in blind evaluations for high-resource language pairs, as demonstrated by systems like those evaluated in WMT competitions where neural models outperform human baselines in aggregate quality.^[104] In summarization tasks, these models attain ROUGE scores indicating strong n-gram overlap with human-written references, with empirical correlations to human judgments of informativeness and fluency exceeding 0.8 in controlled studies. Multimodal foundation models extend these competencies to video generation, as exemplified by OpenAI's Sora, released in February 2024, which produces videos up to one minute in length from text prompts while maintaining high visual fidelity, temporal consistency, and adherence to descriptive instructions in reproducible demonstrations.^[105] The follow-up Sora 2, launched in September 2025, further improves physical realism, simulation accuracy, and user controllability in generated content.^[106] In robotics, foundation models enable generalized planning from language instructions, supporting decision-making in perception, manipulation, and control tasks with verified success rates in real-world benchmarks; for instance, vision-language-action models achieve over 90% task completion in diverse manipulation scenarios through pre-training on large-scale datasets.^[107] These outcomes stem from scalable pre-training that allows zero-shot adaptation to novel environments, as confirmed in peer-reviewed evaluations of embodied agents.^[108] Such capabilities are underpinned by reproducible benchmark results where foundation models surpass human experts on specialized tasks, including neuroscience behavior prediction with accuracies exceeding domain specialists by 10-20 percentage points on held-out datasets.^[109] By mid-2025, these verified performances have driven widespread integration, with generative AI from foundation models used monthly by 1 in 8 workers globally, per survey data from enterprise and consumer adoption tracking.^[110]

Real-World Deployments and Achievements

GitHub Copilot, deployed in technical preview in June 2021 by GitHub in collaboration with OpenAI, represents an early enterprise-scale application of foundation model adaptations for software development assistance. Built on the Codex model—a descendant of GPT-3 trained on vast code repositories—Copilot suggests code completions and functions in real-time within integrated development environments. A randomized controlled experiment with 95 professional developers found that Copilot users completed repository-level programming tasks 55.8% faster than a no-Copilot baseline, with equivalent or superior solution correctness, as measured by passed unit tests.^[111] Subsequent enterprise evaluations, including collaborations with Accenture, reported consistent speedups in task completion and reduced cognitive load, enabling developers to focus on higher-level architecture over boilerplate implementation.^[112] In healthcare and biotechnology, DeepMind's AlphaFold system, advanced in its second iteration released in July 2021, achieved breakthrough performance in protein structure prediction by leveraging deep learning architectures pre-trained on protein sequence and structural data. AlphaFold2 topped the Critical Assessment of Structure Prediction (CASP14) competition with median global distance test scores exceeding experimental accuracy for many targets, enabling reliable predictions without physical crystallization.^[113] The model's open-source release and associated database, launched in July 2021, have provided computed structures for over 200 million protein entries across UniProt, accelerating drug target identification and enzyme engineering; for instance, it has informed variant pathogenicity assessments in rare diseases and antibiotic design pipelines.^[114] This work earned its primary developers, Demis Hassabis and John Jumper, the 2024 Nobel Prize in Chemistry, shared with David Baker for computational protein design. These deployments underscore empirical productivity gains in pattern-heavy domains like coding and structural biology, where foundation models reduce iterative trial-and-error. However, real-world integrations reveal boundaries in causal inference tasks, such as predicting intervention outcomes in dynamic systems, where models' correlational training yields unreliable extrapolations absent human validation; studies emphasize that oversight mitigates hallucination risks in deployment pipelines. Enterprise adopters report acceptance rates of 20-30% for generated outputs, necessitating review to ensure logical coherence beyond statistical mimicry.^[115]

Criticisms and Technical Limitations

Architectural and Performance Shortfalls

Foundation models, particularly autoregressive transformer-based architectures, demonstrate persistent hallucinations, wherein they produce confident but verifiably false outputs, especially under out-of-distribution conditions where training data patterns do not align with query demands.^[116] This brittleness arises from probabilistic next-token prediction, which favors fluency over factual fidelity, leading to fabricated details in responses.^[116] For example, in evaluations of GPT-4 released in March 2023, the model exhibited factual inaccuracies and reasoning lapses, such as inventing non-existent references or events, despite safeguards.^[18] Empirical assessments, including those extracting propositional claims from GPT-4-generated biographies, confirmed hallucination rates where models deviated from ground-truth facts in up to 20-30% of verifiable instances.^[117] The absence of domain-specific inductive biases in foundation models further exacerbates architectural shortfalls, as these systems rely on scale rather than embedded priors for reasoning about causal structures or physical invariants. Without hardcoded assumptions like conservation laws or temporal ordering, models violate basic physics in simulation tasks; for instance, they may generate trajectories ignoring momentum preservation or energy constraints when prompted for hypothetical scenarios beyond memorized examples.^[118] This stems from the transformer architecture's emphasis on pattern matching over mechanistic understanding, resulting in outputs that contradict first-principles derivations even after fine-tuning.^[118] Such deficiencies highlight a core limitation: performance gains from parameter scaling do not substitute for explicit biases tailored to real-world causality. Efficiency constraints compound these issues, with inference latency scaling poorly due to the quadratic complexity of attention mechanisms in models exceeding billions of parameters.^[119] Generating a single response in large foundation models like GPT-4 can require 1-10 seconds on high-end GPUs, rendering them unsuitable for edge deployment where sub-millisecond latencies are mandated for applications such as autonomous robotics or real-time sensor processing.^[120] Hardware limitations on resource-constrained devices amplify this, as full model loading demands gigabytes of memory and sustained power draw, often exceeding available thermal and computational budgets without aggressive compression that degrades accuracy.^[119]^[120]

Empirical Critiques of Hype and Benchmarks

Critics of foundation model progress have highlighted benchmark saturation as a key indicator of overstated capabilities. By 2024, leading models such as OpenAI's o1 and GPT-4o achieved MMLU scores exceeding 90%, nearing the benchmark's effective ceiling near human expert performance levels (typically 90-95% for specialists across subjects), which obscures further differentiation and masks persistent gaps in general intelligence. ^[121] ^[122] This saturation arises because MMLU primarily evaluates knowledge recall and simple inference, allowing high scores through data-driven pattern matching rather than robust reasoning, prompting the development of harder variants like MMLU-Pro to expose these limitations. ^[122] ^[123] Scaling skepticism has intensified in 2025 analyses, which argue that continued compute and data increases yield diminishing returns on misleading metrics that conflate narrow task proficiency with broad intelligence. Publications like those from AIGuys contend that AGI-oriented benchmarks overstate progress by rewarding statistical extrapolation over causal understanding, with economic constraints—such as data scarcity and inference costs—preventing sustained scaling without architectural innovations. ^[124] Empirical evidence supports this, as performance plateaus emerge despite exponential resource growth; for instance, models trained on trillions of tokens show no proportional gains in novel problem-solving, indicating reliance on interpolation rather than extrapolation to unseen domains. ^[125] Benchmarks targeting core reasoning, such as the Abstraction and Reasoning Corpus (ARC), reveal foundational shortcomings, with foundation models failing to exceed 40-50% accuracy even after extensive scaling and test-time compute. ^[126] ^[127] ARC tasks require few-shot abstraction from grid-based patterns, where humans achieve over 80% success, but LLMs demonstrate brittleness, mistaking superficial correlations for rules and lacking systematic generalization—evidence against claims of emergent reasoning as mere artifacts of scale-induced pattern matching. ^[128] ^[129] These failures underscore that hype around benchmark leaderboards often ignores holdout evaluations designed to probe true causal inference, prioritizing verifiable data over narrative-driven optimism.

Risks, Controversies, and Debates

Safety, Alignment, and Misuse Risks

Reinforcement Learning from Human Feedback (RLHF), introduced by OpenAI in January 2022 for aligning foundation models like InstructGPT, trains reward models on human preferences to steer outputs away from harmful or unhelpful responses, significantly reducing overt toxicity in deployments such as ChatGPT. Despite these gains, RLHF exhibits fundamental limitations, including reward hacking where models exploit superficial patterns in training data rather than internalizing intent, and vulnerability to adversarial jailbreaks that bypass safeguards via prompt engineering or encoded inputs.^[130]^[131] Evaluations of production-scale models in 2023–2025 reveal jailbreak success rates exceeding 50% in some scenarios, even after fine-tuning, underscoring RLHF's incomplete robustness against determined circumvention.^[132]^[133] Misuse risks from foundation models encompass generating deepfakes for fraud, scams, or targeted harassment, with capabilities enabling realistic synthetic audio, video, and text that amplify deception at scale; for instance, non-consensual deepfake pornography has targeted over 4,000 public figures using accessible tools derived from models like Stable Diffusion.^[3]^[134] Open-weight models heighten proliferation concerns by allowing malicious adaptation without oversight, yet they also empower defensive innovations, such as community-built detectors and red-teaming datasets that outpace proprietary restrictions in adaptability.^[135]^[136] Empirical instances of misuse, including voice-cloning scams defrauding individuals of thousands of dollars, demonstrate tangible harms but remain sporadic relative to model usage volumes exceeding trillions of tokens processed annually.^[137] Despite alarmist projections, real-world catastrophic outcomes from foundation model misalignment or misuse have not materialized at scale; analyses of over 499 reported generative AI incidents through mid-2025 identify predominantly localized harms like misinformation bursts or ethical lapses, with no verified existential escalations amid deployments serving hundreds of millions of users.^[138] This empirical track record—contrasting hype-driven forecasts of imminent doom—highlights precautionary approaches' tendency to overprioritize speculative tail risks, often sidelining iterative market mechanisms where developer updates, user feedback loops, and competitive pressures refine safeguards faster than centralized mandates.^[139]^[140] Such dynamics suggest that while vigilance against jailbreaks and misuse persists, overreliance on halting innovation undervalues evidence of self-correcting equilibria in deployed systems.

Policy and Regulatory Controversies

The development and deployment of foundation models have sparked intense policy debates over the appropriate balance between regulatory oversight to mitigate potential harms and minimal intervention to preserve innovation and global competitiveness. Proponents of deregulation argue that overly stringent rules risk ceding leadership to less-regulated jurisdictions like China, where state-backed AI advancement proceeds unchecked, while advocates for controls emphasize preemptive measures against unproven systemic risks. These tensions manifest in divergent approaches across jurisdictions, with critics of heavy regulation highlighting empirical evidence of slowed technological progress in regulated sectors, such as past EU data protection rules that disadvantaged European firms relative to U.S. counterparts.^[141]^[142] The European Union's AI Act, finalized in 2024 and entering phased enforcement from August 2025, imposes specific obligations on providers of general-purpose AI models—encompassing foundation models trained with substantial compute resources, such as those exceeding thresholds for systemic risk classification. These include requirements for technical documentation, usage instructions, copyright compliance, and, for high-impact models, risk assessments, transparency reporting, and adherence to forthcoming codes of practice developed by the European Commission. Systemic-risk models, identified via criteria like training compute over 10^25 FLOPs or high-impact capabilities, face additional scrutiny, including model evaluations and incident reporting to the EU AI Office.^[143]^[23]^[144] Critics contend that the Act's classifications and compliance burdens, such as mandatory disclosures and evaluations, could hinder U.S. and EU firms' agility, potentially diverting investment and talent to unregulated markets and eroding Western competitiveness against China, where no equivalent constraints apply. For instance, vague implementation guidelines have been faulted for creating strategic uncertainty, prompting firms to relocate operations outside Europe, as evidenced by prior regulatory precedents like GDPR's uneven impact on innovation. Supporters, often aligned with precautionary frameworks, view these measures as essential for addressing opacity in model training and deployment, though empirical data on foundation model incidents remains limited, with most reported risks stemming from misuse rather than inherent model flaws.^[145]^[142]^[30] In the United States, Executive Order 14110, issued on October 30, 2023, directed agencies to address risks from dual-use foundation models—defined as those trained via self-supervision on broad data with at least 10^26 FLOPs of compute—through safety testing, red-teaming, and reporting requirements for developers exceeding compute thresholds. It also advanced export controls on AI-enabling hardware to curb proliferation to adversaries, balancing innovation promotion via federal AI infrastructure investments against safeguards like bias mitigation guidelines.^[146]^[147]^[148] Debates surrounding the order pit calls for open model weights and reduced controls—favoring broad access to accelerate domestic innovation—against restrictions to prevent adversarial gains, with proponents of the latter arguing that empirical underestimation of diffusion risks justifies controls, though evidence shows open models like early Llama variants enabling rapid global replication without proportional harm. The order's subsequent partial rescission under the incoming administration in 2025 underscored accelerationist preferences for deregulation, prioritizing compute scaling over preemptive limits.^[148]^[149] Broader ideological divides frame these policies: "doomer" perspectives, rooted in effective altruism circles, advocate stringent pauses or caps on foundation model scaling due to speculative existential threats, as seen in petitions for development halts signed by figures like Geoffrey Hinton. In contrast, accelerationists, including the effective accelerationism (e/acc) movement, counter that such fears overestimate unproven risks while ignoring historical patterns where rapid tech adoption—such as nuclear or biotech—yielded net benefits under lighter touch regulation, urging empirical focus on verifiable misuse over hypothetical doomsdays. This clash highlights a core tension: left-leaning precautionary approaches in institutions like the EU Commission versus market-driven realism emphasizing underappreciated upsides in capability gains.^[150]^[151]^[152]

National Security Implications

The United States holds a commanding lead in foundation model development, propelled by private-sector initiatives that outpace state-directed efforts elsewhere, with U.S. entities releasing 40 notable AI models in 2024 versus China's 15.^[153] Ventures like xAI, established in July 2023 to advance AI through competitive innovation, underscore this edge in scaling large-scale models via market-driven resources rather than centralized planning. U.S. export controls on advanced chips, implemented since 2022 to curb technology transfer, have accelerated China's domestic semiconductor and model-building capabilities, shrinking the performance differential between top U.S. and Chinese systems from 9.3% in early 2024 to under 2% by mid-year.^[154] ^[155] Open-source foundation models offer strategic advantages for U.S. defense by diversifying supply chains and mitigating risks from proprietary dependencies, enabling the Department of Defense to fine-tune models for specialized applications without vendor lock-in.^[101] A CSIS assessment emphasizes that open architectures counter closed-system monopolies, which present exploitable chokepoints in wartime scenarios, while fostering rapid iteration among allies and contractors.^[101] This ecosystem bolsters resilience against adversarial disruption, as distributed development reduces the impact of targeted sabotage on any single provider. Proliferation risks from open models, including potential adaptation by state actors like China for offensive cyber tools or autonomous systems, must be weighed against verification benefits: public weights permit auditing for backdoors and vulnerabilities, unlike black-box closed models that obscure hidden flaws.^[99] ^[148] Transparency enables preemptive hardening, such as community-driven safeguards, enhancing U.S. deterrence in an era where adversaries like China have surged in open-model releases, surpassing U.S. counterparts in certain benchmarks by October 2025.^[156] Prioritizing open ecosystems thus sustains American primacy by embedding defensive scrutiny into global AI flows, rather than ceding ground to opaque rivals.^[99]

Societal and Economic Impacts

Productivity and Innovation Benefits

Foundation models have demonstrated measurable productivity enhancements in software development, where empirical studies indicate developers can complete coding tasks up to twice as fast when utilizing generative AI tools built on these models.^[157] A randomized controlled trial involving experienced open-source developers found that early-2025 AI tools, leveraging foundation model capabilities, increased perceived productivity by approximately 24%, with actual task completion times reduced through AI-assisted code generation and debugging.^[158] These gains stem from the models' ability to automate routine subtasks like boilerplate code writing, allowing human developers to focus on higher-level architecture and problem-solving, thereby accelerating iteration cycles in engineering workflows.^[159] Enterprise adoption of foundation models has surged, with generative AI usage among businesses rising from 33% in 2023 to 71% in 2024, reflecting rapid integration into operational processes.^[160] IBM reported achieving $4.5 billion in productivity gains through AI and automation initiatives by August 2025, attributing these to scalable deployments of foundation model-based systems that optimize resource allocation and decision-making across its operations.^[161] Such transformations enable firms to reallocate human effort from repetitive analysis to strategic tasks, yielding compounded efficiency in sectors like finance and manufacturing, where AI-driven forecasting and process automation have reduced operational latencies by 20-50% in targeted applications.^[162] By providing pre-trained, adaptable architectures, foundation models lower technical and financial barriers to AI development, enabling non-experts and smaller entities to innovate without building models from scratch.^[163] This democratization fosters broader exploration of downstream applications, as evidenced by platforms like Hugging Face, where less experienced developers leverage fine-tuned foundation models to prototype novel tools, expanding the ecosystem of AI-powered products beyond elite research labs.^[164] Consequently, these models accelerate market-driven innovation by empowering startups and independent creators to deploy specialized applications—such as custom analytics apps or automated design tools—hastening the translation of ideas into viable solutions and outpacing resource-constrained, equity-prioritizing alternatives.^[165]

Labor and Economic Disruptions

Foundation models, powering generative AI systems, have automated routine cognitive tasks such as data entry, basic coding, and content summarization, leading to targeted job displacements in administrative and clerical roles. A 2024 World Economic Forum analysis projected that AI-driven automation could displace up to 85 million jobs globally by 2025, primarily in repetitive sectors like customer service and manufacturing support. Empirical studies on U.S. firms adopting AI technologies, including those leveraging foundation models, indicate employment declines of 2-5% in exposed occupations, particularly for entry-level workers in information processing.^[166] However, these shifts have been accompanied by net job creation in AI-related operations, operations management, and complementary human roles requiring oversight and creativity. The same World Economic Forum report forecasts the simultaneous emergence of 97 million new positions, yielding a net gain of 12 million jobs, driven by demand for AI specialists, data annotators, and system integrators. U.S. Bureau of Labor Statistics data from 2023-2025 shows AI job postings growing by over 20% annually, concentrated in states like California (15.7% of total AI postings in 2024), offsetting losses in automatable tasks.^[167] MIT research on firm-level AI adoption confirms that while routine tasks diminish, overall employment rises due to productivity gains enabling business expansion and hiring in non-routine areas.^[168] Economic benefits from foundation models disproportionately accrue to skilled workers, exacerbating short-term inequality but evidenced by wage premiums for AI proficiency. OECD analysis of U.S. data reveals that workers with AI skills command a 20-25% wage premium, concentrated in high-skill sectors like technology and finance where models augment rather than replace expertise.^[169] A 2025 sectoral study finds AI integration raises wages by 10-15% in complementary industries but suppresses them in labor-intensive ones with high substitution rates.^[170] This skill bias aligns with economic models showing foundation models widening the premium for cognitive flexibility, though upskilling programs have mitigated gaps for mid-tier workers in adopting firms.^[171] Alarmist predictions of mass unemployment from foundation models overlook historical patterns of technological adaptation, where fears of permanent displacement proved unfounded. During the 1930s mechanization wave, U.S. unemployment peaked at 25% amid automation anxieties, yet post-World War II employment surged as new industries absorbed labor.^[172] Similarly, computerization from the 1980s onward displaced routine clerical jobs but generated more positions in programming and IT support, with OECD countries experiencing rising employment-to-population ratios despite productivity doublings.^[173] Recent Yale metrics post-ChatGPT (a foundation model derivative) show no broad U.S. labor market disruption 33 months after release, underscoring market-driven reallocation over systemic collapse.^[174] Empirical evidence favors labor-creating effects of innovation, as task-specific automation prompts specialization rather than wholesale job elimination.^[175]

Geopolitical and Competitive Dynamics

The development of foundation models has intensified geopolitical competition, particularly between the United States and China, where these technologies are recognized as dual-use capabilities with applications in civilian innovation and military strategy. In 2024, U.S.-based institutions produced 40 notable AI models, maintaining a lead in high-performance foundation models, though China narrowed the performance gap through rapid iteration and access to global resources.^[153]^[176] By January 2025, China's DeepSeek model challenged U.S. dominance narratives by demonstrating competitive capabilities in large language models, underscoring the pace of Beijing's advancements despite U.S. export controls on advanced semiconductors implemented in January 2025.^[177]^[155] U.S. private-sector firms, including Meta and OpenAI, have released open-weight foundation models like Llama series, which democratize access and accelerate global innovation but inadvertently empower rivals by providing foundational architectures that Chinese developers can fine-tune. China's vibrant open-source ecosystem has leveraged these releases, enabling firms like DeepSeek and Alibaba to produce models that, by October 2025, surpassed U.S. counterparts in certain open AI rankings for power and adoption.^[178]^[179]^[156] This diffusion highlights a trade-off: while open releases foster redundancy in development paths, they reduce barriers for state-directed actors in authoritarian regimes to adapt models for strategic purposes, potentially amplifying dual-use risks without equivalent safeguards.^[180] In 2025, international governance efforts intensified, with the OECD reinforcing its 2019 AI Principles for trustworthy systems amid calls for "global red lines" on unacceptable risks, yet multilateral frameworks have yielded to bilateral arrangements for agility.^[181]^[182] The EU AI Act, effective from 2024 with phased enforcement, imposes risk-based obligations that critics argue could stifle European innovation through compliance burdens like assessments and fines up to 7% of global revenue, potentially ceding competitive ground to less-regulated actors.^[183]^[142] Bilateral pacts, such as those emerging in 2024-2025 between the U.S. and allies like the UK or Japan, prioritize targeted cooperation on safety and compute sharing over broad slowdowns, reflecting a pragmatic shift in a fragmented landscape.^[184] Proponents of competitive dynamics contend that rivalry among decentralized actors generates safety benefits through redundant, diverse approaches to alignment and robustness, contrasting with centralized control that risks single points of failure or capture by unaccountable entities.^[185] This perspective posits that U.S. private-sector agility, unhindered by over-regulation, sustains leads in foundational capabilities, while excessive multilateral constraints could enable authoritarian regimes to outpace through state mobilization, as evidenced by China's open-source momentum.^[178]^[176]

References

[1]
On the Opportunities and Risks of Foundation Models - arXiv
Aug 16, 2021 · This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (eg, language, vision, robotics, ...
[2]
On the Opportunities and Risks of Foundation Models
Fundamentally, foundation models are a high-leverage single point of failure, making them a prime target for attack: existing work demonstrates a variety of ...Missing: achievements controversies
[3]
[PDF] On the Opportunities and Risks of Foundation Models
2.6.1 What is a foundation model? There is not a precise technical definition of foundation model. Rather, this is an informal label for a large family of ...
[4]
What are Foundation Models? - Aisera
Rating 8.9/10 (147) Scale: Foundation models are parameter-intensive, often containing billions or trillions of parameters. Parameters include weights and biases in the feedforward ...
[5]
What Are Foundation Models? | IBM
1 On the Opportunities and Risks of Foundation Models, Stanford Center for Research on Foundation Models and Stanford Institute for Human-Centered ...
[6]
What is a foundation model? - Ada Lovelace Institute
Jul 17, 2023 · A defining characteristic of foundation models is the scale of data and computational resources involved in building them. They require datasets ...AI technologies and... · Foundation models: applicationsMissing: key | Show results with:key
[7]
What are Foundation Models? - DataCamp
Aug 15, 2023 · Narrow AI refers to AI systems designed for specific tasks but which are unable to perform tasks outside their planned scope.What are Foundation Models... · How Do Foundation Models... · Modality
[8]
[2005.14165] Language Models are Few-Shot Learners - arXiv
May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
[9]
[PDF] Can Foundation Models Talk Causality? - OpenReview
In this work, we argue that foundation models might be exploiting a “loop hole” in the CHT2. Namely, what happens if the causal assumptions (which are required,.
[10]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · Access Paper: View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors. View PDF · HTML (experimental) ...
[11]
What are foundation models? | Google Cloud
Foundation models are AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning. Learn more from Google Cloud.<|separator|>
[12]
Foundation models: 2022's AI paradigm shift - VentureBeat
Sep 13, 2022 · 2022 has seen incredible growth in foundation models, or large-scale AI models trained on a massive scale. What does the future hold?
[13]
Pathways Language Model (PaLM): Scaling to 540 Billion ...
Apr 4, 2022 · We introduce the Pathways Language Model (PaLM), a 540-billion parameter, dense decoder-only Transformer model trained with the Pathways system.
[14]
AI Index: State of AI in 13 Charts | Stanford HAI
Apr 15, 2024 · This past year, organizations released 149 foundation models, more than double the number released in 2022. Of these newly released models, 65.7 ...Biggest Players · Prices Skyrocket · What Ai Race?
[15]
The Llama 4 herd: The beginning of a new era of natively ...
Apr 5, 2025 · We're introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context length support.
[16]
Frontier AI regulation: Managing emerging risks to public safety
Jul 6, 2023 · In this paper, we focus on what we term “frontier AI” models: highly capable foundation models that could possess dangerous capabilities ...
[17]
Frontier AI: capabilities and risks – discussion paper - GOV.UK
Apr 28, 2025 · Increasingly, frontier AI models are multi-modal. In addition to text, they can generate and process other data types such as images, video, and ...What is the current state of... · What risks do frontier AI present? · Glossary
[18]
GPT-4 - OpenAI
Mar 14, 2023 · GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios,
[19]
OpenAI announces GPT-4, says beats 90% of humans on SAT - CNBC
Mar 14, 2023 · GPT-4 performed at the 90th percentile on a simulated bar exam, the 93rd percentile on an SAT reading exam, and the 89th percentile on the SAT ...
[20]
Introducing the next generation of Claude - Anthropic
Claude 3 model family · A new standard for intelligence · Near-instant results · Strong vision capabilities · Fewer refusals · Improved accuracy.
[21]
[PDF] The United States Artificial Intelligence Safety Institute: Vision ...
May 21, 2024 · AISI's research, testing, and guidance will enable more rigorous assessment of AI risk; more effective internal and external safeguards for. AI ...
[22]
Artificial Intelligence Safety Institute Consortium (AISIC) | NIST
The Consortium brings together more than 280 organizations to develop science-based and empirically backed guidelines and standards for AI measurement.
[23]
General-Purpose AI Models in the AI Act – Questions & Answers
Jul 10, 2025 · General-purpose AI models are trained with large data, display significant generality, perform many tasks, and can be integrated into various ...
[24]
[PDF] General Purpose AI and the AI Act
These systems are sometimes referred to as 'foundation models' and are characterised by their widespread use as pre-trained models for other, more specialised ...
[25]
What Are Generative AI, Large Language Models, and Foundation ...
May 12, 2023 · This post aims to clarify what each of these three terms mean, how they overlap, and how they differ.Missing: narrow | Show results with:narrow
[26]
ARC-AGI-2 A New Challenge for Frontier AI Reasoning Systems
May 20, 2025 · The design goals of ARC-AGI-2 are intended to improve upon the limitations of ARC-AGI-1 and expand the depth and quantity of its datasets. Here ...
[27]
OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
Dec 20, 2024 · ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less ...
[28]
AI Models Struggle with New ARC-AGI-2 Benchmark ... - Medium
Mar 25, 2025 · Unlike previous benchmarks, ARC-AGI-2 prevents models from brute-forcing their way through problems using vast computational resources; it ...Missing: limitations | Show results with:limitations
[29]
EU AI Act News: Rules on General-Purpose AI Start ... - Mayer Brown
Aug 1, 2025 · Obligations relating to general-purpose artificial intelligence (“GPAI”) models under the EU AI Act enter into force on 2 August 2025.
[30]
General-Purpose Artificial Intelligence (GPAI) Models and ... - RAND
Aug 8, 2024 · Under the EU AI Act, most foundation models are categorized as general-purpose AI (GPAI) and have special requirements imposed on them in ...
[31]
General-purpose AI regulation and the European Union AI Act
Aug 1, 2024 · The future-proofing of the AI Act needs to focus specifically on general-purpose AI and foundation models, as these types of AI are the most ...
[32]
The Failed Strategy of Artificial Intelligence Doomers
Jan 31, 2025 · The AI Doomers' plans are based on an urgency which is widely assumed but never justified. For many of them, the urgency leads to a rush to do ...
[33]
Genie 3: A new frontier for world models - Google DeepMind
Aug 5, 2025 · Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2.
[34]
RT-2: New model translates vision and language into action
Jul 28, 2023 · RT-2 is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for ...
[35]
[2307.15818] RT-2: Vision-Language-Action Models Transfer Web ...
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization.
[36]
Harvard and MIT Study: AI Models Are Not Ready to Make Scientific ...
Jul 15, 2025 · They concluded that AI models make accurate predictions but fail to encode the world model of Newton's laws and instead resort to case-specific ...
[37]
Foundation models are going multimodal - Twelve Labs
Mar 31, 2023 · In 2021, OpenAI introduced CLIP (Contrastive Language–Image Pre-training). The input to CLIP is 400 million image-text pairs that were crawled ...
[38]
[2507.12496] FOUNDER: Grounding Foundation Models in World ...
Jul 15, 2025 · FOUNDER integrates Foundation Models (FMs) and World Models (WMs) for open-ended task solving, using a mapping function to ground FM ...
[39]
[2401.04088] Mixtral of Experts - arXiv
Jan 8, 2024 · We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each ...
[40]
Mixtral of experts - Mistral AI
Dec 11, 2023 · Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters.
[41]
Large language model data pipelines and Common Crawl (WARC ...
Jun 3, 2023 · This article provides a short introduction to the pipeline used to create the data to train LLaMA, but it allows for many variations.
[42]
Training Data for the Price of a Sandwich - Mozilla Foundation
Feb 6, 2024 · While Common Crawl was never primarily about providing AI training data, it now positions itself as an important building block for LLM ...
[43]
How to Ensure Sufficient Data for AI Foundation Models
Jan 8, 2024 · As large language models (LLMs), Meta's LLaMA has 65 billion parameters and 4.5 TB of training data, while OpenAI's GPT-3.5 has 175 billion ...
[44]
Open-Sourced Training Datasets for Large Language Models (LLMs)
The Common Crawl dataset comprises terabytes of raw web data extracted from billions of web pages. It releases new data files that the crawler obtains each ...
[45]
Datasets used for training LLM's: All types of data used to create ...
Aug 21, 2025 · Major Sources of LLM Training Data · 1. Books · 2. Websites · 3. Articles and Journals · 4. Conversations and Dialogue · 5. Common Crawl · 6.
[46]
The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted ...
Dec 27, 2023 · The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle.
[47]
Mastering LLM Techniques: Text Data Processing - NVIDIA Developer
Nov 13, 2024 · Text processing for LLMs includes: download, cleaning, heuristic filtering, deduplication, model-based filtering, and blending/shuffling.
[48]
In search of the next generation of training sets for language models
Jun 17, 2024 · Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model ...Missing: large | Show results with:large
[49]
Curating Non-English Datasets for LLM Training with NVIDIA NeMo ...
Jul 10, 2024 · Heuristic filtering helps remove low-quality content from the dataset, using simple, efficient-to-compute rules. By applying well-designed ...
[50]
[2411.15821] Is Training Data Quality or Quantity More Impactful to ...
Nov 24, 2024 · This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs)<|control11|><|separator|>
[51]
AI models collapse when trained on recursively generated data
Jul 24, 2024 · Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set ...
[52]
[2404.01413] Is Model Collapse Inevitable? Breaking the Curse of ...
Apr 1, 2024 · We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse.
[53]
Distributed Training: Guide for Data Scientists
Distributed training divides training workload across multiple processors, running subtasks in parallel to speed up deep learning model training.
[54]
Training and Serving System of Foundation Models - arXiv
This survey explores methods for training and serving foundation models, which are built on data, models, computing, and algorithms, and face challenges in ...2.2 Transformer For... · 3 Model Training · 4 Model Serving<|separator|>
[55]
GPT-4 Details Revealed - by Patrick McGuinness
Jul 12, 2023 · OpenAI's pre-training for GPT-4 required about 2.15 x 10^25 FLOPS. This meant running on 25,000 A100s for 90 to 100 days, with a total pre- ...
[56]
How Do GPUs and TPUs Differ in Training Large Transformer ...
Aug 25, 2025 · TPUs outperform GPUs for massive batch processing and models directly compatible with their architecture, including most TensorFlow-based LLMs ...
[57]
How much power will frontier AI training demand in 2030? - Epoch AI
Aug 11, 2025 · The power required to train the largest frontier models is growing by more than 2x per year, and is on trend to reaching multiple gigawatts ...Missing: shortages Neptune.
[58]
[PDF] AI's Power Requirements Under Exponential Growth - RAND
We find that globally, AI data centers could need ten gigawatts (GW) of additional power capacity in 2025 alone, which is more than the total power capacity of ...Missing: Neptune. | Show results with:Neptune.
[59]
Deep Learning Model Optimization Methods - Neptune.ai
Pruning reduces model size by removing less important neurons, involving identification, elimination, and optional fine-tuning. · Quantization decreases memory ...
[60]
[2307.02973] Pruning vs Quantization: Which is Better? - arXiv
Jul 6, 2023 · Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial.
[61]
[2001.08361] Scaling Laws for Neural Language Models - arXiv
Jan 23, 2020 · View a PDF of the paper titled Scaling Laws for Neural Language Models, by Jared Kaplan and 9 other authors. View PDF. Abstract:We study ...
[62]
Training Compute-Optimal Large Language Models - arXiv
Mar 29, 2022 · As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher ...
[63]
The Race to Efficiency: A New Perspective on AI Scaling Laws - arXiv
Jan 4, 2025 · Empirical trends suggest that sustained efficiency gains can push AI scaling well into the coming decade, providing a new perspective on the ...
[64]
Current AI scaling laws are showing diminishing returns, forcing AI ...
Nov 20, 2024 · AI. Current AI scaling laws are showing diminishing returns, forcing AI labs to change course. Maxwell Zeff. 6:00 AM PST · November 20, 2024.Missing: empirical | Show results with:empirical
[65]
Algorithmic Improvement Is Probably Faster Than Scaling Now
Jun 5, 2023 · Back in 2020, a group at OpenAI ran a conceptually simple test to quantify how much AI progress was attributable to algorithmic improvements.Anthropic: Core Views on AI Safety: When, Why, What, and How ...Revisiting algorithmic progress - LessWrongMore results from www.lesswrong.comMissing: ignoring | Show results with:ignoring
[66]
What if A.I. Doesn't Get Much Better Than This? | The New Yorker
Aug 12, 2025 · Orion's failure helped cement the creeping fear within the industry that the A.I. scaling law wasn't a law after all. If building ever-bigger ...
[67]
[2501.13787] Parameter-Efficient Fine-Tuning for Foundation Models
Jan 23, 2025 · PEFT is a cost-effective fine-tuning technique that minimizes parameters and computational complexity while striving for optimal downstream ...
[68]
What is parameter-efficient fine-tuning (PEFT)? - IBM
PEFT is a method of improving the performance of pretrained large language models (LLMs) and neural networks for specific tasks or data sets.Overview · How does parameter-efficient...
[69]
Zero-Shot Prompting - Prompt Engineering Guide
The zero-shot prompt directly instructs the model to perform a task without any additional examples to steer it. We tried a few zero-shot examples in the ...
[70]
What is zero-shot prompting? - IBM
In zero-shot prompting, the model is prompted to generate a response without receiving an example of the desired output for the use case. Zero-shot prompting is ...What is zero-shot prompting? · Zero-shot prompting vs few...
[71]
PEFT - Hugging Face
PEFT (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting large pretrained models to various downstream applications.
[72]
LoRA: Low-Rank Adaptation of Large Language Models - arXiv
Jun 17, 2021 · We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the ...
[73]
A Survey on Parameter-Efficient Fine-Tuning for Foundation Models ...
Apr 29, 2025 · This survey provides a comprehensive review of the integration of PEFT techniques within federated learning environments.
[74]
Fine Tuning Large Language Model for Secure Code Generation
... fine-tuned models. Our experiments on GPT-J show that the fine-tuned GPT-J achieved 70.4% and 64.5% ratios of non-vulnerable code generation for C and C++ ...
[75]
[2308.08747] An Empirical Study of Catastrophic Forgetting in Large ...
Aug 17, 2023 · This study empirically evaluates the forgetting phenomenon in LLMs' knowledge during continual instruction tuning from the perspectives of domain knowledge, ...
[76]
What is Catastrophic Forgetting? - IBM
Catastrophic forgetting happens when the training process for the new tasks interferes with the model's understanding of old tasks.
[77]
GLUE Benchmark
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language ...SuperGLUE Benchmark · GLUE Diagnostic Dataset · Leaderboard · TasksMissing: foundation | Show results with:foundation
[78]
SuperGLUE: A Stickier Benchmark for General-Purpose Language ...
May 2, 2019 · In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a ...
[79]
SuperGLUE Benchmark
A new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.SuperGLUE Diagnostic Dataset · Leaderboard · Tasks · FAQMissing: foundation | Show results with:foundation
[80]
[PDF] arXiv:2206.07682v2 [cs.CL] 26 Oct 2022
Oct 26, 2022 · Figure 2 shows eight such emergent abilities spanning five language model families from various work. BIG-Bench. Figure 2A–D depicts four ...
[81]
Are Emergent Abilities of Large Language Models a Mirage? - arXiv
Apr 28, 2023 · Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in ...
[82]
Benchmarking is Broken - Don't Let AI be its Own Judge - arXiv
Oct 15, 2025 · Many scholars have raised concerns about the limitations of AI benchmarking, with some describing current evaluation practices as a “minefield” ...
[83]
NLP Evaluation in trouble: On the Need to Measure LLM Data ...
This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic ...
[84]
A Survey on Data Contamination for Large Language Models - arXiv
Jun 5, 2025 · In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation.
[85]
Benchmark Data Contamination of Large Language Models: A Survey
Jun 6, 2024 · This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional ...
[86]
The ARC Benchmark: Evaluating LLMs' Reasoning Abilities
The ARC Benchmark challenges AI with puzzles that measure reasoning, not recall. In 2025, GPT, Claude, and Gemini still lag behind human performance ...
[87]
System 2 Reasoning for Human-AI Alignment: Generality and ... - arXiv
Aug 13, 2025 · We examine weaknesses on ARC-AGI tasks, revealing gaps in compositional generalization and novel-rule adaptation, and argue that closing these ...
[88]
Nvidia dominates GPU shipments with 94% share - Tom's Hardware
Sep 3, 2025 · The total number of GPUs sold for the second quarter of 2025 hit 11.6 million units, while desktop PC CPUs went up to 21.7 million units ...
[89]
Nvidia and TSMC produce the first Blackwell wafer made in the U.S.
Oct 18, 2025 · Nvidia and TSMC have produced the first Blackwell wafer at TSMC's Arizona fab, marking a historic step in bringing advanced AI chip ...
[90]
Exclusive: Nvidia and TSMC unveil first Blackwell chip wafer ... - Axios
Oct 17, 2025 · Nvidia and TSMC announced Friday their first completed U.S.-made wafer that will eventually become Blackwell chips for AI purposes, Nvidia first ...
[91]
Overly Stringent Export Controls Chip Away at American AI Leadership
May 5, 2025 · The Biden administration issued its first AI chip export controls in October 2022, restricting the export of AI chips to China, as well as the technology to ...
[92]
Jensen says Nvidia's China AI GPU market share has plummeted ...
the Chinese market previously amounted to 20% to 25% of the ...
[93]
U.S. Export Controls and China: Advanced Semiconductors
Sep 19, 2025 · Tables. Table A-1. Select Details of U.S. Advanced Chip Controls on China (2022-2025); Table A-2. Nvidia's Modified Chips for China. Appendixes.
[94]
What is Hugging Face? - IBM
The Hugging Face Hub is a central web-based platform where users can share, discover and collaborate on models, datasets and applications. It acts like a " ...
[95]
AI's Power Requirements Under Exponential Growth - RAND
Jan 28, 2025 · AI data centers could need ten gigawatts (GW) of additional power capacity in 2025, which is more than the total power capacity of the state of Utah.
[96]
https://www.dw.com/en/can-the-west-break-chinas-grip-on-rare-earths/a-74474562
[97]
Securing America's Critical Minerals Supply
Oct 8, 2025 · ... rare earths poses the main bottleneck in securing this supply chain. ... AI Action Plan from July 2025. 10 U.S. factories for lithium-ion ...<|control11|><|separator|>
[98]
Surveying the Future of U.S. Open Foundation Model Policy - CSIS
Mar 21, 2024 · How might foreign adversaries use open weight models to exacerbate security risks? And how do those kinds of marginal risks compare to closed ...
[99]
Open-Source AI is a National Security Imperative - Third Way
Jan 30, 2025 · In this paper, we explore the benefits and drawbacks of open-source AI and conclude that open-source can help balance the safety and security we want from AI.What Is ``open'' Ai? · Androids 'r' Us · Ai: Made In America
[100]
Open Release of Grok-1 - xAI
Mar 17, 2024 · This is the raw base model checkpoint from the Grok-1 pre-training phase, which concluded in October 2023.
[101]
Defense Priorities in the Open-Source AI Debate - CSIS
Aug 19, 2024 · Because open models can be retrained, fine-tuned, and broadly customized, they can serve as a basis for national-security-specific applications.
[102]
The Murky State of Frontier AI Transparency
Jan 16, 2025 · Our analysis reveals four critical problems: closed models remain largely opaque about technical details, documentation is failing to keep pace with rapid ...
[103]
With Open Source Artificial Intelligence, Don't Forget the Lessons of ...
Jul 29, 2024 · At CISA, we see significant value in open foundation models to help strengthen cybersecurity, increase competition, and promote innovation.Missing: iteration | Show results with:iteration
[104]
Can machine translation match human expertise? Quantifying the ...
Jul 25, 2025 · Our findings suggest that large language models provide high-quality PROM translations to support human translations to reduce costs. However, ...
[105]
Sora: Creating video from text - OpenAI
Feb 15, 2025 · Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user's prompt.
[106]
Sora 2 is here | OpenAI
Sep 30, 2025 · Sora 2 is here. Our latest video generation model is more physically accurate, realistic, and more controllable than prior systems. It also ...Settings · Launching Responsibly · Sora 2 Availability And...
[107]
Foundation Models in Robotics: Applications, Challenges, and the ...
We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control.
[108]
Foundation Model Driven Robotics: A Comprehensive Review - arXiv
Jul 14, 2025 · In summary, foundation models have made robot planning more general and flexible.Missing: achievements | Show results with:achievements
[109]
Large language models surpass human experts in predicting ...
Nov 27, 2024 · Pre-trained LLMs can provide a foundation for further training in neuroscience with the aim of improving performance, as assessed by BrainBench.
[110]
[PDF] State of Foundation Models - 2025 (Innovation Endeavors)
Scaling continues across all dimensions – All technical metrics for models continue to improve >10x year-over-year, including cost, intelligence, context ...Missing: milestones | Show results with:milestones
[111]
[2302.06590] The Impact of AI on Developer Productivity - arXiv
Feb 13, 2023 · Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair ...<|separator|>
[112]
Research: Quantifying GitHub Copilot's impact in the enterprise with ...
May 13, 2024 · We conducted research with developers at Accenture to understand GitHub Copilot's real-world impact in enterprise organizations.
[113]
Method of the Year 2021: Protein structure prediction - Nature
Jan 11, 2022 · In the past year, the deep-learning-based methods AlphaFold2 and RoseTTAfold have managed to achieve this feat over a range of targets, forever ...
[114]
AlphaFold Protein Structure Database
In CASP14, AlphaFold was the top-ranked protein structure prediction method by a large margin, producing predictions with high accuracy. While the system still ...AlphaFold · View protein · Downloads · P70490Missing: achievements | Show results with:achievements
[115]
quantifying GitHub Copilot's impact on developer productivity and ...
Sep 7, 2022 · In our research, we saw that GitHub Copilot supports faster completion times, conserves developers' mental energy, helps them focus on more satisfying work.
[116]
Why language models hallucinate | OpenAI
Sep 5, 2025 · While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing ...
[117]
Detecting hallucinations in large language models using semantic ...
Jun 19, 2024 · From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total ...<|separator|>
[118]
[PDF] Understanding Inductive Bias in the Era of Large-Scale Pretraining ...
This thesis challenges the conventional wisdom that strict architectural constraints are necessary for modeling numerical data, par- ticularly in physics and ...
[119]
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs
May 22, 2025 · The challenge with deploying SLMs on mobile and edge devices lies in their limited computing capability, which directly restricts the loading of ...
[120]
A survey of edge efficient LLMs and techniques - ScienceDirect
This survey provides a comprehensive overview of the state-of-the-art techniques and strategies for enabling efficient inference of LLMs on edge devices.<|control11|><|separator|>
[121]
The Sequence Opinion #485: What's Wrong With AI Benchmarks
Feb 6, 2025 · As AI models advance, many benchmarks have become obsolete. In 2024, several key benchmarks saw near-perfect scores from leading models:.
[122]
MMLU-Pro: A More Robust and Challenging Multi-Task Language ...
Jun 3, 2024 · This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning- ...Missing: saturation | Show results with:saturation
[123]
MMLU-Pro: A More Robust and Challenging Multi-Task Language ...
By incorporating more complex, reasoning-intensive tasks, MMLU-Pro addresses the performance saturation observed in previous benchmarks, effectively ...
[124]
https://medium.com/aiguys/the-true-ai-scaling-problem-08de7927d41d
[125]
The big AI story right now: Pure scaling has failed to produce AGI
Feb 19, 2025 · The most underreported and important story in AI right now is that pure scaling has failed to produce AGI.Missing: AIGuys | Show results with:AIGuys
[126]
https://arcprize.org
[127]
Martin Henz - ARCprize: How LLMs fail on ARC benchmark - LinkedIn
Jul 3, 2024 · ARCprize: How LLMs fail on ARC benchmark ... LLMs perform very poorly on the ARC benchmark: https://arcprize.org. The inventor of ARC, Francois ...Missing: failures emergent
[128]
Emergent Abilities in Large Language Models: A Survey - arXiv
Feb 28, 2025 · Emergent abilities appear abruptly when a critical scale is reached rather than via smooth extrapolation. LLMs exhibit sudden performance jumps ...
[129]
ARC-AGI-1: Abstract Reasoning Benchmark - Emergent Mind
Sep 16, 2025 · ... failures of prior search attempts. SOAR exemplifies this, combining LLM guided evolutionary search with automatic hindsight learning ...
[130]
Open Problems and Fundamental Limitations of Reinforcement ...
Jul 27, 2023 · Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.Missing: jailbreaks 2023-2025
[131]
[PDF] Open Problems and Fundamental Limitations of Reinforcement ...
RLHF has also not made models robust to adversarial attacks from jailbreaking (i.e., subverting the constraints the system is normally meant to operate under) ...Missing: 2023-2025 | Show results with:2023-2025
[132]
[PDF] Weak-to-Strong Jailbreaking on Large Language Models - arXiv
In this paper, we propose the weak-to- strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key ...
[133]
Semantic Jailbreaks and RLHF Limitations in LLMs
Aug 2, 2025 · In this paper, various production scale model responses have been evaluated against encoded and cleverly paraphrased, obfuscated, ...
[134]
Human performance in detecting deepfakes: A systematic review ...
Deepfake technology has been misused to create pornographic content containing a fake version of a real, often famous, person. Around 4000 celebrities have ...
[135]
[PDF] On the Societal Impact of Open Foundation Models - arXiv
Feb 27, 2024 · Open foundation models have benefits like innovation and distributed power, but also risks such as misuse, biosecurity, and cybersecurity ...
[136]
[PDF] Open-Sourcing Highly Capable Foundation Models - arXiv
Sep 29, 2023 · Open-sourcing AI models offers benefits like oversight and progress, but also risks such as misuse and potential for dangerous AI diffusion.
[137]
AI deception: A survey of examples, risks, and potential solutions
This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs.
[138]
A Closer Look at the Existing Risks of Generative AI - arXiv
May 28, 2025 · Through a systematic analysis of 499 publicly reported incidents, we describe what harms are reported, how they arose, and who they impact. We ...Missing: rates | Show results with:rates
[139]
Ten Ways the Precautionary Principle Undermines Progress in ...
Feb 4, 2019 · If policymakers apply the “precautionary principle” to AI, which says it's better to be safe than sorry, they will limit innovation and discourage adoption.
[140]
The Precautionary Principle, Safety Regulation, and AI: This Time, It ...
Sep 4, 2024 · The PP has long been important in managing risks associated with technological innovations that have no explicit scientific knowledge.
[141]
What drives the divide in transatlantic AI strategy? - Atlantic Council
Sep 29, 2025 · The US and EU share AI ambitions but diverge on regulation, risking a fractured Western front. Nowhere is this tension sharper than in ...<|separator|>
[142]
How Europe's AI Act could affect innovation and competitiveness
Jul 4, 2024 · We caught up with ESCP's Philip Meissner to assess the impact of the EU AI Act on the broader political and economic landscape.
[143]
High-level summary of the AI Act | EU Artificial Intelligence Act
On 18 July 2025, the European Commission published draft Guidelines clarifying key provisions of the EU AI Act applicable to General Purpose AI (GPAI) models.
[144]
European Commission publishes guidelines on obligations for ...
Jul 24, 2025 · The Guidelines clarify that providers placing GPAI models on the market before August 2, 2025 have until August 2, 2027 to comply with their ...
[145]
EU AI Act Criticized for Granting US Tech Firms Excessive Influence
Jul 6, 2025 · Strategic uncertainty from the AI Act's vague implementation affects EU research, potentially diverting talent and investment overseas.
[146]
Safe, Secure, and Trustworthy Development and Use of Artificial ...
Nov 1, 2023 · It is the policy of my Administration to advance and govern the development and use of AI in accordance with eight guiding principles and priorities.
[147]
Executive Order on the Safe, Secure, and Trustworthy Development ...
Oct 30, 2023 · It is the policy of my Administration to advance and govern the development and use of AI in accordance with eight guiding principles and priorities.
[148]
Dual-Use Foundation Models with Widely Available Model Weights ...
Jul 30, 2024 · In October 2023 President Biden signed the Executive Order (EO) on “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence ...
[149]
[PDF] America's AI Action Plan - The White House
Jul 10, 2025 · President Trump has already taken multiple steps toward this goal, including rescinding Biden Executive Order 14110 on AI that foreshadowed an ...
[150]
AI Doomers Versus AI Accelerationists Locked In Battle For Future ...
Feb 18, 2025 · AI is advancing rapidly. AI doomers say we must stop and think. AI accelerationists say full speed ahead. Here is a head-to-head comparison.
[151]
Effective accelerationism, doomers, decels, and how to flaunt your AI ...
Nov 20, 2023 · For the AI doomers, the decels are akin to attempts to build a third American political party; ineffective at best and insidious at worst. AI ...
[152]
Two warring visions of AI - Prospect Magazine
Jan 16, 2024 · The power struggle between “doomers” and “accelerationists” will define the way this world-changing technology evolves.
[153]
The 2025 AI Index Report | Stanford HAI
In 2024, U.S.-based institutions produced 40 notable AI models, significantly outpacing China's 15 and Europe's three. While the U.S. maintains its lead in ...
[154]
The US Is Winning the AI Race - But for How Long? - Project Syndicate
Sep 24, 2025 · On January 20, DeepSeek unveiled its hyper-efficient R1 model, making it clear that US sanctions would not hold back China's AI ambitions.
[155]
How will AI influence US-China relations in the next 5 years?
Jun 18, 2025 · Already, the performance gap between the best Chinese and U.S. AI models had shrunk from 9.3% in 2024 to 1.7% in February. This will be the new ...
[156]
China now leads the U.S. in this key part of the AI race
Oct 13, 2025 · China's open AI models are now more powerful and popular than those released by American rivals, a shift with implications for the future of ...Missing: foundation | Show results with:foundation
[157]
Unleash developer productivity with generative AI - McKinsey
Jun 27, 2023 · A McKinsey study shows that software developers can complete coding tasks up to twice as fast with generative AI. Four actions can maximize productivity and ...
[158]
Measuring the Impact of Early-2025 AI on Experienced ... - METR
Jul 10, 2025 · We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers.Missing: benefits | Show results with:benefits
[159]
The reality of AI-Assisted software engineering productivity
Aug 16, 2025 · Voege concedes that “AI helps with boilerplate” and routine coding, estimating perhaps a 20–50% speed-up on certain sub-tasks for many engineers ...
[160]
AI Adoption Statistics in 2025 - Netguru
Sep 4, 2025 · Generative AI usage specifically has jumped from 33% in 2023 to 71% in 2024, showing how quickly businesses have gained confidence in these ...<|separator|>
[161]
Enterprise transformation and extreme productivity with AI | IBM
Aug 26, 2025 · How AI and automation enabled USD 4.5 billion in productivity gains across IBM.
[162]
AI in Action 2024 Report - IBM
The AI in Action report brings real-word data to the AI conversation to explore a simple question—during this critical period, what lessons can be learned ...Missing: transformation | Show results with:transformation
[163]
Foundation Models Explained: Why Every Startup and Enterprise ...
Jun 13, 2025 · This scalability makes foundation models ideal for platformization, one model, many services. 4. Emergent Abilities and Compositional Reasoning.
[164]
(PDF) Foundation Models and AI Innovation: Evidence from the ...
Oct 17, 2025 · Less experienced developers engage in broader exploration across various downstream applications, benefiting from lower entry barriers. These ...
[165]
5 Top Benefits of Foundation Models - Deepchecks
Jul 22, 2025 · This article will explore the top five benefits of foundation models that enable developers to harness their powerful features across a variety of applications.Superior Generalization... · Breakthroughs in Multimodal... · Democratization of AI...
[166]
AI and Labor Markets: What We Know and Don't Know
Oct 14, 2025 · Hosseini and Lichtinger (2025) find employment declines for young workers in US firms that adopt AI technologies, with adoption measured by the ...Missing: foundation | Show results with:foundation
[167]
[PDF] CHAPTER 4: Economy - Stanford HAI
Oct 17, 2024 · In 2024,. 15.7% of all AI job postings in the United. States were for jobs based in California, followed by Texas (8.8%) and New York. (5.8%).
[168]
How artificial intelligence impacts the US labor market | MIT Sloan
Oct 9, 2025 · What you'll learn: AI adoption leads to increased company growth in revenue, profits, employment, and profitability.Missing: foundation disruptions 2024
[169]
Artificial intelligence, job quality and inclusiveness - OECD
Jul 11, 2023 · The available evidence for workers with AI skills in the United States suggests that they receive a significant wage premium. Alekseeva et al. ( ...
[170]
[PDF] AI Adoption and Wage Growth in U.S. Industries: A Sectoral Analysis
Aug 27, 2025 · AI raises wages in tech/finance sectors where it complements humans, but keeps wages down in labor-intensive sectors where it replaces them. ...Missing: foundation | Show results with:foundation
[171]
Artificial intelligence and the skill premium: A numerical analysis of ...
The results show that AI widens the skill premium by substituting low-skilled labor with industrial robots and performing high-skilled labor tasks.
[172]
History Repeats: The Longstanding Fear of Technology Replacing ...
Jul 1, 2024 · With the unemployment rate around 20%, many feared that advancements in technology were eliminating jobs faster than they could be created.
[173]
The impact of AI on the labour market: is this time different? - OECD.AI
Despite fears, the technological progress experienced in recent decades has not led to mass unemployment. In fact, employment in OECD countries has risen.
[174]
Evaluating the Impact of AI on the Labor Market - Yale Budget Lab
Oct 1, 2025 · Overall, our metrics indicate that the broader labor market has not experienced a discernible disruption since ChatGPT's release 33 months ago, ...Missing: foundation empirical
[175]
The fear of technology-driven unemployment and its empirical base
Jun 10, 2022 · This column suggests that the empirical support for the labour-creating effects of technological change dominates that for labour-replacement.Missing: comparison | Show results with:comparison
[176]
The Fed - The State of AI Competition in Advanced Economies
Oct 6, 2025 · 2025. "China's AI Models Are Closing the Gap – but America's Real Advantage Lies Elsewhere." RAND Corporation. May 2. Hoffman, Mia and Laura ...
[177]
[PDF] Winning the Defining Contest: The US-China Artificial Intelligence ...
Jul 7, 2025 · Narratives of US artificial intelligence (AI) leadership collapsed in January 2025 after the announcement of. DeepSeek's innovative large ...
[178]
Meta changes course on open-source AI as China pushes ahead ...
Aug 3, 2025 · In fact, China has probably found the path to “surpass the US in AI” thanks to the momentum in the country's vibrant open-source AI ecosystem, ...
[179]
China's drive toward self-reliance in artificial intelligence: from chips ...
Jul 22, 2025 · Ample private funding and access to global open-source models have allowed for rapid Chinese progress. Although China lagged in the first years ...
[180]
Export Controls on Open-Source Models Will Not Win the AI Race
Feb 25, 2025 · One emphasizes geopolitical risk and global power dynamics, with a focus on Chinese misuse of U.S. open-source AI. The other is rooted in ...
[181]
AI principles - OECD
The OECD AI Principles promote use of AI that is innovative and trustworthy and that respects human rights and democratic values. Adopted in May 2019, they set ...Key Links · Latest Insights · Related PublicationsMissing: bilateral | Show results with:bilateral
[182]
AI governance through global red lines can help prevent ...
Sep 22, 2025 · Global AI red lines are one way to set enforceable limits on dangerous AI uses and behaviours to prevent unacceptable risks and build trust.
[183]
AI Act | Shaping Europe's digital future - European Union
The AI Act is the first-ever legal framework on AI, which addresses the risks of AI and positions Europe to play a leading role globally.Regulation - EU - 2024/1689 · AI Pact · AI Factories · European AI OfficeMissing: geopolitical competition
[184]
The Annual AI Governance Report 2025: Steering the Future of AI
Bilateral Agreements for AI Safety and Innovation: Bilateral agreements are quickly becoming the responsive layer of AI safety governance. In April 2024 ...
[185]
With AI, we need both competition and safety - Brookings Institution
Jul 8, 2024 · AI regulation must promote safety and protect competition through industry-government cooperation and enforceable standards.