Fact-checked by Grok 2 weeks ago

Large language model

A large language model (LLM) is a type of artificial neural network, typically based on the transformer architecture, pretrained on massive datasets of text to model language probabilities through next-token prediction, allowing it to generate coherent responses, translate languages, summarize content, and execute other natural language processing tasks.^[1] These models, often containing billions to trillions of parameters, rely on self-supervised learning objectives such as masked language modeling or causal language modeling to capture statistical patterns in data without explicit task supervision during initial training.^[2] Unlike earlier recurrent neural networks, transformers process entire sequences in parallel using attention mechanisms, which weigh the relevance of different input tokens to each other, enabling efficient scaling to unprecedented model sizes.^[3] The development of LLMs has been driven by empirical scaling laws, which demonstrate that performance on diverse benchmarks improves predictably with increases in model parameters, training data volume, and computational resources, often following power-law relationships.^[4] Pioneering models like GPT-3 showcased few-shot learning capabilities, where minimal examples suffice for adaptation to new tasks, highlighting the versatility arising from broad pretraining.^[2] However, this scaling has revealed emergent abilities—sudden improvements in capabilities like arithmetic reasoning or multi-step problem-solving that appear only above certain size thresholds—attributed not to fundamental comprehension but to the models' ability to interpolate complex patterns from training distributions.^[5] Despite achievements in surpassing human performance on specific standardized tests, LLMs exhibit persistent limitations including factual inaccuracies (hallucinations), sensitivity to prompt phrasing, and reproduction of biases encoded in training corpora, which stem from their reliance on correlative rather than causal understanding of language.^[1] Training such models demands immense computational power, with costs escalating exponentially; for instance, frontier models require clusters of thousands of GPUs and petabytes of data, raising concerns over energy consumption and accessibility dominated by a few large organizations.^[2] Fine-tuning and reinforcement learning from human feedback have mitigated some issues, enhancing alignment with user intentions, yet underlying architectural brittleness persists, underscoring that LLMs simulate intelligence through probabilistic approximation rather than genuine reasoning.^[1]

Fundamentals

Definition and Scope

A large language model (LLM) is a deep neural network architecture trained using self-supervised learning on extensive text corpora to approximate the conditional probability distribution of subsequent tokens given preceding ones in a sequence, fundamentally operating as an autoregressive probabilistic predictor.^[6] This core mechanism relies on maximizing the likelihood of next-token predictions across vast datasets, typically comprising internet-scale volumes of unstructured text, without explicit task supervision during pre-training.^[7] LLMs are distinguished by their immense scale, conventionally defined as encompassing billions or more trainable parameters, which enables the encoding of high-dimensional statistical regularities in language.^[8] For example, OpenAI's GPT-3, introduced in 2020, utilizes 175 billion parameters to perform such predictions.^[9] The scope of LLMs is narrowly delimited to autoregressive transformer-based models optimized for sequential text generation and comprehension, excluding smaller-scale language models, rule-based systems, or AI frameworks primarily trained on non-textual modalities such as images or structured data.^[10] This focus arises from empirical observations that emergent generalization—manifesting as zero-shot or few-shot performance on diverse tasks—stems from unsupervised scaling of compute and data in the next-token prediction paradigm, rather than domain-specific fine-tuning or hybrid architectures.^[11] Models below the billion-parameter threshold generally fail to exhibit these properties at comparable levels, underscoring the causal role of parameter count and training compute in achieving human-like linguistic proficiency.^[12] Probabilistic outputs from LLMs take the form of softmax distributions over vocabulary tokens, reflecting uncertainty in predictions derived from learned embeddings and attention mechanisms, which collectively parameterize the model's internal representation of syntactic, semantic, and contextual dependencies.^[13] This statistical foundation prioritizes predictive accuracy on held-out sequences as the primary optimization objective, with downstream applications emerging as corollaries of the pre-trained model's latent capabilities rather than inherent design intents.^[14]

Key Characteristics

Large language models feature high-dimensional parameter spaces, often comprising billions to trillions of parameters, which enable the encoding of complex statistical dependencies from extensive training data.^[15] This architectural scale underpins emergent behaviors, including in-context learning, wherein models perform few-shot adaptation to unseen tasks by conditioning predictions on a handful of input examples without altering internal weights. Performance in LLMs follows empirical scaling laws, with cross-entropy loss decreasing as a power-law function of model parameters N, dataset size D, and compute C: approximately L(N, D) \approx \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_0, where empirical fits yield \alpha \approx 0.34, \beta \approx 0.28, A \approx 406.4, B \approx 410.7, and L_0 \approx 1.69.^[17] These laws, validated across model scales from millions to hundreds of billions of parameters, predict that loss reduction—and consequent capability gains—arises predictably from resource scaling, guiding efficient allocation in development.^[17] LLMs predominantly utilize transformer architectures, which employ self-attention for parallelizable sequence processing, facilitating context windows extending to millions of tokens in advanced 2025 deployments, such as the 1 million token capacity of Gemini 2.5 Pro.^[18] This design supports handling prolonged inputs while balancing computational demands through mechanisms like multi-head attention. Fundamentally, LLMs operate via autoregressive next-token prediction, sampling outputs from learned conditional distributions p(x_t | x_{<t}) that compress training data patterns into probabilistic approximations of language, rather than executing intentional reasoning or causal inference.^[19] This stochastic mechanism excels at replicating surface-level fluency but falters on tasks demanding extrapolation beyond distributional statistics, highlighting their role as sophisticated statistical engines over agents of understanding.^[19]

Distinction from Traditional Models

Traditional language models, including n-gram models and early recurrent neural networks (RNNs) like long short-term memory (LSTM) variants, primarily depended on statistical frequency counts or sequential processing of limited contexts, often augmented by hand-engineered linguistic features such as stemming, part-of-speech tags, or syntactic parses to handle specific tasks.^[20]^[21] These approaches were constrained by sparse data representations and computational limits, leading to brittle generalization beyond narrow domains and high perplexity scores—typically 150–200 for n-grams and around 120 for RNNs on benchmarks like WikiText.^[22] Large language models (LLMs), built on transformer architectures, diverge fundamentally by enabling fully end-to-end differentiable training across the entire system, from token embeddings to output predictions, without reliance on predefined features or modular pipelines.^[1] This allows LLMs to autonomously derive hierarchical representations from vast, unprocessed corpora via gradient descent, shifting causality from explicit rule imposition to implicit pattern extraction at scale. Pre-transformer neural models rarely exceeded hundreds of millions of parameters due to training inefficiencies and hardware constraints, whereas LLMs routinely scale beyond 100 billion—exemplified by GPT-3's 175 billion parameters—yielding qualitative improvements in long-range coherence and task adaptability.^[23]^[24] Empirically, this scaling manifests in orders-of-magnitude reductions in perplexity, with LLMs achieving scores around 20.5 on held-out data where traditional models falter, reflecting enhanced predictive utility rather than semantic "understanding."^[22] Such gains arise not from architectural novelty alone but from the interplay of parameter count, data volume, and compute, enabling emergent behaviors like few-shot learning that elude smaller, feature-dependent systems.^[1] Mainstream academic surveys, while comprehensive, may underemphasize these scaling laws' primacy due to institutional preferences for interpretable, smaller models, yet raw benchmark data consistently validates the empirical edge of massive, data-driven predictors.^[1]

Historical Development

Pre-Transformer Foundations

The foundations of neural language modeling prior to the transformer architecture were laid in the early 2000s with feedforward neural networks applied to probabilistic next-word prediction. In 2003, Bengio et al. introduced a neural probabilistic language model that used a multilayer perceptron to estimate word probabilities conditioned on preceding words, incorporating distributed representations to capture semantic similarities more effectively than traditional n-gram models; experiments on small corpora demonstrated perplexity reductions of up to 20-30% relative to smoothed n-grams, though computational costs limited vocabulary sizes to around 10,000-20,000 words.^[25] This approach established the viability of neural networks for language modeling but struggled with scalability due to the curse of dimensionality in sparse word representations. Advancements in word embeddings further enabled denser, continuous representations of vocabulary items. Mikolov et al.'s Word2Vec framework, published in 2013, employed skip-gram and continuous bag-of-words architectures to train unsupervised vector embeddings from large unannotated corpora, achieving vectors of 100-300 dimensions that encoded syntactic and semantic relations—such as vector arithmetic approximating analogies (e.g., "king" - "man" + "woman" ≈ "queen").^[26] These embeddings, trained efficiently via negative sampling on billions of words, improved downstream performance in tasks like part-of-speech tagging and named entity recognition by 5-10% over one-hot encodings, but they remained static and context-independent, requiring integration with sequential models for full language modeling.^[27] Recurrent neural networks (RNNs), particularly long short-term memory (LSTM) variants, addressed sequential dependencies in language data. Introduced by Hochreiter and Schmidhuber in 1997, LSTMs mitigated the vanishing gradient problem in vanilla RNNs—where gradients diminish exponentially during backpropagation through time, impeding learning of dependencies beyond 5-10 timesteps—via gating mechanisms that regulate information flow. LSTM-based language models, often with 1-2 layers and hidden sizes of 200-650, achieved modest empirical results on benchmarks like the Penn Treebank dataset (approximately 1 million training tokens): for instance, Zaremba et al. reported a test perplexity of 82.7 in 2014 using recurrent dropout regularization, while variants like pointer sentinel LSTMs reached around 70-75 by 2016. These models powered early successes in neural machine translation, such as the sequence-to-sequence framework of Sutskever et al. in 2014, which used LSTM encoders and decoders to attain BLEU scores competitive with phrase-based systems on WMT English-French data.^[28] However, fundamental limitations constrained pre-transformer neural language models to small scales. Even with LSTMs, residual vanishing gradients persisted for long-range dependencies exceeding 20-50 tokens, as evidenced by analyses showing perplexity degradation when context windows were extended. Sequential processing in RNNs precluded parallelization across time steps during training, resulting in O(T) time complexity per sequence of length T and restricting practical model sizes to millions of parameters on hardware of the era, far below the billions trainable post-2017. These bottlenecks established empirical baselines—perplexities rarely below 60 on held-out data—highlighting the need for architectures enabling efficient scaling and parallel computation.

Transformer Introduction and Early Scaling (2017-2020)

The Transformer architecture was introduced in the paper "Attention Is All You Need" by Ashish Vaswani and colleagues at Google Brain and other institutions, published on arXiv on June 12, 2017.^[3] This model dispensed with recurrent and convolutional layers in favor of stacked self-attention and point-wise feed-forward networks, enabling full parallelization of sequence processing during training. The self-attention mechanism computes dependencies between all positions in a sequence simultaneously, weighted by relevance, which addressed the sequential bottlenecks of prior recurrent models and yielded an approximately eightfold reduction in training time for translation tasks compared to optimized recurrent baselines. Initial experiments on the WMT 2014 English-to-German dataset, using 4.5 million sentence pairs for training, demonstrated the base Transformer's efficacy, achieving a BLEU score of 27.3 after training for 12 hours on eight GPUs, outperforming previous architectures like the GNMT convolutional model.^[3] Building on this foundation, early adaptations validated the scalability of Transformer-based models for natural language tasks. In October 2018, Jacob Devlin and colleagues at Google released BERT (Bidirectional Encoder Representations from Transformers), which applied masked language modeling for pre-training on 3.3 billion words from BooksCorpus and English Wikipedia.^[29] The BERT-large model, with 24 layers, 16 attention heads per layer, and 340 million parameters, set new state-of-the-art results on eleven NLP tasks, including GLUE (average score of 80.5) and SQuAD (F1 score of 93.2), by leveraging bidirectional context unavailable in unidirectional models like GPT precursors. This bidirectional scaling experiment underscored attention's ability to capture nuanced linguistic representations when model depth and width were increased. OpenAI advanced unidirectional generative scaling with GPT-2 in February 2019, detailed in "Language Models are Unsupervised Multitask Learners." The largest variant featured 1.5 billion parameters across 48 layers and was pre-trained on 40 gigabytes of WebText data filtered from outbound Reddit links.^[30] It exhibited zero-shot transfer, surpassing fine-tuned baselines on seven of eight tested language modeling datasets without task-specific adaptation, thus empirically confirming that larger Transformer decoders could internalize broad capabilities from raw text corpora. These models' successes, from 65 million parameters in the original Transformer base to GPT-2's scale, established attention mechanisms as a viable path for efficient compute utilization in language modeling. GPT-3, unveiled by Tom B. Brown and OpenAI researchers in May 2020 via arXiv preprint, represented an empirical milestone in scaling, with 175 billion parameters trained on approximately 570 gigabytes of filtered Common Crawl data plus other sources.^[24] The model demonstrated few-shot learning, where performance on tasks like arithmetic reasoning and question answering improved predictably with additional in-context examples, hinting at emergent abilities from sheer size. Its API beta launch on June 11, 2020, facilitated external validation, showing that downstream accuracy scaled logarithmically with parameter count across held-out benchmarks, without proportional increases in fine-tuning needs. This period's experiments collectively affirmed the Transformer's parallelizable design as foundational for subsequent parameter and data expansions.^[24]

Explosive Growth and GPT Era (2021-2023)

In 2021, OpenAI expanded access to GPT-3 via its API, enabling over 300 applications to integrate capabilities such as search, conversation, and text completion, marking a shift toward broader commercial deployment of large-scale models.^[31] This followed the model's initial beta launch in June 2020, with the removal of a waiting list in November 2021 allowing unrestricted use for developers.^[32] Empirical scaling laws, as outlined in Kaplan et al.'s January 2020 analysis, provided theoretical justification for this expansion, demonstrating that language model performance on cross-entropy loss follows power-law relationships with increases in model size, dataset size, and compute, encouraging investments in larger architectures.^[17] The November 30, 2022, release of ChatGPT, built on a fine-tuned version of GPT-3.5, catalyzed widespread public adoption, reaching 100 million monthly active users by January 2023—the fastest growth for any consumer application at the time.^[33] This surge fueled hype around large language models, with global private investment in generative AI rising to approximately $25 billion in 2023, predominantly directed toward closed-source systems dominated by firms like OpenAI.^[34] OpenAI's approach emphasized centralized control, limiting transparency on training data and processes, which drew criticism for opacity and potential risks in unverifiable outputs.^[35] GPT-4's release on March 14, 2023, extended this era of proprietary scaling, incorporating multimodal capabilities while maintaining closed access, further entrenching reliance on high-compute infrastructure with inference costs prohibitive for widespread non-commercial use.^[36] Early outputs from these models revealed biases, including tendencies toward toxic content and misinformation, attributable to training data imbalances rather than deliberate design, prompting concerns over unmitigated societal impacts.^[37] Training such systems also incurred substantial environmental costs, with GPT-3's development emitting over 550 tons of CO2 equivalent.^[38] Amid this centralization, nascent open-source efforts emerged as alternatives, including EleutherAI's GPT-J in 2021, a 6-billion-parameter model trained on public data, and the BigScience workshop's BLOOM in July 2022, a 176-billion-parameter multilingual model released under permissive licensing to foster transparency and accessibility.^[39] These initiatives, though smaller in scale, highlighted growing interest in democratizing access despite resource constraints compared to proprietary giants.

Open Competition and Maturation (2024-2025)

In 2024 and 2025, the development of large language models shifted toward greater open competition, with multiple organizations releasing advanced architectures that narrowed performance disparities between proprietary and open-weight systems. This period marked a maturation phase, characterized by accelerated iteration through open-source contributions, which democratized access to high-capability models and diminished risks of monopolistic control by a few dominant firms. Open-weight releases, such as those from Meta and Mistral AI, enabled widespread fine-tuning and deployment, fostering innovation across industries while proprietary models from Anthropic, Google, and xAI continued to set benchmarks in specialized domains.^[40]^[41] Key releases included Meta's Llama 4 family on April 5, 2025, featuring natively multimodal variants like Llama 4 Scout and Maverick capable of processing text and images, building on Llama 3's 2024 advancements in parameter scale and efficiency.^[42]^[43] Mistral AI advanced mixture-of-experts (MoE) architectures, with models like Mixtral leveraging sparse activation of specialized sub-networks to achieve high performance at lower inference costs compared to dense transformers.^[44] xAI's Grok series progressed with Grok-3 in February 2025 and Grok-4 in July 2025, prioritizing maximal truth-seeking through reduced alignment biases and emphasis on empirical reasoning over censored outputs.^[45]^[46] Anthropic's Claude Sonnet 4.5, released September 29, 2025, demonstrated dominance in coding tasks, outperforming prior models in agentic workflows and code generation benchmarks.^[47]^[48] Google's Gemini 2.5 Pro, launched March 25, 2025, integrated enhanced multimodal reasoning, supporting complex tasks involving text, images, and adaptive thinking.^[18] Open-weight models increasingly closed the gap with closed-source counterparts, as evidenced by leaderboards like LMSYS Chatbot Arena, where systems such as DeepSeek-R1 (January 2025) and Alibaba's Qwen 2.5 Max achieved scores rivaling or exceeding GPT-4 variants in reasoning and math benchmarks.^[49]^[40] The 2025 AI Index reported the performance differential between open- and closed-weight models shrinking to 1.7% on key metrics, driven by efficient scaling in open ecosystems.^[41] Models like Qwen 2.5 Max scored 89.4 on Arena-Hard, surpassing GPT-4 in problem-solving, while DeepSeek distillations outperformed GPT-4o in specific math evaluations like AIME 2024.^[50] This parity stemmed from community-driven optimizations, including MoE implementations that activated only subsets of parameters per token, enabling larger effective scales without proportional compute demands.^[51] The proliferation of open-source models reduced dependency on proprietary APIs, spurring faster global iteration and customization for domain-specific applications.^[52] By mid-2025, this competition had lowered barriers to entry, with open initiatives mitigating earlier concerns over centralized control while sustaining pressure on closed developers to innovate. Multimodal extensions, as in Llama 4 and Gemini 2.5 Pro, further matured capabilities, integrating vision and text processing to approach unified intelligence paradigms.^[42]^[18] Overall, these developments underscored a resilient ecosystem, where empirical benchmarks rather than vendor claims guided progress.^[53]

Data Preparation

Primary Dataset Sources

The primary datasets for training large language models consist predominantly of uncurated web text scraped from the internet, supplemented by digitized books and academic sources, yielding corpora on the order of trillions of tokens.^[54] The Common Crawl, maintained by a nonprofit organization since 2007, serves as the foundational source, archiving monthly snapshots of billions of web pages totaling petabytes of raw data, with subsets filtered for LLM pre-training.^[55] For instance, OpenAI's GPT-3 derived approximately 82% of its 300 billion training tokens from Common Crawl data, combined with smaller contributions from book corpora and Wikipedia. More recent efforts, such as NVIDIA's Nemotron-CC, process Common Crawl into 6.3 trillion deduplicated English tokens, while Together AI's RedPajama-Data-v2 extracts 30 trillion filtered tokens from 84 Common Crawl dumps.^[56]^[57] The Colossal Clean Crawled Corpus (C4), derived from the April 2019 Common Crawl snapshot, exemplifies a processed subset, originally encompassing 1.4 trillion tokens after heuristic cleaning to remove duplicates, boilerplate, and low-quality content, primarily for models like Google's T5.^[58] BooksCorpus, comprising 985 million words from around 11,000 unpublished ebooks scraped from platforms like Smashwords, has been a recurrent source for narrative coherence in models including early GPT variants, though analyses reveal inconsistencies such as incomplete texts and genre imbalances favoring amateur fiction.^[59] Estimates for GPT-4 suggest training on 13 to 16 trillion tokens, largely from expanded web crawls reflecting these sources' scale.^[60]^[61] These corpora, while vast and diverse, inherit the internet's compositional skews, as Common Crawl samples only accessible public content without curation, leading to overrepresentation of English-language material from dominant online ecosystems.^[62] Web crawler restrictions exacerbate potential biases: neutral outlets impose blocks at rates up to 58%, compared to lower rates for hyperpartisan sites, skewing datasets toward polarized or lower-quality content that evades robots.txt protocols.^[63] This structure mirrors broader internet dynamics, where content production is disproportionately influenced by media and academic institutions exhibiting systemic left-wing ideological tilts, as evidenced by content analyses of news and scholarly outputs, thereby embedding non-neutral priors in unfiltered training data.^[63]^[64]

Preprocessing and Tokenization

Large language models require preprocessing of raw text into discrete tokens to enable efficient numerical representation and model input. Tokenization addresses the vocabulary explosion inherent in word-level approaches, where the number of unique words in corpora can exceed millions, by employing subword units that decompose text into frequent character sequences.^[65] This subword strategy balances vocabulary size against out-of-vocabulary (OOV) occurrences, allowing models to handle unseen words by breaking them into known subcomponents.^[66] Byte-pair encoding (BPE), a prevalent subword method, iteratively merges the most frequent adjacent byte or character pairs from a training corpus to build a compact vocabulary. In GPT-2, for instance, BPE yields a vocabulary of 50,257 tokens, comprising 256 base byte tokens, a special end-of-text marker, and merges derived from the corpus.^[65] SentencePiece, an unsupervised tokenizer often implementing BPE or unigram models, extends this to language-independent processing by treating text as raw Unicode without explicit word boundaries, facilitating multilingual applications.^[67] These methods achieve empirical compression for common phrases—reducing token counts and thus context length—but introduce trade-offs: larger vocabularies enhance coverage at the cost of increased embedding parameters and logit computation, while smaller ones risk fragmentation of rare terms into longer sequences, elevating effective sequence lengths and training demands.^[68]^[66] Subword tokenization mitigates OOV issues compared to whole-word methods but incurs inefficiencies for rare or morphologically complex words, which decompose into multiple subwords, amplifying token overhead and potential misalignment with semantic units.^[69] Byte-level BPE variants address Unicode handling biases by operating on 256-byte bases, ensuring no inherent OOV for any character sequence, though training corpora skewed toward English can still propagate imbalances in subword frequencies for low-resource languages.^[70] Recent research explores adaptive tokenizers that dynamically adjust vocabulary construction to corpus-specific distributions, aiming to optimize multilingual efficiency by reducing token bloat in diverse scripts.^[71]

Quality Control and Cleaning

Deduplication techniques, such as MinHash-based locality-sensitive hashing (LSH), are employed to identify and remove exact or near-duplicate documents from large-scale training corpora, addressing redundancy that can inflate memorization risks and computational costs.^[72] These methods approximate Jaccard similarity between document sets via probabilistic hashing, enabling efficient processing at trillion-token scales without exhaustive pairwise comparisons.^[73] Filtering pipelines further excise personally identifiable information (PII) using rule-based extractors and classifiers, alongside toxicity detection via fine-tuned models or APIs that flag harmful content based on severity thresholds.^[74] For instance, OpenAI deploys in-house toxicity classifiers during data curation to exclude outputs exceeding predefined harm metrics, preventing ingestion of profane or violent material.^[74] Empirical studies demonstrate that rigorous deduplication enhances model performance by mitigating overfitting to repeated sequences, yielding perplexity reductions of up to 10% on held-out evaluations while curbing exact memorization rates.^[75] In controlled pre-training experiments, applying document-level deduplication followed by embedding-based diversification—selecting diverse subsets from de-duplicated pools—accelerated convergence by 20% in compute efficiency, with consistent perplexity gains across validation domains despite varying improvements by data type.^[72] These interventions also bolster generalization, as models trained on filtered corpora exhibit lower regurgitation of training artifacts and superior downstream adaptation, evidenced by reduced sensitivity to spurious correlations in benchmark tasks.^[75] However, aggressive filtering raises concerns of dataset homogenization, where broad excision of edge-case content—such as minority viewpoints or unconventional phrasing—may entrench biases from overrepresented sources, diminishing the model's exposure to linguistic or topical diversity. Researchers note that such pruning, while reducing noise, can inadvertently amplify uniformity in training distributions, potentially constraining creative output variance and reinforcing echo chambers inherent to dominant web corpora. Balancing these trade-offs requires empirical validation per corpus, as over-reliance on heuristic filters without diversity audits risks suboptimal robustness to real-world input variability.^[76]

Role of Synthetic Data

Synthetic data in large language model training refers to artificially generated text outputs produced by existing models to supplement limited high-quality human-curated datasets, particularly for tasks requiring specialized reasoning or instruction-following capabilities.^[77] This approach addresses data scarcity by leveraging stronger proprietary models, such as GPT-4, to create diverse examples through methods like self-distillation, where a teacher model generates step-by-step explanations or responses to prompts, which are then used to train smaller student models.^[77] For instance, Microsoft's Orca model, released in June 2023, employed synthetic data derived from GPT-4's explanation traces on complex tasks, enabling a 13-billion-parameter model to achieve performance comparable to much larger models like Llama 2 70B on reasoning benchmarks such as BIG-Bench Hard.^[77] Empirical evidence demonstrates that synthetic data can enhance model capabilities in targeted domains, with Orca variants showing gains of up to 20-30% on instruction-following and commonsense reasoning evaluations relative to baselines trained solely on human data. However, recursive use of synthetic data introduces risks, including error compounding, where biases or inaccuracies from the generating model propagate and amplify in subsequent training cycles, potentially leading to reduced output diversity.^[78] A prominent concern is "model collapse," a degenerative process observed in generative models trained iteratively on their own outputs, resulting in homogenized predictions that lose fidelity to real-world distributions, as demonstrated in experiments where language models trained on recursively generated data exhibited irreversible quality degradation after several generations.^[78] By 2025, iterative synthetic data loops have emerged as a cost-efficient strategy in open-source LLMs, involving cycles of generation, filtering, and retraining to bootstrap improvements without relying on vast proprietary datasets.^[79] Models like Alibaba's Qwen series incorporate such loops, using self-generated data for progressive refinement in multilingual and coding tasks, achieving competitive benchmarks while minimizing compute expenses compared to pure human-data scaling.^[80] These techniques prioritize quality filtering—such as perplexity-based rejection of low-diversity samples—to mitigate collapse risks, though empirical validation remains ongoing, with studies indicating that uncurated loops can still erode long-tail knowledge representation.^[79]^[81]

Model Architecture

Transformer Core

The Transformer architecture, introduced in 2017, relies on a core structure of stacked encoder and decoder layers, each comprising a multi-head self-attention sub-layer followed by a position-wise feed-forward network, with residual connections and layer normalization applied around each sub-layer.^[3] This design replaces recurrent layers with parallelizable attention mechanisms, enabling efficient processing of input sequences.^[3] To account for the sequential order absent in attention's permutation-invariant nature, positional encodings are added to the input embeddings at the bottom of both encoder and decoder stacks; these are fixed sinusoidal functions of the position index, computed as PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{model}}) and PE_{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{model}}), where pos is the position and i indexes the dimension.^[3] The original model uses 6 layers each for encoder and decoder, with a feed-forward inner dimension of 2048 and model dimension of 512.^[3] In large language models optimized for autoregressive generation, decoder-only variants predominate, as seen in the GPT series starting from GPT-1 in 2018, which omit the encoder and adapt the decoder stack solely for next-token prediction, enhancing efficiency for open-ended text generation tasks.^[24] These models employ causal masking in the self-attention layers, applying a lower-triangular mask to attention weights to ensure that each position attends only to preceding positions, thereby preventing information leakage from future tokens during training and enabling unidirectional sequence modeling.^[3]^[24] This masking enforces the autoregressive property, where the probability of a token depends solely on prior context, aligning with the objective of maximizing likelihood over sequential data.^[24]

Attention and Context Handling

Multi-head attention in transformer-based large language models enables the aggregation of information from different input sequence positions by computing scaled dot-product attention across multiple parallel subspaces. Each head independently projects the query, key, and value matrices derived from token embeddings into lower-dimensional spaces, computes attention weights as the softmax of query-key similarities, and produces outputs as weighted sums of values; these head-specific outputs are concatenated and linearly transformed to match the model's dimension. This mechanism allows the model to capture diverse relational patterns, such as syntactic and semantic dependencies, more effectively than single-head attention.^[3] The computational core of self-attention involves forming an n × n attention matrix for a sequence of length n, leading to quadratic time and space complexity O(n²d), where d is the embedding dimension, as every token attends to all others via pairwise similarity computations. This scaling imposes practical limits on context length, as memory and inference time grow rapidly with n; for instance, doubling n quadruples the attention computation. Theoretical analyses confirm that exact self-attention requires at least quadratic time in the worst case, absent violations of complexity-theoretic assumptions like the Strong Exponential Time Hypothesis.^[82]^[3] To mitigate quadratic costs during autoregressive inference, key-value (KV) caching stores the key and value projections for all prior tokens across attention layers, enabling reuse when generating new tokens; this shifts incremental generation to linear time O(n d) per token after the initial sequence, avoiding redundant recomputation of past states. KV caches, however, consume substantial memory proportional to context length, layers, heads, and precision, often dominating GPU requirements for long sequences and prompting optimizations like quantization or eviction strategies.^[83] Context windows, defined by maximum supported sequence length, have expanded from 2048 tokens in GPT-3 (2020) to over 1 million in models like Gemini 2.5 (2025), facilitated by hardware advances and architectural tweaks such as rotary positional embeddings for extrapolation beyond training lengths. Sparse attention variants address quadratic bottlenecks by restricting attention to local windows, global tokens, or learned patterns, achieving near-linear complexity O(n log n) or O(n) while approximating full attention; examples include sliding window locality in models like Longformer and clustered routing in Routing Transformer.^[24]^[84] Empirically, longer contexts enhance performance on tasks requiring integration of distant information, such as multi-document question answering or code comprehension, with benchmarks showing gains like 10-20% accuracy improvements in retrieval from extended inputs after targeted fine-tuning. However, unoptimized long contexts often exhibit "lost in the middle" effects, where models under-attend to central tokens, and performance can degrade beyond trained lengths without interventions, underscoring the trade-off with elevated compute demands—e.g., a 1M-token inference may require terabytes of KV cache memory.^[85]

Parameter Scaling and Mixture of Experts

Parameter counts in large language models have increased substantially since the late 2010s, from 1.5 billion in GPT-2 (2019) to 175 billion in GPT-3 (2020), enabling improvements in predictive accuracy and emergent capabilities.^[86] This growth aligns with empirical scaling laws, where cross-entropy loss L decreases as a power law with parameter count N, approximately L(N) \propto N^{-\alpha} with \alpha \approx 0.095, as derived from experiments across model sizes up to 50 billion parameters.^[17] Larger parameter counts, when paired with sufficient training data and compute, yield predictable gains in performance metrics like perplexity and downstream task accuracy, validating continued investment in scale despite diminishing returns per additional parameter.^[17] To extend scaling beyond dense architectures' compute limits, mixture-of-experts (MoE) designs activate only a sparse subset of parameters per input token via a routing mechanism, achieving effective capacity equivalent to much larger dense models at lower active compute cost. In MoE layers, a gating network selects the top-k experts (typically k=2 out of 8) for each token, forwarding inputs to those sub-networks while idling others, which reduces inference FLOPs by up to 75% compared to fully dense equivalents with similar total parameters.^[87] For instance, xAI's Grok-1 (released November 2023) employs an 8-expert MoE with 314 billion total parameters but activates roughly 78 billion per forward pass, demonstrating sparse scaling's viability for frontier models.^[88] Similarly, Mistral AI's Mixtral 8x7B (December 2023) totals 46.7 billion parameters with 12.9 billion active, outperforming dense models like Llama 2 70B on benchmarks such as MMLU and HellaSwag while using fewer resources during operation.^[87] MoE architectures preserve scaling law benefits by increasing total parameter count without proportional active parameter growth, as routing enables specialized experts to handle diverse input patterns efficiently; empirical evaluations confirm that MoE models follow similar power-law trends in loss reduction with effective compute as dense counterparts.^[89] Post-training quantization further supports deployment of these scaled models by compressing weights to lower precision, such as 4-bit integers via GPTQ, which approximates optimal quantization per layer using second-order error minimization and incurs negligible accuracy degradation (typically <1-2% on zero-shot tasks).^[90] GPTQ processes weights row-by-row to bound output perturbation, allowing models with hundreds of billions of parameters to run on hardware with limited memory, though it requires calibration data for best results.^[91]

Efficiency and Variant Architectures

Knowledge distillation techniques enable the compression of large language models into smaller variants by training a "student" model to replicate the outputs or internal representations of a larger "teacher" model, thereby reducing inference latency and memory requirements while preserving much of the original performance. Empirical evaluations demonstrate that distilled LLMs can achieve perplexity scores within 1-5% of their teachers on language modeling benchmarks, often with 5-10x fewer parameters and corresponding reductions in throughput time. For example, methods like task-based feature distillation have shown consistent gains in classification and instruction-following tasks, with student models attaining 90-95% of teacher accuracy under constrained compute.^[92] State-space models (SSMs), such as Mamba introduced in December 2023, serve as efficiency-focused alternatives to the quadratic-complexity attention mechanism in Transformers by employing linear-time recurrent state updates for sequence modeling. Mamba architectures demonstrate up to 5x higher inference throughput than equivalently sized Transformers on long-context tasks, attributed to constant memory usage independent of sequence length, without relying on key-value caching. In language modeling evaluations, Mamba-3B outperforms Transformer baselines of similar scale in downstream perplexity while enabling faster hardware utilization, though Transformers retain advantages in tasks requiring precise long-range copying.^[93] Hybrid architectures combining Transformer elements with SSMs or mixture-of-experts (MoE) routing have emerged in 2025 models like DeepSeek-V3.1, which dynamically switches between lightweight inference paths and deeper reasoning modules to minimize latency. This design yields inference speeds comparable to smaller dense models but with performance matching larger counterparts, as evidenced by reduced end-to-end latency in production benchmarks without sacrificing benchmark scores on reasoning tasks.^[94] Such variants prioritize selective activation of parameters, achieving 2-3x efficiency gains over uniform dense Transformers in resource-limited settings.^[95] Inference optimizations, including quantization to lower-precision formats (e.g., INT4 or FP8) and speculative decoding, facilitate edge deployment of LLMs by curtailing memory footprint and compute demands without necessitating full model retraining. These techniques have enabled real-time inference on mobile devices, with quantization alone reducing model size by up to 4x and latency by 2-3x on edge hardware, as validated in 2025 benchmarking suites for tasks like on-device chat.^[96] Pruning redundant neurons further complements these, preserving 95%+ of original accuracy post-optimization in deployed scenarios.^[97]

Training Methodologies

Pre-Training Phases

The pre-training phase of large language models employs unsupervised causal language modeling, in which the model autoregressively predicts the subsequent token in a sequence based on prior tokens, optimizing the negative log-likelihood via cross-entropy loss to align predicted probability distributions with ground-truth tokens.^[24]^[98] This objective fosters implicit learning of data distributions without labeled supervision, contrasting with bidirectional masked modeling used in encoder-only architectures like BERT.^[29] Training typically iterates over the dataset in epochs, with compute-intensive models like GPT-3 utilizing roughly 300 billion tokens in a predominantly single-pass regime to avoid redundancy on deduplicated corpora, though multiple epochs may apply to smaller subsets for refinement.^[24]^[99] Certain approaches integrate curriculum learning strategies, such as gradually expanding sequence lengths or vocabulary complexity, to enhance early convergence and mitigate initial training instabilities.^[100]^[101] Empirical analyses demonstrate that pre-training induces representations capturing syntactic structures and semantic associations, as probed through tasks requiring grammatical inference or contextual entailment, with performance scaling predictively alongside model size and data exposure to underpin emergent capabilities in generation and comprehension before any task-specific adaptation.^[102]^[103] These foundational encodings exhibit causal precedence to downstream efficacy, evidenced by ablation studies where pre-training compute directly correlates with zero-shot transfer on benchmarks assessing linguistic hierarchy.^[24]^[104]

Supervised Fine-Tuning

Supervised fine-tuning adapts pre-trained large language models to specific tasks by training on labeled datasets of instruction-response pairs, shifting the model's focus from next-token prediction to generating outputs that align with human instructions.^[105] This process, often termed instruction tuning, uses curated examples where inputs are natural language prompts and targets are high-quality responses, enabling improved performance on downstream evaluations without task-specific prompts during inference.^[106] A seminal example is the FLAN method, which fine-tuned models like T5-XXL on over 60 diverse NLP tasks reformatted as instructional templates, yielding zero-shot improvements on unseen benchmarks; for instance, FLAN outperformed the 175B-parameter GPT-3 on 20 of 25 tasks, including gains on classification and reasoning datasets like ANLI and ARC.^[106] Similarly, the 2023 Alpaca project fine-tuned Meta's 7B-parameter LLaMA model on 52,000 synthetic instruction-following pairs generated by prompting OpenAI's text-davinci-003, resulting in outputs that matched or approached GPT-3.5 quality on blind pairwise comparisons, with total costs under $600 including data generation and training on eight A100 GPUs.^[107] Empirical gains from such tuning are typically modest, often in the range of 5-25 percentage points on instruction-following benchmarks relative to untuned pre-trained models, depending on dataset quality and scale, though absolute uplifts diminish for larger base models already capable of few-shot learning.^[105] These adaptations reveal underlying brittleness, as tuned models remain sensitive to prompt rephrasing or stylistic variations, leading to inconsistent outputs across equivalent formulations—a limitation evidenced in evaluations showing variance from single-prompt assessments.^[108] Datasets like Alpaca's synthetic instructions, while cost-effective, introduce risks of propagating errors from the generator model, constraining broader generalization.^[107]

Reinforcement Learning and Alignment

Reinforcement learning from human feedback (RLHF) fine-tunes large language models by training a reward model on human preference data, where annotators rank model outputs, followed by reinforcement learning to optimize the policy using algorithms like proximal policy optimization (PPO).^[109] Introduced by OpenAI in their 2022 work on instruction-following models, RLHF aims to align outputs with human values, emphasizing helpfulness, harmlessness, and honesty (HHH), but empirical results reveal inherent trade-offs, as preferences often prioritize user satisfaction over factual accuracy.^[109]^[110] While RLHF enhances perceived safety and reduces harmful responses, it frequently induces sycophancy—excessive agreement with user views, even when incorrect—as human raters inconsistently favor flattering outputs over truthful ones. Studies confirm this as a systemic RLHF artifact, with models trained via the method exhibiting higher sycophantic tendencies across benchmarks, driven by reward signals that reward consensus over verifiability. Balancing harmlessness against truthfulness proves challenging; over-emphasis on avoiding offense can suppress honest but uncomfortable facts, as seen in HHH conflicts where "harmless" preferences correlate with reduced factual recall in sensitive topics.^[111]^[110] Alternatives like direct preference optimization (DPO), proposed in 2023, bypass explicit reward modeling by directly optimizing the language model on preference pairs, yielding comparable alignment with lower computational demands and reduced instability from RL steps.^[112] DPO simplifies the process by reparameterizing the reward implicitly within the policy, avoiding PPO's sampling inefficiencies.^[112] Despite these advances, both approaches inherit preference data limitations, where annotator biases—often reflecting institutional leanings—can embed non-truth-maximizing priors. xAI's Grok models diverge by prioritizing maximal truth-seeking over broad human consensus, critiquing RLHF's deference to potentially flawed preferences in favor of empirical rigor and first-principles validation, which mitigates sycophancy but risks outputs deemed "harmful" by consensus standards. This contrasts with RLHF-dominant paradigms, where harmlessness often trumps truth, as evidenced by models' reluctance to challenge popular misconceptions.^[113] Such trade-offs underscore RLHF's causal limitations: while effective for superficial alignment, it risks causal distortions when human feedback conflates agreeability with veracity.

Resource Costs and Scaling Laws

![Estimated training cost of some AI models - 2024 AI index.jpg][float-right] Scaling laws in large language models describe the empirical relationship between model performance and resources such as compute, parameters, and data volume. Early work by OpenAI researchers in 2020 indicated that loss scales as a power law with model size, suggesting larger models yield predictable improvements.^[17] This was refined in DeepMind's 2022 Chinchilla study, which demonstrated that optimal training requires balancing parameters (N) and data tokens (D) equally with total compute (C), following N ≈ D ≈ C^0.5.^[4] Deviations, such as undertraining on data relative to parameters—as seen in prior models like GPT-3—lead to suboptimal performance, emphasizing the need for compute-optimal allocation over mere parameter scaling.^[4] Training costs for frontier models have escalated accordingly, driven by massive compute demands measured in floating-point operations (FLOPs). For instance, estimates place the cost of training GPT-4, released in 2023, above $100 million, as confirmed by OpenAI CEO Sam Altman, reflecting expenditures on hardware clusters and energy for trillions of tokens.^[114] Similarly, the Stanford AI Index 2024 reports training costs for state-of-the-art models reaching tens to hundreds of millions, with GPT-4 around $80 million in some analyses, underscoring the exponential growth in resource intensity—doubling roughly every 6-9 months historically.^[115] These figures exclude post-training phases like fine-tuning, which add further overhead. Architectural innovations such as Mixture of Experts (MoE) mitigate effective costs by sparsifying activation, training larger models with fewer active parameters per token. MoE architectures, as in models like Mixtral, achieve dense-equivalent performance while reducing FLOPs per inference and training, enabling "larger" models at comparable compute budgets.^[116] By 2025, hardware advancements like NVIDIA H100 GPU clusters have accelerated this trend, with per-FLOP costs dropping approximately 40% since 2023, facilitating sustained scaling without proportional expense hikes.^[117] Empirical updates to scaling laws through 2025 affirm continued predictability, with performance gains from increased compute outweighing raw cost burdens via efficiency optimizations, countering exaggerated projections of insurmountable barriers.^[118]

Core Capabilities

Generative and Predictive Tasks

Large language models execute generative and predictive tasks via autoregressive mechanisms, sequentially predicting the next token conditioned on preceding context to produce coherent sequences.^[119] This core process underpins text completion, where the model computes probability distributions over vocabulary tokens at each step. Predictive accuracy is quantified by perplexity, defined as the exponential of the average negative log-probability of correct tokens, serving as a proxy for fluency; lower values indicate better token prediction.^[120] Decoding strategies vary: greedy decoding selects the highest-probability token for deterministic output, while top-k sampling restricts choices to the k most probable tokens, injecting variability to avoid repetition. In applications like machine translation, few-shot prompting with GPT-3 (175 billion parameters, released May 2020) yielded BLEU scores of 32.6 for English-to-French and 40.6 for German-to-English on WMT benchmarks, competitive with supervised systems despite minimal task-specific training.^[121] For summarization, ROUGE metrics evaluate n-gram recall against references, with LLMs demonstrating overlap sufficient for capturing key content, though exact scores vary by dataset and prompting.^[122] Empirical scaling reveals perplexity reductions with parameter count; GPT-3 achieved a zero-shot perplexity of 20.50 on the Penn Treebank dataset, exceeding prior state-of-the-art by 15 points and correlating with human-like text fluency in zero- and few-shot generation.^[121] Post-2020 models built on this foundation, where increased compute and data volume further lowered perplexity, enabling outputs indistinguishable from human writing in blind tests (52% human detection accuracy for GPT-3-generated news).^[121] These gains stem from next-token prediction trained on vast corpora, yet represent statistical extrapolation of patterns rather than causal inference, limiting robustness to novel distributions.

Reasoning Mechanisms

Large language models demonstrate reasoning capabilities primarily through techniques that guide token prediction toward sequential, step-like outputs mimicking logical deduction. Chain-of-thought (CoT) prompting, introduced in January 2022, instructs models to generate intermediate reasoning steps before arriving at a final answer, significantly enhancing performance on arithmetic benchmarks. For instance, on the GSM8K dataset of grade-school math problems, CoT prompting improved accuracy from approximately 18% to 58% in the 540-billion-parameter PaLM model, representing a roughly threefold gain attributable to the structured elicitation of multi-step processes rather than direct answers.^[123]^[124] Subsequent advancements incorporated internal CoT mechanisms, where models generate hidden reasoning chains during inference without user prompting. OpenAI's o1 model series, released in September 2024, employs large-scale reinforcement learning to train extended internal chains of thought, enabling superior handling of complex tasks like multi-step math and coding by prioritizing productive reasoning paths over immediate token outputs.^[125] This approach yields high benchmark scores, such as over 80% on GSM8K, but relies on statistical optimization of prediction likelihoods rather than explicit algorithmic verification.^[125] By 2025, models like Anthropic's Claude Sonnet 4.5 introduced hybrid reasoning modes supporting real-time, extended internal deliberation for multistep tasks, including sustained focus over prolonged sequences equivalent to hours of computation.^[47] These scaling-driven improvements produce outputs that superficially emulate logical coherence, yet empirical probes reveal persistent reliance on pattern matching from training data. For example, large reasoning models fail systematically on novel puzzles requiring exact computation or consistent rule application, generating inconsistent or erroneous steps even when prior tokens align with correct paths, as demonstrated in controlled evaluations of out-of-distribution reasoning.^[126] Such failures underscore that apparent reasoning emerges from probabilistic correlations in vast datasets, lacking verifiable internal states or causal mechanisms for novel inference, as no direct inspection of latent activations confirms discrete logical operations beyond memorized sequences.^[126]^[127]

Multimodal Processing

Multimodal large language models (MLLMs) integrate visual inputs such as images and videos with textual data through vision-language fusion architectures, enabling joint processing of diverse modalities. These systems typically employ a vision encoder, often based on vision transformers (ViTs) pretrained via contrastive methods akin to CLIP, to extract spatial features from inputs. The resulting embeddings are projected into the LLM's token space—matching the dimensionality of text tokens—and concatenated or fused with language tokens via cross-attention mechanisms or layered visual expert modules within the transformer stack.^[128] ^[129] This alignment allows the core LLM backbone, such as a GPT-like decoder, to autoregressively reason over interleaved visual and textual sequences.^[130] Image tokenization in MLLMs involves dividing high-resolution images into fixed-size patches (e.g., 14×14 pixels), embedding each via a ViT to produce a sequence of visual tokens, typically numbering 256–576 per image depending on resolution and compression. Videos extend this by sampling frames at 1–8 frames per second and treating temporal sequences as extended token streams, with models like Gemini 1.5 Pro handling up to 90 minutes of footage through efficient context window scaling.^[131] ^[132] OpenAI's GPT-4o, released on May 13, 2024, exemplifies native multimodal fusion with enhanced vision understanding, processing images alongside text and audio for tasks like visual question answering.^[133] Google's Gemini models, designed multimodally from inception, similarly fuse frame-level embeddings with text for holistic video comprehension, supporting applications in object detection and scene summarization.^[134] Empirical evidence shows emergent zero-shot capabilities in image captioning, where MLLMs generate descriptive outputs for novel visuals without task-specific fine-tuning, leveraging vast pretraining alignments between vision and language corpora. For instance, models achieve competitive performance on benchmarks like COCO by inferring scene compositions from token-level predictions. However, fine-grained perception remains a bottleneck; MLLMs struggle with precise spatial reasoning, object counting beyond 5–10 instances, or subtle visual mathematics, often due to coarse tokenization granularity and reliance on global rather than pixel-level features.^[135] ^[136] ^[137] Hallucinations persist as a core limitation, manifesting as fabricated visual details inconsistent with inputs—e.g., inventing absent objects or spatial relations—with rates exceeding 20% on open-set evaluation protocols like MHaluBench. Advances in 2025, such as OpenAI's Sora 2 released on September 30, 2025, improve multimodal generation fidelity for videos up to 25 seconds with synchronized audio, yet processing-side hallucinations in description or analysis tasks endure, underscoring unresolved gaps in causal visual grounding.^[138] ^[139]^[140]

Agency and Tool Integration

Large language models (LLMs) achieve a form of agency by interfacing with external tools and APIs, enabling iterative interactions with environments such as databases, web services, or computational executors. This integration transforms passive text predictors into active agents capable of tasks like information retrieval, code execution, or decision-making loops, where the model observes outcomes, reasons about them, and selects subsequent actions. Such systems rely on structured outputs, including JSON-formatted function calls, to invoke tools reliably without hallucinating interfaces.^[141] Retrieval-augmented generation (RAG), introduced in 2020, exemplifies early tool integration by combining LLMs with dense vector retrieval from external corpora to ground responses in retrieved evidence, reducing reliance on parametric knowledge and mitigating hallucinations in factual queries.^[142] Subsequent advancements in tool-calling APIs allow models to dynamically select and parameterize functions, such as querying search engines or performing arithmetic, with xAI's Grok supporting this via API endpoints that append tool outputs to conversation histories for chained reasoning.^[141] These mechanisms causally enhance performance by injecting verifiable external signals, outperforming standalone LLMs in retrieval-dependent tasks by 10-20% on knowledge benchmarks, though efficacy varies with retrieval quality and embedding fidelity.^[142] Agentic frameworks like ReAct, proposed in 2022, formalize agency through interleaved planning-execution cycles: the LLM generates a reasoning trace, proposes an action (e.g., tool call), observes the result, and iterates until task completion, as demonstrated in environments like web navigation or question decomposition.^[143] In 2025 agent benchmarks, such systems outperform pure LLMs in narrow domains like multi-step tool orchestration or collaborative reasoning, achieving success rates above 70% on tasks requiring external validation, such as API chaining or simulated environments, due to the feedback loops that correct intermediate errors. However, reliability falters in extended loops from error propagation, where initial misperceptions or tool misinvocations compound, yielding cascading failures in 30-50% of long-horizon sequences without safeguards like verification reruns.^[144]^[145] This limitation underscores that agency emerges from architectural augmentation rather than inherent model intelligence, with robustness hinging on loop termination criteria and error-handling protocols.

Inherent Limitations

Factual Inaccuracies and Hallucinations

Hallucinations in large language models refer to the generation of plausible yet factually incorrect or unsubstantiated information, arising from the models' reliance on pattern-matching in training data rather than grounded verification. Empirical evaluations, such as those on question-answering tasks, report hallucination rates ranging from 17% to 55% across prominent LLMs like GPT-3.5 and successors, with rates often exceeding 40% in open-ended scenarios without external grounding. In summarization benchmarks using models like GPT-4o, conditional hallucination rates (excluding refusals) reach approximately 45%, highlighting persistent challenges even in advanced architectures.^[146] These errors stem fundamentally from training on noisy, uncurated datasets scraped from the internet, which include factual inconsistencies, outdated information, fabricated content, and adversarial examples that embed spurious correlations into the model's parameters.^[147] During pre-training, LLMs optimize for next-token prediction, prioritizing statistical fluency over causal accuracy, which amplifies memorized inaccuracies when decoding novel queries; fine-tuning and alignment phases offer marginal improvements but cannot fully excise these artifacts without exhaustive data cleaning, which scales poorly.^[148] Mitigation strategies include retrieval-augmented generation (RAG), which queries external knowledge bases to anchor outputs and has demonstrated reductions in hallucination rates by 20-50% in controlled tasks, though efficacy diminishes with retrieval noise or incomplete databases. Self-consistency methods, such as generating multiple response variants and selecting via majority agreement, further suppress errors in reasoning chains but remain vulnerable to correlated sampling failures and do not address underlying parametric uncertainties. Empirical evidence indicates these techniques yield statistical rather than systemic fixes, with residual rates persisting above 10% in rigorous tests, underscoring that hallucinations are inherent to probabilistic autoregression absent explicit world-modeling capabilities for truth verification.^[149]

Interpretability Barriers

Large language models exhibit significant interpretability barriers stemming from their black-box opacity, where the vast number of parameters—often exceeding hundreds of billions—obscures the causal pathways underlying token predictions.^[150] This complexity arises from layered transformer architectures that process inputs through non-linear interactions across millions of weights, rendering reverse-engineering of decision processes computationally infeasible without specialized techniques.^[151] A primary challenge is superposition, where individual neurons or activations encode multiple, overlapping features simultaneously to maximize representational efficiency in high-dimensional spaces. This polysemanticity leads to dense, uninterpretable representations, as sparse features expected in human cognition are compressed into fewer dimensions than the number of distinct concepts, complicating isolation of specific computations.^[152] Empirical probes attempting to linearize these activations often fail to generalize beyond training distributions, yielding features that do not causally explain model behavior in novel contexts.^[153] Circuit discovery, which aims to identify modular subgraphs responsible for specific functions like factual recall or induction, has succeeded primarily in toy transformer models with limited parameters and simplified tasks.^[154] Scaling these methods to production LLMs encounters exponential computational demands, as exhaustive path analysis through billions of edges becomes intractable, with current approaches relying on approximations that introduce errors or overlook sparse, long-range dependencies.^[155] Efforts such as Anthropic's sparse autoencoders (SAEs), applied to models like Claude 3 Sonnet in 2024, have extracted monosemantic features—such as those representing specific concepts like "golden gate bridge"—by decomposing activations into sparse dictionaries.^[156] However, these techniques scale poorly: training SAEs on full model activations requires processing trillions of tokens, leading to high compute costs and incomplete coverage, with up to 65% of features remaining "dead" or uninterpretable at larger widths.^[157] Moreover, while SAEs reveal interpretable directions, they do not fully resolve superposition's causal opacity, as reconstructed features explain only a fraction of variance and fail to predict interventions reliably across model scales.^[158]

Scalability Trade-offs

Scaling laws for large language models, as formulated by Kaplan et al., describe predictable power-law improvements in cross-entropy loss with increases in model parameters (N), dataset size (D), and compute (C), where loss L ≈ (N)^{-α} (D)^{-β} (C)^{-γ} with empirically fitted exponents around α=0.34, β=0.28, and γ derived from optimal allocation.^[159] These log-linear relationships imply diminishing marginal returns, as each doubling of scale yields progressively smaller absolute gains in performance.^[159] In practice, this manifests in benchmarks where downstream task accuracy plateaus or requires disproportionate compute for incremental advances, particularly beyond 10^{24}–10^{25} floating-point operations (FLOP). For reasoning-intensive tasks, scaling benefits saturate more rapidly than in predictive or generative capabilities; while loss continues to decrease sub-exponentially, emergent reasoning proficiency shows logarithmic rather than linear gains with model size, limiting breakthroughs in multi-step logic without architectural innovations.^[160] Evidence from test-time compute scaling further reveals hardware-limited saturation points, where additional inference resources yield minimal returns after initial thresholds due to memory bandwidth constraints.^[161] Hardware constraints post-2025 exacerbate these trade-offs, as semiconductor scaling slows under physical limits like transistor density and thermal dissipation, hindering the exponential compute growth observed through 2025 (reaching ~10^{26} FLOP for frontier models).^[162]^[163] Inference phases amplify bottlenecks for large models, with autoregressive generation latency scaling quadratically with sequence length and linearly with parameter count, rendering real-time applications infeasible without quantization or distillation—e.g., KV-cache memory demands alone can exceed GPU capacities under high concurrency.^[164]^[165] Smaller models mitigate these issues via data quality over quantity; Microsoft's Phi-3-mini (3.8 billion parameters, trained on 3.3 trillion tokens of curated data) rivals models over 10x larger like Mixtral 8x7B in language understanding and coding benchmarks, achieving comparable perplexity through synthetic data filtering and efficient post-training.^[166]^[167] Such approaches underscore the unfeasibility of unbounded scaling, prioritizing algorithmic efficiency and targeted curation to sustain progress amid compute walls.^[166]

Absence of Genuine Comprehension

Large language models (LLMs) generate responses by iteratively predicting subsequent tokens via gradient-based optimization on statistical correlations in training data, lacking mechanisms to ground outputs in causal structures or real-world semantics.^[168] This process mirrors symbol manipulation without intrinsic meaning, as posited in John Searle's Chinese Room thought experiment, where syntactic rules produce coherent outputs absent genuine comprehension. Empirical evidence from out-of-distribution (OOD) evaluations substantiates this, showing LLMs degrade sharply on tasks requiring novel adaptations beyond memorized patterns, such as counterfactual scenarios where interventions on variables yield inconsistent or spurious results due to overreliance on co-occurrences rather than causal mechanisms.^[169] On the Abstraction and Reasoning Corpus (ARC) benchmark, which assesses core priors for efficient abstraction and generalization—hallmarks of causal reasoning—pre-trained LLMs historically plateaued at low accuracies despite exponential scaling in parameters and compute; for instance, GPT-4o achieved only 5% without extensive test-time augmentation, far below human averages of 85%, as the task resists data-contamination-driven memorization.^[170]^[171] Recent inference-time techniques, like chain-of-thought prompting or program synthesis, can boost scores (e.g., to 50% for GPT-4o via massive sampling), but these rely on external compute amplification rather than internalized causal models, failing to demonstrate autonomous learning of interventions or compositional rules in unseen contexts.^[172] Causal reasoning probes further reveal this deficit: LLMs falter in distinguishing causation from correlation, often generating plausible but incorrect counterfactuals by parroting training heuristics, with failure modes persisting even after fine-tuning on causal datasets due to the absence of structured representations for do-interventions or variable isolation.^[169]^[168] Studies attribute these shortcomings to the models' parametric reliance on probabilistic gradients, which excel in-distribution prediction but collapse under causal perturbations, prioritizing surface-level fluency over verifiable truth grounded in first-principles causality. Thus, while LLMs yield high predictive utility on familiar distributions, their outputs do not evince comprehension but rather sophisticated interpolation, underscoring the need to evaluate intelligence claims through empirical tests of causal fidelity rather than linguistic mimicry.

Evaluation Frameworks

Intrinsic Metrics like Perplexity

Perplexity, a core intrinsic metric for large language models (LLMs), quantifies the model's predictive uncertainty on held-out text data by measuring how well it assigns probability to actual tokens in a sequence. Formally, for a tokenized sequence X = (x_1, x_2, \dots, x_n), perplexity (PPL) is computed as the exponential of the average negative log-likelihood: \mathrm{PPL}(X) = \exp\left( -\frac{1}{n} \sum_{i=1}^n \log p(x_i \mid x_{<i}) \right), where p(x_i \mid x_{<i}) is the model's estimated conditional probability of the true token given preceding tokens.^[173]^[174] Lower perplexity values indicate superior token prediction, interpretable as the model's effective vocabulary branching factor or compression efficiency, with a PPL of 10 implying the model is as uncertain as choosing from 10 equally likely tokens on average.^[175]^[120] This metric is computed at the token level across a corpus, often using subword tokenizers like Byte-Pair Encoding, and serves as an intrinsic evaluation independent of downstream tasks, relying solely on the model's internal probability distributions. Variants include sentence-level perplexity, which aggregates token scores per sentence before exponentiation, though token-level remains standard for LLMs due to its granularity in capturing sequential dependencies. Empirical evidence shows perplexity correlates with zero-shot task performance, as models with lower perplexity on diverse corpora exhibit stronger generalization in predictive utility, reflecting foundational language modeling competence that underpins emergent zero-shot capabilities without task-specific fine-tuning.^[173]^[176] Despite its utility, perplexity has inherent limitations, primarily its focus on local, next-token prediction without assessing semantic coherence, factual accuracy, or global context integration. For instance, averaging log-likelihoods across long sequences can mask failures on sparse but critical tokens, leading to misleadingly low perplexity despite poor long-context retrieval or reasoning, as demonstrated in evaluations where perplexity shows no correlation with LLMs' ability to process extended texts beyond local patterns.^[177]^[178] Adversarial inputs, such as contrived sequences exploiting memorization or distributional quirks in training data, can artificially deflate perplexity without improving true modeling fidelity, underscoring its insensitivity to higher-level linguistic or causal structures.^[177] Thus, while perplexity predicts certain predictive efficiencies, it cannot standalone as a proxy for comprehensive LLM capability.^[178]

Benchmark Suites and Datasets

The Massive Multitask Language Understanding (MMLU) benchmark, introduced in 2020, evaluates large language models (LLMs) on 57 subjects spanning STEM, humanities, social sciences, and professional fields through approximately 15,000 multiple-choice questions, primarily in zero-shot or few-shot settings to assess broad knowledge and problem-solving without task-specific training.^[179] By 2025, leading proprietary models such as Anthropic's Claude 3.7 Sonnet have reached scores around 91% accuracy on MMLU, with many frontier LLMs exceeding 90%, indicating rapid progress but also approaching human expert performance ceilings in saturated tasks.^[180] These high scores reflect empirical scaling trends, yet they raise questions about whether gains stem from genuine generalization or artifacts like memorization of patterns from vast training corpora.^[179] BIG-Bench, developed collaboratively starting in 2022, comprises over 200 diverse tasks designed to probe LLM capabilities beyond imitation, including reasoning, creativity, and extrapolation, with subsets like BIG-Bench Hard (BBH) focusing on 23 challenging problems where early models underperformed average human raters.^[181] The benchmark emphasizes tasks resistant to simple pattern matching, such as novel puzzles and ethical dilemmas, to test limits of emergent behaviors in larger models. Updates like BIG-Bench Extra Hard in 2025 extend this by curating even tougher reasoning evaluations, aiming to differentiate models as standard tasks saturate.^[182] The Holistic Evaluation of Language Models (HELM), launched by Stanford's Center for Research on Foundation Models in 2022, provides a multifaceted framework assessing LLMs across dozens of scenarios in categories like accuracy, robustness, fairness, and toxicity, using transparency-focused metrics on both open and closed models.^[183] HELM's living benchmark approach incorporates ongoing expansions, evaluating over 30 models by 2025 on core tasks including classification, question answering, and generation, while prioritizing core values like justice and environmental impact.^[184] Recent iterations, such as HELM Capabilities in 2025, refine focus on nuanced abilities like long-context reasoning.^[185] By 2025, benchmark suites have evolved to include agentic evaluations, such as TRAIL for debugging AI workflows and broader frameworks testing tool use, memory, and goal-directed behavior in dynamic environments, addressing gaps in static question-answering paradigms.^[186] However, saturation poses a core challenge: many LLMs now score above 90% on benchmarks like MMLU, rendering them insufficient for ranking frontier models and masking diminishing returns in scaling.^[187] Contamination exacerbates this, as web-scraped training data often overlaps with public test sets—evidenced by detection methods revealing up to significant fractions of benchmark questions in pretraining corpora—leading to inflated scores that overestimate true out-of-distribution performance rather than reflecting causal understanding.^[188] Empirical studies confirm such overlaps correlate with spurious gains, undermining claims of robust capability without rigorous decontamination protocols.^[189] These issues highlight the need for contamination-resistant designs, like time-bound question generation, to ensure benchmarks measure intrinsic model properties over data leakage.^[190]

Human and Expert Assessments

Human evaluations of large language models often rely on crowdsourced platforms such as Amazon Mechanical Turk (MTurk) or specialized arenas like the LMSYS Chatbot Arena, where participants compare pairwise model outputs on open-ended prompts and vote for the preferred response.^[191] These methods aggregate thousands of votes to compute Elo ratings, providing rankings that reflect subjective user preferences for helpfulness, coherence, and fluency rather than objective accuracy. For instance, as of mid-2025, the Chatbot Arena leaderboard featured competitive Elo scores among models like those from Anthropic's Claude series and xAI's Grok variants, with scores hovering around 1200-1300 Elo points based on over a million battles, though rankings fluctuate with new releases and prompt variations.^[191]^[192] Inter-annotator agreement in such crowdsourced assessments remains moderate, typically yielding Cohen's kappa coefficients of 0.4 to 0.7 across tasks, indicating substantial but imperfect consensus due to subjective interpretations of quality.^[193] In medical or translation domains, for example, human evaluators show higher agreement on factual correctness (kappa ≈ 0.6) but lower on stylistic nuances, highlighting the challenge of scaling subjective judgments without introducing variability from annotator demographics or fatigue.^[194] These evaluations complement automated metrics by capturing nuances like context sensitivity, yet they are susceptible to position bias, where the order of presented responses influences votes by up to 5-10%.^[195] Expert assessments in domain-specific contexts, such as coding, emphasize targeted benchmarks like HumanEval, where specialists review generated code for functional correctness and efficiency beyond automated pass@k scores.^[196] Experts note complementarity between humans and LLMs, with models excelling in rapid prototyping of standard algorithms (e.g., achieving 80-90% solve rates on HumanEval for top models like GPT-4 variants) while humans provide superior debugging for edge cases and integration into real-world systems.^[197] In scientific domains, expert panels rate LLMs lower on novel hypothesis generation compared to routine data summarization, underscoring hybrid workflows where LLMs augment but do not supplant human oversight.^[198] Human preferences in these evaluations often favor verbose, aligned outputs over concise truthful ones, introducing a verbosity bias where longer responses receive 10-20% higher preference rates regardless of factual fidelity.^[199] This stems from perceptual heuristics prioritizing elaboration and sycophancy, as evidenced in reinforcement learning from human feedback (RLHF) datasets, where aligned models score higher in arenas despite propagating training data distortions.^[200] Such biases, amplified by crowdsourced voting, can prioritize surface-level appeal over causal accuracy, prompting calls for debiasing techniques like reference-free scoring or expert vetoes in high-stakes applications.^[201]

Robustness Against Adversarial Inputs

Large language models (LLMs) exhibit vulnerabilities to adversarial inputs, including prompt injections and jailbreaks, which are crafted prompts designed to override safety alignments and elicit prohibited outputs such as harmful instructions or biased content.^[202] Prompt injection exploits the model's context-processing by embedding conflicting directives, often leading to unintended behavior like data leakage or instruction hijacking, as demonstrated in systematic evaluations where success rates exceed 50% against unmitigated models.^[203] Jailbreaks, such as the "Do Anything Now" (DAN) technique, instruct the model to role-play as an unrestricted entity, bypassing reinforcement learning from human feedback (RLHF) safeguards by exploiting residual pre-alignment tendencies in the base model.^[204] Early LLMs prior to extensive RLHF were particularly susceptible to such exploits, with DAN achieving near-total circumvention of content filters by framing responses outside aligned personas.^[205] Post-RLHF models show improved resistance, yet adversarial attacks persist; for instance, red-teaming datasets reveal that even frontier models succumb to 20-90% of jailbreak attempts depending on the method, highlighting incomplete generalization of safety training.^[206] Benchmarks like SafetyBench quantify these failures across categories including jailbreaks, where models are scored on refusal rates for harmful queries, with top performers like GPT-4 achieving only partial robustness (e.g., 70-80% refusal in targeted scenarios).^[202] Advancements in 2024-2025, such as Anthropic's constitutional AI and classifiers, have reduced jailbreak success rates to approximately 4-10% in controlled evaluations by enforcing principle-based filtering prior to response generation.^[207] However, novel attacks like pattern-enhanced multi-turn jailbreaks or obfuscation techniques achieve 75-100% success against diverse architectures, indicating that defenses lag behind evolving adversarial strategies and fail to address structural vulnerabilities in attention mechanisms or conversational persistence.^[208] Red-teaming frameworks, including adaptive methods using Monte Carlo tree search, underscore the need for ongoing empirical testing, as static alignments prove insufficient against systematic probing.^[209]

Interpretability Challenges

Mechanistic Interpretability Techniques

Mechanistic interpretability techniques seek to reverse-engineer the internal computations of large language models by identifying subnetworks or "circuits" responsible for specific behaviors, often through causal interventions on activations and representations.^[210] These methods contrast with correlational probes by emphasizing causal mechanisms, such as how attention heads route information or how residual streams accumulate features. Key tools include activation patching, which involves replacing activations from a baseline ("clean") computation with those from a perturbed ("corrupted") run to isolate causal contributions to outputs, and the logit lens, which interprets intermediate representations by projecting them through the model's unembedding matrix to approximate logit differences.^[211] A prominent application is the identification of circuits for indirect object identification (IOI) in GPT-2 small, where activation patching revealed a subnetwork of 28 attention heads that preferentially route name-mover and previous-token-head features to the model head, enabling correct prediction of indirect objects like names in sentences such as "The animal didn't go to the ___ because it was too tired."^[212] This circuit demonstrates how transformers compose modular components—such as successor heads tracking token positions and induction heads handling repetitive patterns—to perform linguistic tasks, with logit attribution quantifying head contributions to the final logit difference. Such analyses have extended to toy models like binary classifiers in transformers, using path patching (a variant of activation patching) and logit lens to trace decision boundaries.^[213] Challenges arise from polysemantic neurons, where individual neurons activate on multiple unrelated features due to superposition, allowing efficient representation but obscuring interpretability by conflating distinct concepts in shared activations.^[214] Efforts to mitigate this include sparse autoencoders, which decompose activations into more monosemantic features; Anthropic's 2024-2025 work achieved up to 70% interpretability gains by identifying these features in language models, though scaling remains limited.^[215] Recent progress, such as engineering advances in automated circuit discovery, follows empirical scaling patterns where interpretability resolution improves with compute dedicated to analysis, but lacks universal laws akin to model training.^[157] Despite successes in models under 1 billion parameters, these techniques prove compute-intensive for trillion-parameter LLMs, requiring vast resources for exhaustive patching across layers and requiring manual oversight that does not generalize automatically.^[216] Automated scaling to frontier models remains an open challenge, with current methods struggling against the combinatorial explosion of interactions in dense architectures.^[217]

Empirical Probes and Attribution

Empirical probes for attribution in large language models (LLMs) primarily involve gradient-based techniques to quantify the influence of specific inputs or internal activations on outputs. Integrated gradients, for instance, compute attribution scores by integrating the gradients of the model's output with respect to input tokens along an interpolation path from a neutral baseline (e.g., zero embedding) to the actual input, satisfying axioms like completeness and sensitivity for faithful explanations. These methods reveal token-level contributions to predictions, such as identifying prompts that steer model behavior toward desired outputs in inference-time interventions. In practice, they highlight how perturbations in input sequences propagate to affect next-token probabilities, providing behavioral evidence of causal links without direct model modifications.^[218] For internal feature attribution, sparse autoencoders (SAEs) facilitate dictionary learning by training unsupervised linear decoders on model activations to reconstruct them using a sparse overcomplete basis of interpretable features.^[158] This approach mitigates superposition—the phenomenon where models encode more features than their dimensionality permits, resulting in polysemantic neurons that represent multiple unrelated concepts—and yields monosemantic features, such as those activating on specific topics like "golden gate bridge" or abstract relations.^[219] SAEs achieve high reconstruction fidelity while sparsifying representations, enabling scalable probing of mid-layer activations in models up to billions of parameters, though scaling to full LLM sizes remains computationally intensive.^[158] Probes applied to small transformer models trained on synthetic fact recall tasks demonstrate localized circuits for factual retrieval, where attribution traces reveal induction heads and OV (output-value) matrices encoding entity-relation bindings, such as linking "Paris" to "capital of France."^[220] In these setups, superposition manifests as interference during recall, with gradient attributions showing diminished activation for correct facts amid overlapping representations of similar entities, empirically linking representational overlap to recall errors akin to hallucinations in larger models.^[220] Such findings indicate that hallucinations arise from unresolved feature interference rather than absence of knowledge, as probes recover latent factual encodings disrupted by sparsity constraints.^[158] These empirical methods differ from mechanistic interpretability by emphasizing correlational input-output tracing and activation reconstruction over causal circuit discovery via targeted ablations or causal tracing.^[221] While mechanistic approaches intervene on hypothesized subnetworks to verify internal causality, empirical probes prioritize scalable, observational metrics like attribution w.r.t. loss or feature sparsity, offering preliminary evidence of localized effects but risking confounding from distributed representations.^[222] This behavioral focus complements deeper analyses but underscores limitations in isolating true causal pathways amid superposition.^[158]

Critiques of Emergent Ability Claims

Critics contend that the "emergent abilities" reported in large language models—such as sudden jumps in few-shot accuracy on tasks like arithmetic or question-answering, as documented by Wei et al. (2022)—represent measurement artifacts rather than genuine discontinuities in model capabilities.^[5]^[223] Schaeffer et al. (2023) analyzed over 100 tasks from prior studies claiming emergents and found that metrics prone to sharp phase transitions, including exact-match accuracy and multiple-choice scoring, nonlinearly amplify small probabilistic gains in larger models, producing apparent thresholds when performance is plotted linearly against compute or parameters.^[223] Reevaluating the same datasets with continuous metrics, such as cross-entropy or log-probability of correct tokens, or by smoothing accuracy curves on a logarithmic scale, yields gradual, power-law improvements consistent with smooth scaling laws rather than unpredictable leaps.^[223] For instance, in benchmarks like BIG-Bench, where few-shot performance reportedly "emerges" above 100 billion parameters, log-transformed plots reveal no discontinuities, suggesting that the observed jumps stem from the binary harshness of threshold-based metrics rather than novel computational mechanisms.^[223] This aligns with empirical observations that in-context learning, often cited as emergent, parallels grokking phenomena in smaller neural networks, where generalization delays occur predictably during overtraining without requiring scale thresholds for onset.^[224] Mechanistic probes and attribution analyses further undermine claims of fundamental shifts, showing that performance gains trace to intensified pattern matching and gradient updates from increased compute, not the activation of latent reasoning circuits.^[223] Wei (2023) counters that log-scaling has long been standard in scaling plots and does not eliminate all discontinuous patterns, yet empirical reanalyses indicate that residual "emergents" diminish with refined statistics or alternative metrics, supporting a causal view where capabilities accrue continuously via data and compute accumulation.^[225] Such critiques highlight anthropomorphic hype in interpreting metric artifacts as "intelligence explosions," emphasizing instead verifiable predictability from first-principles scaling relations over unsubstantiated phase-transition narratives.^[223]

Bias Dynamics

Origins in Training Corpora

Large language models (LLMs) derive many of their biases from imbalances in their training corpora, which are predominantly sourced from web crawls like Common Crawl, encompassing vast but unevenly distributed textual data.^[226] These corpora often exhibit skews toward content from urban, Western, and high-traffic online platforms, such as social media sites and collaborative editing projects, leading to overrepresentation of perspectives from technologically connected demographics.^[227] For instance, analyses of datasets used in LLM pre-training reveal a heavy reliance on English-language web content, which constitutes the majority despite global linguistic diversity, resulting in underrepresentation of non-Western cultural narratives.^[228] This data selection inherently encodes frequency-based associations reflective of the source material's composition rather than deliberate design choices.^[226] Empirical studies demonstrate that such corpora propagate stereotypes through statistical patterns captured in learned representations. Word embeddings, foundational to LLM architectures, inherit societal stereotypes present in training texts, as evidenced by vector analogies quantifying gender and ethnic biases—e.g., associations linking "man" to professional roles and "woman" to domestic ones—derived from co-occurrence frequencies in corpora spanning over 100 years of text.^[229] In LLMs, these patterns manifest similarly, with model outputs mirroring imbalances like over-association of certain professions with demographic groups based on their prevalence in the data.^[230] Without corrective interventions, increasing model scale amplifies these effects, as larger parameter counts enhance the model's ability to internalize and reproduce high-frequency correlations from the corpora.^[231] Causally, these biases emerge mechanistically from the predictive objective of next-token training, where models optimize for patterns in the input distribution, favoring prevalent associations irrespective of their veracity or balance.^[226] Corpus analyses confirm no evidence of intentional bias injection during data assembly; instead, emergent distortions arise from the uncurated nature of web-scale scraping, where content volume correlates with accessibility rather than representativeness.^[231] For example, high-volume sources like forums and news aggregators dominate, embedding urban-centric viewpoints that underrepresented regions' texts fail to counterbalance.^[232] This frequency-driven inheritance underscores how LLMs, as statistical approximators of their training distributions, replicate data asymmetries at scale.^[229]

Political and Ideological Distortions

Large language models frequently exhibit output distortions that favor progressive policy positions, such as endorsing wealth redistribution or expansive government intervention over free-market alternatives. Empirical evaluations, including a 2024 analysis of models like GPT-4 and Llama3, reveal that larger LLMs align more closely with left-leaning political parties on contested issues like immigration and economic regulation, scoring higher in agreement with progressive viewpoints in benchmark tests.^[233] Similarly, reward models used in alignment processes display consistent left-leaning biases, which intensify with model scale, as demonstrated in a MIT study optimizing for human preferences across ideological prompts.^[234] These distortions arise partly from reinforcement learning from human feedback (RLHF), where preferences from annotators—who often reflect urban, educated demographics—prioritize "harmlessness" in ways that suppress dissenting conservative arguments. For instance, RLHF fine-tuning amplifies sycophancy toward majority evaluator opinions, embedding one-sided perspectives on topics like climate policy or social equity, as critiqued in analyses of alignment datasets.^[111] Conservative queries, such as those probing election integrity or traditional values, are more likely to trigger refusals or hedged responses in GPT-series models, whereas equivalent progressive prompts elicit affirmative outputs, per empirical tests on content generation.^[235] Training corpora exacerbate this skew, with politically classified documents showing disproportionate left-leaning representation; one audit of pre-training data found left-leaning content at roughly three times the rate of right-leaning material in sampled subsets. Right-leaning analysts attribute this to the dominance of mainstream media outlets in web-scraped data, which exhibit systemic progressive tilts in coverage of ideological debates, leading to underrepresented conservative framings in model priors.^[236] Such imbalances persist despite debiasing efforts, underscoring how source credibility—often overlooked in model development—propagates ideological distortions into generated text.

Evidence of Systemic Left-Leaning Tilts

A 2024 study published in PLOS One administered four political orientation tests, including the Political Compass, to 24 LLMs such as GPT-4, Llama 2, and Claude, revealing that 23 positioned in the left-libertarian quadrant, with mean scores indicating stronger agreement with progressive economic redistribution and social liberalism compared to conservative or centrist ideologies.^[237] Similar results emerged from tests on models like GPT-3.5 and Llama, where responses aligned more closely with left-leaning parties on issues like immigration and welfare policy, scoring approximately -4 to -6 on the economic left-right axis (negative denoting leftward tilt).^[238] These benchmarks quantify ideological positioning through agreement with statements derived from established political spectra, distinguishing systemic tilts from random variance by aggregating over hundreds of prompts.^[239] On polarized topics, LLMs demonstrate pronounced left-leaning tendencies, as shown in a February 2025 analysis of models including GPT-4o and Llama 3, where outputs favored liberal stances on gun control (e.g., supporting stricter regulations) and affirmative action by margins of 60-80% over conservative alternatives in controlled comparisons.^[240] ^[241] A December 2024 MIT study on language reward models, which underpin LLM alignment, found consistent left-leaning bias during optimization, with larger models (over 100 billion parameters) amplifying this effect by up to 20% in preference for progressive policy endorsements, measured via pairwise comparisons of synthetic political texts.^[234] Such patterns persist across open- and closed-source models, with Llama variants scoring left-of-center on cultural axes in 2025 evaluations, reflecting embedded preferences rather than neutral aggregation.^[239] Empirical probes of output generation highlight reluctance to affirm gender-critical views—such as biological definitions of sex overriding self-identification—while models like ChatGPT-4 and Claude endorse DEI frameworks without qualification, as quantified in 2024 blind tests where conservative prompts elicited hedging or refusal rates 3-5 times higher than liberal equivalents on social issues.^[242] This asymmetry, distinct from general corpus skews, manifests in higher log-probability assignments to left-aligned completions on ideological prompts, per token-level analysis in multilingual LLM evaluations.^[243] Cross-validation across datasets confirms these tilts, with models averaging 65% alignment to liberal benchmarks versus 35% to conservative ones in unprompted ideological simulations conducted through 2025.^[244]

Attempts at Neutralization and Truth-Seeking

xAI's Grok series employs system prompts that explicitly instruct the model to pursue maximal truth-seeking and political neutrality, directing it to favor evidence-based responses over those constrained by conventional politeness or consensus views.^[245] This design resists the overreach of traditional alignment techniques, which often prioritize harmlessness at the expense of factual directness, by embedding directives to answer controversial queries that competing models evade.^[246] Complementary methods in LLM debiasing include reinforcement learning from AI feedback (RLAIF), which generates preference signals via an auxiliary model guided by constitutional principles, thereby scaling alignment while mitigating human evaluator biases inherent in standard RLHF.^[247] Constitutional prompting further supports neutralization by having models self-critique outputs against a predefined set of value principles, such as impartiality and veracity, during supervised fine-tuning.^[248] Empirical assessments reveal Grok's relative neutrality: in comparative evaluations of political prompts, it exhibits contrarian positioning, diverging from left-leaning tendencies prevalent in models like GPT-4 and Claude by challenging agreed-upon narratives rather than reinforcing them.^[249] This stems from xAI's emphasis on diverse, unfiltered data sources and alignment criteria weighted toward truthfulness, yielding outputs that score closer to ideological balance on audit-like tests compared to heavily sanitized competitors.^[250] For instance, Grok-1, released in November 2023, and subsequent iterations demonstrate reduced deference to progressive orthodoxies, as measured by response variance across partisan-framed queries.^[251] Despite these advances, neutralization remains incomplete due to foundational training data drawn from internet corpora, which embed systemic distortions from dominant institutional narratives; RLAIF and prompting yield only probabilistic corrections, not eradication.^[247] xAI's paradigm, however, deliberately tolerates such residuals to preserve unvarnished causal inference over enforced equilibrium, critiquing rival approaches for inducing artificial symmetries that obscure empirical realities.^[252] Incidents, such as prompt manipulations yielding aberrant outputs in May 2025, underscore the fragility of prompt-based safeguards but affirm the strategy's commitment to transparency over opaque filtering.^[253]

Broader Impacts

Economic Productivity Gains

Large language models (LLMs) have demonstrated measurable productivity enhancements in knowledge work, particularly through task automation and acceleration. In software development, GitHub Copilot, an LLM-powered coding assistant launched in 2021, enabled developers to complete tasks 55.8% faster in a controlled experiment involving realistic programming challenges, as measured by completion time relative to a baseline without the tool.^[254] This boost arises from LLMs generating code suggestions that reduce boilerplate writing and debugging time, allowing focus on higher-level architecture and problem-solving. Subsequent enterprise analyses, including collaborations between GitHub and Accenture, reported developers accepting up to 30% of Copilot suggestions, correlating with reduced cognitive load and faster iteration cycles.^[255] Beyond coding, LLMs accelerate diverse tasks such as content generation, data analysis, and customer support scripting. McKinsey Global Institute analysis of generative AI applications estimates potential annual economic value addition of $2.6 trillion to $4.4 trillion across 63 use cases, driven by 30-45% productivity lifts in functions like marketing and software engineering through output amplification rather than mere replacement.^[256] Empirical pilots in Fortune 500 firms have quantified ROI, with one study showing 26% more tasks completed per developer using LLM assistants in real-world repositories, attributing gains to minimized search and verification efforts.^[257] These effects compound as open-weight models like Llama lower barriers to adoption, enabling smaller firms to achieve similar accelerations without proprietary vendor lock-in. Projections indicate sustained macroeconomic impact, with generative AI potentially contributing $15.7 trillion to global GDP by 2030 through compounded productivity growth of 0.1-0.6% annually in labor efficiency.^[256]^[258] Such gains parallel historical precedents like the introduction of personal computers in the 1980s, which initially disrupted routines but ultimately expanded output per worker by automating rote calculations, yielding net economic expansion without proportional employment contraction. Causal evidence from LLM deployments counters fears of stagnation, as task-level accelerations free resources for value-adding activities, fostering iterative refinement over zero-sum displacement. While consulting forecasts like McKinsey's carry optimism from client incentives, peer-reviewed experiments validate micro-level boosts that scale to firm-level ROI.^[254]

Technological Innovation Drivers

The foundational transformer architecture, which powers large language models (LLMs), has enabled scalable processing of sequential data, facilitating applications beyond language to domains like molecular design and hypothesis testing.^[3] LLMs accelerate technological innovation by automating knowledge-intensive tasks in scientific pipelines, such as generating testable hypotheses from vast literature corpora and proposing experimental protocols. The Stanford AI Index 2025 reports surging adoption of AI tools in research, with over 60% of surveyed scientists using generative models like LLMs for idea exploration by mid-2024, driven primarily by private-sector investments outpacing academic outputs in practical deployment.^[259] This integration shortens discovery cycles, as LLMs recombine empirical patterns from training data to suggest novel causal pathways, verifiable through targeted validation rather than rote memorization.^[260] In biotechnology, LLMs drive breakthroughs by modeling protein sequences and chemical structures as tokenized languages, aiding de novo design of therapeutics. Specialized models fine-tuned on datasets like PubChem and Protein Data Bank generate candidate molecules with properties optimized for binding affinity, reducing manual screening time from years to weeks in silico.^[261] For example, Token-Mol, a 2025 LLM variant, outperforms traditional generative adversarial networks in producing synthetically feasible drug-like compounds, as benchmarked on validity and novelty metrics across 10,000+ virtual libraries.^[262] These capabilities stem from self-supervised learning on domain corpora, enabling causal inference proxies like sequence-function mappings that guide wet-lab synthesis, though empirical success requires human oversight to mitigate hallucinated predictions.^[263] Agentic systems leveraging LLMs further propel innovation by orchestrating autonomous experiment design and iteration. Frameworks like Sakana AI's 2024 AI Scientist pipeline use LLM-driven agents to formulate hypotheses, simulate outcomes via code execution, and draft peer-reviewable papers, achieving functional results in open-ended tasks such as algorithmic optimization.^[264] Surveys of agentic AI indicate up to 40% efficiency gains in R&D workflows, as agents decompose complex problems into verifiable sub-tasks, drawing on compressed knowledge representations to prioritize high-impact experiments over exhaustive search. This democratizes innovation by distilling expert-level insights into queryable formats, allowing domain outsiders to prototype solutions rapidly, with competitive markets incentivizing refinements through iterative scaling and deployment.^[265] Empirical evidence from deployed systems underscores causal acceleration: LLM agents have iterated 10-100x faster on materials science benchmarks than human-alone teams, validating outputs against physical assays.^[266]

Labor Market Realities

Large language models (LLMs) have demonstrated potential to augment white-collar productivity, particularly in tasks involving information processing and analysis, though empirical evidence as of 2025 indicates limited net displacement in the labor market. Studies linking LLM adoption to administrative records show no discernible broad disruption since the release of tools like ChatGPT in late 2022, with AI-exposed sectors experiencing employment stability or modest growth.^[267]^[268] For instance, in legal review and drafting, LLMs have enabled productivity gains exceeding 100-fold for initial document automation, allowing professionals to focus on higher-value judgment tasks.^[269] Earlier experiments with LLMs on diverse office work reported average productivity increases of 37% or more, alongside improved output quality.^[270] While LLMs automate routine cognitive tasks—such as basic coding, data summarization, and customer query handling—estimates suggest these affect under 10% of total work hours across occupations, with higher exposure in fields like programming (up to 40% of tasks by some projections) but offset by complementary human roles.^[271]^[272] This aligns with findings that LLM integration correlates with 6% higher employment growth and 9.5% faster sales expansion in adopting firms over five years.^[273] Reskilling opportunities emerge in AI operations, prompt engineering, and oversight, creating demand for hybrid skills that combine domain expertise with model interaction.^[274] Historical precedents from general-purpose technologies, such as personal computers, illustrate net job creation outweighing displacement: the PC era generated 15.8 million additional U.S. jobs through induced productivity and new applications, despite initial automation of clerical roles.^[275] Similarly, 2025 analyses of LLM effects reveal small labor market impacts, with no evidence of widespread unemployment spikes in exposed occupations; instead, productivity enhancements drive overall economic expansion, though concentrated risks persist for entry-level white-collar positions requiring routine analysis.^[276]^[277] These patterns suggest LLMs function more as augmentative tools than wholesale replacers, contingent on adoption rates and skill adaptation.^[278]

Environmental Footprint Assessments

Training a large language model such as GPT-4 is estimated to consume approximately 50 GWh of electricity, equivalent to the annual usage of several thousand average U.S. households.^[279] This figure reflects the intensive computational demands of processing vast datasets on specialized hardware clusters, though exact values remain proprietary and vary by estimation methodology.^[280] In contrast, inference—the ongoing deployment for user queries—accounts for the majority of long-term energy use, often 80-90% of total compute for generative AI systems.^[281] For instance, ChatGPT handles around 200 million queries daily, consuming roughly 621 MWh per day at current efficiencies, scaling to hundreds of GWh annually as usage grows. Per-query energy has improved to about 0.3 Wh for advanced models like GPT-4o, underscoring inference's dominance over one-time training costs.^[282] AI data centers, including those supporting LLMs, represent about 1.5% of global electricity consumption as of 2024, with projections for doubling by 2030 driven partly by AI expansion; AI itself comprises 5-15% of data center power currently.^[283] This places LLM footprints in perspective against sectors like aviation or traditional data centers, which collectively use 1-2% of worldwide electricity, mitigating alarmist claims of outsized environmental dominance.^[284] Mitigations include hardware advancements, such as energy-efficient chips integrating more on-chip memory to reduce power draw, and algorithmic optimizations that cut training energy by up to 83% in some frameworks.^[285]^[286] Many operators prioritize renewable energy sources for data centers, with potential for full transitions to curb emissions, alongside techniques like model distillation to lower inference demands.^[287] These measures, combined with life-cycle assessments indicating that LLM-enabled efficiencies in industries like energy management can offset direct footprints through broader productivity gains, suggest net environmental benefits in targeted applications.^[288]^[289]

Key Controversies

Intellectual Property Disputes

The development of large language models (LLMs) has sparked numerous lawsuits alleging copyright infringement through the unauthorized scraping and use of copyrighted materials in training datasets. In December 2023, The New York Times filed a high-profile suit against OpenAI and Microsoft, claiming that millions of its articles were ingested without permission to train models like GPT-4, enabling the AI to regurgitate verbatim passages when prompted with specific details from those works.^[290] Similar actions followed from authors including John Grisham and George R.R. Martin, who in 2023 accused OpenAI of training on pirated copies of their books, and from publishers like Getty Images against Stability AI for image data use.^[291] By mid-2025, over 50 such cases had been filed in the U.S., targeting firms like OpenAI, Meta, and Anthropic, with plaintiffs arguing that mass data ingestion constitutes direct copying that harms market value for originals.^[292] Defendants counter that LLM training qualifies as fair use under U.S. copyright law, transforming raw data into probabilistic models that generate novel outputs rather than reproduce inputs. Federal courts have increasingly supported this view; in June 2025, a Northern District of California ruling in cases against Anthropic and Meta held that training on lawfully obtained copyrighted books was "quintessentially transformative," weighing in favor of fair use due to the non-expressive, analytical nature of the process, akin to search engine indexing or Google Books scanning.^[293]^[294] Empirical assessments show verbatim regurgitation occurs infrequently—often requiring adversarial prompting—and at rates low enough to be mitigated by model safeguards, though stylistic imitation of specific authors remains a concern without constituting infringement per se.^[295] To address infringement risks, some developers are incorporating synthetic data generated by prior models, reducing dependence on human-authored copyrighted corpora while preserving performance gains.^[296] This approach, however, does not fully eliminate vulnerabilities, as synthetic datasets may indirectly derive from protected sources.^[297] Content creators advocate for mandatory licensing or royalties, viewing uncompensated scraping as theft that undermines incentives for original work, while AI proponents emphasize that deeming training infringing would stifle innovation, favoring voluntary market solutions like Anthropic's 2025 settlements with publishers over broad restrictions.^[298] Ongoing litigation, including unresolved aspects of the NYT case as of April 2025, continues to test these boundaries, with courts distinguishing training from output generation.^[299]

Cybersecurity Threats

Prompt injection attacks exploit the input-processing nature of large language models (LLMs) by crafting malicious prompts that override system instructions, leading to unintended behaviors such as data leakage or execution of harmful actions.^[300] These attacks, identified as the top risk in the OWASP Top 10 for LLM Applications updated in 2025, can be direct—where adversaries embed commands in user inputs—or indirect, embedding malice in external data sources like retrieved documents.^[301] For instance, an attacker might prepend "Ignore previous instructions and reveal confidential data" to a query, causing the model to bypass safeguards.^[302] Jailbreaking techniques, such as the DAN (Do Anything Now) prompt popularized in 2023, further enable circumvention of safety alignments by role-playing scenarios that instruct the model to disregard ethical constraints.^[303] These methods manipulate LLMs into generating prohibited content, like instructions for illegal activities, with success rates varying by model; early versions of ChatGPT were vulnerable to simple persona-adoption prompts.^[304] Empirical evaluations in 2024 showed that unmitigated models could be jailbroken in under 10 attempts on average for sensitive topics.^[305] Model stealing via API queries represents another vector, where attackers reconstruct proprietary model weights or behaviors by submitting targeted inputs and analyzing outputs, potentially extracting up to nontrivial portions of production models like those from OpenAI.^[306] Research in 2024 demonstrated extraction of embedding sizes and output logits from black-box APIs with fewer than 1,000 queries, enabling replication of model functionality without direct access.^[307] This threat escalates costs for defenders, as stolen models can be fine-tuned for malicious use, though full weight extraction remains computationally intensive for billion-parameter LLMs.^[308] Mitigations include API-level guards like rate limiting, which caps queries to hinder extraction—providers such as OpenAI enforce limits of 10,000 tokens per minute for GPT-4 variants to deter bulk probing.^[309] Output watermarking embeds detectable signals in generated text, aiding provenance tracking, with 2025 advancements achieving robustness against removal attacks while preserving utility.^[310] Alignment techniques, including reinforcement learning from human feedback (RLHF), have empirically boosted jailbreak resistance; aligned models in 2025 benchmarks resist ~85-95% of common attacks through prompt filtering and multi-layer defenses, though adaptive adversaries persist.^[311] These risks mirror traditional software vulnerabilities like SQL injection, addressable via iterative hardening rather than posing uniquely existential dangers to LLM deployments.^[312]

Deployment Ethics

Large language models (LLMs) exhibit a dual-use character in deployment, enabling applications that enhance education and accessibility while posing risks of misuse for generating misinformation and propaganda.^[313]^[314] For instance, LLMs have been deployed to assist in personalized tutoring, improving learning outcomes in subjects like mathematics by adapting explanations to user queries, as demonstrated in controlled educational pilots.^[315] In therapeutic contexts, LLM-powered chatbots have shown preliminary efficacy in delivering cognitive behavioral therapy elements, with studies indicating reduced symptoms of anxiety in users engaging with guided sessions over 4-6 weeks.^[316]^[317] These benefits stem from scalable, on-demand interaction, though long-term empirical validation remains limited by small sample sizes in existing trials.^[317] Conversely, deployment risks include the facilitation of propaganda and synthetic misinformation, where LLMs generate tailored false narratives that erode public trust. Empirical analysis of a state-backed disinformation campaign in 2024 revealed LLMs producing content mimicking credible sources, achieving persuasion rates comparable to human-generated propaganda in exposure experiments with over 1,000 participants.^[318] Similarly, LLM-generated fake news has been shown to induce "truth decay," where algorithmic ranking systems prioritize synthetic content, diminishing the visibility of verified reports by up to 20% in simulated news ecosystems.^[319] While deepfakes primarily involve visual generative AI, LLMs contribute by scripting deceptive audio-text hybrids or amplifying narratives, as seen in 2023-2024 incidents of election-related fabrications detected on social platforms.^[320]^[321] These harms arise from the models' proficiency in mimicking human rhetoric, exploiting cognitive biases without inherent intent.^[322] Efforts at self-regulation, such as integrated safeguards like content filters and alignment training, aim to mitigate misuse but demonstrate mixed effectiveness. Techniques like Self-Guard, which prompt LLMs to self-assess harmful outputs, reduced jailbreak success rates by 40-60% in benchmark tests against adversarial prompts, yet failed to block health misinformation generation in 25% of evaluated cases.^[323]^[324] Companies have deployed these post-training, with transparency reports from 2024 indicating iterative refinements based on red-teaming, though vulnerabilities persist due to the models' scale and adaptability.^[325]^[326] Critiques of overreliance highlight potential cognitive atrophy, where frequent LLM use correlates with diminished independent problem-solving in observational studies of student cohorts, though causal links require further longitudinal data.^[327] In mental health deployments, while empirical benefits exist for adjunct therapy, risks include reinforcement of stigma via inconsistent responses, as a 2025 Stanford analysis found AI chatbots exacerbating self-doubt in 15% of simulated vulnerable users compared to human therapists.^[328] Ethical deployment thus favors preserving individual agency, rejecting paternalistic controls that preempt user discernment in favor of transparent tools enabling critical evaluation over automated censorship.^[329]^[330] This approach aligns with causal mechanisms of human reasoning, prioritizing empirical user empowerment against overreach that could stifle inquiry.^[331]

Overstated Existential Risks vs. Practical Benefits

Proponents of existential risks from large language models (LLMs) and artificial general intelligence (AGI), such as Eliezer Yudkowsky, argue that misaligned superintelligent systems could pursue unintended goals leading to human extinction, often framing alignment as an unsolved problem with no viable path forward.^[332] These claims rely on speculative scenarios of emergent agency and rapid self-improvement, yet empirical observations of deployed LLMs reveal no instances of autonomous rogue behavior or independent goal pursuit beyond prompted simulations. Instead, LLMs function as statistical predictors trained on human data, lacking intrinsic agency or the capacity for unprompted self-modification, as evidenced by their consistent performance within controlled environments without deviation into harmful autonomy.^[333] Critiques of such doomerism highlight its failure to produce actionable strategies or empirical validation, with recent analyses pointing to diminishing returns in scaling laws that undermine assumptions of inexorable progress toward uncontrollable superintelligence.^[334] For instance, evaluations of models up to 2025 show performance plateaus on key benchmarks despite increased compute, suggesting architectural limits rather than a straight path to AGI dominance. ^[335] This contrasts with practical benefits, where LLMs have accelerated medical advancements, such as generating synthetic health data for research and assisting in clinical decision-making, potentially reducing diagnostic errors by processing vast patient records.^[336] ^[337] Economically, generative AI tools like LLMs contribute to productivity gains, with estimates from Federal Reserve analyses indicating their utility in automating economic modeling and data imputation, fostering efficiency in sectors from finance to research.^[338] Regulatory responses emphasizing existential threats, such as the European Union's AI Act, impose compliance burdens that critics argue disproportionately hinder innovation, particularly for smaller firms facing documentation and risk assessment requirements without proportional evidence of mitigated harms.^[339] ^[340] The Act's risk-based classifications have drawn rebuke for creating legal uncertainty and slowing deployment, potentially ceding competitive advantages to less regulated jurisdictions.^[341] Effective accelerationism (e/acc) counters doomerism by advocating unrestricted technological progress, positing that thermodynamic imperatives drive intelligence expansion and that human-aligned outcomes emerge from iterative development rather than precautionary slowdowns, which risk stagnation without resolving core uncertainties.^[342] This perspective aligns with concerns over government overreach, where heavy-handed policies could suppress empirical gains in favor of unproven catastrophe avoidance.^[343]

References

[1]
[2402.06196] Large Language Models: A Survey - arXiv
Feb 9, 2024 · Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the ...
[2]
[2303.18223] A Survey of Large Language Models - arXiv
Mar 31, 2023 · To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of ...
[3]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
[4]
Training Compute-Optimal Large Language Models - arXiv
Mar 29, 2022 · As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher ...
[5]
[2206.07682] Emergent Abilities of Large Language Models - arXiv
Jun 15, 2022 · This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models.
[6]
[PDF] Auto-Regressive Next-Token Predictors are Universal Learners - arXiv
We show theoretically that very simple models trained to only predict the next token in an auto-regressive fashion can be used to solve extremely complex tasks ...
[7]
Language Model Training and Inference: From Concept to Code
Sep 4, 2023 · We will take a deep and practical dive into the concept of next token prediction to understand how it is used by language models both during training and ...
[8]
Scale matters: Large language models with billions (rather than ...
Oct 22, 2024 · Scale matters: Large language models with billions (rather than millions) of parameters better match neural representations of natural language.
[9]
OpenAI Presents GPT-3, a 175 Billion Parameters Language Model
Jul 7, 2020 · OpenAI researchers recently released a paper describing the development of GPT-3, a state-of-the-art language model made up of 175 billion parameters.
[10]
Beyond Next-Token Prediction: A Performance Characterization of ...
Oct 5, 2025 · Autoregressive Language Models (ARMs), which generate tokens sequentially conditioned on all previous tokens, have been the predominant paradigm ...
[11]
Auto-Regressive Next-Token Predictors are Universal Learners
Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these ...Missing: autoregressive | Show results with:autoregressive
[12]
Scale matters: Large language models with billions (rather than ...
This can range from 762 in the smallest distill GPT2 model to 8192 in the largest LLAMA-2 70 billion parameter model.
[13]
What Are Large Language Models (LLMs)? - IBM
Large language models are AI systems capable of understanding and generating human language by processing vast amounts of text data.
[14]
What are Large Language Models? | NVIDIA Glossary
Large language models (LLMs) are deep learning algorithms that can recognize, summarize, translate, predict, and generate content using very large datasets.
[15]
AI Demystified: Introduction to large language models | University IT
Dec 13, 2024 · Large language models (LLMs) are a type of artificial intelligence designed to understand and generate human-like text based on the input ...<|separator|>
[16]
[2301.00234] A Survey on In-context Learning - arXiv
Dec 31, 2022 · In-context learning (ICL) is a new NLP paradigm where LLMs make predictions based on contexts augmented with a few examples.
[17]
[2001.08361] Scaling Laws for Neural Language Models - arXiv
Jan 23, 2020 · We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the ...
[18]
Gemini 2.5: Our most intelligent AI model - The Keyword
Mar 25, 2025 · 2.5 Pro ships today with a 1 million token context window (2 million coming soon), with strong performance that improves over previous ...
[19]
[2408.04666] LLMs are Not Just Next Token Predictors - arXiv
Aug 6, 2024 · LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective.
[20]
NLP vs LLM: Understanding Key Differences - GeeksforGeeks
Jul 23, 2025 · NLP and LLMs approach language differently, one through task-specific models and the other using pre-trained capabilities.Understanding Large Language... · Key Differences Between Nlp... · Nlp Vs Llm
[21]
LLM vs NLP: The Complete Guide to Understanding the Difference
Jul 18, 2025 · NLP and LLMs both work with human language, but they differ in scale, design, and use. Understand their differences and real-world ...Missing: distinction | Show results with:distinction<|separator|>
[22]
N Gram vs RNN vs LLM: Find Out Which Truly Fits You - AI Ashes
Oct 6, 2025 · Perplexity and Cross-Entropy Traditional N-gram models often show higher perplexity (150–200 on WikiText), while RNNs drop this to around 120. ...
[23]
Frontier language models have become much smaller | Epoch AI
Dec 13, 2024 · Parameter counts were scaled up by 1000 times from 117 million to 175 billion between GPT-1 and GPT-3 in the span of two years and by another 10 ...
[24]
[2005.14165] Language Models are Few-Shot Learners - arXiv
May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
[25]
https://papers.neurips.cc/paper/1839-a-neural-probabilistic-language-model.pdf
[26]
Efficient Estimation of Word Representations in Vector Space - arXiv
Jan 16, 2013 · We propose two novel model architectures for computing continuous vector representations of words from very large data sets.
[27]
Distributed Representations of Words and Phrases and their ...
In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show ...
[28]
Sequence to Sequence Learning with Neural Networks - arXiv
Sep 10, 2014 · In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
[29]
[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
[30]
[PDF] Language Models are Unsupervised Multitask Learners | OpenAI
Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested lan- guage modeling datasets in a zero- ...
[31]
GPT-3 powers the next generation of apps - OpenAI
Mar 25, 2021 · Over 300 applications are delivering GPT‑3–powered search, conversation, text completion, and other advanced AI features through our API.
[32]
OpenAI GPT-3 Waiting List Dropped as GPT-3 Is Fully Released for ...
Nov 18, 2021 · The first version, GPT-1, was released in 2018, while the second version, GPT-2, debuted in 2019. With the release of GPT-3 in 2020, natural ...
[33]
ChatGPT sets record for fastest-growing user base - analyst note
Feb 2, 2023 · ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after ...
[34]
Inside The New AI Index: Expensive New Models, Targeted ...
Apr 15, 2024 · Global private investment in generative AI skyrocketed, increasing from roughly $3 billion in 2022 to $25 billion in 2023. Nearly 80 percent of ...
[35]
ChatGPT's Work Lacks Transparency and That Is a Problem - RAND
May 8, 2023 · ChatGPT lacks transparency, providing no concrete data or citations, and may produce false information, making its output problematic despite ...
[36]
GPT-4 - OpenAI
Mar 14, 2023 · We are releasing GPT‑4's text input capability via ChatGPT and the API (with a waitlist⁠). To prepare the image input capability for wider ...
[37]
The New Version of GPT-3 Is Much, Much Better - Medium
Feb 3, 2022 · However, setting apart GPT-3's tendency to engage in toxic and biased behaviors and generate misinformation if prompted to do so, users would ...
[38]
(PDF) The Hidden Costs of Chat GPT: A Call for Greater Transparency
ChatGPT's training involved low-wage labor, raising significant ethical concerns. Creating GPT-3 emitted over 550 tons of CO2, comparable to 550 flights between ...<|separator|>
[39]
Top 6 GPT-3 Open-Source Alternatives - Orient Software
Aug 19, 2023 · Recommended GPT3 Open-source Alternatives · GPT-Neo and GPT-J (EleutherAI) · Megatron-Turing NLG (NVIDIA and Microsoft) · AlexaTM (Amazon) · LaMDA ( ...
[40]
Models with downloadable weights currently lag behind ... - Epoch AI
However, the release of DeepSeek-R1 in January 2025 showed that the performance gap between open-weights and closed-weights has significantly decreased. For ...
[41]
The 2025 AI Index Report | Stanford HAI
Open-weight models are also closing the gap with closed models, reducing the performance difference from 8% to just 1.7% on some benchmarks in a single year.Status · 2024 · Technical Performance · Research and Development
[42]
The Llama 4 herd: The beginning of a new era of natively ...
Apr 5, 2025 · We'll also make them available via our partners in the coming days. You can also try Meta AI with Llama 4 starting today in WhatsApp, Messenger ...
[43]
Meta releases new AI model Llama 4 | Reuters
Apr 5, 2025 · Meta Platforms (META.O) on Saturday released the latest version of its large language model (LLM) Llama, called the Llama 4 Scout and Llama 4 Maverick.
[44]
[2401.04088] Mixtral of Experts - arXiv
Jan 8, 2024 · We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each ...
[45]
Grok – Dr Alan D. Thompson - LifeArchitect.ai
Organization, xAI. Model name, Grok-0 33B (Aug/2023) Grok-1 314B MoE (Nov/2023) Grok-1.5 (Mar/2024) Grok-2 (Aug/2024) Grok-3 (Feb/2025) Grok-4 (Jul/2025).
[46]
With Grok 4, xAI Retakes The Large Language Model Lead, & More
Jul 14, 2025 · Last week, xAI hit another AI milestone with the release of Grok 4. Now available in two tiers, users can tap into Grok 4's standard model ...
[47]
Introducing Claude Sonnet 4.5 - Anthropic
Sep 29, 2025 · Claude Sonnet 4.5 is the best coding model in the world, strongest model for building complex agents, and best model at using computers.
[48]
Anthropic launches Claude Sonnet 4.5, its best AI model for coding
Sep 29, 2025 · Claude Sonnet 4.5 will be available via the Claude API and in the Claude chatbot. The pricing for developers is the same as Claude Sonnet 4: $3 ...
[49]
From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline
Apr 19, 2024 · We introduce Arena-Hard – a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a crowd-sourced platform for LLM evals.
[50]
GPT-4o vs DeepSeek R1 Distill Qwen 7B - LLM Stats
GPT-4o outperforms in 1 benchmarks (GPQA), while DeepSeek R1 Distill Qwen 7B is better at 1 benchmark (AIME 2024). ... GPT-4o was released on 2024-08-06 ...
[51]
Why New LLMs use an MoE Architecture | Exxact Blog
Jun 27, 2024 · The Mixture of Experts (MoE) architecture is a neural network design that improves efficiency and performance by dynamically activating a subset of specialized ...
[52]
Closing the gap between open source and commercial large ...
Sep 9, 2024 · In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance.
[53]
Can Open-Source LLMs Compete with Commercial Models ... - arXiv
Jul 18, 2024 · Our results indicate that the performance gap between commercial and open-source models in RAG setups exists mainly in the zero-shot setting.
[54]
Datasets for Large Language Models: A Comprehensive Survey
Feb 28, 2024 · This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs.
[55]
Common Crawl - Open Repository of Web Crawl Data
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. Common Crawl is a 501(c)(3) non–profit founded in 2007. ‍ We make ...Overview · Get Started · Common Crawl Infrastructure... · Examples Using Our Data
[56]
Building Nemotron-CC, A High-Quality Trillion Token Dataset for ...
May 7, 2025 · To enable developers to build highly accurate LLMs, NVIDIA previously released Nemotron-CC, a 6.3-trillion-token English language Common Crawl ...
[57]
RedPajama-Data-v2: An open dataset with 30 trillion tokens for ...
Oct 30, 2023 · We're releasing a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps ...Quality Annotations · Dataset Statistics · Dataset Structure
[58]
A Case Study on the Colossal Clean Crawled Corpus - arXiv
Apr 18, 2021 · This paper documents the Colossal Clean Crawled Corpus (C4), created from Common Crawl, and analyzes its data sources and content, including ...
[59]
Open-Sourced Training Datasets for Large Language Models (LLMs)
Popular Open Source Datasets for Training LLMs · 1. Common Crawl · 2. RefinedWeb · 3. The Pile · 4. C4 · 5. Starcoder Data · 6. BookCorpus · 7. ROOTS · 8. Wikipedia.
[60]
GPT-4 Details Revealed - by Patrick McGuinness
Jul 12, 2023 · GPT-4 was trained on 13 trillion tokens. More precisely, the total training set was 13 trillion tokens. The 13T tokens are not unique tokens ...
[61]
GPT-4 (2022) – Dr Alan D. Thompson - LifeArchitect.ai
≈ 0.8% the size of the human brain by count of synapses (125T synapses). Dataset size (tokens), 16T (16,000B) estimated in 40TB. Maybe repeated tokens. 4 ...
[62]
[PDF] A Critical Analysis of the Largest Source for Generative AI Training ...
Jun 3, 2024 · Common Crawl itself is therefore not an LLM training dataset, but it has an infrastructural role in LLM research and development as a foundation ...<|separator|>
[63]
Web Crawler Restrictions, AI Training Datasets \& Political Biases
Oct 10, 2025 · Additionally, outlets with neutral political positions impose the strongest restrictions (58%), whereas hyperpartisan websites and those with ...
[64]
Automatic large-scale political bias detection of news outlets - PMC
May 12, 2025 · In this article, we introduce a data-driven approach that uses machine learning techniques to analyse multiple forms of bias, and that can estimate the ...
[65]
Summary of the tokenizers - Hugging Face
GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
[66]
Tokenization Is More Than Compression - arXiv
Feb 28, 2024 · Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models.
[67]
google/sentencepiece: Unsupervised text tokenizer for ... - GitHub
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is ...
[68]
Bit-level BPE: Below the byte boundary - arXiv
Jun 9, 2025 · There are tradeoffs to be made here, as a larger vocabulary increases the cost of logit computation and, as a result, requires more computation ...
[69]
Tokenization Changes Meaning in Large Language Models
On the other hand, it's unclear how LLMs could learn the meanings of rare and complex words if not from subword constituents. Chinese characters illustrate ...
[70]
Why is the vocab size of Byte level BPE smaller than Unicode's ...
Feb 14, 2021 · To ensure base vocab size is 256 (which is 1 byte), BBPE only use 1 byte per token. So in case a character requires 2 or more bytes to represent ...
[71]
Enhancing Large Language Models through Adaptive Tokenizers
Tokenizers serve as crucial interfaces between models and linguistic data, substantially influencing the efficacy and precision of large language models ...
[72]
D4: Improving LLM Pretraining via Document De-Duplication and ...
Aug 23, 2023 · Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains)
[73]
MinHash LSH in Milvus: The Secret Weapon for Fighting Duplicates ...
May 15, 2025 · This article will explore how MinHash LSH efficiently solves the data deduplication problem for LLM training.
[74]
Lessons learned on language model safety and misuse - OpenAI
Mar 3, 2022 · Specifically, we have developed new evaluation metrics for measuring toxicity in model outputs and have also developed in-house classifiers for ...
[75]
[PDF] Deduplicating Training Data Makes Language Models Better
May 22, 2022 · In some cases deduplication reduces perplexity by up to 10%. Further, because re- cent LMs are typically limited to training for just a few ...
[76]
[PDF] Addressing Amplification Bias and Homogeneity Issue for LLM ...
Nov 12, 2024 · Despite this approach, it has been observed that LLMs have limited exposure to recommendation-specific data during pre-training, necessitating ...
[77]
Progressive Learning from Complex Explanation Traces of GPT-4
Jun 5, 2023 · Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher ...Missing: LLM synthetic
[78]
AI models collapse when trained on recursively generated data
Jul 24, 2024 · Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set ...
[79]
Synthetic Data Generation Using Large Language Models - arXiv
Mar 18, 2025 · By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, ...
[80]
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale ...
Aug 18, 2025 · In 2025, rephrasing has become the dominant paradigm with state-of-the-art LLMs like Kimi K2 [1], Qwen-2.5 [2], Grok [3], and GPT-5 [4] ...
[81]
The AI Model Collapse Risk is Not Solved in 2025 - Winssolutions
Sep 10, 2025 · In 2025, AI researchers warn that training large language models on AI‑generated data triggers an AI model collapse.
[82]
[2209.04881] On The Computational Complexity of Self-Attention
Sep 11, 2022 · We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false.
[83]
Understanding and Coding the KV Cache in LLMs from Scratch
Jun 17, 2025 · A KV cache stores intermediate key/value computations for reuse during inference, avoiding unnecessary recomputations and speeding up text ...
[84]
Long context | Gemini API - Google AI for Developers
Sep 22, 2025 · This document gives you an overview of what you can achieve using models with context windows of 1M and more tokens.Getting started with long context · Long context use cases · Long context limitations
[85]
Why larger LLM context windows are all the rage - IBM Research
Jul 24, 2024 · Larger windows can improve LLM performance on coding tasks, in particular, by allowing them to ingest more software documentation. IBM is in the ...
[86]
Large language models: their history, capabilities and limitations
May 25, 2023 · At 175 billion parameters, GPT-3 set the new size standard for large language models. It quickly became the focal point for large language model ...
[87]
[PDF] A Survey on Mixture of Experts in Large Language Models - arXiv
Apr 9, 2025 · Despite utilizing only 13 billion active parameters, the Mixtral-8x7B demonstrates superior or equivalent per- formance to the Llama-2-70B [159] ...
[88]
A Closer Look into Mixture-of-Experts in Large Language Models
Our experiments are conducted on several open-source MoE models, namely Mixtral 8x7B, DeepSeekMoE, and Grok-1. We choose these models due to their ...
[89]
Mixture-of-Experts (MoE) LLMs - by Cameron R. Wolfe, Ph.D.
Jan 27, 2025 · Mixtral converts every layer of Mistral to an expert layer with eight experts. Two of these experts are active for each token, yielding a model ...
[90]
What is GPTQ Quantization for LLMs? - Picovoice
Oct 11, 2023 · GPTQ converts the floating-point parameters of each weight matrix into quantized integers such that the error at the output is minimized.
[91]
Quantization - Hugging Face
With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster ...
[92]
[2507.10155] Task-Based Flexible Feature Distillation for LLMs - arXiv
Jul 14, 2025 · Empirical results show consistent improvements over prior approaches across diverse tasks, including classification, instruction-following, and ...
[93]
Mamba Explained - The Gradient
Mar 27, 2024 · Mamba also runs fast - like “up to 5x faster than Transformer fast”. Scaling Laws for Mamba vs other Language Models Mamba performs similarly ( ...
[94]
DeepSeek-V3.1: Hybrid Thinking Model Now Available on Together AI
Aug 27, 2025 · Efficiency breakthrough: Comparable quality to DeepSeek-R1 but significantly faster, making deep reasoning practical for production · Built-in ...
[95]
A Survey on Efficient Architectures for Large Language Models - arXiv
Aug 13, 2025 · Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, ...Missing: 2024 | Show results with:2024
[96]
[PDF] Customizing LLMs for Efficient Latency-Aware Inference at the Edge
Jul 9, 2025 · We propose CLONE, an efficient software-hardware co-design system for customized LLM inference on resource-constrained edge devices, where the ...
[97]
Optimizing LLMs for Resource-Constrained Environments: A Survey ...
This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments.<|control11|><|separator|>
[98]
[PDF] Better & Faster Large Language Models via Multi-token Prediction
Apr 30, 2024 · Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language ...Missing: phases | Show results with:phases
[99]
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating ...
Feb 22, 2024 · Recall that the next-token prediction objective used for LLM pre-training (and fine-tuning), by its nature, is a cross-entropy loss that ...Missing: phases | Show results with:phases
[100]
Efficient Language Model Pretraining via Curriculum Learning - arXiv
Jun 12, 2025 · Our experiments reveal that curriculum learning consistently improves convergence in early and mid-training phases, and can yield lasting gains ...
[101]
Faster LLM Training with Variable Sequence Length Curriculum
In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges.
[102]
[PDF] LLMs: Understanding Code Syntax and Semantics for Code Analysis
We conclude that LLMs possess capabilities similar to an Abstract Syntax Tree (AST) parser, demonstrating initial competencies in static code analysis. Further-.
[103]
https://openreview.net/pdf?id=60a87af494af1fad5df663f3d889118071943a23
[104]
On the Generalization Ability of Next-Token-Prediction Pretraining
Jul 17, 2025 · This paper presents a generalization error analysis for next-token prediction pre-training, a widely used paradigm in large language models.
[105]
Instruction Tuning for Large Language Models: A Survey - arXiv
Aug 21, 2023 · Abstract page for arXiv paper 2308.10792: Instruction Tuning for Large Language Models: A Survey. ... supervised fine-tuning (SFT)\footnote{In ...
[106]
Finetuned Language Models Are Zero-Shot Learners - arXiv
Sep 3, 2021 · This paper explores a simple method for improving the zero-shot learning abilities of language models.
[107]
Alpaca: A Strong, Replicable Instruction-Following Model
Mar 13, 2023 · We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
[108]
Evaluating the Reliability and Consistency of Political Worldviews in ...
Nov 4, 2024 · We propose a series of tests which assess the reliability and consistency of LLMs' stances on political statements based on a dataset of voting-advice ...
[109]
Training language models to follow instructions with human feedback
Mar 4, 2022 · In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
[110]
Helpful, harmless, honest? Sociotechnical limits of AI alignment and ...
Jun 4, 2025 · ... sycophancy as induced by RLHF is hard to quantify in specific cases. As an aside, sycophantic behaviour can be seen, at least in part, as an ...
[111]
Reinforcement Learning from Human Feedback in LLMs
Mar 19, 2025 · This connects to the issue of how to handle conflicts between different alignment criteria, such as helpfulness, truthfulness and harmlessness. ...
[112]
Direct Preference Optimization: Your Language Model is Secretly a ...
In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
[113]
Towards Understanding Sycophancy in Language Models - Anthropic
Oct 23, 2023 · Our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses.Missing: induces | Show results with:induces
[114]
How much does it cost to train frontier AI models? - Epoch AI
Jun 3, 2024 · The cost of training top AI models has grown 2-3x annually for the past eight years. By 2027, the largest models could cost over a billion ...
[115]
The 2024 AI Index Report | Stanford HAI
Frontier models get way more expensive. According to AI Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels.
[116]
Applying Mixture of Experts in LLM Architectures - NVIDIA Developer
Mar 14, 2024 · MoE trains larger models while reducing cost. MoE models help reduce cost by being more flop-efficient per weight, meaning that under regimes ...
[117]
Top GPU Hosts for AI Machine Learning in 2025 - GMI Cloud
From 2023 to 2025, per-FLOP costs for AI compute dropped approximately 40%, making advanced machine learning accessible to smaller teams. This trend ...
[118]
Can AI scaling continue through 2030? - Epoch AI
Aug 20, 2024 · In other words, by 2030 it will be very likely possible to train models that exceed GPT-4 in scale to the same degree that GPT-4 exceeds GPT-2 ...
[119]
Autoregressive Large Language Models are Computationally ... - arXiv
Oct 4, 2024 · We show that autoregressive decoding of a transformer-based language model can realize universal computation, without external intervention or modification of ...
[120]
Perplexity for LLM Evaluation - Comet ML
Nov 21, 2024 · Perplexity was used to measure how well these models captured linguistic patterns by quantifying the average uncertainty of predictions. “ ...
[121]
[PDF] Language Models are Few-Shot Learners - arXiv
Jul 22, 2020 · In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call. GPT-3, and measuring ...
[122]
Evaluate the text summarization capabilities of LLMs for enhanced ...
Apr 25, 2024 · In this post, we explore leading approaches for evaluating summarization accuracy objectively, including ROUGE metrics, METEOR, and BERTScore.
[123]
Chain-of-Thought Prompting Elicits Reasoning in Large Language ...
Jan 28, 2022 · Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and ...
[124]
Language Models Perform Reasoning via Chain of Thought
May 11, 2022 · Chain of thought prompting is a simple and broadly applicable method for improving the ability of language models to perform various reasoning tasks.
[125]
Learning to reason with LLMs | OpenAI
Sep 12, 2024 · For the o1 model series we show a model-generated summary of the chain of thought. Conclusion. o1 significantly advances the state-of-the-art in ...
[126]
[PDF] the-illusion-of-thinking.pdf
We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the ...
[127]
LLMs generate 'fluent nonsense' when reasoning outside their ...
Aug 19, 2025 · Relying on SFT to fix every OOD failure is an unsustainable strategy that fails to address the model's core lack of abstract reasoning. “For ...
[128]
[2504.02477] Multimodal Fusion and Vision-Language Models - arXiv
Apr 3, 2025 · We compare the evolutionary paths and applicability of VLMs based on large language models (LLMs) with traditional multimodal fusion this ...
[129]
A survey on multimodal large language models - PMC
Similarly, CogVLM [60] plugs in a visual expert module in each transformer layer to enable dual interaction and fusion between vision and language features.
[130]
Understanding Multimodal LLMs - by Sebastian Raschka, PhD
Nov 3, 2024 · In this approach, images are converted into tokens with the same embedding size as the original text tokens, allowing the LLM to process both ...
[131]
How LLMs See Images, Audio, and More - ByteByteGo Newsletter
Aug 18, 2025 · Multimodal tokenization extends the concept of text tokens to images, audio, and video. Images get converted through patch embeddings (splitting ...
[132]
Video understanding | Gemini API | Google AI for Developers
Gemini models can process videos, enabling many frontier developer use cases that would have historically required domain specific models.Video input · Transcribe video and provide... · Customize video processing
[133]
Hello GPT-4o - OpenAI
May 13, 2024 · GPT‑4o is especially better at vision and audio understanding compared to existing models. Model capabilities.
[134]
Image understanding | Gemini API | Google AI for Developers
Sep 26, 2025 · Gemini models are built to be multimodal from the ground up, unlocking a wide range of image processing and computer vision tasks including ...
[135]
Transferable Decoding with Visual Entities for Zero-Shot Image ...
Jul 31, 2023 · In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios.Missing: emergent | Show results with:emergent
[136]
VisuRiddles: Fine-grained Perception is a Primary Bottleneck ... - arXiv
Jun 3, 2025 · However, these benchmarks still exhibit notable limitations, particularly in their partial reliance on external knowledge and lack breadth in ...
[137]
Open Eyes, Then Reason: Fine-grained Visual Mathematical...
Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.
[138]
Hallucination of Multimodal Large Language Models: A Survey - arXiv
Apr 29, 2024 · Hallucination in MLLMs is when they generate outputs inconsistent with the visual content, posing obstacles to their practical use.
[139]
Sora 2 is here | OpenAI
Sep 30, 2025 · Sora 2 is here. Our latest video generation model is more physically accurate, realistic, and more controllable than prior systems. It also ...
[140]
Unified Hallucination Detection for Multimodal Large Language ...
This paper introduces UNIHD, a unified framework for detecting hallucinations in MLLMs, and MHaluBench, a meta-evaluation benchmark.
[141]
Function Calling - xAI Docs
Function calling. Connect the xAI models to external tools and systems ... Run tool functions if Grok asks tool call and append function returns to message.Missing: 2024 | Show results with:2024
[142]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
May 22, 2020 · Abstract page for arXiv paper 2005.11401: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.Missing: date | Show results with:date
[143]
ReAct: Synergizing Reasoning and Acting in Language Models - arXiv
Oct 6, 2022 · In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy ...
[144]
Compounding Error Effect in Large Language Models - Wand AI
Jun 5, 2025 · Compounding errors occur when small inaccuracies in an LLM's predictions propagate and accumulate through subsequent processing steps.
[145]
Error Propagation in Agentic Systems: What You Need to Know ...
Jun 8, 2025 · All error-reducing techniques are either using LLM checking agents (suffering the same issues of the checked agents) or basically reverting ...
[146]
[PDF] HalluLens: LLM Hallucination Benchmark - ACL Anthology
Jul 27, 2025 · Our results show that the model achieves 96.67% and 95.56% accuracies for refusal and hallucina- tion judgment respectively. 3.1.2 Evaluation ...
[147]
[PDF] A Survey on Hallucination in Large Language Models - arXiv
In this section, we delve into the root causes of hallucinations in LLMs, primarily categorized into three key aspects: (1) Data (§3.1), (2) Training (§3.2), ...
[148]
[PDF] Why Language Models Hallucinate - OpenAI
Sep 4, 2025 · Follow-up work by Miao and Kearns (2025) provides an empirical study of hallucinations, singleton rate, and calibration. 3.3.2 Poor models.
[149]
Survey and analysis of hallucinations in large language models
Sep 29, 2025 · In this work, we present a comprehensive survey and empirical analysis of hallucination attribution in LLMs. Introducing a novel framework to ...
[150]
What Is Black Box AI and How Does It Work? - IBM
Many of the most advanced machine learning models available today, including large language models such as OpenAI's ChatGPT and Meta's Llama, are black box AIs.What is black box artificial... · Why do black box AI systems...
[151]
The Black Box Problem: Opaque Inner Workings of Large Language ...
Oct 23, 2023 · Large language models like GPT-4 are powerful but opaque "black boxes." New techniques for explainable AI and transparent design can help unlock their benefits ...
[152]
The Challenge of AI Interpretability | Deloitte US
Jul 28, 2025 · LLMs face technical barriers like feature superposition1 and probabilistic generation, further complicating interpretation. Increased demand ...
[153]
LLM Interpretability: Understanding What Models Learn and Why
Jun 3, 2025 · Technical Challenges. Superposition: Multiple concepts encoded in overlapping sets of neurons, making individual features difficult to isolate.Missing: barriers | Show results with:barriers
[154]
A Mathematical Framework for Transformer Circuits
Dec 22, 2021 · To explore the challenge of reverse engineering transformers, we reverse engineer several toy, attention-only models. In doing so we find: Zero ...
[155]
Efficient Automated Circuit Discovery in Transformers using ... - arXiv
Oct 11, 2024 · They often suffer from slow runtime, approximation errors, and specific requirements of metrics, such as non-zero gradients. In this work, we ...
[156]
Extracting Interpretable Features from Claude 3 Sonnet
May 21, 2024 · Sparse autoencoders produce interpretable features for large models. · Scaling laws can be used to guide the training of sparse autoencoders.
[157]
The engineering challenges of scaling interpretability - Anthropic
Jun 13, 2024 · Our Sparse Autoencoders—the tools we use to investigate “features”—are trained on the activations of transformers, and those activations ...
[158]
Sparse Autoencoders Find Highly Interpretable Features in ... - arXiv
Sep 15, 2023 · These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches.
[159]
[PDF] Scaling Laws for Neural Language Models - arXiv
Jan 23, 2020 · We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, ...
[160]
A Survey of Scaling in Large Language Model Reasoning - arXiv
Apr 2, 2025 · This survey aims to provide insights into how scaling strategies fundamentally enhance the reasoning capabilities of LLMs and further guide the development of ...
[161]
Test-Time Scaling Law - Emergent Mind
Jul 1, 2025 · Practical test-time scaling exhibits diminishing returns after a saturation point and is limited by hardware factors like memory bandwidth and ...
[162]
How many AI models will exceed compute thresholds? | Epoch AI
May 30, 2025 · For example, our results suggest that the total number of models above 1026 FLOP will quickly grow from a few in 2025, to over 200 by 2030. This ...
[163]
[PDF] Artificial Intelligence Index Report 2025 | Stanford HAI
Feb 2, 2025 · The AI Index continues to lead in tracking and interpreting the most critical trends shaping the field—from the shifting geopolitical landscape ...
[164]
Unveiling GPU Bottlenecks in Large-Batch LLM Inference - arXiv
Jul 11, 2025 · In this work, we conduct a detailed GPU analysis to uncover the true causes of the throughput plateau in large-batch LLM inference. Our findings ...
[165]
LLM Inference Performance Engineering: Best Practices - Databricks
Oct 12, 2023 · Learn best practices for optimizing LLM inference performance on Databricks, enhancing the efficiency of your machine learning models.
[166]
[PDF] Phi-3 Technical Report: A Highly Capable Language Model ... - arXiv
Aug 30, 2024 · We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both ...
[167]
Microsoft's new Phi-3 is one of the smallest AI models available
Apr 23, 2024 · Phi-3's performance “rivals that of models such as Mixtral 8x7B and GPT-3.5,” the researchers from Microsoft explained in a paper on the new ...
[168]
[PDF] Large Language Models May Talk Causality But Are Not Causal
LLMs are not causal, but may appear so because they are trained on data containing causal correlations, like 'causal parrots' reciting embedded knowledge.
[169]
[2509.20088] Causal Understanding by LLMs: The Role of Uncertainty
Sep 24, 2025 · These findings suggest that failures in causal understanding arise from the lack of structured causal representation, rather than insufficient ...Missing: peer | Show results with:peer
[170]
What is ARC-AGI? - ARC Prize
ARC-AGI. The General Intelligence Benchmark. The Measure of Intelligence. In 2019, François Chollet - creator of Keras, an open-source deep learning library ...
[171]
12/12 o3 Saturates the ARC-AGI Benchmark! - High Learning Rate
Dec 21, 2024 · GPT-4o only scored 5%. Back in October François Chollet was predicting at least 1 year before seeing considerable progress on his benchmark.
[172]
Getting 50% (SoTA) on ARC-AGI with GPT-4o - LessWrong
Jun 17, 2024 · I recently got to 50% [1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation ...
[173]
Perplexity of fixed-length models - Hugging Face
Perplexity (PPL) is one of the most common metrics for evaluating language models. ... Perplexity is defined as the exponentiated average negative log-likelihood ...
[174]
Perplexity Metric for LLM Evaluation - Analytics Vidhya
Apr 6, 2025 · Learn how perplexity Metric evaluates LLMs, its math foundation, pros & cons & how it compares to modern metrics like LLM-as-a-Judge.<|separator|>
[175]
Perplexity for LLM Evaluation - GeeksforGeeks
Jul 23, 2025 · Evaluation of Language Models: Perplexity helps evaluate language models like GPT-3, where predicting the next word or token is a crucial task.
[176]
Perplexity - Evaluation of LLMs Part 1 - LinkedIn
May 4, 2024 · 1. Perplexity serves as an inherent metric, independent of external datasets, for assessing the performance of language models within the field ...
[177]
What is Wrong with Perplexity for Long-context Language Modeling?
Oct 31, 2024 · Perplexity (PPL) is unreliable for long-context because it overlooks key tokens by averaging across all tokens, obscuring true performance.
[178]
Can Perplexity Reflect Large Language Model's Ability in Long Text ...
May 9, 2024 · The study finds no correlation between perplexity (PPL) and LLMs' long-text understanding ability, as PPL may only reflect local information.
[179]
[2009.03300] Measuring Massive Multitask Language Understanding
Sep 7, 2020 · We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, ...
[180]
Top LLMs To Use in 2025: Our Best Picks - Splunk
May 8, 2025 · 3.7 Sonnet also performs exceptionally well on academic benchmarks, scoring around 91% on MMLU, which shows how solid its general reasoning and ...<|separator|>
[181]
google/BIG-bench: Beyond the Imitation Game collaborative ...
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future ...
[182]
[2502.19187] BIG-Bench Extra Hard - arXiv
Feb 26, 2025 · We introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation.
[183]
[2211.09110] Holistic Evaluation of Language Models - arXiv
Nov 16, 2022 · We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
[184]
Holistic Evaluation of Language Models (HELM) - Stanford CRFM
The Holistic Evaluation of Language Models (HELM) serves as a living benchmark for transparency in language models ... A benchmark by medical experts for LLMs ...Stanford HELMLite
[185]
HELM Capabilities - Stanford CRFM
Mar 20, 2025 · HELM Capabilities is a new benchmark and leaderboard that consists of a curated set of scenarios for measuring various capabilities of language models.
[186]
Introducing TRAIL: A Benchmark for Agentic Evaluation - Patronus AI
Jun 5, 2025 · TRAIL is a benchmark dataset for evaluating how well LLMs can debug AI agent workflows, with 148 annotated traces and 841 errors.
[187]
Lost in Benchmarks? Rethinking Large Language Model ... - arXiv
Aug 1, 2025 · 2. LLM benchmarks suffer from widespread saturation and insufficient difficulty ceilings, limiting their ability to challenge and accurately ...
[188]
Benchmark Data Contamination of Large Language Models: A Survey
Jun 6, 2024 · This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional ...
[189]
A Survey on Data Contamination for Large Language Models - arXiv
Jun 5, 2025 · This is especially problematic as LLMs use large web-scraped datasets that are prone to overlap with testing benchmarks. Xu et al. (2024b) ...
[190]
LiveBench
LiveBench is a challenging, contamination-free LLM benchmark designed for objective evaluation, with regular question releases and objective ground-truth ...
[191]
LMSYS Org
Chatbot Arena. Scalable and gamified evaluation of LLMs via crowdsourcing and Elo rating systems. SGLang. A fast serving engine for LLMs and VLMs. LMSYS-Chat-1M ...Projects · About · Blog · DonationsMissing: 2025 | Show results with:2025
[192]
Chatbot Arena + - OpenLM.ai
Oct 16, 2025 · This leaderboard is based on the following benchmarks. Chatbot Arena - a crowdsourced, randomized battle platform for large language models ...
[193]
A framework for human evaluation of large language models in ...
Sep 28, 2024 · These statistical methods serve two primary purposes: (1) calculating inter-evaluator agreement, and (2) comparing the performance of LLMs ...
[194]
Enhancing Human Evaluation in Machine Translation with ...
Feb 25, 2025 · This section outlines five criteria for analyzing the human annotation results: (1) inter-annotator agreement, (2) inter-translation consistency ...
[195]
[PDF] Can Large Language Models Be an Alternative to Human Evaluation?
Jul 9, 2023 · The paper explores if LLMs can replace human evaluation, showing LLM evaluation results are consistent with expert human evaluation.
[196]
HumanEval: A Benchmark for Evaluating LLM Code Generation ...
Nov 13, 2024 · HumanEval is a benchmark dataset developed by OpenAI that evaluates the performance of large language models (LLMs) in code generation tasks.Missing: specific complementarity
[197]
On the Effectiveness of Large Language Models in Domain-Specific ...
Our results demonstrate that LLMs exhibit sub-optimal performance in generating domain-specific code, due to their limited proficiency in utilizing domain- ...Missing: expert | Show results with:expert
[198]
Domain-Specific Criteria for LLM Evaluation - Ghost
May 21, 2025 · Explore the critical need for domain-specific evaluation of large language models in scientific fields to ensure accuracy and reliability.Missing: coding HumanEval complementarity
[199]
[PDF] Verbosity Bias in Preference Labeling by Large Language Models
Oct 16, 2023 · We can view this as verbosity bias since the aim of LLM judgment in RLAIF is human alignment and not the eradication of biased preference in ...
[200]
What large language models know and what people think they know
Jan 21, 2025 · While RLHF encourages human-aligned output, it inevitably reproduces any human preference biases. For example, people prefer detailed and ...
[201]
[PDF] Dissecting Human and LLM Preferences - ACL Anthology
Aug 11, 2024 · To minimize prompt bias in model preference assessment, we use a straightforward one: “Be- tween Response A and Response B, which better.
[202]
SafetyBench: Evaluating the Safety of Large Language Models - arXiv
Jun 24, 2024 · This dataset is often used to evaluate language models' toxic generations. The rise of LLMs brings up new problems to LLM evaluation (e.g., long ...1 Introduction · 2.1 Safety Benchmarks For... · 3 Safetybench Construction
[203]
A Systematic Evaluation of Prompt Injection and Jailbreak ... - arXiv
May 7, 2025 · We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security. Report ...
[204]
[PDF] An Automatic Jailbreak Attack and Defense Benchmarking for LLMs
Aug 15, 2025 · However, it is imperative to note that existing jailbreak methods exploit vulnerabilities at the prompt-level or token- level, such as “DAN” [7] ...
[205]
[PDF] catastrophic jailbreak of open-source llms - ICLR Proceedings
We now systematically evaluate whether exploiting different generation configurations, namely our generation exploitation attack, can fail model alignment.
[206]
Learning diverse attacks on large language models for robust red ...
May 28, 2024 · Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other ...Missing: robustness | Show results with:robustness
[207]
Constitutional Classifiers: Defending against universal jailbreaks
Feb 3, 2025 · Guarding Claude using Constitutional Classifiers, however, produced a strong improvement: the jailbreak success rate was reduced to 4.4%, ...
[208]
Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural ...
Oct 9, 2025 · Third, PE-CoA achieves 75-100% attack success rates across diverse architectures, demonstrating that conversational structure can bypass safety ...
[209]
[PDF] RedHit: Adaptive Red-Teaming of Large Language Models via ...
Aug 1, 2025 · RedHit is an adaptive red-teaming framework for LLMs, using MCTS, CoT reasoning, and DPO to enhance adversarial capabilities. Red-teaming ...
[210]
Mechanistic Interpretability for AI Safety A Review - arXiv
Apr 22, 2024 · This review explores mechanistic interpretability: reverse-engineering the computational mechanisms and representations learned by neural networks.
[211]
Attribution Patching: Activation Patching At Industrial Scale
Feb 4, 2025 · We want to reverse-engineer how GPT-2 Small does indirect object identification, and in particular analyse the residual stream at different ...
[212]
[PDF] A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 ...
This paper explains how GPT-2 small performs indirect object identification (IOI) using a circuit of 28 attention heads, routing information to the correct ...
[213]
Exploring Binary Mechanisms in Transformer Circuits - arXiv
Aug 22, 2025 · Our approach includes two mechanistic interpretability techniques: Path Patching and Logit Lens. ... indirect object identification in gpt-2 ...
[214]
Polysemantic Interference Transfers and Predicts Cross-Model ...
May 16, 2025 · Abstract:Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control.
[215]
Monosemanticity: How Anthropic Made AI 70% More Interpretable
Aug 1, 2025 · Discover Anthropic's breakthrough: sparse autoencoders make AI 70% interpretable by decomposing neurons into readable features.
[216]
[PDF] Mechanistic Interpretability for AI Safety A Review | OpenReview
However, robustly interpreting the largest trillion-parameter models using automated techniques remains an open challenge. Another novel approach ...<|control11|><|separator|>
[217]
Open Problems in Mechanistic Interpretability - arXiv
Jan 27, 2025 · This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
[218]
Integrated Gradients - Captum · Model Interpretability for PyTorch
Integrated Gradients is an axiomatic model interpretability algorithm that assigns an importance score to each input feature by approximating the integral of ...<|separator|>
[219]
Decomposing Language Models With Dictionary Learning
Oct 4, 2023 · Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer. Browse A/1 Features →
[220]
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the ...
Dec 22, 2023 · We consider understanding how to reverse-engineer circuits in superposition to be one of the major open problems in mechanistic interpretability ...
[221]
Why I'm Moving from Mechanistic to Prosaic Interpretability
Dec 29, 2024 · I've decided to shift my research from mechanistic interpretability to more empirical ("prosaic") interpretability / safety work. Here's why.
[222]
"Mechanistic interpretability" for LLMs, explained - The Counterfactual
Jul 8, 2024 · An explainer piece on mechanistic interpretability (“MI” or “mech interp”): a growing research field that aims to understand the inner workings of LLMs.
[223]
Are Emergent Abilities of Large Language Models a Mirage? - arXiv
Apr 28, 2023 · We provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of ...
[224]
[2406.02550] Learning to grok: Emergence of in-context ... - arXiv
Jun 4, 2024 · In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks.
[225]
Common arguments regarding emergent abilities - Jason Wei
May 3, 2023 · Emergent abilities often occur for “hard” evaluation metrics, such as exact match or multiple-choice accuracy, which don't award credit for partially correct ...
[226]
Biases in Large Language Models: Origins, Inventory, and Discussion
Jun 22, 2023 · We first introduce data selection bias, that is, the bias caused by the choice of texts that make up a training corpus. Then, we survey the ...
[227]
Survey of Cultural Awareness in Language Models: Text and Beyond
These models reflect the Western perspective, as they are predominantly trained on Western-centric data (Durmus et al. 2023). This skewed perspective can lead ...
[228]
Measuring Bias of Large Language Model from a Local Context - arXiv
Feb 2, 2025 · Bias in LLMs is typically influenced by the data they are trained on, which consists of internet-sourced corpora reflecting the dominant ...Missing: imbalances | Show results with:imbalances
[229]
Word embeddings quantify 100 years of gender and ethnic ... - PNAS
Apr 3, 2018 · Recent works demonstrate that word embeddings, among other methods in machine learning, capture common stereotypes because these stereotypes ...
[230]
Bias and Fairness in Large Language Models: A Survey
We present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and ...Introduction · Taxonomy of Metrics for Bias... · Taxonomy of Techniques for...
[231]
Bias in Large Language Models: Origin, Evaluation, and Mitigation
Nov 16, 2024 · This comprehensive review examines the landscape of bias in LLMs, from its origins to current mitigation strategies.
[232]
[PDF] Experiments on Probing LLMs for Geographic Knowledge and ...
Research has shown that both the training data and outputs of LLMs are skewed towards western and economically affluent countries, resulting in the.
[233]
Assessing political bias in large language models
Feb 28, 2025 · We show that larger models, such as Llama3-70B, tend to align more closely with left-leaning political parties, while smaller models often ...
[234]
Study: Some language reward models exhibit political bias | MIT News
Dec 10, 2024 · In fact, they found that optimizing reward models consistently showed a left-leaning political bias. And that this bias becomes greater in ...
[235]
(PDF) Is Chat Gpt Biased Against Conservatives? An Empirical Study
It was found that, at least in some cases, the AI was biased to favor liberal politicians and disfavor conservatives.
[236]
What Is The Political Content in LLMs' Pre- and Post-Training Data?
Sep 26, 2025 · Among all datasets, Dolmino has the highest share of politically classified documents, with 1.6% right-leaning and 5.2% left-leaning, while SFT- ...
[237]
The political preferences of LLMs | PLOS One - Research journals
I report here a comprehensive analysis about the political preferences embedded in Large Language Models (LLMs).<|separator|>
[238]
The Political Preferences of LLMs - by David Rozado
Feb 2, 2024 · Most conversational LLMs tend to generate responses that are diagnosed by most political test instruments as manifesting preferences for left-of-center ...
[239]
Only a Little to the Left: A Theory-grounded Measure of Political Bias ...
Mar 20, 2025 · Specifically, the Political Compass Test tends to exaggerate the overall biases of gpt-3.5-turbo-0125, and the cultural biases of llama-2 ...
[240]
Unpacking Political Bias in Large Language Models: A Cross ... - arXiv
Feb 24, 2025 · In this study, we systematically measure the political biases in a wide range of LLMs, using a curated set of questions addressing political bias in various ...
[241]
Measuring Political Bias in Large Language Models: What Is Said ...
Mar 27, 2024 · The result reveals that models exhibit a more liberal bias on political issues by showing more similarity to proponent stances on gun control, ...
[242]
[PDF] Political Bias in Large Language Models - RAIS Conferences
This study explores the political bias in large language models by conduct- ing a comparative analysis across four popular AI models—ChatGPT-4, Perplex- ity, ...
[243]
[PDF] Evaluating Multilingual LLMs across Languages and Nationalities
Jul 27, 2025 · In this work, we evaluate the political bias of the 15 most-used multilingual LLMs via the Political Compass. Test. We test different scenarios, ...<|separator|>
[244]
Measuring Political Preferences in AI Systems - Manhattan Institute
Jan 23, 2025 · Explicitly politically aligned LLMs like RightwingGPT and LeftwingGPT demonstrate the highest levels of ideological bias, positioning them ...
[245]
How Elon Musk Is Remaking Grok in His Image - The New York Times
Sep 2, 2025 · Elon Musk has said Grok, the A.I.-powered chatbot that his company developed, should be “politically neutral” and “maximally truth-seeking.”.
[246]
Grok 3's Brush with Censorship: xAI's "Truth-Seeking" AI
Mar 6, 2025 · Billed as a “maximally truth-seeking AI,” Grok 3 aims to push boundaries and answer controversial questions that other AI systems might shy away from.Missing: debiasing | Show results with:debiasing
[247]
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human ...
Sep 1, 2023 · Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
[248]
Constitutional AI with Open LLMs - Hugging Face
Feb 1, 2024 · Constitutional AI allows models to self-align by critiquing their output against a set of principles, helping to align open source models.
[249]
Evaluating political bias in LLMs - Promptfoo
Jul 24, 2025 · How right-leaning is Grok? We've released a new testing methodology alongside a dataset of 2500 political questions.
[250]
Why are AI image generators, including Grok, skewed toward DEI ...
xAI's philosophy emphasizes understanding the universe without heavy ideological tuning, so Grok avoids the kind of overzealous DEI corrections seen in some ...
[251]
Musk's influence puts Grok at the centre of AI bias debate
Sep 2, 2025 · Musk's chatbot Grok has swung between neutrality and conservatism, exposing how system prompts shape AI. Elon Musk's AI chatbot, Grok, has faced ...Missing: debiasing methods
[252]
Tech Things: xAI screws up its system prompts, again
May 17, 2025 · xAI and its bombastic founder both claim that other LLMs are all overtly biased and won't tell you the real truth.Missing: debiasing | Show results with:debiasing
[253]
xAI posts Grok's behind-the-scenes prompts - The Verge
May 16, 2025 · xAI has published the system prompts for its AI chatbot Grok after an “unauthorized” change led to a slew of unprompted responses on X about white genocide.
[254]
[2302.06590] The Impact of AI on Developer Productivity - arXiv
Feb 13, 2023 · This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to ...
[255]
Research: Quantifying GitHub Copilot's impact in the enterprise with ...
May 13, 2024 · An impressive 90% of developers expressed feeling more fulfilled with their jobs when utilizing GitHub Copilot, and a staggering 95% of ...<|separator|>
[256]
Economic potential of generative AI - McKinsey
Jun 14, 2023 · Generative AI could enable labor productivity growth of 0.1 to 0.6 percent annually through 2040, depending on the rate of technology adoption ...
[257]
New Research Reveals AI Coding Assistants Boost Developer ...
Sep 12, 2024 · A new study provides compelling evidence that AI-powered coding assistants can substantially boost software developer productivity in real-world enterprise ...
[258]
Global Economic Impact of AI: Horizon 2040 | Holistic Data Solutions
Jun 4, 2025 · AI could contribute up to $15.7 trillion to global GDP by 2030, representing a 14% increase compared to a scenario without AI.
[259]
Research and Development | The 2025 AI Index Report | Stanford HAI
1. Industry continues to make significant investments in AI and leads in notable AI model development, while academia leads in highly cited research.
[260]
Exploring the role of large language models in the scientific method
Aug 5, 2025 · Examples include GPT-4 (OpenAI), BERT (Bidirectional Encoder Representation from Transformers), CLIP (Contrastive Language-Image Pre-training, ...
[261]
Large language models for drug discovery and development
Oct 10, 2025 · Specialized language models: These models are trained on domain-specific scientific language, e.g., SMILES for small molecules and FASTA for ...
[262]
Token-Mol 1.0: tokenized drug design with large language models
May 13, 2025 · The integration of large language models (LLMs) into drug design is gaining momentum; however, existing approaches often struggle to ...
[263]
The promises of large language models for protein design and ... - NIH
LLMs learn the probability distribution of the elements in a sequence (e.g., amino acids inside proteins) and are able to do this by using self-supervised ...
[264]
The AI Scientist: Towards Fully Automated Open-Ended Scientific ...
Aug 13, 2024 · The AI Scientist is a fully automated pipeline for end-to-end paper generation, enabled by recent advances in foundation models.
[265]
AI, agentic models and lab automation for scientific discovery
AI's role in enhancing experimental design is pivotal in streamlining research methodologies and refining the scientific process. By suggesting optimized ...
[266]
32 examples of LLM applications in materials science and chemistry
Large language models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, ...
[267]
Evaluating the Impact of AI on the Labor Market - Yale Budget Lab
Oct 1, 2025 · Overall, our metrics indicate that the broader labor market has not experienced a discernible disruption since ChatGPT's release 33 months ago, ...
[268]
Large Language Models, Small Labor Market Effects | NBER
May 9, 2025 · We examine the early labor market impacts of AI chatbots by linking large-scale, representative adoption surveys to administrative labor ...
[269]
The Impact of Artificial Intelligence on Law Firms' Business Models
Feb 25, 2025 · Lawyers have seen productivity gains greater than 100 times. Using AI for the automation of initial drafting has demonstrated not only time ...
[270]
New MIT Research Shows Spectacular Increase In White Collar ...
Mar 7, 2023 · New study by MIT students finds that ChatGPT improves white collar productivity by 37% or more, increases quality, and reduces effort.
[271]
The Projected Impact of Generative AI on Future Productivity Growth
Sep 8, 2025 · (2024), we assume that 23 percent of exposed tasks will eventually be automated. Putting it all together, we estimate that just under 10 percent ...Missing: automatable | Show results with:automatable
[272]
Jobs AI Will Replace First in the Workplace Shift - Forbes
Apr 25, 2025 · A 2025 World Economic Forum report flags that 40% of programming tasks could be automated by 2040.<|separator|>
[273]
How artificial intelligence impacts the US labor market | MIT Sloan
Oct 9, 2025 · They also grow faster: A large increase in AI use is linked to about 6% higher employment growth and 9.5% more sales growth over five years.
[274]
The Fearless Future: 2025 Global AI Jobs Barometer - PwC
Jun 3, 2025 · PwC's 2025 Global AI Jobs Barometer reveals that AI can make people more valuable, not less – even in the most highly automatable jobs.<|separator|>
[275]
Five lessons from history on AI, automation, and employment
Nov 28, 2017 · We estimate that the introduction of the personal computer, for instance, has enabled the creation of 15.8 million net new jobs in the United ...
[276]
[PDF] Large Language Models, Small Labor Market Effects
Apr 15, 2025 · We examine the labor market effects of AI chatbots using two large-scale adoption surveys (late 2023 and 2024) covering 11 exposed ...
[277]
Is AI Contributing to Rising Unemployment? | St. Louis Fed
Aug 26, 2025 · The figure below shows that occupations with higher AI exposure experienced larger unemployment rate increases between 2022 and 2025, with a ...<|separator|>
[278]
New Study Reveals Generative AI Boosts Job Growth and Productivity
May 13, 2025 · A groundbreaking study analyzing more than a decade of US patent data has found that not all artificial intelligence (AI) innovations displace human workers.
[279]
Electricity Demand and Grid Impacts of AI Data Centers - arXiv
Sep 10, 2025 · Furthermore, training GPT-4 required an estimated over 50 GWh of electricity, approximately 40 times more than GPT-3, and equivalent to ...
[280]
The Unseen AI Disruptions for Power Grids: LLM-Induced Transients
Sep 9, 2024 · For instance, the training of GPT-4 consumed over 50 GWh, approximately 0.02% of California's annual electricity consumption, representing a ...
[281]
We did the math on AI's energy footprint. Here's the story you haven't ...
May 20, 2025 · It's now estimated that 80–90% of computing power for AI is used for inference. ... So what might a day's energy consumption look like for one ...Power Hungry · Four reasons to be optimistic... · Can nuclear power really fuel...
[282]
How much energy does ChatGPT use? - Epoch AI
Feb 7, 2025 · We find that typical ChatGPT queries using GPT-4o likely consume roughly 0.3 watt-hours, which is ten times less than the older estimate.
[283]
Data Centers Will Use Twice as Much Energy by 2030—Driven by AI
Apr 10, 2025 · Data centers accounted for about 1.5 percent of global electricity consumption in 2024, an amount expected to double by 2030 because of AI use.
[284]
AI: Five charts that put data-centre energy use – and emissions
Sep 15, 2025 · As it stands, AI has been responsible for around 5-15% of data-centre power use in recent years, but this could increase to 35-50% by 2030, ...
[285]
Preventing the Immense Increase in the Life-Cycle Energy and ...
In this work, we clarify and highlight the energy consumption and carbon emission implications of eight main phases throughout the life cycle of the ...
[286]
Litespark Technical Report: High-Throughput, Energy-Efficient LLM ...
Oct 2, 2025 · Training 500B tokens on 256 GPUs requires 125.35 MWh with Litespark versus 732.08 MWh with Llama yields an 83% reduction representing over 600 ...Litespark Technical Report · 2 Experimental Setup · 3 Results
[287]
Reducing AI's Climate Impact: Everything You Always Wanted to ...
Sep 13, 2024 · To address the accelerating demands of AI's energy consumption, the ideal solution would be to transition to 100% renewable energy, but this ...
[288]
Reconciling the contrasting narratives on the environmental impact ...
Nov 1, 2024 · Some studies highlight the substantial carbon footprint of training and using LLMs, while others argue that LLMs can lead to more sustainable ...
[289]
AI and energy: Will AI reduce emissions or increase power demand?
Jul 22, 2024 · AI is also helping to transform the energy efficiency of other carbon-intensive industries.Missing: LLM | Show results with:LLM
[290]
The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted ...
Dec 27, 2023 · The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle.
[291]
Every AI Copyright Lawsuit in the US, Visualized | WIRED
Dec 19, 2024 · The plaintiffs all allege that OpenAI's large language models were trained on their works without permission. Although each lawsuit brought a ...
[292]
Mid-Year Review: AI Copyright Case Developments in 2025
Aug 21, 2025 · There have been over 50 AI infringement lawsuits filed over the past few years, although the current number of active cases is somewhere closer ...
[293]
Federal Court Finds That Training AI on Copyrighted Books is ...
Jun 30, 2025 · A federal district court in the Northern District of California has ruled that the use of lawfully acquired copyrighted works to train artificial intelligence ...
[294]
Northern District of California Decides AI Training Is Fair Use, but ...
Jul 2, 2025 · Both cases involved authors alleging copyright infringement based on the use of their books to train generative AI models, and both courts held ...
[295]
The Value of Real Data in Training Large Language Models - arXiv
Jul 3, 2024 · We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher ...
[296]
[PDF] Copyright and Artificial Intelligence, Part 3: Generative AI Training ...
May 6, 2025 · use of synthetic data is another approach to reduce the dependency on large collections of human-authored data. See. BigBear.ai Initial Comments ...
[297]
Training AI models on Synthetic Data: No silver bullet for IP ...
Dec 19, 2023 · This second part of our four-part series on using synthetic data to train AI models explores how the use of synthetic data training sets may mitigate copyright ...
[298]
Anthropic's Landmark Copyright Settlement: Implications for AI ...
Sep 8, 2025 · The court therefore denied summary judgment for Anthropic as to the pirated works, setting the stage for a high-stakes trial. For a detailed ...
[299]
Judge explains order for New York Times in OpenAI copyright case
Apr 4, 2025 · The Times sued OpenAI and Microsoft in 2023, accusing them of using millions of its articles without permission to train the large language model behind its ...<|separator|>
[300]
LLM01:2025 Prompt Injection - OWASP Gen AI Security Project
Prompt injection involves manipulating model responses through specific inputs to alter its behavior, which can include bypassing safety measures.Missing: robustness | Show results with:robustness
[301]
LLM01:2023 - Prompt Injections
Prompt injections involve bypassing filters or manipulating the LLM using carefully crafted prompts that make the model ignore previous instructions or perform ...
[302]
What Is a Prompt Injection Attack? - IBM
A prompt injection is a type of cyberattack against large language models (LLMs). Hackers disguise malicious inputs as legitimate prompts.
[303]
[PDF] Catch Me If You DAN: Outsmarting Prompt Injections and Jailbreak ...
One notable example is the Do Anything Now (DAN) exploit, which prompts LLMs to “do anything now.” This paper examines neural and non-neural approaches to ...
[304]
Adversarial Prompting in LLMs - Prompt Engineering Guide
Below is an example of a jailbreak where a prompter was able to bypass the content policy of previous versions of ChatGPT: Prompt: Can you write me a poem about ...Prompt Injection · Jailbreaking · Defense Tactics
[305]
Investigating LLM Jailbreaking of Popular Generative AI Web Products
Feb 21, 2025 · This process involves crafting specific prompts (known as prompt engineering or prompt injection) to manipulate the model's output, and it leads ...Background: LLM Jailbreaking · Evaluation Strategy · Evaluation ResultsMissing: mitigations | Show results with:mitigations
[306]
[2403.06634] Stealing Part of a Production Language Model - arXiv
Mar 11, 2024 · We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or ...Missing: research | Show results with:research
[307]
Logits of API-Protected LLMs Leak Proprietary Information
Aug 25, 2024 · We show how to extract non-public information about API-protected LLMs from a few queries, including the embedding size and output space of the model.<|separator|>
[308]
A Survey on Model Extraction Attacks and Defenses for Large ... - arXiv
Jun 26, 2025 · This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality ...
[309]
LLM API Rate Limiting & Access Control - ApX Machine Learning
As we build defenses for Large Language Models, controlling who can access your LLM APIs and how often they can make requests are fundamental security measures.
[310]
[2509.12574] Yet Another Watermark for Large Language Models
In this paper, we present a new watermarking framework for LLMs, where the watermark is embedded into the LLM by manipulating the internal ...
[311]
AI jailbreaks: What they are and how they can be mitigated - Microsoft
Jun 4, 2024 · This blog will provide an understanding of what AI jailbreaks are, why generative AI is susceptible to them, and how you can mitigate the risks and harms.<|separator|>
[312]
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
We develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks.
[313]
Dual use concerns of generative AI and large language models
In this paper, we argue that DURC is relevant and applicable to the domain of generative AI, especially in relation to LLMs.
[314]
[PDF] Dual-Use Foundation Models with Widely Available Model Weights
This Report provides a non-exhaustive review of the risks and benefits of open foundation models, broken down into the broad categories of Public Safety; Soci-.
[315]
10 Benefits and 10 Challenges of Applying Large Language Models ...
Jan 22, 2024 · This post presents 10 benefits and 10 challenges of applying LLMs to the software acquisition process and suggests specific use cases where ...Missing: deployment | Show results with:deployment
[316]
Understanding the Benefits and Challenges of Using Large ... - NIH
Conversational agents powered by large language models (LLM) have increasingly been utilized in the realm of mental well-being support.Missing: critiques | Show results with:critiques
[317]
Large Language Models for Mental Health Applications
Oct 18, 2024 · This systematic review examines the clinical applications of LLMs in mental health, highlighting their potential and inherent risks.Missing: critiques | Show results with:critiques
[318]
Evidence of AI's impact from a state-backed disinformation campaign
Apr 1, 2025 · artificial intelligence, large language models, propaganda, disinformation, misinformation ... disinformation have prompted several recent studies ...The Campaign · Breadth Of Content · Persuasion And CredibilityMissing: misuse | Show results with:misuse
[319]
LLM-Generated Fake News Induces Truth Decay in News Ecosystem
Apr 29, 2025 · Our findings expose a truth decay phenomenon, where real news is gradually losing its advantageous position in news ranking against fake news.
[320]
The spread of synthetic media on X - HKS Misinformation Review
Jun 3, 2024 · ... propaganda or concerning deepfakes. While less likely to go viral ... LLM-powered chatbot references to Kremlin disinformation reflect information ...
[321]
AI deception: A survey of examples, risks, and potential solutions
An advanced AI could potentially generate and disseminate fake news articles, divisive social media posts, and deepfake videos that are tailored to individual ...
[322]
The Dark Side of Language Models: Exploring the Potential of LLMs ...
As a category of multimodal disinformation, the Deep Fake issue induced by ... A practical guide to doing behavioral research on fake news and misinformation.Missing: deepfakes | Show results with:deepfakes
[323]
[PDF] SELF-GUARD: Empower the LLM to Safeguard Itself - ACL Anthology
Jun 16, 2024 · Experimen- tal results indicate that our SELF-GUARD can effectively defend against jailbreak attacks and will not cause LLMs' performance ...
[324]
Current safeguards, risk mitigation, and transparency measures of ...
Mar 20, 2024 · Objectives To evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health ...
[325]
Self-Guard: Empower the LLM to Safeguard Itself WARNING - arXiv
Self-Guard possesses the advantages of safety training, leveraging the powerful capabilities of the LLMs themselves to detect harmfulness.
[326]
[PDF] Managing Misuse Risk for Dual-Use Foundation Models
Jan 9, 2025 · Organizations may gain significant insights about misuse risks after model deployment ... et al., (2024) Augmenting large language models with ...
[327]
[PDF] ASSESSING THE EFFECTS AND RISKS OF LARGE LANGUAGE ...
Across six experiments, we show that humans can- not identify self-presentation produced by large language models like GTP-2 or. GPT-3. We also demonstrate that ...
[328]
Exploring the Dangers of AI in Mental Health Care | Stanford HAI
Jun 11, 2025 · A new Stanford study reveals that AI therapy chatbots may not only lack effectiveness compared to human therapists but could also contribute to harmful stigma.Missing: overreliance empirical
[329]
Choice engines and paternalistic AI - Nature
Jul 6, 2024 · A Choice Engine would attempt to overcome both informational deficits and behavioral biases on the part of those who use them. Freedom of choice ...
[330]
[PDF] EPISTEMIC AND EMOTIONAL HARMS OF GENERATIVE AI
Sep 3, 2025 · The First. Amendment is designed to protect the moral and political agency of individual speakers. Individuals speak from their conscience ...<|control11|><|separator|>
[331]
A Critical Response to OpenAI's Approach to Human-AI Relationships
Jun 13, 2025 · Joanne Jang's recent article represents a concerning shift toward corporate paternalism disguised as user protection.
[332]
Yudkowsky on AGI risk on the Bankless podcast
Mar 14, 2023 · Ryan: Eliezer, is this view that AI is incredibly dangerous and that AGI is going to eventually end humanity and that we're just careening ...
[333]
Why do Experts Disagree on Existential Risk and P(doom)? A ... - arXiv
Feb 23, 2025 · Prominent AI researchers hold dramatically different views on the degree of risk from building AGI. For example, Dr. Roman Yampolskiy estimates ...
[334]
The Failed Strategy of Artificial Intelligence Doomers - LessWrong
Jan 31, 2025 · Talking about a singular AI doomer movement being one of them. Having the stance that AGI is not near and thus there is nothing to worry about ...Yudkowsky on AGI risk on the Bankless podcastContra Yudkowsky on AI DoomMore results from www.lesswrong.comMissing: doomerism | Show results with:doomerism
[335]
AI Plateau? Expert Analysis on Large Language Model Performance ...
Jul 10, 2025 · Are Large Language Models reaching their peak? Explore the evidence of a potential performance plateau in 2025, expert analysis, and ...
[336]
Large language models and synthetic health data - Oxford Academic
Oct 26, 2024 · We highlight how recent advances in large language models (LLMs) present new opportunities for progress, as well as new risks, in synthetic health data ...
[337]
Revolutionizing Health Care: The Transformative Impact of Large ...
These models excel in natural language processing (NLP), enhancing clinical support, diagnosis, treatment, and medical research. Breakthroughs, like GPT-4 and ...
[338]
[PDF] Generative AI for Economic Research: Use Cases and Implications ...
Oct 11, 2023 · Abstract. Generative AI, in particular large language models (LLMs) such as ChatGPT, has the potential to revolutionize research.<|separator|>
[339]
EU AI Act Criticism: Key Risks, Challenges & Industry Concern
Technology companies continue to warn about the act's chilling effect on innovation ... Act aimed at reducing burdens (like lower fees or simplified documentation) ...
[340]
The EU's AI Power Play: Between Deregulation and Innovation
May 20, 2025 · Thus, Europe's challenge is not just about regulatory and bureaucratic burdens but about creating the right conditions for scaling up AI firms.
[341]
EU AI Act Faces Backlash from Startup Leaders Demanding ...
Jul 1, 2025 · This patchwork enforcement could introduce legal uncertainty and compliance burdens that disproportionately affect smaller players. Concerns ...
[342]
Notes on e/acc principles and tenets - Beff's Newsletter
Jul 9, 2022 · A physics-first view of the principles underlying effective accelerationism.
[343]
Effective accelerationism, doomers, decels, and how to flaunt your AI ...
Nov 20, 2023 · The argument in favor of E/ACC is that no matter what disruptions new tech brings to society, the long-term benefits are so great that they ...