Fact-checked by Grok 2 weeks ago

Language model

A language model is a probabilistic framework in machine learning that estimates the over sequences of linguistic units, such as words or subword , enabling predictions of subsequent elements given prior . These models originated with statistical approaches like n-gram estimators in the mid-20th century, which approximated probabilities based on contiguous word sequences, but evolved significantly with the introduction of neural architectures in the early , culminating in transformer-based large language models (LLMs) that leverage massive parallel training on internet-scale text corpora to achieve human-like fluency in generation and comprehension tasks. The core mechanism of modern language models involves autoregressive prediction, where the model computes the P(w_t \mid w_1, \dots, w_{t-1}) for each w_t in a sequence, often using self-attention mechanisms in transformers to capture long-range dependencies without recurrent processing. This shift from sequential models like recurrent neural networks (RNNs) to transformers, introduced in , enabled to billions or trillions of parameters, yielding breakthroughs in applications such as , , and , with empirical benchmarks showing LLMs outperforming prior systems on tasks like GLUE and SuperGLUE by wide margins due to emergent capabilities from pre-training on diverse . Notable achievements include the series by , which demonstrated on unseen tasks, and models like and that revealed scaling laws where predictably improves with compute and , underscoring the causal role of model in approximating complex linguistic patterns. Despite these advances, language models exhibit fundamental limitations rooted in their statistical nature, including hallucinations—generating plausible but factually incorrect outputs—as evidenced by empirical evaluations where even top models like err on novel factual queries at rates exceeding 10-20% in controlled tests, reflecting to distributions rather than genuine causal understanding. Biases inherited from uncurated propagate stereotypes and inaccuracies, with studies quantifying disparate error rates across demographic groups in tasks like , though mitigation via yields inconsistent results due to trade-offs with overall . Controversies also encompass high environmental costs from , equivalent to thousands of households' annual energy use for a single large model, and risks of misuse in generating deceptive content, as demonstrated by adversarial prompts eliciting harmful instructions despite safeguards. These issues highlight that while language models excel at surface-level , they lack robust to out-of-distribution causal scenarios, prompting ongoing into systems incorporating reasoning or retrieval augmentation for enhanced reliability.

Fundamentals

Definition and Scope

A language model is a probabilistic model that defines a over sequences of words, tokens, or symbols drawn from a . It estimates the likelihood of a given occurring, typically factorized via the chain rule of probability as P(w_1, \dots, w_n) = \prod_{i=1}^n P(w_i \mid w_1, \dots, w_{i-1}), where each conditional predicts the next given . This formulation captures sequential dependencies, enabling evaluation of fluency through metrics like , defined as the exponential of the average negative log-likelihood. Early language models relied on statistical methods such as n-grams, which approximate conditional probabilities from empirical frequency counts in corpora, smoothing techniques like Kneser-Ney addressing data sparsity. Neural variants, emerging prominently from 2003 onward, represent sequences via distributed embeddings and recurrent or mechanisms to model long-range dependencies more effectively than fixed-window . The scope excludes non-sequential models like bag-of-words classifiers, focusing instead on generative or predictive modeling of ordered linguistic units. Language models underpin core natural language processing tasks, including autoregressive text generation, where sequences are sampled iteratively from conditional distributions; , scoring candidate translations by fluency; and , rescoring hypotheses via language probabilities integrated with acoustic models. They also support by estimating query likelihood and enable foundational work in semantic representations, such as vector analogies derived from learned embeddings. While scalable neural architectures have expanded capabilities to handle billions of parameters and diverse modalities, the fundamental scope remains bounded to probabilistic sequence modeling, distinct from broader systems like vision models or agents.

Probabilistic Foundations

Language models are statistical models designed to estimate the probability distribution over sequences of linguistic units, such as words, subwords, or characters, in a given . This estimation captures the relative likelihood of different sequences occurring in corpora, enabling applications like text generation, , and . The foundational goal, as articulated in early statistical approaches, is to learn the joint probability function P(w_1, w_2, \dots, w_n) for a of words w_1 to w_n. The chain rule of probability decomposes this joint distribution into a product of conditional probabilities: P(w_1, w_2, \dots, w_n) = \prod_{i=1}^n P(w_i \mid w_1, \dots, w_{i-1}), where P(w_1) initializes the sequence and each subsequent term conditions on all preceding elements. This autoregressive factorization reflects the causal structure of language generation, where each unit depends on prior context, aligning with empirical observations of sequential dependencies in human-produced text. Exact computation of these conditionals is intractable due to —the number of possible histories grows exponentially with sequence length—necessitating approximations. Traditional n-gram models approximate the conditionals via the Markov assumption, restricting dependence to a fixed window of n-1 prior units: P(w_i \mid w_1, \dots, w_{i-1}) \approx P(w_i \mid w_{i-n+1}, \dots, w_{i-1}). For instance, models (n=2) condition solely on the immediate predecessor, with probabilities estimated via maximum likelihood from counts in training data: P(w_i \mid w_{i-1}) = \frac{\#(w_{i-1}, w_i)}{\#(w_{i-1})}. This approach suffers from sparsity, as unseen n-grams yield zero probabilities, addressed through techniques like Laplace or Kneser-Ney, which redistribute probability mass to unobserved events based on empirical patterns. Neural language models parameterize the conditionals using differentiable functions, such as or recurrent networks, trained to maximize the log-likelihood of observed sequences under the chain rule. This enables learning dense representations that encode long-range dependencies and mitigate the curse of dimensionality in sparse count-based methods, as demonstrated in models achieving reductions on benchmarks like the Treebank . Training optimizes parameters \theta via \frac{1}{N} \sum_{i=1}^N \log P_\theta(w_i \mid w_1, \dots, w_{i-1}), where N is the size, often using . Evaluation metrics like , \exp\left( -\frac{1}{N} \sum \log P(w_i \mid \cdot) \right), quantify predictive uncertainty, with lower values indicating better approximation of the data-generating distribution.

Historical Development

Early Statistical Models

Early statistical language models originated in the field of during the late 1940s, drawing on principles to approximate the probabilities of sequential events in text. introduced these concepts in his 1948 paper "," where he modeled language as a to quantify , using zero-order approximations (uniform distributions) and higher-order predictions for letter sequences in English. In a 1951 follow-up, Shannon estimated the of printed English at approximately 1 bit per letter by employing n-gram-like predictions, where the probability of a letter depends on the preceding 0 to 15 characters, demonstrating that and approximations captured much of the language's with per-character entropies dropping from 4.14 bits (zero-order) to around 1.3 bits (eighth-order). These foundational ideas evolved into explicit n-gram models for word-level prediction in by the 1970s and , formalized under the Markov assumption that the probability of the next word w_m depends only on the previous n-1 words: P(w_m \mid w_1, \dots, w_{m-1}) \approx P(w_m \mid w_{m-n+1}, \dots, w_{m-1}). Unigram models treated words independently, bigrams conditioned on one prior word, and trigrams on two, with counts derived from corpora like the (1 million words, 1960s) to estimate probabilities via maximum likelihood, though sparse data necessitated early smoothing techniques such as add-one (Laplace) to assign non-zero probabilities to unseen sequences. By the , these models supported applications in , where trigrams improved measures on datasets like the corpus, reducing prediction error compared to bigrams by factoring in local context. The 1990s marked widespread adoption in , where n-gram language models penalized ungrammatical outputs in noisy-channel frameworks. researchers developed Models 1 through 5 starting in the late , incorporating language models trained on parallel corpora like the Canadian Hansards (millions of pairs) to compute probabilities alongside fluency scores, achieving initial benchmarks on French-English pairs with reductions via interpolated . These models, estimated using expectation-maximization algorithms on up to 10^6 pairs, relied on n-grams up to order 3 or 4 due to computational limits and sparsity, with and distortion extensions addressing word alignments but preserving the core statistical independence assumptions from Shannon's era. Despite limitations like the inability to capture long-range dependencies—evident in higher for n>3 on large corpora—early statistical models established probabilistic foundations, influencing toolkits like SRILM () for efficient n-gram storage and querying on billions of words.

Emergence of Neural Approaches

The transition to neural approaches in language modeling began in the early , addressing limitations of statistical n-gram models, which struggled with data sparsity and the curse of dimensionality due to in possible word sequences. In 2003, and colleagues introduced one of the first neural probabilistic language models, employing a to estimate the probability of the next word given prior context. This model used a distributed representation of words—early word embeddings—learned via with shared parameters across context positions, enabling generalization beyond observed n-grams and achieving lower on held-out data compared to traditional methods, though at higher computational cost. Subsequent advancements incorporated (RNNs) to better capture sequential dependencies, overcoming the fixed-window constraints of feedforward models. In 2010, et al. developed the recurrent neural network language model (RNNLM), which utilized a simple RNN architecture to maintain a hidden state representing arbitrary-length history, trained efficiently with techniques like for normalization. Empirical evaluations on tasks demonstrated RNNLM's superiority, with reductions of up to 20% over n-gram baselines and substantial improvements (e.g., 10-15% relative gains on large corpora like Switchboard). These neural methods gained traction through practical implementations and hardware advances, such as GPUs, which mitigated training inefficiencies; by the mid-2010s, they consistently outperformed statistical models in downstream applications like and ASR, paving the way for deeper architectures. The core innovation—learning continuous, dense representations—facilitated semantic understanding absent in discrete n-gram probabilities, though challenges like vanishing gradients in standard RNNs prompted refinements such as (LSTM) units, introduced earlier in 1997 but increasingly applied to language tasks post-2010.

Scaling Era and Transformer Dominance

The scaling era in language modeling emerged in the late , driven by in computational resources and data , which enabled training of models with billions of parameters and demonstrated predictable performance gains via power-law relationships in reduction. Empirical studies revealed that scales as a power-law with model size N, dataset size D, and compute C, approximately as L(N, D, C) \propto N^{-\alpha} D^{-\beta} C^{-\gamma}, where exponents \alpha \approx 0.076, \beta \approx 0.103, and \gamma \approx 0.050 hold across varied architectures, justifying investments in larger scales for diminishing but consistent returns. This period shifted focus from architectural innovation to resource scaling, as larger models exhibited emergent abilities like without task-specific . The architecture, introduced in June 2017, underpinned this dominance by eschewing recurrent layers in favor of self-attention mechanisms, which compute dependencies between all sequence elements in parallel rather than sequentially. This design overcame limitations of recurrent neural networks, such as vanishing gradients and inefficient handling of long contexts, allowing transformers to process sequences up to thousands of tokens with quadratic complexity in length but superior parallelizability on GPUs. Causal masking in decoder-only variants, like those in the series, further aligned transformers with autoregressive language modeling by restricting attention to prior tokens, enabling efficient next-token prediction central to generative tasks. Key milestones included OpenAI's , detailed in a May 2020 paper, which scaled to 175 billion parameters trained on approximately 570 gigabytes of filtered text, achieving state-of-the-art few-shot performance on benchmarks like SuperGLUE without gradient updates on downstream data. Subsequent refinements, such as optimal compute allocation balancing model size and data (e.g., equal scaling of N and D for fixed C), reinforced transformer's scalability, as larger models proved more sample-efficient than smaller ones under equivalent compute budgets. By the early 2020s, transformers supplanted prior paradigms due to their ability to capture long-range syntactic and semantic dependencies via multi-head attention, with ablation studies confirming attention's causal role in performance over alternatives like convolutions or recurrences. This architectural edge, combined with hardware advances like TPUs and multi-node training, established transformers as the , powering models from proprietary systems to open-source efforts exceeding trillion-parameter scales.

Architectures and Types

N-Gram and Statistical Precursors

Statistical language models based on n-grams served as foundational precursors to modern neural language models, relying on probabilistic estimation from empirical word sequences rather than learned representations. These models approximate the of a word given its preceding context by considering only the immediately prior n-1 words, leveraging the Markov assumption that the probability P(wi | w1, ..., wi-1) ≈ P(wi | wi-n+1, ..., wi-1). This fixed-order approximation stems from early applications of Markov chains to text prediction, with roots in Andrey Markov's analysis of letter sequences in , later extended to words. Early conceptual groundwork was laid by in his study on the of printed English, where human subjects and Markov models of increasing order (up to 15 for letters) were used to estimate redundancy and predict text, yielding an of approximately 1.3 bits per letter after accounting for dependencies. Practical statistical modeling gained traction in the 1970s through Frederick Jelinek's work at on continuous , where n-gram models were integrated into frameworks to score word sequences probabilistically. The first significant advancement in n-gram estimation came in 1980 with Jelinek and Mercer's interpolated linear smoothing method, which combined lower-order probabilities to mitigate data sparsity in higher-order models. Subsequent refinements addressed the challenge of unseen n-grams in finite corpora, a core limitation causing zero probabilities. Katz's 1987 backing-off technique recursively falls back to lower-order models for unobserved events while discounting seen ones using Good-Turing estimates, which allocate probability mass to unseen types based on the frequency of singletons. Jelinek-Mercer weighted higher- and lower-order estimates directly, while later methods like Kneser-Ney (1994) incorporated absolute discounting with refined counts to better capture lexical diversity. These techniques enabled models to achieve perplexities around 109 on corpora like , outperforming bigrams (170) and unigrams (962), though higher n remained computationally infeasible due to exponential growth in parameters (e.g., ~20 billion for 4-grams on large vocabularies). N-gram models found primary application in acoustic modeling for and early , as in Brown et al.'s 1990 models, which used trigrams to model fluency in target languages. Despite successes in perplexity reduction through smoothing and class-based partitioning (e.g., Brown et al. 1992), inherent limitations—such as inability to capture long-range dependencies beyond fixed n, sensitivity to out-of-vocabulary words, and reliance on massive corpora for sparse events—prompted the shift toward neural architectures in the early . These statistical precursors emphasized empirical frequency over semantic understanding, establishing evaluation via as a standard metric for predictive accuracy that persists in neural successors.

Recurrent and Sequence Models

Recurrent neural networks (RNNs) process sequential data by maintaining a hidden state that captures information from previous timesteps, enabling them to model dependencies in language sequences for tasks like next-word prediction. In language modeling, a basic RNN takes an input sequence of words represented as vectors and updates its hidden state h_t = \sigma(W_{xh} x_t + W_{hh} h_{t-1} + b_h), where \sigma is an like tanh, to compute the probability of the next word via a softmax over the output layer. This architecture allows RNNs to theoretically handle variable-length inputs, addressing limitations of fixed-context n-gram models, though early applications in the 1980s focused more on general sequence prediction than large-scale language modeling. A key advancement came with Tomas Mikolov's RNN-based language model (RNNLM) in 2010, which integrated a simple RNN with a output layer to predict words in tasks, achieving reductions of up to 20% over traditional n-gram models on corpora like . However, vanilla RNNs suffered from vanishing or exploding gradients during through time, making it difficult to learn long-range dependencies beyond 5-10 timesteps, as gradients diminish exponentially with sequence length. To mitigate these issues, (LSTM) units, introduced by Hochreiter and Schmidhuber in 1997, incorporate gating mechanisms—an input gate, forget gate, and output gate—to selectively update and retain information in a cell state, allowing effective capture of dependencies over hundreds of timesteps. In language modeling, LSTMs demonstrated superior performance; for instance, Sundermeyer et al. in 2012 reported relative improvements of about 8% on English and large French corpora compared to neural networks. LSTMs became a staple for sequence modeling, powering early and text generation by maintaining contextual memory without full sequence recomputation. Gated recurrent units (GRUs), proposed by et al. in 2014, simplify LSTMs by merging the forget and input gates into a single update gate and eliminating the separate output gate, reducing parameters by roughly 25% while retaining comparable performance on sequence tasks. Empirical comparisons in language modeling show GRUs training 20-30% faster than LSTMs due to fewer computations, with negligible differences on datasets like WikiText-2, though LSTMs may edge out on very long dependencies. Despite these refinements, recurrent models face inherent limitations in language modeling, including sequential processing that precludes efficient parallelization across timesteps, leading to training times scaling linearly with sequence length—unlike the constant-time operations in later architectures. Additionally, even gated variants struggle with extremely long contexts (e.g., beyond 1000 tokens) due to accumulated numerical instability and attention dilution, prompting shifts toward -based mechanisms by the mid-2010s. These constraints were empirically evident in experiments, where recurrent models plateaued in perplexity gains as datasets grew to billions of tokens, underscoring their role as transitional architectures rather than scalable solutions for modern large-scale language modeling.

Transformer-Based and Large-Scale Variants

The architecture, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., marked a in sequence modeling by replacing recurrent layers with self-attention mechanisms, enabling parallel computation across sequences and capturing long-range dependencies more effectively than prior recurrent neural networks (RNNs). This design consists of stacked encoder and blocks, where multi-head self-attention computes weighted representations of input tokens relative to each other, scaled by dot-product similarity and softened via softmax, followed by feed-forward networks and layer normalization. Transformers initially excelled in but rapidly adapted to language modeling by autoregressively predicting the next token in a sequence, leveraging positional encodings to preserve order information absent in pure attention. Decoder-only Transformer variants, pioneered by OpenAI's GPT series, focus on unidirectional generation for causal language modeling, omitting the encoder to prioritize efficient autoregressive inference. , released in June 2018 with 117 million trained on the dataset, demonstrated emergent in-context learning on few-shot tasks, outperforming prior baselines in zero-shot transfer. , announced in February 2019 with a 1.5 billion model trained on WebText (8 million web pages), showed text generation capabilities approaching human-like coherence, though initially withheld due to misuse concerns before partial release. , unveiled in May 2020 with 175 billion trained on 570 gigabytes of filtered plus Books and , scaled predictably in performance, achieving strong few-shot results on benchmarks like SuperGLUE without task-specific , attributed to increased model and volume. Encoder-only Transformers, such as (Bidirectional Encoder Representations from Transformers) from , released in October 2018, enable bidirectional context for masked language modeling and next-sentence prediction, pretraining on 3.3 billion words from BooksCorpus and to yield embeddings fine-tuned for downstream tasks like . Variants like (Text-to-Text Transfer Transformer), introduced by in October 2019, unify tasks under a text-to-text framework with an encoder-decoder setup, scaling to 11 billion parameters by 2021 and demonstrating that framing all problems as generation improves versatility. Large-scale models, often exceeding 100 billion parameters, rely on massive distributed training: for instance, (Pathways Language Model) from , with 540 billion parameters trained in 2022 on 780 billion tokens using Pathways infrastructure, highlighted multilingual and reasoning gains from compute-intensive scaling. Empirical scaling laws, formalized by Kaplan et al. in 2020, quantify that language model loss decreases as a with model size (N), dataset size (D), and compute (C ≈ 6ND), with loss scaling as L(N) ∝ N^{-α} where α ≈ 0.076 for parameters, guiding efficient . Hoffmann et al.'s 2022 analysis refined this, finding compute-optimal models balance parameters and data at roughly 20 tokens per parameter, as in the 70 billion parameter model outperforming larger but data-underdense on BIG-Bench, underscoring that naive parameter scaling without proportional data yields diminishing returns. These laws, validated across models up to trillions of parameters like Google's 2023 2 (up to 340 billion parameters), explain performance predictability but also reveal plateaus in certain capabilities, such as factual recall, limited by data over sheer scale. Open-source efforts, including Meta's series (e.g., 2 in July 2023 with 70 billion parameters trained on 2 trillion tokens), democratized access while emphasizing responsible scaling through safety fine-tuning. By 2025, proprietary models like OpenAI's (parameter count undisclosed but estimated >1 trillion) and xAI's Grok-1 (314 billion parameters, released November 2023) continued this trend, integrating extensions while prioritizing efficiency via techniques like mixture-of-experts () sparsity, as in Grok-1's design for reduced active parameters during forward passes.

Training and Optimization

Data Acquisition and Preparation

Data acquisition for large language models primarily relies on vast web-scale corpora, with serving as the foundational source due to its comprehensive, freely available snapshots of the , comprising petabytes of raw from monthly crawls since 2008. This dataset has been integral to training models like and BLOOM, often comprising 60-80% of pretraining corpora after downsampling to manage scale and quality. Supplementary sources include digitized books (e.g., via or proprietary scans), academic publications from , code from repositories, and specialized datasets like news archives or scientific texts to enhance domain-specific coverage. Curated public datasets such as (derived from with basic cleaning), The Pile (825 GB across 22 diverse subsets), and (multilingual extracts) aggregate these to provide trillions of tokens, enabling models to capture broad linguistic patterns without proprietary dependencies. Preparation begins with extraction, parsing raw formats like WARC files from to isolate textual content while discarding non-text elements such as scripts, ads, and navigation boilerplate using tools like Boilerpipe or rules based on document structure. Cleaning follows, applying filters for minimum length (e.g., over 3 words), detection to retain primary s like English, and scoring via small proxy models to exclude low-quality or nonsensical text, which can constitute up to 50% of raw web data. Deduplication is critical to prevent and reduce training redundancy, employing methods like exact hashing for near-duplicates, locality-sensitive hashing for fuzzy matches at trillion-token scales, or embedding-based clustering, yielding efficiency gains of 20% or more in convergence speed as demonstrated in controlled pretraining experiments. Further quality filtering uses classifiers trained on heuristics or lightweight models to remove toxic, repetitive, or off-topic , with pipelines like FineWeb demonstrating that heuristic-based selection (e.g., for educational value via scores) can distill 15 trillion tokens from into higher-utility subsets outperforming unfiltered baselines on downstream tasks. Tokenization concludes the pipeline, converting cleaned text into subword units via algorithms like Byte-Pair Encoding (BPE) or Unigram, which compress vocabulary to 50,000-100,000 tokens while handling rare words through merging frequent pairs, essential for efficient model input as raw characters would explode sequence lengths. These steps collectively transform noisy, heterogeneous inputs into coherent token sequences, with showing that rigorous preparation correlates with improved , though unaddressed biases in web-sourced —such as overrepresentation of English-centric —persist as inherent limitations.

Parameter Scaling and Empirical Laws

Empirical scaling laws in language models describe predictable relationships between training resources—such as the number of parameters N, dataset size D, and compute C—and model performance, typically measured by on held-out . These laws emerged from systematic experiments showing that decreases as a with increases in each resource when others are held fixed. Kaplan et al. (2020) first quantified this by training transformer-based models ranging from 10 million to 6 billion parameters on datasets up to 300 billion tokens, finding that validation L follows L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + L_\infty for parameters, with \alpha_N \approx 0.095, and analogous forms for D (\alpha_D \approx 0.095) and C (\alpha_C \approx 0.046), where L_\infty represents an irreducible floor. The exponents indicate , but the smooth, unbroken behavior across orders of magnitude suggested that performance gains would persist with further scaling, challenging prior assumptions of abrupt saturation. Under compute constraints, where C \propto N \cdot D for transformer training (approximating FLOPs as $6 N D), Kaplan et al. derived an optimal allocation favoring larger N over D, predicting that model size should scale as N \propto C^{0.73} and data as D \propto C^{0.27}. This informed early large-scale efforts like (175 billion parameters trained on approximately 300 billion tokens), which aligned roughly with compute-optimal paths and demonstrated broad capability improvements. However, subsequent analysis revealed inefficiencies: Hoffmann et al. (2022) re-evaluated scaling across models up to 280 billion parameters and found that prior large models were severely data-limited, with optimal scaling requiring N \propto C^{0.5} and D \propto C^{0.5}, emphasizing balanced growth in parameters and data to minimize loss for a given compute . They validated this by training , a 70-billion-parameter model on 1.4 trillion tokens using the same compute as the 280-billion-parameter (trained on 300 billion tokens), achieving a 7% higher average accuracy on the MMLU benchmark (67.5% vs. 59.7%) and lower across evaluations. These laws have guided resource allocation in subsequent models, with empirical validation extending to trillion-parameter scales, though exponents vary slightly by architecture and data quality. For instance, mixture-of-experts models decouple active parameters from total N, yielding adjusted scaling where effective compute efficiency alters the N-C relationship. Recent work confirms power-law predictability holds for inference-time scaling, where performance improves with additional compute via techniques like test-time training or chain-of-thought prompting, following L \propto M^{-\beta} for inference FLOPs M. However, deviations arise with high-quality or synthetic data, where sub-scaling (steeper loss reduction per resource) can occur, and real-world limits like data scarcity or hardware constraints challenge indefinite extrapolation. The empirical nature of these laws—derived from curve-fitting experimental runs rather than theoretical proofs—underscores their utility for prediction but highlights risks of breakdown beyond probed regimes, as seen in varying task-specific exponents (e.g., shallower scaling for reasoning benchmarks).

Alignment and Fine-Tuning Methods

Fine-tuning adapts pre-trained language models to downstream tasks or desired behaviors by continuing training on smaller, curated datasets, typically using objectives like next-token prediction on instruction-response pairs. Supervised fine-tuning (SFT), also known as instruction tuning, involves training on high-quality, human-annotated examples where inputs are prompts and outputs are desired responses, enabling models to follow instructions more effectively than zero-shot prompting alone. This method has been empirically shown to improve task performance on benchmarks like GLUE and SuperGLUE, though it risks to the fine-tuning distribution if data quality is low. Alignment extends to steer models toward human-preferred outputs, emphasizing helpfulness, honesty, and harmlessness, often addressing issues like toxicity or refusal to answer unsafe queries. (RLHF) is a prominent technique, introduced by in 2022, where human annotators rank model outputs for quality, training a reward model to score responses, followed by policy optimization using (PPO) to maximize rewards while staying close to the SFT baseline. RLHF significantly reduced harmful outputs in models like InstructGPT, with evaluations showing up to 80% preference alignment on held-out tasks, but it scales poorly due to high annotation costs and can induce or reward , where models exploit rewards rather than truly understanding values. Alternatives to RLHF mitigate these issues by avoiding explicit reward modeling. Direct preference optimization (DPO), proposed in 2023, directly fine-tunes the language model on preference pairs using a that implicitly derives an optimal from rankings, bypassing RL instability and achieving comparable or superior alignment on datasets like HH-RLHF without PPO's computational overhead. Empirical results demonstrate DPO converging faster and yielding less variance in outputs, though it assumes access to a for regularization. Constitutional AI, developed by in 2022, uses self-supervised critique and revision guided by a predefined "constitution" of principles (e.g., avoiding harm or bias), reducing reliance on labels by having the model generate and evaluate its own outputs against rules, which improved harmlessness scores by 20-30% over baselines in internal tests while enhancing transparency. These methods highlight ongoing trade-offs: while effective for surface-level behaviors, deeper causal misalignment persists, as evidenced by persistent hallucinations and jailbreak vulnerabilities in aligned models.

Evaluation Frameworks

Intrinsic Measures of Predictability

Intrinsic measures of predictability evaluate a language model's core capability to forecast subsequent tokens in a given prior , relying solely on the model's probability distributions over the rather than performance on external tasks. These metrics quantify the model's or "surprise" when encountering test , providing a direct gauge of predictive fidelity independent of application-specific outcomes. The most widely adopted such measure is (PPL), which serves as a proxy for the model's average —the effective number of choices it considers plausible at each prediction step. Perplexity is computed as the exponential of the average negative log-likelihood of a test under the model's predictions: for a of n w_1, \dots, w_n, \mathrm{PPL} = \exp\left(-\frac{1}{n} \sum_{i=1}^n \log P(w_i \mid w_1, \dots, w_{i-1})\right). This formulation derives from , where lower reflects higher predictability, akin to the model being less "perplexed" by the data; for instance, a PPL of 10 implies the model behaves as if selecting from 10 equally likely options on average per . loss underpins this, measuring the divergence between the empirical distribution p and the model's predicted distribution q as H(p, q) = -\sum p \log q, with as e^{H(p, q)} in natural log units. Bits-per-character (BPC), another related metric, normalizes (in bits) by length in characters, facilitating comparisons across languages or granularities by emphasizing efficiency. These measures are typically assessed on held-out corpora such as WikiText-103 or the dataset, where models like achieved perplexities around 20-30 on English text by 2020, improving with scale; for example, larger models under scaling laws reduced PPL logarithmically with compute. However, 's intrinsic nature limits its scope: it prioritizes fluent token prediction but does not ensure factual accuracy, semantic coherence, or robustness to adversarial inputs, as models can memorize to lower PPL without generalizing causally. Recent advancements address tokenizer disparities—different subword schemes (e.g., BPE vs. SentencePiece) inflate or deflate raw PPL—via normalized variants like weighted perplexity, which adjust for vocabulary size and token length distributions to enable fair cross-model comparisons. Empirical studies confirm 's correlation with downstream capabilities in controlled settings, yet divergences arise; for instance, over-optimized models may exhibit low on in-distribution data while hallucinating on novel prompts, underscoring that predictability alone proxies fluency rather than understanding. Bits-per-character complements by revealing sub-token inefficiencies, with human-language BPC baselines around 1-1.5 bits for English, against which models like approached 1.2 by 2022. Despite these utilities, intrinsic metrics undervalue long-context dependencies, where can degrade quadratically without architectural mitigations like transformers' .

Task-Specific Benchmarks

Task-specific benchmarks evaluate language models on predefined tasks using standardized datasets, metrics such as accuracy, F1-score, or exact match, and often involve multiple-choice, classification, or generation subtasks to measure capabilities like , , or problem-solving. These differ from intrinsic predictability measures by focusing on downstream applications rather than raw , though saturation in older benchmarks has prompted development of harder variants. Empirical performance on these benchmarks correlates with scaling laws, where larger models trained on more data achieve higher scores, but results must account for potential data contamination from training corpora. The GLUE benchmark, introduced in January 2018, aggregates nine tasks including single-sentence (e.g., for linguistic acceptability, SST-2 for sentiment polarity) and sentence-pair tasks (e.g., MNLI for , QQP for detection). Scores are computed per task—such as Matthews correlation for or Pearson correlation for (STS-B)—and averaged into a single GLUE score, with human baselines around 80-90% but early models like achieving 80.5% in 2018. By 2023, large models exceeded 90%, indicating saturation and limited differentiation for advanced systems. SuperGLUE, released in May 2019 as a more challenging successor, includes eight tasks emphasizing coreference resolution (WSC), word-in-context disambiguation (), and reading comprehension (), with metrics like exact match for generation tasks and accuracy for classification. It incorporates longer contexts and adversarial examples to probe deeper reasoning, where human performance averages 89.8% but top models like T5-11B reached 89.1% by 2020; however, discrepancies in leaderboard rankings for models like suggest inconsistencies possibly from evaluation protocols. Knowledge-intensive benchmarks like MMLU (Massive Multitask Language Understanding), proposed in September 2020, test factual recall and reasoning across 57 subjects (e.g., history, , ) via 14,000 multiple-choice questions at professional or high-school levels, scored by accuracy with chain-of-thought prompting boosting results. Models like achieve 86.4% in 2023, approaching expert levels in some domains but revealing gaps in abstract reasoning. tasks such as HellaSwag (2019), with 70,000 sentence-completion items derived from video captions and adversarial filtering, use accuracy to assess plausible continuation prediction, where models like score 95%+ but falter on subtle inferences. Domain-specific benchmarks target specialized skills: GSM8K (2021) comprises 8,500 grade-school math word problems requiring multi-step arithmetic reasoning, evaluated by exact match accuracy, with models like reaching 58% via prompting but highlighting symbolic manipulation weaknesses. HumanEval (2021), for , presents 164 programming problems solved via functional correctness, using pass@1 (first-attempt success) or pass@k metrics; GPT-3.5 scores 48.1% pass@1, while specialized elevates this to over 70% in later models, though it exposes brittleness to edge cases. These benchmarks collectively reveal scaling benefits but underscore needs for robustness against distribution shifts.
BenchmarkIntroduction YearKey TasksPrimary MetricExample Top Score (Model, Year)
GLUE2018NLI, sentiment, Averaged task scores91.3% (DeBERTa, 2021)
SuperGLUE2019, WiC, ReCoRDAveraged task scores89.1% (T5-11B, 2020)
MMLU2020Multi-subject MCQsAccuracy86.4% (, 2023)
HellaSwag2019Commonsense completionAccuracy95.3% (, 2021)
GSM8K2021Math word problemsExact match74.4% (, 2022)
HumanEval2021Code synthesisPass@167.0% (, 2021)

Comparative Performance Analysis

Language models are evaluated comparatively through standardized benchmarks that measure capabilities such as multitask (MMLU), commonsense (HellaSwag), scientific reasoning (GPQA), proficiency (HumanEval), and overall user preference via crowdsourced platforms like the LMSYS Arena. These metrics reveal scaling trends where larger counts and refined correlate with improved scores, though and benchmark saturation are evident among models. However, benchmarks face limitations including potential from corpora, over-optimization by developers, and failure to capture long-tail real-world robustness or depth. Crowdsourced arenas mitigate some issues by incorporating human judgments on helpfulness and but introduce subjective biases and may favor verbose or safety-aligned responses over raw capability. As of mid-2024, models like OpenAI's GPT-4o and Anthropic's Claude 3.5 lead with MMLU scores of 88.7%, closely trailed by Meta's open-source 3.1 405B at 88.6%. xAI's Grok-2 achieves 87.5% on MMLU, demonstrating competitive knowledge recall while emphasizing uncensored outputs that may diverge from safety-tuned competitors. On coding benchmarks, Claude 3.5 scores 92.0% on HumanEval, surpassing GPT-4o's 90.2%, 3.1 405B's 89.0%, and Grok-2's 88.4%. These narrow margins highlight driven by compute-intensive scaling, yet open models like enable broader verification and adaptation, reducing reliance on black-box evaluations.
ModelMMLU (%)HumanEval (%)GPQA (%)LMSYS Arena Elo
GPT-4o88.790.253.61286
Claude 3.5 Sonnet88.792.059.41272
Llama 3.1 405B88.689.051.11264
Grok-287.588.4N/A~1250
In the LMSYS , GPT-4o ranks highest with an score of approximately 1286, reflecting user preferences for speed and versatility, while open models lag slightly due to deployment differences rather than intrinsic limits. Differences often stem from post-training : safety-focused in Claude boosts refusal rates on edge cases, potentially inflating perceived reliability but constraining utility in unrestricted domains. Empirical evidence suggests that raw pre-training compute, not architecture alone, drives most gains, with transformers remaining dominant despite alternatives like state-space models showing promise in but not yet surpassing in . evaluations underscore that no single model dominates all tasks; for instance, excels in multilingual settings, while prioritizes integration for timeliness over static optimization.

Capabilities and Deployments

Core Linguistic Tasks

Large language models (LLMs) handle core linguistic tasks through probabilistic prediction of linguistic structures, drawing on patterns learned from massive text corpora during pre-training. These tasks encompass (inflection and ), (grammatical structure and ), semantics (meaning representation and entailment), and (contextual and ). Empirical evaluations show LLMs achieving high proficiency in and semantics via benchmarks like and SuperGLUE, where tasks such as linguistic acceptability () and natural language (MNLI) test these abilities, with models like saturating scores above 90% on aggregate metrics. However, performance derives statistically from data correlations rather than explicit rule internalization, leading to robustness in standard cases but vulnerabilities to adversarial perturbations. In , LLMs generate and recognize word forms across languages, such as verb conjugations or noun plurals, by modeling distributional regularities. For example, transformer-based models excel in inflectional tasks on datasets like UniMorph, predicting forms with accuracy rates surpassing 95% for high-resource languages in few-shot settings, as scaling parameters enhances capture of rare morphological patterns. This capability supports applications in language generation but falters on low-resource languages or systematic gaps in training data, where overgeneralization occurs. Syntactic processing involves assessing sentence well-formedness and hierarchical structure, where LLMs outperform traditional on benchmarks like the Corpus of Linguistic Acceptability, achieving near-ceiling performance (e.g., 60-70% accuracy on human-labeled judgments, exceeding earlier RNN models). Studies identify dedicated neural subspaces in LLMs corresponding to syntactic , comprising about 1% of parameters yet driving generalization to unseen constructions. Nonetheless, causal interventions reveal that syntax emerges as a byproduct of next-token prediction, not isolated modular knowledge, enabling efficient zero-shot but susceptibility to long-range errors in complex . Semantic tasks, including entailment and , leverage contextual embeddings to infer relations, with LLMs scoring above 90% on Multi-Genre (MNLI) subsets of SuperGLUE. Vector arithmetic in spaces approximates analogies (e.g., - man + woman ≈ queen), reflecting distributional semantics, though this breaks under compositionality demands. presents greater challenges, as LLMs inconsistently handle s; while exceeds human averages (4.80 vs. ~4.0) on scalar and manner implicature tests, it underperforms on context-dependent benchmarks like PUB, scoring below 70% without prompting refinements, due to literal biases in training objectives. Multi-agent setups or chain-of-thought prompting mitigate this, boosting pragmatic reasoning by simulating cooperative . Overall, scaling correlates with improved linguistic fidelity, but persistent gaps in pragmatic nuance underscore statistical approximation over genuine comprehension.

Generative and Multimodal Applications

Large language models (LLMs) primarily generate text through autoregressive decoding, predicting subsequent tokens based on prior context, which enables applications in such as systems, summarization, and . OpenAI's , released on June 11, 2020, with 175 billion parameters, demonstrated emergent abilities in zero-shot and few-shot generation tasks, including translation and without task-specific . Similarly, fine-tuned variants like , introduced in August 2021, support from descriptions, powering tools such as , which assists developers by suggesting code completions and has been adopted in over 1 million repositories by 2023. These capabilities stem from scaling laws where increased parameters and training data correlate with improved coherence and versatility in output, though outputs often require human verification due to factual inaccuracies. In code-related generative tasks, LLMs have achieved competitive results in structured programming challenges; for example, DeepMind's AlphaCode, a transformer-based model trained on code, solved 34% of problems in contests as of February 2022, outperforming average human coders in select metrics but lagging in systematic reasoning. Broader applications extend to domain-specific generation, such as legal document drafting or scientific hypothesis formulation, where models like , released March 14, 2023, generate plausible outputs but exhibit limitations in and long-term consistency. Empirical evaluations, including the HumanEval benchmark, show LLMs passing 67% of unit tests for functions via pass@k metrics, highlighting probabilistic strengths over deterministic . Multimodal large language models (MLLMs) integrate LLMs with or audio encoders, enabling generative applications across modalities, such as describing images or generating text conditioned on visual inputs. OpenAI's GPT-4V, made available in September 2023, supports visual (VQA) and captioning, processing real-world images to output descriptive narratives or answer queries with reported accuracy improvements over prior vision-language models on benchmarks like VQAv2. Google's 1.0, announced December 6, 2023, handles interleaved text, images, audio, and video for tasks including multimodal reasoning and content synthesis, achieving state-of-the-art scores on MMMU (59.4%) by fusing modalities through a unified . These models facilitate applications in for scene understanding and instruction following, as well as analysis, where MLLMs process diagrams to generate diagnostic hypotheses, though performance varies by with accuracies around 50-70% on benchmarks in 2024 evaluations. Despite advances, MLLMs often propagate biases from training data and struggle with spatial reasoning, necessitating hybrid systems for robust deployment.

Integration in Systems and Products

Language models are deployed in consumer products primarily as conversational agents and enhancers for user interfaces. For example, OpenAI's , powered by GPT-series models, serves as a standalone application and endpoint, with over 100 million weekly active users reported in late 2023, enabling integrations into third-party apps for tasks like drafting emails and summarizing documents. Google's , introduced in 2023 and updated iteratively, is natively integrated into operating systems starting with version 15 in August 2024, facilitating on-device features such as real-time language translation and contextual assistance within apps. In search engines, language models augment query understanding and response generation. Google incorporated Gemini into its search infrastructure by December 2023, allowing multimodal inputs like image-based queries via features such as Circle to Search, which expanded to more devices by mid-2025. This integration processes billions of daily searches, grounding outputs in web data to reduce factual errors. Enterprise systems leverage language models for and , often through cloud-based APIs or customized . Microsoft integrated OpenAI's GPT-5 into its Copilot ecosystem in August 2025, embedding it across applications for tasks including data analysis in Excel and code review in , with reported productivity gains of up to 30% in internal pilots. , utilizing these models since 2021, provides real-time code suggestions to over 1 million developers, accelerating software development by suggesting completions based on context. In business software, models are woven into CRM and ERP platforms for natural language querying. Salesforce introduced Einstein GPT in 2023, fine-tuned on proprietary data for sales forecasting, handling queries like "summarize pipeline risks" via API calls. Enterprise deployments emphasize secure, scalable architectures, with options for on-premises hosting to address data privacy concerns, as seen in frameworks from providers like Azure AI Foundry, which support model orchestration for hybrid environments. Open-source models like Meta's Llama series enable cost-effective integrations in custom products, such as internal chatbots, though requiring significant engineering for production reliability.

Technical Limitations

Inherent Uncertainties and Hallucinations

Large language models (LLMs) generate text autoregressively by predicting the next based on conditional probabilities derived from training , introducing inherent uncertainties due to the sampling process and the absence of grounded . This probabilistic mechanism favors high-likelihood sequences that mimic patterns in the , but it does not enforce factual , as models approximate rather than comprehend underlying truths. Empirical evaluations confirm that base LLMs exhibit calibrated uncertainty in predictions—meaning their confidence scores align with accuracy—but this calibration does not prevent deviations from when gaps or ambiguities arise. Hallucinations manifest as the production of plausible yet verifiably false statements, often with undue confidence, stemming from the optimization objective that prioritizes fluency and coherence over empirical verification. Causal factors include imperfections in training corpora, such as factual errors or contradictions, which propagate through pattern extrapolation, and the model's reliance on statistical correlations absent a robust world model for validation. For instance, in tasks requiring long-range reasoning, LLMs may confabulate details by overgeneralizing sparse training signals, as observed in studies where models fabricate references or events unsupported by input prompts. Quantitative assessments reveal prevalence varies by domain and model scale: legal queries elicit fabricated content in 58% to 82% of responses from general-purpose LLMs, highlighting risks in high-stakes applications. In clinical evaluations, unmitigated rates reached 64-68% for case summaries, dropping to 43-45% with prompting adjustments, yet major errors persisted in 44% of hallucinated instances. Scaling model size and instruction-tuning can amplify unreliability for difficult tasks, as larger LLMs increasingly avoid or err on low-concordance problems, per analyses of models up to releases. Detection methods, such as measures, identify subsets of hallucinations by quantifying output variability, but these post-hoc tools underscore the foundational challenge: LLMs' token-level predictions inherently conflate with when faced with or uncertain inputs. Despite mitigations like retrieval-augmented generation, hallucinations remain irreducible in base architectures without external verification, as the core lacks mechanisms for self-correction against causal realities.

Amplification of Data Biases

Language models acquire biases from their training corpora, which consist predominantly of internet-sourced text reflecting societal patterns, including overrepresentations of certain demographic and ideological viewpoints. These biases are amplified through the , as the autoregressive next-token objective reinforces probabilistic correlations in the , causing models to favor completions that exaggerate underlying skews beyond their frequency in the source material. For example, if contains subtle associations linking professions to genders at rates mirroring real-world disparities, the model's learned embeddings and dynamics can intensify these links, producing outputs with higher adherence rates. Empirical studies quantify this effect across domains. In political bias evaluations using sentence continuation benchmarks, models like exhibited increasing skew toward liberal framings over multiple generation steps, with bias metrics rising by up to 20-30% relative to initial prompts, independent of synthetic data collapse. Similarly, in moral judgment tasks, large language models displayed amplified cognitive es, such as a stronger preference for inaction (e.g., 15-25% higher rates than baselines in resolutions), stemming from compressed representations of ethical scenarios in training data. Stereotypical amplification has been observed in controlled experiments where models, after minimal human-AI interaction loops, output associations (e.g., criminality to ethnic groups) at rates exceeding input data by factors of 1.5-2.0. Political bias amplification is particularly pronounced, with models trained on corpora—dominated by content from and outlets showing systemic left-leaning tendencies—generating responses that favor progressive policies at rates 10-40% higher than conservative alternatives on balanced prompts. For instance, evaluations of systems like revealed misalignment with median U.S. voter preferences, amplifying liberal leanings in policy advice (e.g., stronger support for redistribution over market solutions). This occurs mechanistically via feature compression during scaling: larger models distill broader data patterns into sharper ideological modes, exacerbating imbalances as parameter count increases from billions to trillions. Such amplification extends to iterative training on model-generated data, where initial biases compound exponentially, as each cycle reinforces the most probable (skewed) outputs, potentially leading to "bias collapse" in long-term deployments. Mitigation strategies, including targeted on debiased subsets or constitutional prompts, reduce but do not eliminate the issue, as emergent biases reappear in out-of-distribution scenarios due to the causal entanglement of semantics and priors in learned weights. These dynamics highlight a core limitation: while models excel at pattern , their from correlative inherently magnifies human flaws, risking downstream of societal divides in applications like content generation or decision support.

Resource Demands and Scalability Barriers

Training large language models (LLMs) demands immense computational resources, typically measured in . For instance, models at the scale of require approximately 10^{25} for pre-training, equivalent to the output of thousands of high-end GPUs running for months. Over 30 such models exceeding 10^{25} have been trained as of mid-2025, reflecting in compute allocation, with frontier models announced at an average rate of two per month in 2024. These requirements stem from scaling laws, where performance improves predictably as a power-law of model parameters, dataset size, and compute budget, as established in early empirical studies. Hardware and energy costs amplify these demands. Training a GPT-4-scale model incurs compute expenses estimated at $78–100 million, driven by specialized accelerators like GPUs or TPUs, with total training compute costs for frontier models doubling every eight months and growing at 2.4x annually. is similarly prohibitive; training involves clusters of thousands of GPUs, contributing to carbon footprints benchmarked across 30 LLMs, where a single run can emit hundreds of tons of CO2 equivalent, alongside substantial water usage for cooling. Projections indicate generative AI's global electricity demand could surge from 8 in to 652 by 2030, rivaling small nations' usage, underscoring the thermodynamic inefficiencies of matrix multiplications in architectures. Scalability faces multifaceted barriers. Data constraints loom large, as high-quality human-generated text may exhaust available sources by the 2030s under continued , forcing reliance on whose quality degrades performance gains. Compute-optimal regimes, per updated scaling laws balancing Kaplan's parameter-heavy predictions with Chinchilla's emphasis on proportionality (e.g., 20 tokens per ), still yield logarithmic returns, but hardware bottlenecks like AI chip shortages and limits cap effective . Economic pressures further hinder progress, as marginal improvements demand disproportionately larger investments, potentially stalling non-state actors while favoring well-resourced entities. Empirical evidence shows no hard "wall" yet, with performance reliably even under over-training, but physical limits on energy production and fabrication pose causal ceilings absent algorithmic breakthroughs.

Controversies and Broader Impacts

Misinformation and Reliability Debates

Language models generate outputs that can include fabricated details, known as hallucinations, where the system confidently asserts false due to its reliance on probabilistic token prediction rather than grounded verification. Empirical studies quantify these issues, with hallucination rates in summarization tasks ranging from 50% to 82% across models like and variants, even after prompt-based mitigations reduce rates by only marginal amounts. For instance, evaluations of GPT-3.5 and on medical queries showed hallucination rates of 39.6% and 28.6%, respectively, highlighting persistent unreliability in domain-specific factuality. Reliability debates center on the models' inability to distinguish fact from fiction inherently, as training objectives prioritize fluency over accuracy, rewarding plausible guesses amid data uncertainties. Benchmarks such as FELM and FactBench reveal that while closed-source models like achieve higher factuality scores on curated datasets—often exceeding 70% on closed-book questions— degrades on dynamic, real-world queries involving recent events or adversarial inputs, dropping below 50% in temporal misalignment tests. Critics argue this stems from architectural limits, where models mimic patterns without causal comprehension, amplifying errors from biased training corpora that overrepresent certain viewpoints, as evidenced by asymmetric propagation of positive favoring developer home countries in geopolitical audits. Broader concerns involve misinformation amplification, with language models enabling scalable generation of deceptive text, images, and videos, though empirical scoping reviews indicate dual roles: aiding detection via but risking unchecked dissemination in low-gatekeeping environments. experiments show models falter when verifying claims, sometimes endorsing falsehoods at rates comparable to human baselines under misinformed prompts, underscoring debates over deployment safeguards like retrieval-augmented generation, which reduce but do not eliminate errors in high-stakes applications such as . Proponents counter that scaling and yield measurable gains—e.g., newer iterations halving rates in controlled settings—but skeptics, drawing from causal analyses, maintain fundamental unreliability persists without paradigm shifts beyond autoregressive architectures. Academic enthusiasm may underemphasize these limits, given institutional incentives favoring optimistic narratives on AI progress.

Intellectual Property and Data Usage Conflicts

Large language models are typically trained on massive datasets scraped from the internet, books, and other sources, which frequently include copyrighted materials without obtaining licenses from rights holders. This practice has sparked numerous lawsuits alleging direct infringement through unauthorized reproduction and derivative use during training. As of September 2025, at least 51 copyright infringement suits have been filed against AI developers in the U.S., targeting companies like , , , and for using protected works in datasets such as and Books3. A prominent example is The New York Times Co. v. Microsoft Corp. and OpenAI (filed December 27, 2023), where the Times accused the defendants of ingesting millions of its articles to train models like , enabling competitive outputs that summarize or reproduce content verbatim. The case, consolidated into multidistrict litigation by April 2025, prompted a May 13, 2025, preservation order requiring to retain ChatGPT logs from over 400 million users to assess infringement scope, though contested the burden and partially resolved data retention disputes by October 2025. Similar claims appear in suits by authors, including and against (filed 2023), alleging training on pirated ebooks from datasets like The Pile eroded market value for originals. Defendants counter that training constitutes under U.S. law, arguing it is transformative as models learn statistical patterns rather than store copies, akin to human reading for inspiration. In June-July 2025 rulings from the Northern District of California, courts in Bartz v. and Kadrey v. denied motions to dismiss fair use defenses, finding allegations insufficient to prove non-transformative copying at the pleadings stage, though emphasizing that market harm and output regurgitation remain fact-intensive issues. Conversely, a ruling in an early 2025 case held that wholesale ingestion of books for training likely exceeds fair use without substantial alteration, marking the first judicial rejection of the defense in this context. These conflicts extend to data sourcing methods, including that violates site , as seen in suits against AI for allegedly bypassing paywalls on news sites. Critics, including publishers, contend that unlicensed training supplants licensing markets, with from output analyses showing models can reproduce substantial excerpts—up to 10-20% verbatim in some tests—undermining incentives for . AI firms maintain that prohibiting such data use would stifle , but no appellate rulings have resolved the fair use question as of October 2025, leaving developers to pursue mechanisms or licensed datasets amid regulatory scrutiny in the EU under the AI Act.

Economic Disruptions and Innovation Trade-offs

Large language models (LLMs) have prompted concerns over economic disruptions, particularly in knowledge-intensive sectors where tasks like writing, , and analysis are automatable. Empirical estimates suggest potential job displacement affecting 6% to 7% of U.S. workers due to adoption, with white-collar roles in data-rich industries facing higher exposure. A Wharton study analyzing LLM exposure across occupations found that while some jobs experience net positive impacts from augmentation, others, such as routine analytical roles, risk substitution, potentially exacerbating if retraining proves insufficient. Brookings research highlights the limits of worker retraining programs amid rapid -driven displacement, noting historical precedents where such interventions failed to fully offset losses in analogous technological shifts. Counterbalancing these disruptions, LLMs have demonstrated measurable productivity gains in controlled studies. A field experiment published in Science showed reducing task completion time by 40% while improving output quality by 18% for professional writers. Similarly, a study on coding tasks reported over 50% increases in code output using generative AI, though gains were concentrated among entry-level programmers rather than experts. McKinsey projections indicate generative AI, including LLMs, could drive annual labor growth of 0.1% to 0.6% through 2040, contingent on rates, though broader labor market data post- release (as of October 2025) reveals no widespread disruption yet. The trade-offs involve high upfront costs juxtaposed against accelerated R&D, fostering dependency risks. Training reportedly cost between $78 million and over $100 million, with OpenAI's 2024 expenditures on training and projected to reach $7 billion, underscoring that favor incumbents and concentrate economic power. While LLMs enable and idea generation—potentially displacing 92 million jobs globally but creating 170 million new ones per estimates—their integration risks skill atrophy and overreliance, diminishing human critical thinking and creativity over time. NBER analysis confirms small near-term labor market effects but warns of uneven distribution, with productivity enhancements not yet translating to proportional wage gains, potentially widening . These dynamics highlight a causal : short-term efficiency boosts versus long-term vulnerabilities from reduced human capacity and challenges in mitigating uneven sectoral shifts.

Prospective Developments

Efficiency Enhancements

Efficiency enhancements in large language models (LLMs) address the high computational and demands of and , which scale quadratically with sequence length in architectures and linearly with model size. Techniques focus on model compression, optimized computations, and hardware-aware implementations to maintain while reducing resource usage by factors of 2-10x in or , depending on the method. These approaches enable deployment on devices and lower costs, with empirical evaluations showing minimal accuracy degradation, such as less than 1-2% increase on benchmarks or WikiText. Quantization reduces parameter precision from 32-bit floating-point (FP32) to lower-bit formats like 8-bit integers (INT8) or 4-bit, compressing models by 4-8x while accelerating on GPUs via integer arithmetic. Post-training quantization applies directly to pretrained weights, achieving up to 4x speedups on hardware without retraining, though it risks overflow in activations; quantization-aware training mitigates this by simulating low precision during . For LLMs like GPT-3-scale models, 4-bit quantization via methods like GPTQ preserves over 95% of original capability on tasks like , as measured by datasets such as HellaSwag. Pruning eliminates redundant weights or neurons, often iteratively by identifying low-magnitude connections and removing up to 90% of parameters in dense layers while retraining to recover performance. Structured targets entire heads or feed-forward modules, reducing model size by 50-70% with sparsity patterns compatible with hardware accelerators; unstructured requires sparse kernels but yields finer granularity. In LLMs, combined with has compressed models like Llama-7B to under 2GB while matching baseline accuracy on MMLU benchmarks. Knowledge distillation transfers knowledge from a large "teacher" LLM to a smaller "student" by minimizing output logit differences or intermediate feature matches, yielding compact models 5-10x smaller with 80-90% of teacher performance. Offline distillation uses fixed teacher outputs for supervision, while online variants co-train both models; for LLMs, distilling from 175B-parameter models to 7B ones via techniques like MiniLLM achieves near-equivalent zero-shot accuracy on SuperGLUE. This method excels in preserving generalization, though it inherits teacher biases. Efficient attention mechanisms, such as FlashAttention introduced in , optimize the bottleneck in self- by computations to minimize high-bandwidth (HBM) accesses between GPU layers, enabling exact with 2-4x speedups and 10x savings for sequences up to 64k . It recomputes intermediates on-the-fly instead of materializing full matrices, reducing IO by fusing softmax and masking operations. Extensions like FlashAttention-3, optimized for Hopper GPUs in 2024, incorporate asynchronous Tensor Cores and low-precision formats for further 1.5-2x gains in training throughput. These IO-aware designs integrate seamlessly into frameworks like , supporting longer contexts in LLMs without approximation errors. Hybrid approaches combine these techniques, such as quantizing pruned models post-distillation, yielding end-to-end efficiencies like running 70B-parameter LLMs on consumer GPUs with latencies under 1 second per . Ongoing emphasizes sparsity-inducing regularization and dynamic inference paths to adapt to input complexity, balancing fixed costs with per-query savings.

Extensions to Multimodality and Agency

Extensions of large language models to involve integrating sensory inputs such as , audio, and video with textual processing, typically achieved by prefixing LLM token sequences with embeddings from modality-specific encoders like vision transformers or audio models. This architecture enables capabilities like visual and cross-modal reasoning, where models generate text descriptions or inferences from non-text data. Early implementations, such as OpenAI's GPT-4V released in September 2023, demonstrated understanding but were limited to static visuals; subsequent models like GPT-4o, announced on May 13, 2024, incorporated real-time audio and video processing for more interactive applications. Google's 1.0, introduced in December 2023, was natively , handling interleaved text and from training, while its 2025 iteration, Gemini 2.0, expanded to advanced video analysis and planning. These extensions have improved performance on benchmarks like VQA-v2, where MLLMs achieve over 80% accuracy in some cases, though challenges persist in across modalities and alignment between visual and linguistic representations. Agency extensions leverage LLMs as central reasoning components in autonomous systems capable of planning, tool usage, and multi-step decision-making in external environments. Frameworks like , proposed in 2022 and refined through 2023-2025 implementations, interleave reasoning traces with actions, allowing models to select tools such as calculators or browsers dynamically. Developments from 2023 onward include Auto-GPT, released in March 2023, which demonstrated goal-oriented task decomposition but suffered from high error rates in long-horizon ; by 2025, agentic systems evolved into multi-agent collaborations, where specialized LLMs handle subtasks like or , outperforming single models in simulations. OpenAI's o1 model, previewed in September 2024, enhanced agency through internal chain-of-thought reasoning, enabling better error correction and tool orchestration, while Anthropic's Claude models integrated "computer use" capabilities in October 2024 for screen-based interactions. Evaluations, such as those in the 2025 AI Index, show LLM agents surpassing humans in certain navigation tasks but lagging in robustness to adversarial inputs or real-world variability. These extensions intersect in agents, which process visual or auditory environments to execute actions, as seen in integrations where LLMs interpret camera feeds for manipulation planning. However, scalability issues arise from increased computational demands— requires datasets exceeding 10 billion image-text pairs—and ethical concerns over , including unintended autonomy in deployed systems, have prompted calls for verifiable mechanisms. Despite biases inherited from data, empirical progress indicates MLLMs and agents approaching general-purpose utility, with 2025 benchmarks reporting 70-90% success rates in controlled agentic workflows.

Open-Source Dynamics vs Proprietary Control

Open-source language models release model weights, architectures, and training code publicly, enabling developers to fine-tune, deploy locally, and iterate without vendor dependency. This approach, exemplified by Meta's series—starting with Llama 2 in July 2023 (7B to 70B parameters) and advancing to Llama 3.1 in July 2024 (up to 405B parameters) and Llama 4 Scout in 2025—fosters collaborative ecosystems on platforms like , where community contributions accelerate refinements such as quantization for efficient inference. Other notable releases include 's models (e.g., Mistral 7B in 2023) and Alibaba's Qwen series (Qwen2.5-72B in 2024), which have narrowed performance gaps with proprietary counterparts through techniques like from closed APIs. These dynamics promote rapid innovation by democratizing access: developers in resource-constrained settings can adapt models for niche tasks, such as multilingual applications in Qwen3 (235B parameters, released 2025), bypassing high compute barriers faced by individuals or small firms. Empirical evidence shows open-source models closing capability gaps; for instance, by mid-2025, models like DeepSeek V3 achieved competitive benchmarks in reasoning and against proprietary leaders, driven by iterative and shared datasets. However, this openness introduces risks, including easier adaptation for malicious uses like generating content or bypassing safety filters, as model weights can be modified without oversight, contrasting with controlled deployments. Proprietary models, such as 's o (released May 2024) and Anthropic's Claude 3.5 (June 2024), retain weights internally, offering access via with enforced rate limits, content filters, and usage policies to mitigate harms like amplification. This control stems from substantial investments—OpenAI reportedly spent over $100 million on training in 2023—allowing integrated safety measures like (RLHF) tailored to corporate risk assessments. Benefits include higher reliability in enterprise settings, where providers handle and , but drawbacks encompass and opaque decision-making, potentially embedding unexamined biases from training data curated under institutional pressures. The tension manifests in an : proprietary firms lead frontier capabilities due to exclusive data and compute (e.g., Google's models leveraging internal search corpora), yet open-source efforts erode this edge via , where open models are trained to mimic proprietary outputs, sustaining a of catch-up innovation. Critics of proprietary control argue it enables gatekeeping, delaying broader scrutiny of flaws like patterns, while proponents cite reduced risks, though empirical audits of open models reveal comparable safety vulnerabilities when properly aligned. By 2025, open-source has spurred models and cost reductions—running a 70B-parameter open model locally costs pennies per query versus proprietary fees—but proprietary dominance persists in regulated sectors prioritizing accountability over customization.

References

  1. [1]
    [PDF] Language Models: A Guide for the Perplexed - arXiv
    Nov 29, 2023 · Language modeling is the task of next word prediction. The guide covers natural language processing concepts and tools.
  2. [2]
    History, Development, and Principles of Large Language Models-An ...
    Feb 10, 2024 · It strives to facilitate a comprehensive understanding by exploring the historical background of language models and tracing their evolution ...
  3. [3]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · Abstract page for arXiv paper 1706.03762: Attention Is All You Need. ... We propose a new simple network architecture, the Transformer ...
  4. [4]
    [2303.18223] A Survey of Large Language Models - arXiv
    Mar 31, 2023 · Large language models (LLMs) are pre-trained language models of significant size, showing special abilities not present in small-scale models.
  5. [5]
    [PDF] Meaning and understanding in large language models - arXiv
    Can a machine understand the meanings of the language through which machine-human communication takes place? A state-of-the-art generative AI model leads to the ...
  6. [6]
    The Limitations of Large Language Models for Understanding ...
    Aug 31, 2024 · We argue on two grounds that LLMs alone tell us very little about human language and cognition in terms of acquisition and evolution.
  7. [7]
  8. [8]
    Language Models: Past, Present, and Future
    Jul 1, 2022 · A language modeling overview, highlighting basic concepts, intuitive explanations, technical achievements, and fundamental challenges.Key Insights · Neural Language Models · Pre-Trained Language Models
  9. [9]
    [PDF] A Neural Probabilistic Language Model
    A goal of statistical language modeling is to learn the joint probability function of sequences of ... A NEURAL PROBABILISTIC LANGUAGE MODEL n c. h m direct mix ...
  10. [10]
    [PDF] N-gram Language Models - Stanford University
    Log probabilities Language model probabilities are always stored and computed in log space as log probabilities. This is because probabilities are (by ...
  11. [11]
    What Is a Language Model? | deepset Blog
    Jul 20, 2022 · A language model is a machine learning model designed to represent the language domain. It can be used as a basis for a number of different language-based ...
  12. [12]
    Introduction to Large Language Models | Machine Learning
    Aug 25, 2025 · If you assume that a token is a word, then a language model determines the probabilities of different words or sequences of words to replace ...
  13. [13]
    A Measure-Theoretic Characterization of Tight Language Models
    Dec 20, 2022 · Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, ...
  14. [14]
    [PDF] 13.1 Probabilistic Langauge Models
    The probabilistic language model is to compute a probability distribution of a sentence of words ... the chain rule of probability: P(w1, ..., wN ) = P(w1)P(w2|w1) ...
  15. [15]
    Statistical Language Models — MTH 337: Spring 2017
    The chain rule in probability theory describes how to factor P(w1, w2, ..., wn) into the probabilities of each word given the words that precede it. P(S) ...
  16. [16]
    8.5.6 Probabilistic Models of Language
    This uses a bigram model for words, and assumes P(Wi∣Wi−1) P ⁢ ( W i ∣ W i - 1 ) is provided as the language model. A stationary model is typically appropriate.
  17. [17]
    Chain Rule for Sequence Probability - 1Cademy
    The chain rule of probability is a fundamental principle used in language modeling to calculate the joint probability of a sequence of tokens, such as {x_0, ...
  18. [18]
    [PDF] A Mathematical Theory of Communication
    Roughly the ergodic property means statistical homogeneity. All the examples of artificial languages given above are ergodic. This property is related to the ...
  19. [19]
    Andrey Markov & Claude Shannon Counted Letters to Build the First ...
    Nov 11, 2019 · Shannon, via Markov, revealed a statistical framework for the English language, and showed that by modeling this framework—by analyzing the ...
  20. [20]
    [PDF] The Mathematics of Statistical Machine Translation - ACL Anthology
    IBM T.J. Watson Research Center. We describe a series o,f five statistical models o,f the translation process and give algorithms,for estimating the ...
  21. [21]
    A Neural Probabilistic Language Model
    Abstract. A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically ...
  22. [22]
    Recurrent neural network based language model - ISCA Archive
    ISCA Archive Interspeech 2010. Recurrent neural network based language model. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, Sanjeev Khudanpur. A ...
  23. [23]
    [2001.08361] Scaling Laws for Neural Language Models - arXiv
    Jan 23, 2020 · We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the ...
  24. [24]
    [2005.14165] Language Models are Few-Shot Learners - arXiv
    May 28, 2020 · Abstract page for arXiv paper 2005.14165: Language Models are Few-Shot Learners. ... GPT-3 achieves strong performance on many NLP datasets, ...
  25. [25]
    [PDF] Prediction and Entropy of Printed English - Princeton University
    Prediction and Entropy of Printed English. By C. E. SHANNON. (Manuscript Received Sept. 75,. A new method of estimating the entropy and redundancy of a ...
  26. [26]
    [PDF] Fred Jelinek - ACL Anthology
    As a result, by the time that Fred Jelinek went to IBM in 1972 to work on speech recognition, the Information Theory bandwagon of the 1950s was lying forgotten ...
  27. [27]
    [PDF] TWO DECADES OF STATISTICAL LANGUAGE MODELING
    Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here, point to a few ...
  28. [28]
  29. [29]
  30. [30]
    A Brief History and Introduction of Recurrent Neural Networks | by ...
    Mar 8, 2021 · The concept of RNN was brought up in 1986. And the famous LSTM architecture was invented in 1997. The number of well-known architectures of RNN ...1 Why Rnn · 2 The Rnns · 2.4 Encoder And Decoder With...
  31. [31]
    Recurrent neural network based language model - Semantic Scholar
    Recurrent neural network based language model · Tomas Mikolov, M. Karafiát, +2 authors. S. Khudanpur · Published in Interspeech 2010 · Computer Science.
  32. [32]
    What are limitations of recurrent neural networks? - Quora
    Jan 30, 2017 · The major disadvantage of RNNs are the vanishing gradient and gradient exploding problem. It makes the training of RNN difficult in several ways ...
  33. [33]
    LSTM neural networks for language modeling - ISCA Archive
    In this work, we apply this type of network to an English and a large French language modeling task. Experiments show improvements of about 8% relative in ...
  34. [34]
    The Role of Recurrent Neural Networks (RNNs) in Language ...
    In language modeling, recurrent neural networks (RNNs) are used to predict the next word in a sentence based on the words that came before. They process input ...Missing: history | Show results with:history
  35. [35]
    When to use GRU over LSTM? - Data Science Stack Exchange
    Oct 17, 2016 · The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, ...
  36. [36]
    When to Use GRUs Over LSTMs? - Analytics Vidhya
    Mar 7, 2025 · GRUs typically train 20-30% faster than equivalent LSTM models due to their simpler internal structure and fewer parameters.Performance Comparisons... · Task-Specific Considerations
  37. [37]
    Recurrent Neural Network: Working, Applications, Challenges
    Aug 27, 2023 · Lack of Parallelism: The inherently sequential nature of RNNs limits their parallel processing capabilities, making them less efficient for ...
  38. [38]
    When Recurrent Models Don't Need to be Recurrent
    Aug 6, 2018 · Feed-forward models can offer improvements in training stability and speed, while recurrent models are strictly more expressive.
  39. [39]
    [1602.02410] Exploring the Limits of Language Modeling - arXiv
    Feb 7, 2016 · In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding.
  40. [40]
    A History of Large Language Models - Gregory Gundersen
    Oct 1, 2025 · I trace an academic history of some of the core ideas behind large language models, such as distributed representations, transducers, ...
  41. [41]
    Training Compute-Optimal Large Language Models - arXiv
    Mar 29, 2022 · We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B ...
  42. [42]
    A Critical Analysis of the Largest Source for Generative AI Training ...
    Jun 5, 2024 · Common Crawl is the largest freely available collection of web crawl data and one of the most important sources of pre-training data for large language models ...
  43. [43]
    Data | CS324
    During training, Common Crawl is downsampled (Common Crawl is 82% of the dataset, but contributes only 60%). The Pile. While a web crawl is a natural place ...
  44. [44]
    LLM Training Data: The 8 Main Public Data Sources - Oxylabs
    Sep 27, 2024 · This article overviews LLM training, the need for public web data, and the major public data sources for highly-performant LLMs.
  45. [45]
    Large language model data pipelines and Common Crawl (WARC ...
    Jun 3, 2023 · Common Crawl provides different archival formats that you can use and this format evolved over time. Nowadays they are available in 3 main ...
  46. [46]
    Mastering LLM Techniques: Text Data Processing - NVIDIA Developer
    Nov 13, 2024 · To optimize LLM performance, data processing techniques such as text cleaning, heuristic filtering, deduplication, and model-based quality ...
  47. [47]
    D4: Improving LLM Pretraining via Document De-Duplication and ...
    Aug 23, 2023 · Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains)
  48. [48]
    Data Deduplication at Trillion Scale: How to Solve the Biggest ... - Zilliz
    Jul 22, 2025 · Explore how MinHash LSH and Milvus handle data deduplication at the trillion-scale level, solving key bottlenecks in LLM training for improved ...
  49. [49]
    FineWeb: decanting the web for the finest text data at scale
    May 31, 2024 · Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt ...<|separator|>
  50. [50]
    Data Collection and Preprocessing for LLMs [Updated] - Labellerr
    Sep 27, 2024 · This section focuses on the acquisition and processing of pretraining data, which includes the sources of data, methods for preprocessing, and an analysis.
  51. [51]
    Inference Scaling Laws: An Empirical Analysis of Compute-Optimal ...
    Aug 1, 2024 · We study inference scaling laws (aka test-time scaling laws) and compute-optimal inference, focusing on the trade-offs between model sizes and generating ...
  52. [52]
    Revisiting Scaling Laws for Language Models: The Role of Data ...
    This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical ...
  53. [53]
    Supervised Fine-Tuning - Hugging Face LLM Course
    Supervised Fine-Tuning (SFT) is a critical process for adapting pre-trained language models to specific tasks. It involves training the model on a task-specific ...
  54. [54]
    Training language models to follow instructions with human feedback
    Mar 4, 2022 · Abstract page for arXiv paper 2203.02155: Training language models to follow instructions with human feedback. ... GPT-3 using supervised learning ...
  55. [55]
    Aligning language models to follow instructions - OpenAI
    Jan 27, 2022 · To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF)⁠ ...
  56. [56]
    Direct Preference Optimization: Your Language Model is Secretly a ...
    May 29, 2023 · In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
  57. [57]
    Constitutional AI: Harmlessness from AI Feedback - Anthropic
    Dec 15, 2022 · We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs.
  58. [58]
    [2504.12501] Reinforcement Learning from Human Feedback - arXiv
    Apr 16, 2025 · Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.
  59. [59]
    Evaluation Metrics for Language Modeling - The Gradient
    Oct 18, 2019 · Traditionally, language model performance is measured by perplexity, cross entropy, and bits-per-character (BPC).
  60. [60]
    Cross Entropy in Large Language Models (LLMs) | by Charles Chi | AI
    Feb 4, 2024 · LLMs utilize cross entropy as a loss function during training to measure the discrepancy between the predicted probability distribution of words ...
  61. [61]
    Language Model Perplexity Predicts Scientific Surprise and ... - arXiv
    Sep 6, 2025 · Our findings reveal that computational measures of corpus-wide linguistic surprise can forecast the reception and ultimate influence of ...
  62. [62]
    Tokenizer-Normalized Evaluation for Language Model Comparison
    Jul 7, 2025 · Perplexity, defined as the exponentiated average negative log-likelihood of a sequence, remains the standard intrinsic evaluation metric.
  63. [63]
    Perplexity of language models revisited | by Pirmin Lemberger
    Jun 28, 2022 · In a nutshell, the perplexity of a language model measures the degree of uncertainty of a LM when it generates a new token, averaged over very ...From Basic Information... · Get Pirmin Lemberger's... · Bounding The Perplexity Of...
  64. [64]
    30 LLM evaluation benchmarks and how they work - Evidently AI
    Sep 20, 2025 · We put together database of 250+ LLM benchmarks and datasets you can use to evaluate the performance of language models. LLM benchmarks vary in ...
  65. [65]
    SuperGLUE: A Stickier Benchmark for General-Purpose Language ...
    May 2, 2019 · In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a ...
  66. [66]
    Large Language Model Evaluation: 10+ Metrics & Methods
    Sep 19, 2025 · A combination of benchmarks is often necessary to comprehensively evaluate a language model's performance. A set of benchmark tasks is selected ...
  67. [67]
    LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide
    This article will teach you everything you need to know about LLM evaluation metrics, with code samples included.
  68. [68]
    SuperGLUE Benchmark
    SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new ...LeaderboardTasks
  69. [69]
    [Discussion] ChatGPT and language understanding benchmarks
    Jan 30, 2023 · The SuperGLUE benchmark has GPT-3 ranked #24, not terrible, but outperformed by old models like T5, which seems odd. GLUE nothing. SQUAD ...<|separator|>
  70. [70]
    What Are LLM Benchmarks? - IBM
    LLM benchmarks are standardized frameworks for assessing the performance of large language models (LLMs) ... HellaSwag, MMLU, GSM8K, TruthfulQA and Winogrande ...
  71. [71]
    Top LLM Benchmarks Explained: MMLU, HellaSwag, BBH, and ...
    LLM benchmarks such as MMLU, HellaSwag, and DROP, are a set of standardized tests designed to evaluate the performance of LLMs on various skills.
  72. [72]
    LLM Benchmarks: Measuring AI's Performance & Accuracy
    Jul 8, 2025 · Just a few top LLM benchmarks to watch today include MMLU, HellaSwag, TruthfulQA, HumanEval, Big-bench, GSM8K, and ARC. No single benchmark ...
  73. [73]
    LLM Leaderboard - Comparison of over 100 AI models from OpenAI ...
    Comparison and ranking the performance of over 100 AI models (LLMs) across key metrics including intelligence, price, performance and speed (output speed ...Llama 4 Scout · GPT-4.1 · Llama 4 Maverick · Claude 3.7 Sonnet
  74. [74]
    A Survey on Large Language Model Benchmarks - arXiv
    Aug 21, 2025 · Shopping MMLU: A massive multi-task online shopping benchmark for large language models. In Amir Globersons, Lester Mackey, Danielle ...
  75. [75]
    Grok-2 vs Llama 3.1 405B Instruct - LLM Stats
    Grok-2 outperforms in 4 benchmarks (GPQA, MATH, MMLU, MMLU-Pro), while Llama 3.1 405B Instruct is better at 1 benchmark (HumanEval). Grok-2 significantly ...
  76. [76]
    Announcing Grok-1.5 - xAI
    Mar 28, 2024 · Additionally, it scored 74.1% on the HumanEval benchmark, which evaluates code generation and problem-solving abilities. Benchmark, Grok-1, Grok ...
  77. [77]
    Llama 3.1 405b vs Leading Closed-Source Models - Vellum AI
    Jul 26, 2024 · The main focus on this analysis is to compare Llama 405b with GPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet. We look at standard benchmarks ...<|separator|>
  78. [78]
    Unveiling A Core Linguistic Region in Large Language Models - arXiv
    Oct 23, 2023 · We have discovered a core region in LLMs that corresponds to linguistic competence, accounting for approximately 1% of the total model parameters.
  79. [79]
    Linguistic Interpretability of Transformer-based Language Models
    Apr 9, 2025 · Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or ...Missing: capabilities | Show results with:capabilities
  80. [80]
    What Do LLMs Know About Linguistics? It Depends on How You Ask
    Jul 9, 2023 · This setup overlooks an entire language-motivated side of core NLP, where the models produce linguistic analyses of text, such as a syntax tree.
  81. [81]
    [2410.09613] Transformer-based Language Models for Reasoning ...
    Oct 12, 2024 · Abstract:Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities.
  82. [82]
    Does GPT-4 surpass human performance in linguistic pragmatics?
    Dec 15, 2023 · Findings revealed that LLMs, particularly GPT-4, outperformed humans. GPT4 achieved the highest score of 4.80, surpassing the best human score ...
  83. [83]
    PUB: A Pragmatics Understanding Benchmark for Assessing LLMs ...
    Jan 13, 2024 · The benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.
  84. [84]
    Evaluating Large Language Models on Linguistic Competence
    This project investigates the extent to which LLMs capture core aspects of linguistic knowledge, including syntax, semantics, pragmatics, and sociolinguistic ...
  85. [85]
    [PDF] A Comprehensive Overview of Large Language Models - arXiv
    Large Language Models (LLMs) have shown remarkable capabilities in natural language processing, and this article provides a comprehensive overview of LLM- ...
  86. [86]
    One Year On: Assessing Progress of Multimodal Large Language ...
    Aug 12, 2025 · The average accuracy of multimodal LLMs across subspecialties varied between the 2024 and 2023 CotD question cohorts, with a general trend of ...
  87. [87]
    survey on multimodal large language models - Oxford Academic
    This paper presents the first survey on Multimodal Large Language Models (MLLMs), highlighting their potential as a path to Artificial General Intelligence.
  88. [88]
    How Gemini makes Android more helpful - Google AI
    Aug 13, 2024 · With AI at the core of Android, we've rebuilt Gemini and tailored it for your device, in a private and secure way.New Ways To Use Gemini On... · Secure With Google · Built For Android's Scale
  89. [89]
    Google Gemini's Android app is almost ready to roll out this basic ...
    Jul 13, 2025 · Google also introduced a chat search feature in recent weeks, enabling users to search their chat history with Gemini, albeit only on the web and iOS.
  90. [90]
    Grounding with Google Search | Gemini API
    Sep 25, 2025 · Grounding with Google Search connects the Gemini model to real-time web content and works with all available languages.
  91. [91]
    Microsoft incorporates OpenAI's GPT-5 into consumer, developer ...
    Aug 7, 2025 · Today Microsoft is incorporating GPT-5, OpenAI's best AI system to date, into a wide variety of its products, to bring new reasoning ...
  92. [92]
    Microsoft Embeds ChatGPT-5 Across Copilot Ecosystem
    Aug 7, 2025 · OpenAI's GPT-5 model is now integrated into Microsoft's full portfolio of AI-powered tools, including Microsoft 365 Copilot, GitHub Copilot, ...
  93. [93]
    Foundry Models sold directly by Azure - Microsoft Learn
    GPT-4o audio models support either low latency speech in, speech out conversational interactions or audio generation.Working with models · Azure OpenAI reasoning models · Add and configure models
  94. [94]
    27 of the best large language models in 2025 - TechTarget
    Jul 10, 2025 · Below are some of the most relevant large language models today. They do natural language processing and influence the architecture of future models.Bert · Claude vs. ChatGPT · DeepSeek explained
  95. [95]
    Detecting hallucinations in large language models using semantic ...
    Jun 19, 2024 · Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations.
  96. [96]
    [PDF] Why Language Models Hallucinate - arXiv
    Sep 4, 2025 · Hallucinations are inevitable only for base models.​​ Indeed, empirical studies (Fig. 2) show that base models are often found to be calibrated, ...
  97. [97]
    [PDF] An Empirical Study on Factuality Hallucination in Large Language ...
    Aug 11, 2024 · Factuality hallucination in LLMs is the tendency to generate factually incorrect content that looks plausible, which is a primary erroneous ...
  98. [98]
    A Survey on Hallucination in Large Language Models
    Jan 24, 2025 · Moreover, recent research has exposed that LLMs can occasionally exhibit unpredictable reasoning hallucinations spanning both long-range and ...
  99. [99]
    Survey and analysis of hallucinations in large language models
    Sep 29, 2025 · In this work, we present a comprehensive survey and empirical analysis of hallucination attribution in LLMs. Introducing a novel framework to ...
  100. [100]
    [PDF] Free? Assessing the Reliability of Leading AI Legal Research Tools
    However, the large language models used in these tools are prone to “hallucinate,” or make up false information, making their use risky in high- stakes domains.
  101. [101]
    Multi-model assurance analysis showing large language models are ...
    Without mitigation, hallucination rates were 64.1% for long cases versus 67.6% in short ones. With the mitigation prompt, rates dropped to 43.1% and 45.3% for ...
  102. [102]
    Larger and more instructable language models become less reliable
    Sep 25, 2024 · Larger and more instructable large language models may have become less reliable. By studying the relationship between difficulty concordance, task avoidance ...
  103. [103]
    Biases in Large Language Models: Origins, Inventory, and Discussion
    Jun 22, 2023 · We argue that (i) most types of bias originate in corpora and, consequently, language models learn and amplify such biases, and, (ii) more ...
  104. [104]
    Bias and Fairness in Large Language Models: A Survey
    Model: The training or inference procedure itself may amplify bias, beyond what is present in the training data. The choice of optimization function, such as ...
  105. [105]
    Stereotypical bias amplification and reversal in an experimental ...
    Here we provide the first direct empirical evidence of the core phenomenon—bias amplification—driving public and expert concern about human–AI interaction.
  106. [106]
    Bias Amplification: Large Language Models as Increasingly ... - arXiv
    Oct 19, 2024 · In this paper, we introduce a open, generational, and long-context benchmark specifically designed to measure political bias amplification in LLMs.
  107. [107]
    Large language models show amplified cognitive biases in moral ...
    Our experiments demonstrate that the decisions and advice of LLMs are systematically biased against doing anything, and this bias is stronger than in humans.
  108. [108]
    Assessing political bias and value misalignment in generative ...
    Our analysis reveals a concerning misalignment of values between ChatGPT and the average American. We also show that ChatGPT displays political leanings ...
  109. [109]
    Measuring Political Preferences in AI Systems - Manhattan Institute
    Jan 23, 2025 · Research has hinted at the presence of political biases in Large Language Model (LLM)–based AI systems such as OpenAI's ChatGPT or Google's ...
  110. [110]
    Large Language Models Are Biased Because They Are Large ...
    A great number of studies demonstrate empirically that harmful LLM biases exist, but this is done exclusively, as far as I can tell, via in vitro methods—as ...
  111. [111]
    Bias in Large Language Models: Origin, Evaluation, and Mitigation
    Nov 16, 2024 · This comprehensive review examines the landscape of bias in LLMs, from its origins to current mitigation strategies.
  112. [112]
    Over 30 AI models have been trained at the scale of GPT-4
    Jan 30, 2025 · Over 30 AI models, trained with over 10^25 FLOP, have been identified as of June 2025, with an average of two models announced monthly in 2024.
  113. [113]
    What is the cost of training large language models? - CUDO Compute
    May 12, 2025 · Training OpenAI's GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google's ...
  114. [114]
    Training compute costs are doubling every eight months ... - Epoch AI
    Jun 19, 2024 · Training compute costs for the largest AI models are doubling every eight months, with spending growing at 2.4x per year, costing hundreds of ...
  115. [115]
    [PDF] How Hungry is AI? Benchmarking Energy, Water, and Carbon ...
    Apr 23, 2025 · We benchmark the environmental footprint of 30 LLMs across three modalities: Energy consumption, water usage, and carbon emissions, based on ...
  116. [116]
    Confronting AI's Growing Energy Appetite | Extreme Networks
    Aug 15, 2024 · AI's energy demand is expected to skyrocket from just eight terawatt-hours in 2024 to a staggering 652 terawatt-hours by 2030.
  117. [117]
    Will we run out of data to train large language models? - Epoch AI
    Jun 6, 2024 · So far, AI developers have not faced major limits to scaling beyond simply procuring AI chips, which are scarce but rapidly growing in supply. ...
  118. [118]
    Reconciling Kaplan and Chinchilla Scaling Laws - arXiv
    Jun 12, 2024 · Our approach uses information and data from the Chinchilla and Kaplan studies to estimate the scaling laws that would emerge if the Chinchilla ...
  119. [119]
    Chinchilla data-optimal scaling laws: In plain English - LifeArchitect.ai
    Aug 15, 2025 · Chinchilla scaling laws say 1,400B tokens (1.4T) should be used to train a 70B parameter LLM, needing around 20 text tokens per parameter.
  120. [120]
    Language models scale reliably with over-training and on ...
    Nov 25, 2024 · Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments.<|control11|><|separator|>
  121. [121]
    Scaling Laws for LLMs: From GPT-3 to o3 - Deep (Learning) Focus
    Jan 6, 2025 · Scaling laws help us to predict the results of larger and more expensive training runs, giving us the necessary confidence to continue investing in scale.
  122. [122]
    Why language models hallucinate | OpenAI
    Sep 5, 2025 · Our latest models have lower hallucination rates, and we continue to work hard to further decrease the rates of confident errors output by ...Missing: 2023-2025 | Show results with:2023-2025
  123. [123]
    Multi-model assurance analysis showing large language ... - Nature
    Aug 2, 2025 · Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate ( ...
  124. [124]
    Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...
    May 22, 2024 · The hallucination rate was calculated to quantify the proportion of LLM-generated references that were irrelevant, incorrect, or unsupported by ...Missing: quantitative | Show results with:quantitative
  125. [125]
    Why Language Models Hallucinate - arXiv
    Sep 4, 2025 · We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we ...
  126. [126]
    [PDF] FELM: Benchmarking Factuality Evaluation of Large Language Models
    In the experiments, we examine the abilities of two most powerful LLMs, ChatGPT and GPT-4 (OpenAI, 2023), as factuality evaluators on our benchmark, augmented ...
  127. [127]
    FactBench: A Dynamic Benchmark for In-the-Wild Language Model...
    We curate a benchmark from in-the-wild user-model interactions that evaluates language models' factuality in diverse scenarios.
  128. [128]
    Do language models favor their home countries? Asymmetric ...
    Sep 22, 2025 · As language models (LMs) continue to develop, concerns over foreign misinformation through models developed in authoritarian countries have ...Missing: reliability | Show results with:reliability
  129. [129]
    The Dark Side of Language Models: Exploring the Potential of LLMs ...
    This review paper explores the potential of LLMs to initiate the generation of multi-media disinformation, encompassing text, images, audio, and video.
  130. [130]
    Generative AI and misinformation: a scoping review of the role of ...
    Sep 30, 2025 · This scoping review synthesizes recent empirical studies to explore the dual role of generative AI—particularly large language models (LLMs)—in ...
  131. [131]
    On the reliability of Large Language Models to misinformed and ...
    Jan 8, 2025 · We investigate and observe the behavior and performance of Large Language Model (LLM)-backed chatbots in addressing misinformed prompts and questions with ...
  132. [132]
    Master List of lawsuits v. AI, ChatGPT, OpenAI, Microsoft, Meta ...
    Aug 27, 2024 · LAST UPDATE: As of Sept. 16, 2025, 51 copyright lawsuits filed against AI companies in U.S. New suit: Disney v. MiniMax. There are 24 copyright ...
  133. [133]
    Every AI Copyright Lawsuit in the US, Visualized | WIRED
    Dec 19, 2024 · Twelve separate copyright lawsuits brought by authors, newspapers, and other publishers against OpenAI and Microsoft were consolidated into one ...
  134. [134]
    The Newspaper Cases | BakerHostetler
    What We're Watching. As of April 2025, this case has been consolidated into In re: OpenAI Copyright Infringement Litigation.
  135. [135]
    How we're responding to The New York Times' data ... - OpenAI
    Jun 5, 2025 · Update on October 22, 2025: After months of litigation, we are no longer under a legal order to retain consumer ChatGPT and API content ...
  136. [136]
    Fair Use and AI Training: Two Recent Decisions Highlight the ...
    Jul 8, 2025 · In each case, the court found that, on the facts before it, the use of copyrighted works to train an AI model was highly transformative and fair ...
  137. [137]
    A New Look at Fair Use: Anthropic, Meta, and Copyright in AI Training
    Jul 3, 2025 · Bartz v. Anthropic and Kadrey v. Meta offer the first significant judicial guidance on how courts will look to apply the fair use doctrine to AI training ...
  138. [138]
    First of its Kind Decision Finds AI Training is Not Fair Use
    Judge Bibas's opinion is the first decision in a case that addresses whether AI training is fair use, and it unequivocally holds that it is not.
  139. [139]
    Northern District of California Decides AI Training Is Fair Use, but ...
    Jul 2, 2025 · While use of copyrighted works to train AI may be fair use, copying works without permission carries the risk of infringement.<|separator|>
  140. [140]
    AI Copyright Lawsuits - UBC Wiki
    Aug 26, 2025 · According to Reuters Legal in August 2025 Perplexity AI failed to convince a judge to dismiss a lawsuit over its alleged misuse of articles ...
  141. [141]
  142. [142]
  143. [143]
    Why AI is replacing some jobs faster than others
    Aug 12, 2025 · Data-rich industries are the most prone to being disrupted by AI. Data-poor industries are scrabbling to digitize in order to enjoy the benefits ...
  144. [144]
    How Large Language Models Could Impact Jobs
    Sep 10, 2024 · A new study weighs the positive and negative impacts of LLMs on various jobs, depending on their exposure to AI.
  145. [145]
    AI labor displacement and the limits of worker retraining | Brookings
    May 16, 2025 · Julian Jacobs examines the challenges of worker retraining amid the potential job displacement driven by advances in AI.
  146. [146]
    Experimental evidence on the productivity effects of generative ...
    Jul 13, 2023 · Our results show that ChatGPT substantially raised productivity: The average time taken decreased by 40% and output quality rose by 18%.
  147. [147]
    Generative AI and labour productivity: a field experiment on coding
    Sep 4, 2024 · Our findings indicate that the use of gen AI increased code output by more than 50%. However, productivity gains are statistically significant only among entry ...
  148. [148]
    Economic potential of generative AI - McKinsey
    Jun 14, 2023 · Generative AI could enable labor productivity growth of 0.1 to 0.6 percent annually through 2040, depending on the rate of technology adoption ...
  149. [149]
    Evaluating the Impact of AI on the Labor Market - Yale Budget Lab
    Oct 1, 2025 · Overall, our metrics indicate that the broader labor market has not experienced a discernible disruption since ChatGPT's release 33 months ago, ...
  150. [150]
    Visualizing the Training Costs of AI Models Over Time
    Jun 4, 2024 · Last year, OpenAI's GPT-4 cost an estimated $78.4 million to train, a steep rise from Google's PaLM (540B) model, which cost $12.4 million just ...<|separator|>
  151. [151]
    "OpenAI's costs for AI training and inference could soar to $7 billion ...
    Jul 25, 2024 · OpenAI's costs for AI training and inference could soar to $7 billion this year, while staffing expenses might climb to as much as $1.5 billion.
  152. [152]
  153. [153]
    Artificial intelligence: Development, risks and regulation
    Jul 18, 2023 · Dependence on AI, including the risk that an overreliance on AI leads to a loss of creativity, critical thinking skills and human intuition.2. Ongoing Development Of Ai... · 3. Calls For Rapid... · 4. Proposed Regulatory...Missing: offs | Show results with:offs
  154. [154]
    [PDF] Large Language Models, Small Labor Market Effects
    May 28, 2025 · If these tools enhance individual productivity, a natural question is whether such gains translate into higher earnings. Our first key finding ...
  155. [155]
    Is artificial intelligence a hazardous technology? Economic trade-off ...
    Sep 3, 2024 · Our study explores the trade-off of AI technology, including existential risks. We develop a theory and a Bayesian simulation model in order to explore what is ...
  156. [156]
    [TMLR 2024] Efficient Large Language Models: A Survey - GitHub
    In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main ...
  157. [157]
    A Survey on Model Compression for Large Language Models - arXiv
    Jul 30, 2024 · This paper presents a survey of model compression techniques for LLMs. We cover methods like quantization, pruning, and knowledge distillation, highlighting ...
  158. [158]
    LLM Optimization: Quantization, Pruning, and Distillation Techniques
    May 30, 2025 · This comprehensive guide explores three fundamental approaches to LLM optimization: quantization, pruning, and knowledge distillation. These ...<|separator|>
  159. [159]
    Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer
    Oct 7, 2025 · Pruning and knowledge distillation are highly cost-effective methods to progressively shrink LLMs while matching or exceeding baseline ...
  160. [160]
    [PDF] Quantization, Pruning, and Distillation - Graham Neubig
    Quantization. • keep the model the same but reduce the number of bits. 2. Pruning. • remove parts of a model while retaining performance. 3. Distillation.
  161. [161]
    FlashAttention: Fast and Memory-Efficient Exact Attention with IO ...
    May 27, 2022 · We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory ...
  162. [162]
    FlashAttention-3: Fast and Accurate Attention with Asynchrony and ...
    Jul 11, 2024 · In this blogpost, we describe three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA.
  163. [163]
    Flash Attention - Hugging Face
    Flash Attention is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and ...<|control11|><|separator|>
  164. [164]
    From Large Language Models to Large Multimodal Models - MDPI
    This paper aims to summarize the recent progress from LLMs to LMMs in a comprehensive and unified way. First, we start with LLMs and outline various conceptual ...
  165. [165]
    AI Agents vs. Agentic AI: A Conceptual taxonomy, applications and ...
    In short, foundational AI models give modern AI Agents their basic understanding of language and scenes. ... Representative Agentic AI Models (2023–2025): ...
  166. [166]
    Autonomous generative AI agents: Under development - Deloitte
    Nov 19, 2024 · Built on foundation models: Foundation models like LLMs enable agentic AI to reason, analyze, and adapt to complex and unpredictable workflows.
  167. [167]
    The 2025 AI Index Report | Stanford HAI
    Beyond benchmarks, AI systems made major strides in generating high-quality video, and in some settings, language model agents even outperformed humans in ...
  168. [168]
    Large Multimodal Models (LMMs) vs LLMs - Research AIMultiple
    Sep 26, 2025 · We evaluated the performance of Large Multimodal Models (LMMs) in financial reasoning tasks using a carefully selected dataset.
  169. [169]
    Multimodal Models and Agentic AI: Generative AI in 2025 - Spitch AI
    Feb 6, 2025 · Small Language Models (SLMs) are also trending. In some cases these models match the performance of larger systems like GPT-4 in targeted ...Agentic Ai: Towards Greater... · Contact Centers As A Case... · Overcoming Inference...
  170. [170]
    Best 44 Large Language Models (LLMs) in 2025 - Exploding Topics
    Oct 17, 2025 · This table lists the leading large language models in 2025. LLM Name, Developer, Release Date, Access, Parameters. GPT-5, OpenAI, August 7, 2025 ...
  171. [171]
    Top 10 open source LLMs for 2025 - Instaclustr
    Unlike proprietary models developed by companies like OpenAI and Google, open source LLMs are licensed to be freely used, modified, and distributed by anyone.Missing: 2023-2025 | Show results with:2023-2025
  172. [172]
    Top 8 Open‑Source LLMs to Watch in 2025 - JetRuby Agency
    May 13, 2025 · Top open-source LLMs include Llama 3.1 (multilingual), Llama 4 (text/image), Pixtral 12B (multimodal), Qwen 2.5-72B (multilingual), and Falcon ...
  173. [173]
    Open-Source LLMs You Can Deploy: 11 Best Models 2025
    Sep 16, 2025 · Discover 11 top open-source LLMs for 2025. Compare models, learn deployment strategies, and build AI workflows with practical deployment ...
  174. [174]
    9 Top Open-Source LLMs for 2025 and Their Uses - DataCamp
    9 Top Open-Source Large Language Models For 2025 · 1. GLM 4.6 · 2. gpt-oss-120B · 3. Qwen3 235B 2507 · 4. DeepSeek V3.2 Exp · 5. DeepSeek R1 0528 · 6. Apriel-v1.5-15B ...Missing: 2023-2025 | Show results with:2023-2025
  175. [175]
    The Open-Source Advantage in Large Language Models (LLMs)
    Oct 14, 2025 · Large language models (LLMs) have rapidly advanced natural language processing, driving significant breakthroughs in tasks such as text ...
  176. [176]
    Open-Source LLMs vs Closed-Source LLMs: Key Differences in 2025
    Aug 29, 2025 · Open-source large language models, or LLMs, are models whose ... Unlike proprietary APIs that charge per request, open models are free to use.Missing: dynamics | Show results with:dynamics
  177. [177]
    Top LLMs To Use in 2025: Our Best Picks - Splunk
    May 8, 2025 · Weigh proprietary versus open-source models: proprietary LLMs offer advanced capabilities but may have higher costs and data privacy trade-offs, ...Open-Source Vs. Proprietary... · Top Llms Of 2025 · Llms For Problem Solving
  178. [178]
    Open Source or Proprietary LLMs? - EM360Tech
    Mar 20, 2024 · Disadvantages and risks of proprietary LLMs: Lack of transparency: The inner workings are opaque to outside researchers and the public. We ...
  179. [179]
    Open Source vs. Proprietary LLMs: Arms Race for AI Leadership
    Jan 30, 2025 · While open source models offer greater control and customization potential, proprietary models currently lead in general performance and ease ...
  180. [180]
    Do you think open source models continue to keep pace ... - Reddit
    Jul 24, 2025 · Open source will likely always stay close behind proprietary models due to the practice of model distillation. It's much easier to stay in the ...Missing: dynamics | Show results with:dynamics
  181. [181]
    LLMs Explained: Open-Source Vs Proprietary AI Models - AceCloud
    Sep 4, 2025 · Learn how open-source and proprietary LLMs differ in cost, control, customization, and innovation. Get insights to make the right AI choice.Missing: dynamics | Show results with:dynamics
  182. [182]
    Open vs. Closed LLMs in 2025: Strategic Tradeoffs for Enterprise AI
    biomedicine, law, ...Missing: dynamics | Show results with:dynamics