Fact-checked by Grok 2 weeks ago

Language model

A language model is a probabilistic framework in machine learning that estimates the joint probability distribution over sequences of linguistic units, such as words or subword tokens, enabling predictions of subsequent elements given prior context.^[1] These models originated with statistical approaches like n-gram estimators in the mid-20th century, which approximated probabilities based on contiguous word sequences, but evolved significantly with the introduction of neural architectures in the early 2000s, culminating in transformer-based large language models (LLMs) that leverage massive parallel training on internet-scale text corpora to achieve human-like fluency in generation and comprehension tasks.^[2]^[3] The core mechanism of modern language models involves autoregressive prediction, where the model computes the conditional probability P(w_t \mid w_1, \dots, w_{t-1}) for each token w_t in a sequence, often using self-attention mechanisms in transformers to capture long-range dependencies without recurrent processing.^[4] This shift from sequential models like recurrent neural networks (RNNs) to transformers, introduced in 2017, enabled scaling to billions or trillions of parameters, yielding breakthroughs in applications such as machine translation, code generation, and question answering, with empirical benchmarks showing LLMs outperforming prior systems on tasks like GLUE and SuperGLUE by wide margins due to emergent capabilities from pre-training on diverse data. Notable achievements include the GPT series by OpenAI, which demonstrated zero-shot learning on unseen tasks, and models like PaLM and LLaMA that revealed scaling laws where performance predictably improves with compute and data volume, underscoring the causal role of model size in approximating complex linguistic patterns. Despite these advances, language models exhibit fundamental limitations rooted in their statistical nature, including hallucinations—generating plausible but factually incorrect outputs—as evidenced by empirical evaluations where even top models like GPT-4 err on novel factual queries at rates exceeding 10-20% in controlled tests, reflecting overfitting to training distributions rather than genuine causal understanding.^[5] Biases inherited from uncurated web data propagate stereotypes and inaccuracies, with studies quantifying disparate error rates across demographic groups in tasks like sentiment analysis, though mitigation via fine-tuning yields inconsistent results due to trade-offs with overall perplexity.^[6] Controversies also encompass high environmental costs from training, equivalent to thousands of households' annual energy use for a single large model, and risks of misuse in generating deceptive content, as demonstrated by adversarial prompts eliciting harmful instructions despite safeguards.^[4] These issues highlight that while language models excel at surface-level mimicry, they lack robust generalization to out-of-distribution causal scenarios, prompting ongoing research into hybrid systems incorporating symbolic reasoning or retrieval augmentation for enhanced reliability.^[7]

Fundamentals

Definition and Scope

A language model is a probabilistic model that defines a joint probability distribution over sequences of words, tokens, or symbols drawn from a natural language vocabulary. It estimates the likelihood of a given sequence occurring, typically factorized via the chain rule of probability as P(w_1, \dots, w_n) = \prod_{i=1}^n P(w_i \mid w_1, \dots, w_{i-1}), where each conditional term predicts the next element given prior context.^[8]^[9] This formulation captures sequential dependencies, enabling evaluation of sequence fluency through metrics like perplexity, defined as the exponential of the average negative log-likelihood.^[10] Early language models relied on statistical methods such as n-grams, which approximate conditional probabilities from empirical frequency counts in corpora, smoothing techniques like Kneser-Ney addressing data sparsity.^[10] Neural variants, emerging prominently from 2003 onward, represent sequences via distributed embeddings and recurrent or attention mechanisms to model long-range dependencies more effectively than fixed-window statistics.^[9] The scope excludes non-sequential models like bag-of-words classifiers, focusing instead on generative or predictive modeling of ordered linguistic units. Language models underpin core natural language processing tasks, including autoregressive text generation, where sequences are sampled iteratively from conditional distributions; machine translation, scoring candidate translations by fluency; and speech recognition, rescoring hypotheses via language probabilities integrated with acoustic models.^[8] They also support information retrieval by estimating query likelihood and enable foundational work in semantic representations, such as vector analogies derived from learned embeddings.^[11] While scalable neural architectures have expanded capabilities to handle billions of parameters and diverse modalities, the fundamental scope remains bounded to probabilistic sequence modeling, distinct from broader AI systems like vision models or reinforcement learning agents.^[12]

Probabilistic Foundations

Language models are statistical models designed to estimate the probability distribution over sequences of linguistic units, such as words, subwords, or characters, in a given language. This estimation captures the relative likelihood of different sequences occurring in natural language corpora, enabling applications like text generation, machine translation, and speech recognition. The foundational goal, as articulated in early statistical approaches, is to learn the joint probability function P(w_1, w_2, \dots, w_n) for a sequence of words w_1 to w_n.^[9]^[13] The chain rule of probability decomposes this joint distribution into a product of conditional probabilities: P(w_1, w_2, \dots, w_n) = \prod_{i=1}^n P(w_i \mid w_1, \dots, w_{i-1}), where P(w_1) initializes the sequence and each subsequent term conditions on all preceding elements. This autoregressive factorization reflects the causal structure of language generation, where each unit depends on prior context, aligning with empirical observations of sequential dependencies in human-produced text. Exact computation of these conditionals is intractable due to combinatorial explosion—the number of possible histories grows exponentially with sequence length—necessitating approximations.^[14]^[15] Traditional n-gram models approximate the conditionals via the Markov assumption, restricting dependence to a fixed window of n-1 prior units: P(w_i \mid w_1, \dots, w_{i-1}) \approx P(w_i \mid w_{i-n+1}, \dots, w_{i-1}). For instance, bigram models (n=2) condition solely on the immediate predecessor, with probabilities estimated via maximum likelihood from counts in training data: P(w_i \mid w_{i-1}) = \frac{\#(w_{i-1}, w_i)}{\#(w_{i-1})}. This approach suffers from sparsity, as unseen n-grams yield zero probabilities, addressed through smoothing techniques like Laplace or Kneser-Ney, which redistribute probability mass to unobserved events based on empirical frequency patterns.^[16]^[17] Neural language models parameterize the conditionals using differentiable functions, such as feedforward or recurrent networks, trained to maximize the log-likelihood of observed sequences under the chain rule. This enables learning dense vector representations that encode long-range dependencies and mitigate the curse of dimensionality in sparse count-based methods, as demonstrated in models achieving perplexity reductions on benchmarks like the Penn Treebank corpus. Training optimizes parameters \theta via \frac{1}{N} \sum_{i=1}^N \log P_\theta(w_i \mid w_1, \dots, w_{i-1}), where N is the corpus size, often using stochastic gradient descent. Evaluation metrics like perplexity, \exp\left( -\frac{1}{N} \sum \log P(w_i \mid \cdot) \right), quantify predictive uncertainty, with lower values indicating better approximation of the data-generating distribution.^[9]^[13]

Historical Development

Early Statistical Models

Early statistical language models originated in the field of information theory during the late 1940s, drawing on Markov chain principles to approximate the probabilities of sequential events in text.^[18] Claude Shannon introduced these concepts in his 1948 paper "A Mathematical Theory of Communication," where he modeled language as a stochastic process to quantify information entropy, using zero-order approximations (uniform distributions) and higher-order Markov predictions for letter sequences in English.^[18] In a 1951 follow-up, Shannon estimated the entropy of printed English at approximately 1 bit per letter by employing n-gram-like predictions, where the probability of a letter depends on the preceding 0 to 15 characters, demonstrating that bigram and trigram approximations captured much of the language's redundancy with per-character entropies dropping from 4.14 bits (zero-order) to around 1.3 bits (eighth-order).^[19] These foundational ideas evolved into explicit n-gram models for word-level prediction in natural language processing by the 1970s and 1980s, formalized under the Markov assumption that the probability of the next word w_m depends only on the previous n-1 words: P(w_m \mid w_1, \dots, w_{m-1}) \approx P(w_m \mid w_{m-n+1}, \dots, w_{m-1}).^[10] Unigram models treated words independently, bigrams conditioned on one prior word, and trigrams on two, with counts derived from corpora like the Brown Corpus (1 million words, 1960s) to estimate probabilities via maximum likelihood, though sparse data necessitated early smoothing techniques such as add-one (Laplace) to assign non-zero probabilities to unseen sequences.^[10] By the 1980s, these models supported applications in speech recognition, where trigrams improved perplexity measures on datasets like the Wall Street Journal corpus, reducing prediction error compared to bigrams by factoring in local context.^[10] The 1990s marked widespread adoption in statistical machine translation, where n-gram language models penalized ungrammatical outputs in noisy-channel frameworks. IBM researchers developed Models 1 through 5 starting in the late 1980s, incorporating trigram language models trained on parallel corpora like the Canadian Hansards (millions of sentence pairs) to compute translation probabilities alongside fluency scores, achieving initial benchmarks on French-English pairs with perplexity reductions via interpolated smoothing.^[20] These models, estimated using expectation-maximization algorithms on up to 10^6 sentence pairs, relied on n-grams up to order 3 or 4 due to computational limits and data sparsity, with fertility and distortion extensions addressing word alignments but preserving the core statistical independence assumptions from Shannon's era.^[20] Despite limitations like the inability to capture long-range dependencies—evident in higher perplexities for n>3 on large corpora—early statistical models established probabilistic foundations, influencing toolkits like SRILM (1999) for efficient n-gram storage and querying on billions of words.^[10]

Emergence of Neural Approaches

The transition to neural approaches in language modeling began in the early 2000s, addressing limitations of statistical n-gram models, which struggled with data sparsity and the curse of dimensionality due to exponential growth in possible word sequences.^[21] In 2003, Yoshua Bengio and colleagues introduced one of the first neural probabilistic language models, employing a feedforward neural network to estimate the probability of the next word given prior context.^[21] This model used a distributed representation of words—early word embeddings—learned via backpropagation with shared parameters across context positions, enabling generalization beyond observed n-grams and achieving lower perplexity on held-out data compared to traditional methods, though at higher computational cost.^[21] Subsequent advancements incorporated recurrent neural networks (RNNs) to better capture sequential dependencies, overcoming the fixed-window constraints of feedforward models. In 2010, Tomáš Mikolov et al. developed the recurrent neural network language model (RNNLM), which utilized a simple RNN architecture to maintain a hidden state representing arbitrary-length history, trained efficiently with techniques like importance sampling for normalization.^[22] Empirical evaluations on speech recognition tasks demonstrated RNNLM's superiority, with perplexity reductions of up to 20% over n-gram baselines and substantial word error rate improvements (e.g., 10-15% relative gains on large corpora like Switchboard).^[22] These neural methods gained traction through practical implementations and hardware advances, such as GPUs, which mitigated training inefficiencies; by the mid-2010s, they consistently outperformed statistical models in downstream applications like machine translation and ASR, paving the way for deeper architectures.^[22] The core innovation—learning continuous, dense vector representations—facilitated semantic understanding absent in discrete n-gram probabilities, though challenges like vanishing gradients in standard RNNs prompted refinements such as long short-term memory (LSTM) units, introduced earlier in 1997 but increasingly applied to language tasks post-2010.^[21]

Scaling Era and Transformer Dominance

The scaling era in language modeling emerged in the late 2010s, driven by exponential growth in computational resources and data availability, which enabled training of models with billions of parameters and demonstrated predictable performance gains via power-law relationships in loss reduction.^[23] Empirical studies revealed that cross-entropy loss scales as a power-law with model size N, dataset size D, and compute C, approximately as L(N, D, C) \propto N^{-\alpha} D^{-\beta} C^{-\gamma}, where exponents \alpha \approx 0.076, \beta \approx 0.103, and \gamma \approx 0.050 hold across varied architectures, justifying investments in larger scales for diminishing but consistent returns.^[23] This period shifted focus from architectural innovation to resource scaling, as larger models exhibited emergent abilities like few-shot learning without task-specific fine-tuning.^[24] The Transformer architecture, introduced in June 2017, underpinned this dominance by eschewing recurrent layers in favor of self-attention mechanisms, which compute dependencies between all sequence elements in parallel rather than sequentially.^[3] This design overcame limitations of recurrent neural networks, such as vanishing gradients and inefficient handling of long contexts, allowing transformers to process sequences up to thousands of tokens with quadratic complexity in length but superior parallelizability on GPUs.^[3] Causal masking in decoder-only variants, like those in the GPT series, further aligned transformers with autoregressive language modeling by restricting attention to prior tokens, enabling efficient next-token prediction central to generative tasks.^[24] Key milestones included OpenAI's GPT-3, detailed in a May 2020 paper, which scaled to 175 billion parameters trained on approximately 570 gigabytes of filtered text, achieving state-of-the-art few-shot performance on benchmarks like SuperGLUE without gradient updates on downstream data.^[24] Subsequent refinements, such as optimal compute allocation balancing model size and data (e.g., equal scaling of N and D for fixed C), reinforced transformer's scalability, as larger models proved more sample-efficient than smaller ones under equivalent compute budgets.^[23] By the early 2020s, transformers supplanted prior paradigms due to their ability to capture long-range syntactic and semantic dependencies via multi-head attention, with ablation studies confirming attention's causal role in performance over alternatives like convolutions or recurrences.^[3] This architectural edge, combined with hardware advances like TPUs and multi-node training, established transformers as the de facto standard, powering models from proprietary systems to open-source efforts exceeding trillion-parameter scales.

Architectures and Types

N-Gram and Statistical Precursors

Statistical language models based on n-grams served as foundational precursors to modern neural language models, relying on probabilistic estimation from empirical word sequences rather than learned representations. These models approximate the conditional probability of a word given its preceding context by considering only the immediately prior n-1 words, leveraging the Markov assumption that the probability P(w_i | w₁, ..., w_i-1) ≈ P(w_i | w_i-n+1, ..., w_i-1).^[10] This fixed-order approximation stems from early applications of Markov chains to text prediction, with roots in Andrey Markov's 1913 analysis of letter sequences in Russian literature, later extended to words.^[10] Early conceptual groundwork was laid by Claude Shannon in his 1951 study on the entropy of printed English, where human subjects and Markov models of increasing order (up to 15 for letters) were used to estimate redundancy and predict text, yielding an entropy rate of approximately 1.3 bits per letter after accounting for dependencies.^[25] Practical statistical language modeling gained traction in the 1970s through Frederick Jelinek's work at IBM on continuous speech recognition, where n-gram models were integrated into hidden Markov model frameworks to score word sequences probabilistically.^[26] The first significant advancement in n-gram estimation came in 1980 with Jelinek and Mercer's interpolated linear smoothing method, which combined lower-order probabilities to mitigate data sparsity in higher-order models.^[27] Subsequent refinements addressed the challenge of unseen n-grams in finite corpora, a core limitation causing zero probabilities. Katz's 1987 backing-off technique recursively falls back to lower-order models for unobserved events while discounting seen ones using Good-Turing estimates, which allocate probability mass to unseen types based on the frequency of singletons.^[27] Jelinek-Mercer interpolation weighted higher- and lower-order estimates directly, while later methods like Kneser-Ney (1994) incorporated absolute discounting with refined continuation counts to better capture lexical diversity.^[10] These techniques enabled trigram models to achieve perplexities around 109 on corpora like the Wall Street Journal, outperforming bigrams (170) and unigrams (962), though higher n remained computationally infeasible due to exponential growth in parameters (e.g., ~20 billion for 4-grams on large vocabularies).^[10] N-gram models found primary application in acoustic modeling for speech recognition and early statistical machine translation, as in Brown et al.'s 1990 IBM models, which used trigrams to model fluency in target languages.^[27] Despite successes in perplexity reduction through smoothing and class-based partitioning (e.g., Brown et al. 1992), inherent limitations—such as inability to capture long-range dependencies beyond fixed n, sensitivity to out-of-vocabulary words, and reliance on massive corpora for sparse events—prompted the shift toward neural architectures in the early 2000s.^[27] These statistical precursors emphasized empirical frequency over semantic understanding, establishing evaluation via perplexity as a standard metric for predictive accuracy that persists in neural successors.^[10]

Recurrent and Sequence Models

Recurrent neural networks (RNNs) process sequential data by maintaining a hidden state that captures information from previous timesteps, enabling them to model dependencies in language sequences for tasks like next-word prediction.^[28] In language modeling, a basic RNN takes an input sequence of words represented as vectors and updates its hidden state h_t = \sigma(W_{xh} x_t + W_{hh} h_{t-1} + b_h), where \sigma is an activation function like tanh, to compute the probability of the next word via a softmax over the output layer.^[29] This architecture allows RNNs to theoretically handle variable-length inputs, addressing limitations of fixed-context n-gram models, though early applications in the 1980s focused more on general sequence prediction than large-scale language modeling.^[30] A key advancement came with Tomas Mikolov's RNN-based language model (RNNLM) in 2010, which integrated a simple RNN with a maximum entropy output layer to predict words in speech recognition tasks, achieving perplexity reductions of up to 20% over traditional n-gram models on corpora like the Wall Street Journal.^[22] ^[31] However, vanilla RNNs suffered from vanishing or exploding gradients during backpropagation through time, making it difficult to learn long-range dependencies beyond 5-10 timesteps, as gradients diminish exponentially with sequence length.^[30] ^[32] To mitigate these issues, long short-term memory (LSTM) units, introduced by Hochreiter and Schmidhuber in 1997, incorporate gating mechanisms—an input gate, forget gate, and output gate—to selectively update and retain information in a cell state, allowing effective capture of dependencies over hundreds of timesteps.^[30] In language modeling, LSTMs demonstrated superior performance; for instance, Sundermeyer et al. in 2012 reported relative perplexity improvements of about 8% on English and large French corpora compared to feedforward neural networks.^[33] LSTMs became a staple for sequence modeling, powering early neural machine translation and text generation by maintaining contextual memory without full sequence recomputation.^[34] Gated recurrent units (GRUs), proposed by Cho et al. in 2014, simplify LSTMs by merging the forget and input gates into a single update gate and eliminating the separate output gate, reducing parameters by roughly 25% while retaining comparable performance on sequence tasks.^[35] Empirical comparisons in language modeling show GRUs training 20-30% faster than LSTMs due to fewer computations, with negligible perplexity differences on datasets like WikiText-2, though LSTMs may edge out on very long dependencies.^[36] Despite these refinements, recurrent models face inherent limitations in language modeling, including sequential processing that precludes efficient parallelization across timesteps, leading to training times scaling linearly with sequence length—unlike the constant-time operations in later architectures.^[37] Additionally, even gated variants struggle with extremely long contexts (e.g., beyond 1000 tokens) due to accumulated numerical instability and attention dilution, prompting shifts toward attention-based mechanisms by the mid-2010s.^[38] ^[39] These constraints were empirically evident in scaling experiments, where recurrent models plateaued in perplexity gains as datasets grew to billions of tokens, underscoring their role as transitional architectures rather than scalable solutions for modern large-scale language modeling.^[40]

Transformer-Based and Large-Scale Variants

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., marked a paradigm shift in sequence modeling by replacing recurrent layers with self-attention mechanisms, enabling parallel computation across sequences and capturing long-range dependencies more effectively than prior recurrent neural networks (RNNs).^[3] This design consists of stacked encoder and decoder blocks, where multi-head self-attention computes weighted representations of input tokens relative to each other, scaled by dot-product similarity and softened via softmax, followed by feed-forward networks and layer normalization.^[3] Transformers initially excelled in machine translation but rapidly adapted to language modeling by autoregressively predicting the next token in a sequence, leveraging positional encodings to preserve order information absent in pure attention.^[3] Decoder-only Transformer variants, pioneered by OpenAI's GPT series, focus on unidirectional generation for causal language modeling, omitting the encoder to prioritize efficient autoregressive inference. GPT-1, released in June 2018 with 117 million parameters trained on the BookCorpus dataset, demonstrated emergent in-context learning on few-shot tasks, outperforming prior baselines in zero-shot transfer. GPT-2, announced in February 2019 with a 1.5 billion parameter model trained on WebText (8 million web pages), showed unsupervised text generation capabilities approaching human-like coherence, though initially withheld due to misuse concerns before partial release. GPT-3, unveiled in May 2020 with 175 billion parameters trained on 570 gigabytes of filtered Common Crawl data plus Books and Wikipedia, scaled predictably in performance, achieving strong few-shot results on benchmarks like SuperGLUE without task-specific fine-tuning, attributed to increased model capacity and data volume.^[24] Encoder-only Transformers, such as BERT (Bidirectional Encoder Representations from Transformers) from Google, released in October 2018, enable bidirectional context for masked language modeling and next-sentence prediction, pretraining on 3.3 billion words from BooksCorpus and English Wikipedia to yield embeddings fine-tuned for downstream tasks like question answering. Variants like T5 (Text-to-Text Transfer Transformer), introduced by Google in October 2019, unify tasks under a text-to-text framework with an encoder-decoder setup, scaling to 11 billion parameters by 2021 and demonstrating that framing all NLP problems as generation improves versatility. Large-scale models, often exceeding 100 billion parameters, rely on massive distributed training: for instance, PaLM (Pathways Language Model) from Google, with 540 billion parameters trained in 2022 on 780 billion tokens using Pathways infrastructure, highlighted multilingual and reasoning gains from compute-intensive scaling. Empirical scaling laws, formalized by Kaplan et al. in 2020, quantify that language model loss decreases as a power law with model size (N), dataset size (D), and compute (C ≈ 6ND), with cross-entropy loss scaling as L(N) ∝ N^{-α} where α ≈ 0.076 for parameters, guiding efficient resource allocation.^[23] Hoffmann et al.'s 2022 Chinchilla analysis refined this, finding compute-optimal models balance parameters and data at roughly 20 tokens per parameter, as in the 70 billion parameter Chinchilla model outperforming larger but data-underdense GPT-3 on BIG-Bench, underscoring that naive parameter scaling without proportional data yields diminishing returns.^[41] These laws, validated across models up to trillions of parameters like Google's 2023 PaLM 2 (up to 340 billion parameters), explain performance predictability but also reveal plateaus in certain capabilities, such as factual recall, limited by training data quality over sheer scale.^[23]^[41] Open-source efforts, including Meta's LLaMA series (e.g., LLaMA 2 in July 2023 with 70 billion parameters trained on 2 trillion tokens), democratized access while emphasizing responsible scaling through safety fine-tuning. By 2025, proprietary models like OpenAI's GPT-4 (parameter count undisclosed but estimated >1 trillion) and xAI's Grok-1 (314 billion parameters, released November 2023) continued this trend, integrating multimodal extensions while prioritizing inference efficiency via techniques like mixture-of-experts (MoE) sparsity, as in Grok-1's design for reduced active parameters during forward passes.

Training and Optimization

Data Acquisition and Preparation

Data acquisition for large language models primarily relies on vast web-scale corpora, with Common Crawl serving as the foundational source due to its comprehensive, freely available snapshots of the internet, comprising petabytes of raw HTML from monthly crawls since 2008.^[42] This dataset has been integral to training models like GPT-3 and BLOOM, often comprising 60-80% of pretraining corpora after downsampling to manage scale and quality.^[43] Supplementary sources include digitized books (e.g., via Project Gutenberg or proprietary scans), academic publications from arXiv, code from GitHub repositories, and specialized datasets like news archives or scientific texts to enhance domain-specific coverage.^[44] Curated public datasets such as C4 (derived from Common Crawl with basic cleaning), The Pile (825 GB across 22 diverse subsets), and OSCAR (multilingual extracts) aggregate these to provide trillions of tokens, enabling models to capture broad linguistic patterns without proprietary dependencies.^[44] Preparation begins with extraction, parsing raw formats like WARC files from Common Crawl to isolate textual content while discarding non-text elements such as scripts, ads, and navigation boilerplate using tools like Boilerpipe or heuristic rules based on document structure.^[45] Cleaning follows, applying filters for minimum length (e.g., sentences over 3 words), language detection to retain primary languages like English, and perplexity scoring via small proxy models to exclude low-quality or nonsensical text, which can constitute up to 50% of raw web data.^[46] Deduplication is critical to prevent overfitting and reduce training redundancy, employing methods like exact hashing for near-duplicates, MinHash locality-sensitive hashing for fuzzy matches at trillion-token scales, or embedding-based clustering, yielding efficiency gains of 20% or more in convergence speed as demonstrated in controlled pretraining experiments.^[47]^[48] Further quality filtering uses classifiers trained on heuristics or lightweight models to remove toxic, repetitive, or off-topic content, with pipelines like FineWeb demonstrating that heuristic-based selection (e.g., for educational value via readability scores) can distill 15 trillion tokens from Common Crawl into higher-utility subsets outperforming unfiltered baselines on downstream tasks.^[49] Tokenization concludes the pipeline, converting cleaned text into subword units via algorithms like Byte-Pair Encoding (BPE) or Unigram, which compress vocabulary to 50,000-100,000 tokens while handling rare words through merging frequent pairs, essential for efficient model input as raw characters would explode sequence lengths.^[50] These steps collectively transform noisy, heterogeneous inputs into coherent token sequences, with empirical evidence showing that rigorous preparation correlates with improved generalization, though unaddressed biases in web-sourced data—such as overrepresentation of English-centric content—persist as inherent limitations.^[46]

Parameter Scaling and Empirical Laws

Empirical scaling laws in language models describe predictable relationships between training resources—such as the number of parameters N, dataset size D, and compute C—and model performance, typically measured by cross-entropy loss on held-out data. These laws emerged from systematic experiments showing that loss decreases as a power law with increases in each resource when others are held fixed. Kaplan et al. (2020) first quantified this by training transformer-based models ranging from 10 million to 6 billion parameters on datasets up to 300 billion tokens, finding that validation loss L follows L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + L_\infty for parameters, with \alpha_N \approx 0.095, and analogous forms for D (\alpha_D \approx 0.095) and C (\alpha_C \approx 0.046), where L_\infty represents an irreducible loss floor.^[23] The exponents indicate diminishing returns, but the smooth, unbroken power-law behavior across orders of magnitude suggested that performance gains would persist with further scaling, challenging prior assumptions of abrupt saturation.^[23] Under compute constraints, where C \propto N \cdot D for transformer training (approximating FLOPs as $6 N D), Kaplan et al. derived an optimal allocation favoring larger N over D, predicting that model size should scale as N \propto C^{0.73} and data as D \propto C^{0.27}.^[23] This informed early large-scale efforts like GPT-3 (175 billion parameters trained on approximately 300 billion tokens), which aligned roughly with compute-optimal paths and demonstrated broad capability improvements. However, subsequent analysis revealed inefficiencies: Hoffmann et al. (2022) re-evaluated scaling across models up to 280 billion parameters and found that prior large models were severely data-limited, with optimal scaling requiring N \propto C^{0.5} and D \propto C^{0.5}, emphasizing balanced growth in parameters and data to minimize loss for a given compute budget.^[41] They validated this by training Chinchilla, a 70-billion-parameter model on 1.4 trillion tokens using the same compute as the 280-billion-parameter Gopher (trained on 300 billion tokens), achieving a 7% higher average accuracy on the MMLU benchmark (67.5% vs. 59.7%) and lower perplexity across evaluations.^[41] These laws have guided resource allocation in subsequent models, with empirical validation extending to trillion-parameter scales, though exponents vary slightly by architecture and data quality. For instance, mixture-of-experts models decouple active parameters from total N, yielding adjusted scaling where effective compute efficiency alters the N-C relationship. Recent work confirms power-law predictability holds for inference-time scaling, where performance improves with additional compute via techniques like test-time training or chain-of-thought prompting, following L \propto M^{-\beta} for inference FLOPs M. However, deviations arise with high-quality or synthetic data, where sub-scaling (steeper loss reduction per resource) can occur, and real-world limits like data scarcity or hardware constraints challenge indefinite extrapolation.^[51]^[52] The empirical nature of these laws—derived from curve-fitting experimental runs rather than theoretical proofs—underscores their utility for prediction but highlights risks of breakdown beyond probed regimes, as seen in varying task-specific exponents (e.g., shallower scaling for reasoning benchmarks).^[23]^[41]

Alignment and Fine-Tuning Methods

Fine-tuning adapts pre-trained language models to downstream tasks or desired behaviors by continuing training on smaller, curated datasets, typically using supervised learning objectives like next-token prediction on instruction-response pairs. Supervised fine-tuning (SFT), also known as instruction tuning, involves training on high-quality, human-annotated examples where inputs are prompts and outputs are desired responses, enabling models to follow instructions more effectively than zero-shot prompting alone. This method has been empirically shown to improve task performance on benchmarks like GLUE and SuperGLUE, though it risks overfitting to the fine-tuning distribution if data quality is low.^[53] Alignment extends fine-tuning to steer models toward human-preferred outputs, emphasizing helpfulness, honesty, and harmlessness, often addressing issues like toxicity or refusal to answer unsafe queries. Reinforcement learning from human feedback (RLHF) is a prominent technique, introduced by OpenAI in 2022, where human annotators rank model outputs for quality, training a reward model to score responses, followed by policy optimization using proximal policy optimization (PPO) to maximize rewards while staying close to the SFT baseline.^[54] RLHF significantly reduced harmful outputs in models like InstructGPT, with evaluations showing up to 80% preference alignment on held-out tasks, but it scales poorly due to high annotation costs and can induce sycophancy or reward hacking, where models exploit proxy rewards rather than truly understanding values.^[55] Alternatives to RLHF mitigate these issues by avoiding explicit reward modeling. Direct preference optimization (DPO), proposed in 2023, directly fine-tunes the language model on preference pairs using a loss that implicitly derives an optimal policy from human rankings, bypassing RL instability and achieving comparable or superior alignment on datasets like HH-RLHF without PPO's computational overhead.^[56] Empirical results demonstrate DPO converging faster and yielding less variance in outputs, though it assumes access to a reference model for regularization. Constitutional AI, developed by Anthropic in 2022, uses self-supervised critique and revision guided by a predefined "constitution" of principles (e.g., avoiding harm or bias), reducing reliance on human labels by having the model generate and evaluate its own outputs against rules, which improved harmlessness scores by 20-30% over baselines in internal tests while enhancing transparency.^[57] These methods highlight ongoing trade-offs: while effective for surface-level behaviors, deeper causal misalignment persists, as evidenced by persistent hallucinations and jailbreak vulnerabilities in aligned models.^[58]

Evaluation Frameworks

Intrinsic Measures of Predictability

Intrinsic measures of predictability evaluate a language model's core capability to forecast subsequent tokens in a sequence given prior context, relying solely on the model's probability distributions over the vocabulary rather than performance on external tasks. These metrics quantify the model's uncertainty or "surprise" when encountering test data, providing a direct gauge of predictive fidelity independent of application-specific outcomes. The most widely adopted such measure is perplexity (PPL), which serves as a proxy for the model's average branching factor—the effective number of choices it considers plausible at each prediction step.^[59] Perplexity is computed as the exponential of the average negative log-likelihood of a test sequence under the model's predictions: for a sequence of n tokens w_1, \dots, w_n, \mathrm{PPL} = \exp\left(-\frac{1}{n} \sum_{i=1}^n \log P(w_i \mid w_1, \dots, w_{i-1})\right). This formulation derives from information theory, where lower perplexity reflects higher predictability, akin to the model being less "perplexed" by the data; for instance, a PPL of 10 implies the model behaves as if selecting from 10 equally likely options on average per token. Cross-entropy loss underpins this, measuring the divergence between the empirical token distribution p and the model's predicted distribution q as H(p, q) = -\sum p \log q, with perplexity as e^{H(p, q)} in natural log units. Bits-per-character (BPC), another related metric, normalizes cross-entropy (in bits) by sequence length in characters, facilitating comparisons across languages or granularities by emphasizing compression efficiency.^[59]^[60] These measures are typically assessed on held-out corpora such as WikiText-103 or the C4 dataset, where models like GPT-3 achieved perplexities around 20-30 on English text by 2020, improving with scale; for example, larger models under Chinchilla scaling laws reduced PPL logarithmically with compute. However, perplexity's intrinsic nature limits its scope: it prioritizes fluent token prediction but does not ensure factual accuracy, semantic coherence, or robustness to adversarial inputs, as models can memorize training data to lower PPL without generalizing causally. Recent advancements address tokenizer disparities—different subword schemes (e.g., BPE vs. SentencePiece) inflate or deflate raw PPL—via normalized variants like weighted perplexity, which adjust for vocabulary size and token length distributions to enable fair cross-model comparisons.^[61]^[62] Empirical studies confirm perplexity's correlation with downstream capabilities in controlled settings, yet divergences arise; for instance, over-optimized models may exhibit low PPL on in-distribution data while hallucinating on novel prompts, underscoring that predictability alone proxies fluency rather than understanding. Bits-per-character complements perplexity by revealing sub-token inefficiencies, with human-language BPC baselines around 1-1.5 bits for English, against which models like PaLM approached 1.2 by 2022. Despite these utilities, intrinsic metrics undervalue long-context dependencies, where PPL can degrade quadratically without architectural mitigations like transformers' attention.^[59]^[63]

Task-Specific Benchmarks

Task-specific benchmarks evaluate language models on predefined natural language processing tasks using standardized datasets, metrics such as accuracy, F1-score, or exact match, and often involve multiple-choice, classification, or generation subtasks to measure capabilities like comprehension, inference, or problem-solving.^[64] These differ from intrinsic predictability measures by focusing on downstream applications rather than raw token prediction, though saturation in older benchmarks like GLUE has prompted development of harder variants.^[65] Empirical performance on these benchmarks correlates with scaling laws, where larger models trained on more data achieve higher scores, but results must account for potential data contamination from training corpora.^[66] The GLUE benchmark, introduced in January 2018, aggregates nine tasks including single-sentence classification (e.g., CoLA for linguistic acceptability, SST-2 for sentiment polarity) and sentence-pair tasks (e.g., MNLI for natural language inference, QQP for paraphrase detection).^[64] Scores are computed per task—such as Matthews correlation for CoLA or Pearson correlation for semantic similarity (STS-B)—and averaged into a single GLUE score, with human baselines around 80-90% but early models like BERT achieving 80.5% in 2018.^[67] By 2023, large models exceeded 90%, indicating saturation and limited differentiation for advanced systems. SuperGLUE, released in May 2019 as a more challenging successor, includes eight tasks emphasizing coreference resolution (WSC), word-in-context disambiguation (WiC), and reading comprehension (ReCoRD), with metrics like exact match for generation tasks and accuracy for classification.^[65] It incorporates longer contexts and adversarial examples to probe deeper reasoning, where human performance averages 89.8% but top models like T5-11B reached 89.1% by 2020; however, discrepancies in leaderboard rankings for models like GPT-3 suggest inconsistencies possibly from evaluation protocols.^[68] ^[69] Knowledge-intensive benchmarks like MMLU (Massive Multitask Language Understanding), proposed in September 2020, test factual recall and reasoning across 57 subjects (e.g., history, law, STEM) via 14,000 multiple-choice questions at professional or high-school levels, scored by accuracy with chain-of-thought prompting boosting results. Models like GPT-4 achieve 86.4% in 2023, approaching expert levels in some domains but revealing gaps in abstract reasoning.^[64] Commonsense reasoning tasks such as HellaSwag (2019), with 70,000 sentence-completion items derived from video captions and adversarial filtering, use accuracy to assess plausible continuation prediction, where models like GPT-3 score 95%+ but falter on subtle inferences. Domain-specific benchmarks target specialized skills: GSM8K (2021) comprises 8,500 grade-school math word problems requiring multi-step arithmetic reasoning, evaluated by exact match accuracy, with models like PaLM 540B reaching 58% via prompting but highlighting symbolic manipulation weaknesses. HumanEval (2021), for code generation, presents 164 Python programming problems solved via functional correctness, using pass@1 (first-attempt success) or pass@k metrics; GPT-3.5 scores 48.1% pass@1, while specialized fine-tuning elevates this to over 70% in later models, though it exposes brittleness to edge cases. These benchmarks collectively reveal scaling benefits but underscore needs for robustness against distribution shifts.^[70]

Benchmark	Introduction Year	Key Tasks	Primary Metric	Example Top Score (Model, Year)
GLUE	2018	NLI, sentiment, paraphrase	Averaged task scores	91.3% (DeBERTa, 2021)^[64]
SuperGLUE	2019	Coreference, WiC, ReCoRD	Averaged task scores	89.1% (T5-11B, 2020)^[65]
MMLU	2020	Multi-subject MCQs	Accuracy	86.4% (GPT-4, 2023)^[71]
HellaSwag	2019	Commonsense completion	Accuracy	95.3% (GPT-3, 2021)^[64]
GSM8K	2021	Math word problems	Exact match	74.4% (Minerva, 2022)^[72]
HumanEval	2021	Code synthesis	Pass@1	67.0% (Codex, 2021)

Comparative Performance Analysis

Language models are evaluated comparatively through standardized benchmarks that measure capabilities such as multitask knowledge (MMLU), commonsense inference (HellaSwag), scientific reasoning (GPQA), coding proficiency (HumanEval), and overall user preference via crowdsourced platforms like the LMSYS Chatbot Arena. These metrics reveal scaling trends where larger parameter counts and refined training correlate with improved scores, though diminishing returns and benchmark saturation are evident among frontier models.^[64]^[73] However, benchmarks face limitations including potential data contamination from training corpora, over-optimization by developers, and failure to capture long-tail real-world robustness or causal reasoning depth. Crowdsourced arenas mitigate some issues by incorporating human judgments on helpfulness and coherence but introduce subjective biases and may favor verbose or safety-aligned responses over raw capability.^[74] As of mid-2024, proprietary models like OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet lead with MMLU scores of 88.7%, closely trailed by Meta's open-source Llama 3.1 405B at 88.6%. xAI's Grok-2 achieves 87.5% on MMLU, demonstrating competitive knowledge recall while emphasizing uncensored outputs that may diverge from safety-tuned competitors. On coding benchmarks, Claude 3.5 Sonnet scores 92.0% on HumanEval, surpassing GPT-4o's 90.2%, Llama 3.1 405B's 89.0%, and Grok-2's 88.4%. These narrow margins highlight convergence driven by compute-intensive scaling, yet open models like Llama enable broader verification and adaptation, reducing reliance on black-box proprietary evaluations.^[75]

Model	MMLU (%)	HumanEval (%)	GPQA (%)	LMSYS Arena Elo
GPT-4o	88.7	90.2	53.6	1286
Claude 3.5 Sonnet	88.7	92.0	59.4	1272
Llama 3.1 405B	88.6	89.0	51.1	1264
Grok-2	87.5	88.4	N/A	~1250

In the LMSYS Chatbot Arena, GPT-4o ranks highest with an Elo score of approximately 1286, reflecting user preferences for speed and versatility, while open models lag slightly due to deployment differences rather than intrinsic limits. Differences often stem from post-training alignment: safety-focused tuning in Claude boosts refusal rates on edge cases, potentially inflating perceived reliability but constraining utility in unrestricted domains. Empirical evidence suggests that raw pre-training compute, not architecture alone, drives most gains, with transformers remaining dominant despite alternatives like state-space models showing promise in efficiency but not yet surpassing in absolute performance.^[76] Independent evaluations underscore that no single model dominates all tasks; for instance, Llama excels in multilingual settings, while Grok prioritizes real-time data integration for timeliness over static benchmark optimization.^[77]

Capabilities and Deployments

Core Linguistic Tasks

Large language models (LLMs) handle core linguistic tasks through probabilistic prediction of linguistic structures, drawing on patterns learned from massive text corpora during pre-training. These tasks encompass morphology (inflection and derivation), syntax (grammatical structure and parsing), semantics (meaning representation and entailment), and pragmatics (contextual inference and implicature). Empirical evaluations show LLMs achieving high proficiency in syntax and semantics via benchmarks like GLUE and SuperGLUE, where tasks such as linguistic acceptability (CoLA) and natural language inference (MNLI) test these abilities, with models like GPT-4 saturating scores above 90% on aggregate metrics.^[65]^[68] However, performance derives statistically from data correlations rather than explicit rule internalization, leading to robustness in standard cases but vulnerabilities to adversarial perturbations.^[78] In morphology, LLMs generate and recognize word forms across languages, such as verb conjugations or noun plurals, by modeling distributional regularities. For example, transformer-based models excel in inflectional tasks on datasets like UniMorph, predicting forms with accuracy rates surpassing 95% for high-resource languages in few-shot settings, as scaling parameters enhances capture of rare morphological patterns.^[79] This capability supports applications in language generation but falters on low-resource languages or systematic gaps in training data, where overgeneralization occurs.^[6] Syntactic processing involves assessing sentence well-formedness and hierarchical structure, where LLMs outperform traditional parsers on benchmarks like the Corpus of Linguistic Acceptability, achieving near-ceiling performance (e.g., 60-70% accuracy on human-labeled judgments, exceeding earlier RNN models).^[80] Studies identify dedicated neural subspaces in LLMs corresponding to syntactic competence, comprising about 1% of parameters yet driving generalization to unseen constructions.^[78] Nonetheless, causal interventions reveal that syntax emerges as a byproduct of next-token prediction, not isolated modular knowledge, enabling efficient zero-shot parsing but susceptibility to long-range dependency errors in complex recursion.^[81] Semantic tasks, including entailment and word sense disambiguation, leverage contextual embeddings to infer relations, with LLMs scoring above 90% on Multi-Genre Natural Language Inference (MNLI) subsets of SuperGLUE.^[65] Vector arithmetic in embedding spaces approximates analogies (e.g., king - man + woman ≈ queen), reflecting distributional semantics, though this breaks under compositionality demands. Pragmatics presents greater challenges, as LLMs inconsistently handle implicatures; while GPT-4 exceeds human averages (4.80 vs. ~4.0) on scalar and manner implicature tests, it underperforms on context-dependent benchmarks like PUB, scoring below 70% without prompting refinements, due to literal biases in training objectives.^[82]^[83] Multi-agent setups or chain-of-thought prompting mitigate this, boosting pragmatic reasoning by simulating cooperative inference. Overall, scaling correlates with improved linguistic fidelity, but persistent gaps in pragmatic nuance underscore statistical approximation over genuine comprehension.^[84]

Generative and Multimodal Applications

Large language models (LLMs) primarily generate text through autoregressive decoding, predicting subsequent tokens based on prior context, which enables applications in content creation such as dialogue systems, summarization, and creative writing. OpenAI's GPT-3, released on June 11, 2020, with 175 billion parameters, demonstrated emergent abilities in zero-shot and few-shot generation tasks, including translation and question answering without task-specific fine-tuning.^[24] Similarly, fine-tuned variants like Codex, introduced in August 2021, support code generation from natural language descriptions, powering tools such as GitHub Copilot, which assists developers by suggesting code completions and has been adopted in over 1 million repositories by 2023.^[85] These capabilities stem from scaling laws where increased parameters and training data correlate with improved coherence and versatility in output, though outputs often require human verification due to factual inaccuracies.^[23] In code-related generative tasks, LLMs have achieved competitive results in structured programming challenges; for example, DeepMind's AlphaCode, a transformer-based model trained on GitHub code, solved 34% of problems in Codeforces contests as of February 2022, outperforming average human coders in select metrics but lagging in systematic reasoning. Broader applications extend to domain-specific generation, such as legal document drafting or scientific hypothesis formulation, where models like GPT-4, released March 14, 2023, generate plausible outputs but exhibit limitations in causal inference and long-term consistency. Empirical evaluations, including the HumanEval benchmark, show LLMs passing 67% of unit tests for Python functions via pass@k metrics, highlighting probabilistic strengths over deterministic precision. Multimodal large language models (MLLMs) integrate LLMs with vision or audio encoders, enabling generative applications across modalities, such as describing images or generating text conditioned on visual inputs. OpenAI's GPT-4V, made available in September 2023, supports visual question answering (VQA) and captioning, processing real-world images to output descriptive narratives or answer queries with reported accuracy improvements over prior vision-language models on benchmarks like VQAv2. Google's Gemini 1.0, announced December 6, 2023, handles interleaved text, images, audio, and video for tasks including multimodal reasoning and content synthesis, achieving state-of-the-art scores on MMMU (59.4%) by fusing modalities through a unified architecture. These models facilitate applications in robotics for scene understanding and instruction following, as well as medical imaging analysis, where MLLMs process diagrams to generate diagnostic hypotheses, though performance varies by subspecialty with accuracies around 50-70% on radiology benchmarks in 2024 evaluations.^[86] Despite advances, MLLMs often propagate biases from training data and struggle with spatial reasoning, necessitating hybrid systems for robust deployment.^[87]

Integration in Systems and Products

Language models are deployed in consumer products primarily as conversational agents and enhancers for user interfaces. For example, OpenAI's ChatGPT, powered by GPT-series models, serves as a standalone application and API endpoint, with over 100 million weekly active users reported in late 2023, enabling integrations into third-party apps for tasks like drafting emails and summarizing documents. Google's Gemini, introduced in 2023 and updated iteratively, is natively integrated into Android operating systems starting with version 15 in August 2024, facilitating on-device features such as real-time language translation and contextual assistance within apps.^[88] In search engines, language models augment query understanding and response generation. Google incorporated Gemini into its search infrastructure by December 2023, allowing multimodal inputs like image-based queries via features such as Circle to Search, which expanded to more Android devices by mid-2025.^[89] This integration processes billions of daily searches, grounding outputs in web data to reduce factual errors.^[90] Enterprise systems leverage language models for automation and analytics, often through cloud-based APIs or customized fine-tuning. Microsoft integrated OpenAI's GPT-5 into its Copilot ecosystem in August 2025, embedding it across Microsoft 365 applications for tasks including data analysis in Excel and code review in Visual Studio, with reported productivity gains of up to 30% in internal pilots.^[91] ^[92] GitHub Copilot, utilizing these models since 2021, provides real-time code suggestions to over 1 million developers, accelerating software development by suggesting completions based on context.^[92] In business software, models are woven into CRM and ERP platforms for natural language querying. Salesforce introduced Einstein GPT in 2023, fine-tuned on proprietary data for sales forecasting, handling queries like "summarize pipeline risks" via API calls. Enterprise deployments emphasize secure, scalable architectures, with options for on-premises hosting to address data privacy concerns, as seen in frameworks from providers like Azure AI Foundry, which support model orchestration for hybrid environments.^[93] Open-source models like Meta's Llama series enable cost-effective integrations in custom products, such as internal chatbots, though requiring significant engineering for production reliability.^[94]

Technical Limitations

Inherent Uncertainties and Hallucinations

Large language models (LLMs) generate text autoregressively by predicting the next token based on conditional probabilities derived from training data, introducing inherent uncertainties due to the stochastic sampling process and the absence of grounded causal reasoning.^[95] This probabilistic mechanism favors high-likelihood sequences that mimic patterns in the data, but it does not enforce factual consistency, as models approximate rather than comprehend underlying truths.^[96] Empirical evaluations confirm that base LLMs exhibit calibrated uncertainty in predictions—meaning their confidence scores align with accuracy—but this calibration does not prevent deviations from ground truth when data gaps or ambiguities arise.^[96] Hallucinations manifest as the production of plausible yet verifiably false statements, often with undue confidence, stemming from the optimization objective that prioritizes fluency and coherence over empirical verification.^[97] Causal factors include imperfections in training corpora, such as factual errors or contradictions, which propagate through pattern extrapolation, and the model's reliance on statistical correlations absent a robust world model for validation.^[98] For instance, in tasks requiring long-range reasoning, LLMs may confabulate details by overgeneralizing sparse training signals, as observed in studies where models fabricate references or events unsupported by input prompts.^[99] Quantitative assessments reveal hallucination prevalence varies by domain and model scale: legal queries elicit fabricated content in 58% to 82% of responses from general-purpose LLMs, highlighting risks in high-stakes applications.^[100] In clinical evaluations, unmitigated rates reached 64-68% for case summaries, dropping to 43-45% with prompting adjustments, yet major errors persisted in 44% of hallucinated instances.^[101] Scaling model size and instruction-tuning can amplify unreliability for difficult tasks, as larger LLMs increasingly avoid or err on low-concordance problems, per analyses of models up to 2024 releases.^[102] Detection methods, such as semantic entropy measures, identify subsets of hallucinations by quantifying output variability, but these post-hoc tools underscore the foundational challenge: LLMs' token-level predictions inherently conflate memorization with invention when faced with novel or uncertain inputs.^[95] Despite mitigations like retrieval-augmented generation, hallucinations remain irreducible in base architectures without external verification, as the core training paradigm lacks mechanisms for self-correction against causal realities.^[99]

Amplification of Data Biases

Language models acquire biases from their training corpora, which consist predominantly of internet-sourced text reflecting human societal patterns, including overrepresentations of certain demographic stereotypes and ideological viewpoints.^[103] These biases are amplified through the training process, as the autoregressive next-token prediction objective reinforces probabilistic correlations in the data, causing models to favor completions that exaggerate underlying skews beyond their frequency in the source material.^[104] For example, if training data contains subtle associations linking professions to genders at rates mirroring real-world disparities, the model's learned embeddings and generation dynamics can intensify these links, producing outputs with higher stereotype adherence rates.^[105] Empirical studies quantify this effect across domains. In political bias evaluations using sentence continuation benchmarks, models like GPT-2 exhibited increasing skew toward liberal framings over multiple generation steps, with bias metrics rising by up to 20-30% relative to initial prompts, independent of synthetic data collapse.^[106] Similarly, in moral judgment tasks, large language models displayed amplified cognitive biases, such as a stronger preference for inaction (e.g., 15-25% higher rates than human baselines in dilemma resolutions), stemming from compressed representations of ethical scenarios in training data.^[107] Stereotypical amplification has been observed in controlled experiments where models, after minimal human-AI interaction loops, output associations (e.g., criminality to ethnic groups) at rates exceeding input data by factors of 1.5-2.0.^[105] Political bias amplification is particularly pronounced, with models trained on web corpora—dominated by content from academia and media outlets showing systemic left-leaning tendencies—generating responses that favor progressive policies at rates 10-40% higher than conservative alternatives on balanced prompts.^[108] ^[109] For instance, evaluations of systems like ChatGPT revealed misalignment with median U.S. voter preferences, amplifying liberal leanings in policy advice (e.g., stronger support for redistribution over market solutions).^[108] This occurs mechanistically via feature compression during scaling: larger models distill broader data patterns into sharper ideological modes, exacerbating imbalances as parameter count increases from billions to trillions.^[110] Such amplification extends to iterative training on model-generated data, where initial biases compound exponentially, as each cycle reinforces the most probable (skewed) outputs, potentially leading to "bias collapse" in long-term deployments.^[106] Mitigation strategies, including targeted fine-tuning on debiased subsets or constitutional AI prompts, reduce but do not eliminate the issue, as emergent biases reappear in out-of-distribution scenarios due to the causal entanglement of semantics and priors in learned weights.^[111] These dynamics highlight a core limitation: while models excel at pattern mimicry, their causal inference from correlative data inherently magnifies human flaws, risking downstream reinforcement of societal divides in applications like content generation or decision support.^[111]^[103]

Resource Demands and Scalability Barriers

Training large language models (LLMs) demands immense computational resources, typically measured in floating-point operations (FLOPs). For instance, models at the scale of GPT-4 require approximately 10^{25} FLOPs for pre-training, equivalent to the output of thousands of high-end GPUs running for months.^[112] Over 30 such models exceeding 10^{25} FLOPs have been trained as of mid-2025, reflecting exponential growth in compute allocation, with frontier models announced at an average rate of two per month in 2024.^[112] These requirements stem from scaling laws, where performance improves predictably as a power-law function of model parameters, dataset size, and compute budget, as established in early empirical studies.^[23] Hardware and energy costs amplify these demands. Training a GPT-4-scale model incurs compute expenses estimated at $78–100 million, driven by specialized accelerators like GPUs or TPUs, with total training compute costs for frontier models doubling every eight months and growing at 2.4x annually.^[113]^[114] Energy consumption is similarly prohibitive; training involves clusters of thousands of GPUs, contributing to carbon footprints benchmarked across 30 LLMs, where a single run can emit hundreds of tons of CO2 equivalent, alongside substantial water usage for data center cooling.^[115] Projections indicate generative AI's global electricity demand could surge from 8 TWh in 2024 to 652 TWh by 2030, rivaling small nations' usage, underscoring the thermodynamic inefficiencies of matrix multiplications in transformer architectures.^[116] Scalability faces multifaceted barriers. Data constraints loom large, as high-quality human-generated text may exhaust available sources by the 2030s under continued scaling, forcing reliance on synthetic data whose quality degrades performance gains.^[117] Compute-optimal regimes, per updated scaling laws balancing Kaplan's parameter-heavy predictions with Chinchilla's emphasis on data proportionality (e.g., 20 tokens per parameter), still yield logarithmic returns, but hardware bottlenecks like AI chip shortages and supply chain limits cap effective scaling.^[118]^[119] Economic pressures further hinder progress, as marginal improvements demand disproportionately larger investments, potentially stalling non-state actors while favoring well-resourced entities. Empirical evidence shows no hard "wall" yet, with performance scaling reliably even under over-training, but physical limits on energy production and semiconductor fabrication pose causal ceilings absent algorithmic breakthroughs.^[120]^[121]

Controversies and Broader Impacts

Misinformation and Reliability Debates

Language models generate outputs that can include fabricated details, known as hallucinations, where the system confidently asserts false information due to its reliance on probabilistic token prediction rather than grounded verification.^[122] Empirical studies quantify these issues, with hallucination rates in summarization tasks ranging from 50% to 82% across models like GPT-4 and Llama variants, even after prompt-based mitigations reduce rates by only marginal amounts.^[123] For instance, evaluations of GPT-3.5 and GPT-4 on medical queries showed hallucination rates of 39.6% and 28.6%, respectively, highlighting persistent unreliability in domain-specific factuality.^[124] Reliability debates center on the models' inability to distinguish fact from fiction inherently, as training objectives prioritize fluency over accuracy, rewarding plausible guesses amid data uncertainties.^[125] Benchmarks such as FELM and FactBench reveal that while closed-source models like GPT-4 achieve higher factuality scores on curated datasets—often exceeding 70% on closed-book questions—performance degrades on dynamic, real-world queries involving recent events or adversarial inputs, dropping below 50% in temporal misalignment tests.^[126]^[127] Critics argue this stems from architectural limits, where models mimic patterns without causal comprehension, amplifying errors from biased training corpora that overrepresent certain viewpoints, as evidenced by asymmetric propagation of positive misinformation favoring developer home countries in geopolitical audits.^[128] Broader concerns involve misinformation amplification, with language models enabling scalable generation of deceptive text, images, and videos, though empirical scoping reviews indicate dual roles: aiding detection via pattern recognition but risking unchecked dissemination in low-gatekeeping environments.^[129]^[130] Fact-checking experiments show models falter when verifying claims, sometimes endorsing falsehoods at rates comparable to human baselines under misinformed prompts, underscoring debates over deployment safeguards like retrieval-augmented generation, which reduce but do not eliminate errors in high-stakes applications such as legal research.^[131]^[100] Proponents counter that scaling and fine-tuning yield measurable gains—e.g., newer iterations halving hallucination rates in controlled settings—but skeptics, drawing from causal analyses, maintain fundamental unreliability persists without paradigm shifts beyond autoregressive architectures.^[122]^[125] Academic enthusiasm may underemphasize these limits, given institutional incentives favoring optimistic narratives on AI progress.

Intellectual Property and Data Usage Conflicts

Large language models are typically trained on massive datasets scraped from the internet, books, and other sources, which frequently include copyrighted materials without obtaining licenses from rights holders. This practice has sparked numerous lawsuits alleging direct infringement through unauthorized reproduction and derivative use during training. As of September 2025, at least 51 copyright infringement suits have been filed against AI developers in the U.S., targeting companies like OpenAI, Microsoft, Anthropic, and Meta for using protected works in datasets such as Common Crawl and Books3.^[132]^[133] A prominent example is The New York Times Co. v. Microsoft Corp. and OpenAI (filed December 27, 2023), where the Times accused the defendants of ingesting millions of its articles to train models like GPT-4, enabling competitive outputs that summarize or reproduce content verbatim. The case, consolidated into multidistrict litigation by April 2025, prompted a May 13, 2025, preservation order requiring OpenAI to retain ChatGPT logs from over 400 million users to assess infringement scope, though OpenAI contested the burden and partially resolved data retention disputes by October 2025. Similar claims appear in suits by authors, including John Grisham and George R.R. Martin against OpenAI (filed 2023), alleging training on pirated ebooks from datasets like The Pile eroded market value for originals.^[134]^[135]^[136] Defendants counter that training constitutes fair use under U.S. copyright law, arguing it is transformative as models learn statistical patterns rather than store copies, akin to human reading for inspiration. In June-July 2025 rulings from the Northern District of California, courts in Bartz v. Anthropic and Kadrey v. Meta denied motions to dismiss fair use defenses, finding allegations insufficient to prove non-transformative copying at the pleadings stage, though emphasizing that market harm and output regurgitation remain fact-intensive issues. Conversely, a Delaware ruling in an early 2025 case held that wholesale ingestion of books for AI training likely exceeds fair use without substantial alteration, marking the first judicial rejection of the defense in this context.^[137]^[138]^[139] These conflicts extend to data sourcing methods, including web scraping that violates site terms of service, as seen in suits against Perplexity AI for allegedly bypassing paywalls on news sites. Critics, including publishers, contend that unlicensed training supplants licensing markets, with empirical evidence from output analyses showing models can reproduce substantial excerpts—up to 10-20% verbatim in some tests—undermining incentives for content creation. AI firms maintain that prohibiting such data use would stifle innovation, but no appellate rulings have resolved the fair use question as of October 2025, leaving developers to pursue opt-out mechanisms or licensed datasets amid regulatory scrutiny in the EU under the AI Act.^[140]^[141]^[136]

Economic Disruptions and Innovation Trade-offs

Large language models (LLMs) have prompted concerns over economic disruptions, particularly in knowledge-intensive sectors where tasks like writing, coding, and analysis are automatable. Empirical estimates suggest potential job displacement affecting 6% to 7% of U.S. workers due to AI adoption, with white-collar roles in data-rich industries facing higher exposure.^[142] ^[143] A Wharton study analyzing LLM exposure across occupations found that while some jobs experience net positive impacts from augmentation, others, such as routine analytical roles, risk substitution, potentially exacerbating unemployment if retraining proves insufficient.^[144] Brookings research highlights the limits of worker retraining programs amid rapid AI-driven displacement, noting historical precedents where such interventions failed to fully offset losses in analogous technological shifts.^[145] Counterbalancing these disruptions, LLMs have demonstrated measurable productivity gains in controlled studies. A field experiment published in Science showed ChatGPT reducing task completion time by 40% while improving output quality by 18% for professional writers.^[146] Similarly, a Bank for International Settlements study on coding tasks reported over 50% increases in code output using generative AI, though gains were concentrated among entry-level programmers rather than experts.^[147] McKinsey projections indicate generative AI, including LLMs, could drive annual labor productivity growth of 0.1% to 0.6% through 2040, contingent on adoption rates, though broader labor market data post-ChatGPT release (as of October 2025) reveals no widespread disruption yet.^[148] ^[149] The innovation trade-offs involve high upfront costs juxtaposed against accelerated R&D, fostering dependency risks. Training GPT-4 reportedly cost between $78 million and over $100 million, with OpenAI's 2024 expenditures on training and inference projected to reach $7 billion, underscoring barriers to entry that favor incumbents and concentrate economic power.^[150] ^[151] While LLMs enable rapid prototyping and idea generation—potentially displacing 92 million jobs globally but creating 170 million new ones per World Economic Forum estimates—their integration risks skill atrophy and overreliance, diminishing human critical thinking and creativity over time.^[152] ^[153] NBER analysis confirms small near-term labor market effects but warns of uneven distribution, with productivity enhancements not yet translating to proportional wage gains, potentially widening inequality.^[154] These dynamics highlight a causal tension: short-term efficiency boosts versus long-term vulnerabilities from reduced human innovation capacity and policy challenges in mitigating uneven sectoral shifts.^[155]

Prospective Developments

Efficiency Enhancements

Efficiency enhancements in large language models (LLMs) address the high computational and memory demands of training and inference, which scale quadratically with sequence length in transformer architectures and linearly with model size. Techniques focus on model compression, optimized computations, and hardware-aware implementations to maintain performance while reducing resource usage by factors of 2-10x in memory or latency, depending on the method. These approaches enable deployment on edge devices and lower costs, with empirical evaluations showing minimal accuracy degradation, such as less than 1-2% perplexity increase on benchmarks like GLUE or WikiText.^[156] Quantization reduces parameter precision from 32-bit floating-point (FP32) to lower-bit formats like 8-bit integers (INT8) or 4-bit, compressing models by 4-8x while accelerating inference on GPUs via integer arithmetic. Post-training quantization applies directly to pretrained weights, achieving up to 4x speedups on NVIDIA hardware without retraining, though it risks overflow in activations; quantization-aware training mitigates this by simulating low precision during fine-tuning. For LLMs like GPT-3-scale models, 4-bit quantization via methods like GPTQ preserves over 95% of original capability on tasks like commonsense reasoning, as measured by datasets such as HellaSwag.^[157]^[158] Pruning eliminates redundant weights or neurons, often iteratively by identifying low-magnitude connections and removing up to 90% of parameters in dense layers while retraining to recover performance. Structured pruning targets entire attention heads or feed-forward modules, reducing model size by 50-70% with sparsity patterns compatible with hardware accelerators; unstructured pruning requires sparse kernels but yields finer granularity. In LLMs, pruning combined with distillation has compressed models like Llama-7B to under 2GB while matching baseline accuracy on MMLU benchmarks.^[159]^[157] Knowledge distillation transfers knowledge from a large "teacher" LLM to a smaller "student" by minimizing output logit differences or intermediate feature matches, yielding compact models 5-10x smaller with 80-90% of teacher performance. Offline distillation uses fixed teacher outputs for supervision, while online variants co-train both models; for LLMs, distilling from 175B-parameter models to 7B ones via techniques like MiniLLM achieves near-equivalent zero-shot accuracy on SuperGLUE. This method excels in preserving generalization, though it inherits teacher biases.^[160]^[157]^[158] Efficient attention mechanisms, such as FlashAttention introduced in 2022, optimize the quadratic memory bottleneck in self-attention by tiling computations to minimize high-bandwidth memory (HBM) accesses between GPU layers, enabling exact attention with 2-4x speedups and 10x memory savings for sequences up to 64k tokens. It recomputes intermediates on-the-fly instead of materializing full attention matrices, reducing IO by fusing softmax and masking operations. Extensions like FlashAttention-3, optimized for NVIDIA Hopper GPUs in 2024, incorporate asynchronous Tensor Cores and low-precision formats for further 1.5-2x gains in training throughput. These IO-aware designs integrate seamlessly into frameworks like PyTorch, supporting longer contexts in LLMs without approximation errors.^[161]^[162]^[163] Hybrid approaches combine these techniques, such as quantizing pruned models post-distillation, yielding end-to-end efficiencies like running 70B-parameter LLMs on consumer GPUs with latencies under 1 second per token. Ongoing research emphasizes sparsity-inducing regularization and dynamic inference paths to adapt to input complexity, balancing fixed costs with per-query savings.^[159]^[157]^[156]

Extensions to Multimodality and Agency

Extensions of large language models to multimodality involve integrating sensory inputs such as images, audio, and video with textual processing, typically achieved by prefixing LLM token sequences with embeddings from modality-specific encoders like vision transformers or audio spectrogram models.^[87] This architecture enables capabilities like visual question answering and cross-modal reasoning, where models generate text descriptions or inferences from non-text data.^[164] Early implementations, such as OpenAI's GPT-4V released in September 2023, demonstrated image understanding but were limited to static visuals; subsequent models like GPT-4o, announced on May 13, 2024, incorporated real-time audio and video processing for more interactive applications. Google's Gemini 1.0, introduced in December 2023, was natively multimodal, handling interleaved text and images from training, while its 2025 iteration, Gemini 2.0, expanded to advanced video analysis and planning. These extensions have improved performance on benchmarks like VQA-v2, where MLLMs achieve over 80% accuracy in some cases, though challenges persist in hallucination across modalities and alignment between visual and linguistic representations. Agency extensions leverage LLMs as central reasoning components in autonomous systems capable of planning, tool usage, and multi-step decision-making in external environments.^[165] Frameworks like ReAct, proposed in 2022 and refined through 2023-2025 implementations, interleave reasoning traces with actions, allowing models to select tools such as calculators or web browsers dynamically. Developments from 2023 onward include Auto-GPT, released in March 2023, which demonstrated goal-oriented task decomposition but suffered from high error rates in long-horizon planning; by 2025, agentic systems evolved into multi-agent collaborations, where specialized LLMs handle subtasks like negotiation or verification, outperforming single models in simulations.^[166] OpenAI's o1 model, previewed in September 2024, enhanced agency through internal chain-of-thought reasoning, enabling better error correction and tool orchestration, while Anthropic's Claude models integrated "computer use" capabilities in October 2024 for screen-based interactions. Evaluations, such as those in the 2025 AI Index, show LLM agents surpassing humans in certain web navigation tasks but lagging in robustness to adversarial inputs or real-world variability.^[167] These extensions intersect in multimodal agents, which process visual or auditory environments to execute actions, as seen in robotics integrations where LLMs interpret camera feeds for manipulation planning. However, scalability issues arise from increased computational demands—multimodal training requires datasets exceeding 10 billion image-text pairs—and ethical concerns over agency, including unintended autonomy in deployed systems, have prompted calls for verifiable safety mechanisms.^[168] Despite biases inherited from training data, empirical progress indicates MLLMs and agents approaching general-purpose utility, with 2025 benchmarks reporting 70-90% success rates in controlled agentic workflows.^[169]

Open-Source Dynamics vs Proprietary Control

Open-source language models release model weights, architectures, and training code publicly, enabling developers to fine-tune, deploy locally, and iterate without vendor dependency. This approach, exemplified by Meta's Llama series—starting with Llama 2 in July 2023 (7B to 70B parameters) and advancing to Llama 3.1 in July 2024 (up to 405B parameters) and Llama 4 Scout in 2025—fosters collaborative ecosystems on platforms like Hugging Face, where community contributions accelerate refinements such as quantization for efficient inference.^[170]^[171] Other notable releases include Mistral's models (e.g., Mistral 7B in 2023) and Alibaba's Qwen series (Qwen2.5-72B in 2024), which have narrowed performance gaps with proprietary counterparts through techniques like knowledge distillation from closed APIs.^[172]^[173] These dynamics promote rapid innovation by democratizing access: developers in resource-constrained settings can adapt models for niche tasks, such as multilingual applications in Qwen3 (235B parameters, released 2025), bypassing high compute barriers faced by individuals or small firms. Empirical evidence shows open-source models closing capability gaps; for instance, by mid-2025, models like DeepSeek V3 achieved competitive benchmarks in reasoning and coding against proprietary leaders, driven by iterative fine-tuning and shared datasets.^[174]^[175] However, this openness introduces risks, including easier adaptation for malicious uses like generating phishing content or bypassing safety filters, as model weights can be modified without oversight, contrasting with controlled proprietary deployments.^[176] Proprietary models, such as OpenAI's GPT-4o (released May 2024) and Anthropic's Claude 3.5 (June 2024), retain weights internally, offering access via APIs with enforced rate limits, content filters, and usage policies to mitigate harms like misinformation amplification. This control stems from substantial investments—OpenAI reportedly spent over $100 million on GPT-4 training in 2023—allowing integrated safety measures like reinforcement learning from human feedback (RLHF) tailored to corporate risk assessments.^[94]^[177] Benefits include higher reliability in enterprise settings, where providers handle scaling and compliance, but drawbacks encompass vendor lock-in and opaque decision-making, potentially embedding unexamined biases from training data curated under institutional pressures.^[178] The tension manifests in an arms race: proprietary firms lead frontier capabilities due to exclusive data and compute (e.g., Google's Gemini models leveraging internal search corpora), yet open-source efforts erode this edge via distillation, where open models are trained to mimic proprietary outputs, sustaining a cycle of catch-up innovation. Critics of proprietary control argue it enables gatekeeping, delaying broader scrutiny of flaws like hallucination patterns, while proponents cite reduced proliferation risks, though empirical audits of open models reveal comparable safety vulnerabilities when properly aligned.^[179]^[180] By 2025, open-source has spurred hybrid models and cost reductions—running a 70B-parameter open model locally costs pennies per query versus proprietary API fees—but proprietary dominance persists in regulated sectors prioritizing accountability over customization.^[181]^[182]

References

[1]
[PDF] Language Models: A Guide for the Perplexed - arXiv
Nov 29, 2023 · Language modeling is the task of next word prediction. The guide covers natural language processing concepts and tools.
[2]
History, Development, and Principles of Large Language Models-An ...
Feb 10, 2024 · It strives to facilitate a comprehensive understanding by exploring the historical background of language models and tracing their evolution ...
[3]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · Abstract page for arXiv paper 1706.03762: Attention Is All You Need. ... We propose a new simple network architecture, the Transformer ...
[4]
[2303.18223] A Survey of Large Language Models - arXiv
Mar 31, 2023 · Large language models (LLMs) are pre-trained language models of significant size, showing special abilities not present in small-scale models.
[5]
[PDF] Meaning and understanding in large language models - arXiv
Can a machine understand the meanings of the language through which machine-human communication takes place? A state-of-the-art generative AI model leads to the ...
[6]
The Limitations of Large Language Models for Understanding ...
Aug 31, 2024 · We argue on two grounds that LLMs alone tell us very little about human language and cognition in terms of acquisition and evolution.
[7]
https://arxiv.org/pdf/2410.13065
[8]
Language Models: Past, Present, and Future
Jul 1, 2022 · A language modeling overview, highlighting basic concepts, intuitive explanations, technical achievements, and fundamental challenges.Key Insights · Neural Language Models · Pre-Trained Language Models
[9]
[PDF] A Neural Probabilistic Language Model
A goal of statistical language modeling is to learn the joint probability function of sequences of ... A NEURAL PROBABILISTIC LANGUAGE MODEL n c. h m direct mix ...
[10]
[PDF] N-gram Language Models - Stanford University
Log probabilities Language model probabilities are always stored and computed in log space as log probabilities. This is because probabilities are (by ...
[11]
What Is a Language Model? | deepset Blog
Jul 20, 2022 · A language model is a machine learning model designed to represent the language domain. It can be used as a basis for a number of different language-based ...
[12]
Introduction to Large Language Models | Machine Learning
Aug 25, 2025 · If you assume that a token is a word, then a language model determines the probabilities of different words or sequences of words to replace ...
[13]
A Measure-Theoretic Characterization of Tight Language Models
Dec 20, 2022 · Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, ...
[14]
[PDF] 13.1 Probabilistic Langauge Models
The probabilistic language model is to compute a probability distribution of a sentence of words ... the chain rule of probability: P(w1, ..., wN ) = P(w1)P(w2|w1) ...
[15]
Statistical Language Models — MTH 337: Spring 2017
The chain rule in probability theory describes how to factor P(w1, w2, ..., wn) into the probabilities of each word given the words that precede it. P(S) ...
[16]
8.5.6 Probabilistic Models of Language
This uses a bigram model for words, and assumes P(Wi∣Wi−1) P ⁢ ( W i ∣ W i - 1 ) is provided as the language model. A stationary model is typically appropriate.
[17]
Chain Rule for Sequence Probability - 1Cademy
The chain rule of probability is a fundamental principle used in language modeling to calculate the joint probability of a sequence of tokens, such as {x_0, ...
[18]
[PDF] A Mathematical Theory of Communication
Roughly the ergodic property means statistical homogeneity. All the examples of artificial languages given above are ergodic. This property is related to the ...
[19]
Andrey Markov & Claude Shannon Counted Letters to Build the First ...
Nov 11, 2019 · Shannon, via Markov, revealed a statistical framework for the English language, and showed that by modeling this framework—by analyzing the ...
[20]
[PDF] The Mathematics of Statistical Machine Translation - ACL Anthology
IBM T.J. Watson Research Center. We describe a series o,f five statistical models o,f the translation process and give algorithms,for estimating the ...
[21]
A Neural Probabilistic Language Model
Abstract. A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically ...
[22]
Recurrent neural network based language model - ISCA Archive
ISCA Archive Interspeech 2010. Recurrent neural network based language model. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, Sanjeev Khudanpur. A ...
[23]
[2001.08361] Scaling Laws for Neural Language Models - arXiv
Jan 23, 2020 · We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the ...
[24]
[2005.14165] Language Models are Few-Shot Learners - arXiv
May 28, 2020 · Abstract page for arXiv paper 2005.14165: Language Models are Few-Shot Learners. ... GPT-3 achieves strong performance on many NLP datasets, ...
[25]
[PDF] Prediction and Entropy of Printed English - Princeton University
Prediction and Entropy of Printed English. By C. E. SHANNON. (Manuscript Received Sept. 75,. A new method of estimating the entropy and redundancy of a ...
[26]
[PDF] Fred Jelinek - ACL Anthology
As a result, by the time that Fred Jelinek went to IBM in 1972 to work on speech recognition, the Information Theory bandwagon of the 1950s was lying forgotten ...
[27]
[PDF] TWO DECADES OF STATISTICAL LANGUAGE MODELING
Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here, point to a few ...
[28]
https://towardsdatascience.com/a-brief-history-of-language-models-d9e4620e025b
[29]
https://d2l.ai/chapter_recurrent-neural-networks/rnn.html
[30]
A Brief History and Introduction of Recurrent Neural Networks | by ...
Mar 8, 2021 · The concept of RNN was brought up in 1986. And the famous LSTM architecture was invented in 1997. The number of well-known architectures of RNN ...1 Why Rnn · 2 The Rnns · 2.4 Encoder And Decoder With...
[31]
Recurrent neural network based language model - Semantic Scholar
Recurrent neural network based language model · Tomas Mikolov, M. Karafiát, +2 authors. S. Khudanpur · Published in Interspeech 2010 · Computer Science.
[32]
What are limitations of recurrent neural networks? - Quora
Jan 30, 2017 · The major disadvantage of RNNs are the vanishing gradient and gradient exploding problem. It makes the training of RNN difficult in several ways ...
[33]
LSTM neural networks for language modeling - ISCA Archive
In this work, we apply this type of network to an English and a large French language modeling task. Experiments show improvements of about 8% relative in ...
[34]
The Role of Recurrent Neural Networks (RNNs) in Language ...
In language modeling, recurrent neural networks (RNNs) are used to predict the next word in a sentence based on the words that came before. They process input ...Missing: history | Show results with:history
[35]
When to use GRU over LSTM? - Data Science Stack Exchange
Oct 17, 2016 · The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, ...
[36]
When to Use GRUs Over LSTMs? - Analytics Vidhya
Mar 7, 2025 · GRUs typically train 20-30% faster than equivalent LSTM models due to their simpler internal structure and fewer parameters.Performance Comparisons... · Task-Specific Considerations
[37]
Recurrent Neural Network: Working, Applications, Challenges
Aug 27, 2023 · Lack of Parallelism: The inherently sequential nature of RNNs limits their parallel processing capabilities, making them less efficient for ...
[38]
When Recurrent Models Don't Need to be Recurrent
Aug 6, 2018 · Feed-forward models can offer improvements in training stability and speed, while recurrent models are strictly more expressive.
[39]
[1602.02410] Exploring the Limits of Language Modeling - arXiv
Feb 7, 2016 · In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding.
[40]
A History of Large Language Models - Gregory Gundersen
Oct 1, 2025 · I trace an academic history of some of the core ideas behind large language models, such as distributed representations, transducers, ...
[41]
Training Compute-Optimal Large Language Models - arXiv
Mar 29, 2022 · We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B ...
[42]
A Critical Analysis of the Largest Source for Generative AI Training ...
Jun 5, 2024 · Common Crawl is the largest freely available collection of web crawl data and one of the most important sources of pre-training data for large language models ...
[43]
Data | CS324
During training, Common Crawl is downsampled (Common Crawl is 82% of the dataset, but contributes only 60%). The Pile. While a web crawl is a natural place ...
[44]
LLM Training Data: The 8 Main Public Data Sources - Oxylabs
Sep 27, 2024 · This article overviews LLM training, the need for public web data, and the major public data sources for highly-performant LLMs.
[45]
Large language model data pipelines and Common Crawl (WARC ...
Jun 3, 2023 · Common Crawl provides different archival formats that you can use and this format evolved over time. Nowadays they are available in 3 main ...
[46]
Mastering LLM Techniques: Text Data Processing - NVIDIA Developer
Nov 13, 2024 · To optimize LLM performance, data processing techniques such as text cleaning, heuristic filtering, deduplication, and model-based quality ...
[47]
D4: Improving LLM Pretraining via Document De-Duplication and ...
Aug 23, 2023 · Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains)
[48]
Data Deduplication at Trillion Scale: How to Solve the Biggest ... - Zilliz
Jul 22, 2025 · Explore how MinHash LSH and Milvus handle data deduplication at the trillion-scale level, solving key bottlenecks in LLM training for improved ...
[49]
FineWeb: decanting the web for the finest text data at scale
May 31, 2024 · Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt ...<|separator|>
[50]
Data Collection and Preprocessing for LLMs [Updated] - Labellerr
Sep 27, 2024 · This section focuses on the acquisition and processing of pretraining data, which includes the sources of data, methods for preprocessing, and an analysis.
[51]
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal ...
Aug 1, 2024 · We study inference scaling laws (aka test-time scaling laws) and compute-optimal inference, focusing on the trade-offs between model sizes and generating ...
[52]
Revisiting Scaling Laws for Language Models: The Role of Data ...
This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical ...
[53]
Supervised Fine-Tuning - Hugging Face LLM Course
Supervised Fine-Tuning (SFT) is a critical process for adapting pre-trained language models to specific tasks. It involves training the model on a task-specific ...
[54]
Training language models to follow instructions with human feedback
Mar 4, 2022 · Abstract page for arXiv paper 2203.02155: Training language models to follow instructions with human feedback. ... GPT-3 using supervised learning ...
[55]
Aligning language models to follow instructions - OpenAI
Jan 27, 2022 · To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF)⁠ ...
[56]
Direct Preference Optimization: Your Language Model is Secretly a ...
May 29, 2023 · In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
[57]
Constitutional AI: Harmlessness from AI Feedback - Anthropic
Dec 15, 2022 · We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs.
[58]
[2504.12501] Reinforcement Learning from Human Feedback - arXiv
Apr 16, 2025 · Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.
[59]
Evaluation Metrics for Language Modeling - The Gradient
Oct 18, 2019 · Traditionally, language model performance is measured by perplexity, cross entropy, and bits-per-character (BPC).
[60]
Cross Entropy in Large Language Models (LLMs) | by Charles Chi | AI
Feb 4, 2024 · LLMs utilize cross entropy as a loss function during training to measure the discrepancy between the predicted probability distribution of words ...
[61]
Language Model Perplexity Predicts Scientific Surprise and ... - arXiv
Sep 6, 2025 · Our findings reveal that computational measures of corpus-wide linguistic surprise can forecast the reception and ultimate influence of ...
[62]
Tokenizer-Normalized Evaluation for Language Model Comparison
Jul 7, 2025 · Perplexity, defined as the exponentiated average negative log-likelihood of a sequence, remains the standard intrinsic evaluation metric.
[63]
Perplexity of language models revisited | by Pirmin Lemberger
Jun 28, 2022 · In a nutshell, the perplexity of a language model measures the degree of uncertainty of a LM when it generates a new token, averaged over very ...From Basic Information... · Get Pirmin Lemberger's... · Bounding The Perplexity Of...
[64]
30 LLM evaluation benchmarks and how they work - Evidently AI
Sep 20, 2025 · We put together database of 250+ LLM benchmarks and datasets you can use to evaluate the performance of language models. LLM benchmarks vary in ...
[65]
SuperGLUE: A Stickier Benchmark for General-Purpose Language ...
May 2, 2019 · In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a ...
[66]
Large Language Model Evaluation: 10+ Metrics & Methods
Sep 19, 2025 · A combination of benchmarks is often necessary to comprehensively evaluate a language model's performance. A set of benchmark tasks is selected ...
[67]
LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide
This article will teach you everything you need to know about LLM evaluation metrics, with code samples included.
[68]
SuperGLUE Benchmark
SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new ...LeaderboardTasks
[69]
[Discussion] ChatGPT and language understanding benchmarks
Jan 30, 2023 · The SuperGLUE benchmark has GPT-3 ranked #24, not terrible, but outperformed by old models like T5, which seems odd. GLUE nothing. SQUAD ...<|separator|>
[70]
What Are LLM Benchmarks? - IBM
LLM benchmarks are standardized frameworks for assessing the performance of large language models (LLMs) ... HellaSwag, MMLU, GSM8K, TruthfulQA and Winogrande ...
[71]
Top LLM Benchmarks Explained: MMLU, HellaSwag, BBH, and ...
LLM benchmarks such as MMLU, HellaSwag, and DROP, are a set of standardized tests designed to evaluate the performance of LLMs on various skills.
[72]
LLM Benchmarks: Measuring AI's Performance & Accuracy
Jul 8, 2025 · Just a few top LLM benchmarks to watch today include MMLU, HellaSwag, TruthfulQA, HumanEval, Big-bench, GSM8K, and ARC. No single benchmark ...
[73]
LLM Leaderboard - Comparison of over 100 AI models from OpenAI ...
Comparison and ranking the performance of over 100 AI models (LLMs) across key metrics including intelligence, price, performance and speed (output speed ...Llama 4 Scout · GPT-4.1 · Llama 4 Maverick · Claude 3.7 Sonnet
[74]
A Survey on Large Language Model Benchmarks - arXiv
Aug 21, 2025 · Shopping MMLU: A massive multi-task online shopping benchmark for large language models. In Amir Globersons, Lester Mackey, Danielle ...
[75]
Grok-2 vs Llama 3.1 405B Instruct - LLM Stats
Grok-2 outperforms in 4 benchmarks (GPQA, MATH, MMLU, MMLU-Pro), while Llama 3.1 405B Instruct is better at 1 benchmark (HumanEval). Grok-2 significantly ...
[76]
Announcing Grok-1.5 - xAI
Mar 28, 2024 · Additionally, it scored 74.1% on the HumanEval benchmark, which evaluates code generation and problem-solving abilities. Benchmark, Grok-1, Grok ...
[77]
Llama 3.1 405b vs Leading Closed-Source Models - Vellum AI
Jul 26, 2024 · The main focus on this analysis is to compare Llama 405b with GPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet. We look at standard benchmarks ...<|separator|>
[78]
Unveiling A Core Linguistic Region in Large Language Models - arXiv
Oct 23, 2023 · We have discovered a core region in LLMs that corresponds to linguistic competence, accounting for approximately 1% of the total model parameters.
[79]
Linguistic Interpretability of Transformer-based Language Models
Apr 9, 2025 · Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or ...Missing: capabilities | Show results with:capabilities
[80]
What Do LLMs Know About Linguistics? It Depends on How You Ask
Jul 9, 2023 · This setup overlooks an entire language-motivated side of core NLP, where the models produce linguistic analyses of text, such as a syntax tree.
[81]
[2410.09613] Transformer-based Language Models for Reasoning ...
Oct 12, 2024 · Abstract:Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities.
[82]
Does GPT-4 surpass human performance in linguistic pragmatics?
Dec 15, 2023 · Findings revealed that LLMs, particularly GPT-4, outperformed humans. GPT4 achieved the highest score of 4.80, surpassing the best human score ...
[83]
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs ...
Jan 13, 2024 · The benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.
[84]
Evaluating Large Language Models on Linguistic Competence
This project investigates the extent to which LLMs capture core aspects of linguistic knowledge, including syntax, semantics, pragmatics, and sociolinguistic ...
[85]
[PDF] A Comprehensive Overview of Large Language Models - arXiv
Large Language Models (LLMs) have shown remarkable capabilities in natural language processing, and this article provides a comprehensive overview of LLM- ...
[86]
One Year On: Assessing Progress of Multimodal Large Language ...
Aug 12, 2025 · The average accuracy of multimodal LLMs across subspecialties varied between the 2024 and 2023 CotD question cohorts, with a general trend of ...
[87]
survey on multimodal large language models - Oxford Academic
This paper presents the first survey on Multimodal Large Language Models (MLLMs), highlighting their potential as a path to Artificial General Intelligence.
[88]
How Gemini makes Android more helpful - Google AI
Aug 13, 2024 · With AI at the core of Android, we've rebuilt Gemini and tailored it for your device, in a private and secure way.New Ways To Use Gemini On... · Secure With Google · Built For Android's Scale
[89]
Google Gemini's Android app is almost ready to roll out this basic ...
Jul 13, 2025 · Google also introduced a chat search feature in recent weeks, enabling users to search their chat history with Gemini, albeit only on the web and iOS.
[90]
Grounding with Google Search | Gemini API
Sep 25, 2025 · Grounding with Google Search connects the Gemini model to real-time web content and works with all available languages.
[91]
Microsoft incorporates OpenAI's GPT-5 into consumer, developer ...
Aug 7, 2025 · Today Microsoft is incorporating GPT-5, OpenAI's best AI system to date, into a wide variety of its products, to bring new reasoning ...
[92]
Microsoft Embeds ChatGPT-5 Across Copilot Ecosystem
Aug 7, 2025 · OpenAI's GPT-5 model is now integrated into Microsoft's full portfolio of AI-powered tools, including Microsoft 365 Copilot, GitHub Copilot, ...
[93]
Foundry Models sold directly by Azure - Microsoft Learn
GPT-4o audio models support either low latency speech in, speech out conversational interactions or audio generation.Working with models · Azure OpenAI reasoning models · Add and configure models
[94]
27 of the best large language models in 2025 - TechTarget
Jul 10, 2025 · Below are some of the most relevant large language models today. They do natural language processing and influence the architecture of future models.Bert · Claude vs. ChatGPT · DeepSeek explained
[95]
Detecting hallucinations in large language models using semantic ...
Jun 19, 2024 · Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations.
[96]
[PDF] Why Language Models Hallucinate - arXiv
Sep 4, 2025 · Hallucinations are inevitable only for base models. Indeed, empirical studies (Fig. 2) show that base models are often found to be calibrated, ...
[97]
[PDF] An Empirical Study on Factuality Hallucination in Large Language ...
Aug 11, 2024 · Factuality hallucination in LLMs is the tendency to generate factually incorrect content that looks plausible, which is a primary erroneous ...
[98]
A Survey on Hallucination in Large Language Models
Jan 24, 2025 · Moreover, recent research has exposed that LLMs can occasionally exhibit unpredictable reasoning hallucinations spanning both long-range and ...
[99]
Survey and analysis of hallucinations in large language models
Sep 29, 2025 · In this work, we present a comprehensive survey and empirical analysis of hallucination attribution in LLMs. Introducing a novel framework to ...
[100]
[PDF] Free? Assessing the Reliability of Leading AI Legal Research Tools
However, the large language models used in these tools are prone to “hallucinate,” or make up false information, making their use risky in high- stakes domains.
[101]
Multi-model assurance analysis showing large language models are ...
Without mitigation, hallucination rates were 64.1% for long cases versus 67.6% in short ones. With the mitigation prompt, rates dropped to 43.1% and 45.3% for ...
[102]
Larger and more instructable language models become less reliable
Sep 25, 2024 · Larger and more instructable large language models may have become less reliable. By studying the relationship between difficulty concordance, task avoidance ...
[103]
Biases in Large Language Models: Origins, Inventory, and Discussion
Jun 22, 2023 · We argue that (i) most types of bias originate in corpora and, consequently, language models learn and amplify such biases, and, (ii) more ...
[104]
Bias and Fairness in Large Language Models: A Survey
Model: The training or inference procedure itself may amplify bias, beyond what is present in the training data. The choice of optimization function, such as ...
[105]
Stereotypical bias amplification and reversal in an experimental ...
Here we provide the first direct empirical evidence of the core phenomenon—bias amplification—driving public and expert concern about human–AI interaction.
[106]
Bias Amplification: Large Language Models as Increasingly ... - arXiv
Oct 19, 2024 · In this paper, we introduce a open, generational, and long-context benchmark specifically designed to measure political bias amplification in LLMs.
[107]
Large language models show amplified cognitive biases in moral ...
Our experiments demonstrate that the decisions and advice of LLMs are systematically biased against doing anything, and this bias is stronger than in humans.
[108]
Assessing political bias and value misalignment in generative ...
Our analysis reveals a concerning misalignment of values between ChatGPT and the average American. We also show that ChatGPT displays political leanings ...
[109]
Measuring Political Preferences in AI Systems - Manhattan Institute
Jan 23, 2025 · Research has hinted at the presence of political biases in Large Language Model (LLM)–based AI systems such as OpenAI's ChatGPT or Google's ...
[110]
Large Language Models Are Biased Because They Are Large ...
A great number of studies demonstrate empirically that harmful LLM biases exist, but this is done exclusively, as far as I can tell, via in vitro methods—as ...
[111]
Bias in Large Language Models: Origin, Evaluation, and Mitigation
Nov 16, 2024 · This comprehensive review examines the landscape of bias in LLMs, from its origins to current mitigation strategies.
[112]
Over 30 AI models have been trained at the scale of GPT-4
Jan 30, 2025 · Over 30 AI models, trained with over 10^25 FLOP, have been identified as of June 2025, with an average of two models announced monthly in 2024.
[113]
What is the cost of training large language models? - CUDO Compute
May 12, 2025 · Training OpenAI's GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google's ...
[114]
Training compute costs are doubling every eight months ... - Epoch AI
Jun 19, 2024 · Training compute costs for the largest AI models are doubling every eight months, with spending growing at 2.4x per year, costing hundreds of ...
[115]
[PDF] How Hungry is AI? Benchmarking Energy, Water, and Carbon ...
Apr 23, 2025 · We benchmark the environmental footprint of 30 LLMs across three modalities: Energy consumption, water usage, and carbon emissions, based on ...
[116]
Confronting AI's Growing Energy Appetite | Extreme Networks
Aug 15, 2024 · AI's energy demand is expected to skyrocket from just eight terawatt-hours in 2024 to a staggering 652 terawatt-hours by 2030.
[117]
Will we run out of data to train large language models? - Epoch AI
Jun 6, 2024 · So far, AI developers have not faced major limits to scaling beyond simply procuring AI chips, which are scarce but rapidly growing in supply. ...
[118]
Reconciling Kaplan and Chinchilla Scaling Laws - arXiv
Jun 12, 2024 · Our approach uses information and data from the Chinchilla and Kaplan studies to estimate the scaling laws that would emerge if the Chinchilla ...
[119]
Chinchilla data-optimal scaling laws: In plain English - LifeArchitect.ai
Aug 15, 2025 · Chinchilla scaling laws say 1,400B tokens (1.4T) should be used to train a 70B parameter LLM, needing around 20 text tokens per parameter.
[120]
Language models scale reliably with over-training and on ...
Nov 25, 2024 · Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments.<|control11|><|separator|>
[121]
Scaling Laws for LLMs: From GPT-3 to o3 - Deep (Learning) Focus
Jan 6, 2025 · Scaling laws help us to predict the results of larger and more expensive training runs, giving us the necessary confidence to continue investing in scale.
[122]
Why language models hallucinate | OpenAI
Sep 5, 2025 · Our latest models have lower hallucination rates, and we continue to work hard to further decrease the rates of confident errors output by ...Missing: 2023-2025 | Show results with:2023-2025
[123]
Multi-model assurance analysis showing large language ... - Nature
Aug 2, 2025 · Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate ( ...
[124]
Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...
May 22, 2024 · The hallucination rate was calculated to quantify the proportion of LLM-generated references that were irrelevant, incorrect, or unsupported by ...Missing: quantitative | Show results with:quantitative
[125]
Why Language Models Hallucinate - arXiv
Sep 4, 2025 · We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we ...
[126]
[PDF] FELM: Benchmarking Factuality Evaluation of Large Language Models
In the experiments, we examine the abilities of two most powerful LLMs, ChatGPT and GPT-4 (OpenAI, 2023), as factuality evaluators on our benchmark, augmented ...
[127]
FactBench: A Dynamic Benchmark for In-the-Wild Language Model...
We curate a benchmark from in-the-wild user-model interactions that evaluates language models' factuality in diverse scenarios.
[128]
Do language models favor their home countries? Asymmetric ...
Sep 22, 2025 · As language models (LMs) continue to develop, concerns over foreign misinformation through models developed in authoritarian countries have ...Missing: reliability | Show results with:reliability
[129]
The Dark Side of Language Models: Exploring the Potential of LLMs ...
This review paper explores the potential of LLMs to initiate the generation of multi-media disinformation, encompassing text, images, audio, and video.
[130]
Generative AI and misinformation: a scoping review of the role of ...
Sep 30, 2025 · This scoping review synthesizes recent empirical studies to explore the dual role of generative AI—particularly large language models (LLMs)—in ...
[131]
On the reliability of Large Language Models to misinformed and ...
Jan 8, 2025 · We investigate and observe the behavior and performance of Large Language Model (LLM)-backed chatbots in addressing misinformed prompts and questions with ...
[132]
Master List of lawsuits v. AI, ChatGPT, OpenAI, Microsoft, Meta ...
Aug 27, 2024 · LAST UPDATE: As of Sept. 16, 2025, 51 copyright lawsuits filed against AI companies in U.S. New suit: Disney v. MiniMax. There are 24 copyright ...
[133]
Every AI Copyright Lawsuit in the US, Visualized | WIRED
Dec 19, 2024 · Twelve separate copyright lawsuits brought by authors, newspapers, and other publishers against OpenAI and Microsoft were consolidated into one ...
[134]
The Newspaper Cases | BakerHostetler
What We're Watching. As of April 2025, this case has been consolidated into In re: OpenAI Copyright Infringement Litigation.
[135]
How we're responding to The New York Times' data ... - OpenAI
Jun 5, 2025 · Update on October 22, 2025: After months of litigation, we are no longer under a legal order to retain consumer ChatGPT and API content ...
[136]
Fair Use and AI Training: Two Recent Decisions Highlight the ...
Jul 8, 2025 · In each case, the court found that, on the facts before it, the use of copyrighted works to train an AI model was highly transformative and fair ...
[137]
A New Look at Fair Use: Anthropic, Meta, and Copyright in AI Training
Jul 3, 2025 · Bartz v. Anthropic and Kadrey v. Meta offer the first significant judicial guidance on how courts will look to apply the fair use doctrine to AI training ...
[138]
First of its Kind Decision Finds AI Training is Not Fair Use
Judge Bibas's opinion is the first decision in a case that addresses whether AI training is fair use, and it unequivocally holds that it is not.
[139]
Northern District of California Decides AI Training Is Fair Use, but ...
Jul 2, 2025 · While use of copyrighted works to train AI may be fair use, copying works without permission carries the risk of infringement.<|separator|>
[140]
AI Copyright Lawsuits - UBC Wiki
Aug 26, 2025 · According to Reuters Legal in August 2025 Perplexity AI failed to convince a judge to dismiss a lawsuit over its alleged misuse of articles ...
[142]
https://www.cnbc.com/2025/10/22/ai-taking-white-collar-jobs-economists-warn-much-more-in-the-tank.html
[143]
Why AI is replacing some jobs faster than others
Aug 12, 2025 · Data-rich industries are the most prone to being disrupted by AI. Data-poor industries are scrabbling to digitize in order to enjoy the benefits ...
[144]
How Large Language Models Could Impact Jobs
Sep 10, 2024 · A new study weighs the positive and negative impacts of LLMs on various jobs, depending on their exposure to AI.
[145]
AI labor displacement and the limits of worker retraining | Brookings
May 16, 2025 · Julian Jacobs examines the challenges of worker retraining amid the potential job displacement driven by advances in AI.
[146]
Experimental evidence on the productivity effects of generative ...
Jul 13, 2023 · Our results show that ChatGPT substantially raised productivity: The average time taken decreased by 40% and output quality rose by 18%.
[147]
Generative AI and labour productivity: a field experiment on coding
Sep 4, 2024 · Our findings indicate that the use of gen AI increased code output by more than 50%. However, productivity gains are statistically significant only among entry ...
[148]
Economic potential of generative AI - McKinsey
Jun 14, 2023 · Generative AI could enable labor productivity growth of 0.1 to 0.6 percent annually through 2040, depending on the rate of technology adoption ...
[149]
Evaluating the Impact of AI on the Labor Market - Yale Budget Lab
Oct 1, 2025 · Overall, our metrics indicate that the broader labor market has not experienced a discernible disruption since ChatGPT's release 33 months ago, ...
[150]
Visualizing the Training Costs of AI Models Over Time
Jun 4, 2024 · Last year, OpenAI's GPT-4 cost an estimated $78.4 million to train, a steep rise from Google's PaLM (540B) model, which cost $12.4 million just ...<|separator|>
[151]
"OpenAI's costs for AI training and inference could soar to $7 billion ...
Jul 25, 2024 · OpenAI's costs for AI training and inference could soar to $7 billion this year, while staffing expenses might climb to as much as $1.5 billion.
[152]
https://www.weforum.org/stories/2025/10/education-disruptive-ai-workforce-opportunities/
[153]
Artificial intelligence: Development, risks and regulation
Jul 18, 2023 · Dependence on AI, including the risk that an overreliance on AI leads to a loss of creativity, critical thinking skills and human intuition.2. Ongoing Development Of Ai... · 3. Calls For Rapid... · 4. Proposed Regulatory...Missing: offs | Show results with:offs
[154]
[PDF] Large Language Models, Small Labor Market Effects
May 28, 2025 · If these tools enhance individual productivity, a natural question is whether such gains translate into higher earnings. Our first key finding ...
[155]
Is artificial intelligence a hazardous technology? Economic trade-off ...
Sep 3, 2024 · Our study explores the trade-off of AI technology, including existential risks. We develop a theory and a Bayesian simulation model in order to explore what is ...
[156]
[TMLR 2024] Efficient Large Language Models: A Survey - GitHub
In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main ...
[157]
A Survey on Model Compression for Large Language Models - arXiv
Jul 30, 2024 · This paper presents a survey of model compression techniques for LLMs. We cover methods like quantization, pruning, and knowledge distillation, highlighting ...
[158]
LLM Optimization: Quantization, Pruning, and Distillation Techniques
May 30, 2025 · This comprehensive guide explores three fundamental approaches to LLM optimization: quantization, pruning, and knowledge distillation. These ...<|separator|>
[159]
Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer
Oct 7, 2025 · Pruning and knowledge distillation are highly cost-effective methods to progressively shrink LLMs while matching or exceeding baseline ...
[160]
[PDF] Quantization, Pruning, and Distillation - Graham Neubig
Quantization. • keep the model the same but reduce the number of bits. 2. Pruning. • remove parts of a model while retaining performance. 3. Distillation.
[161]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO ...
May 27, 2022 · We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory ...
[162]
FlashAttention-3: Fast and Accurate Attention with Asynchrony and ...
Jul 11, 2024 · In this blogpost, we describe three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA.
[163]
Flash Attention - Hugging Face
Flash Attention is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and ...<|control11|><|separator|>
[164]
From Large Language Models to Large Multimodal Models - MDPI
This paper aims to summarize the recent progress from LLMs to LMMs in a comprehensive and unified way. First, we start with LLMs and outline various conceptual ...
[165]
AI Agents vs. Agentic AI: A Conceptual taxonomy, applications and ...
In short, foundational AI models give modern AI Agents their basic understanding of language and scenes. ... Representative Agentic AI Models (2023–2025): ...
[166]
Autonomous generative AI agents: Under development - Deloitte
Nov 19, 2024 · Built on foundation models: Foundation models like LLMs enable agentic AI to reason, analyze, and adapt to complex and unpredictable workflows.
[167]
The 2025 AI Index Report | Stanford HAI
Beyond benchmarks, AI systems made major strides in generating high-quality video, and in some settings, language model agents even outperformed humans in ...
[168]
Large Multimodal Models (LMMs) vs LLMs - Research AIMultiple
Sep 26, 2025 · We evaluated the performance of Large Multimodal Models (LMMs) in financial reasoning tasks using a carefully selected dataset.
[169]
Multimodal Models and Agentic AI: Generative AI in 2025 - Spitch AI
Feb 6, 2025 · Small Language Models (SLMs) are also trending. In some cases these models match the performance of larger systems like GPT-4 in targeted ...Agentic Ai: Towards Greater... · Contact Centers As A Case... · Overcoming Inference...
[170]
Best 44 Large Language Models (LLMs) in 2025 - Exploding Topics
Oct 17, 2025 · This table lists the leading large language models in 2025. LLM Name, Developer, Release Date, Access, Parameters. GPT-5, OpenAI, August 7, 2025 ...
[171]
Top 10 open source LLMs for 2025 - Instaclustr
Unlike proprietary models developed by companies like OpenAI and Google, open source LLMs are licensed to be freely used, modified, and distributed by anyone.Missing: 2023-2025 | Show results with:2023-2025
[172]
Top 8 Open‑Source LLMs to Watch in 2025 - JetRuby Agency
May 13, 2025 · Top open-source LLMs include Llama 3.1 (multilingual), Llama 4 (text/image), Pixtral 12B (multimodal), Qwen 2.5-72B (multilingual), and Falcon ...
[173]
Open-Source LLMs You Can Deploy: 11 Best Models 2025
Sep 16, 2025 · Discover 11 top open-source LLMs for 2025. Compare models, learn deployment strategies, and build AI workflows with practical deployment ...
[174]
9 Top Open-Source LLMs for 2025 and Their Uses - DataCamp
9 Top Open-Source Large Language Models For 2025 · 1. GLM 4.6 · 2. gpt-oss-120B · 3. Qwen3 235B 2507 · 4. DeepSeek V3.2 Exp · 5. DeepSeek R1 0528 · 6. Apriel-v1.5-15B ...Missing: 2023-2025 | Show results with:2023-2025
[175]
The Open-Source Advantage in Large Language Models (LLMs)
Oct 14, 2025 · Large language models (LLMs) have rapidly advanced natural language processing, driving significant breakthroughs in tasks such as text ...
[176]
Open-Source LLMs vs Closed-Source LLMs: Key Differences in 2025
Aug 29, 2025 · Open-source large language models, or LLMs, are models whose ... Unlike proprietary APIs that charge per request, open models are free to use.Missing: dynamics | Show results with:dynamics
[177]
Top LLMs To Use in 2025: Our Best Picks - Splunk
May 8, 2025 · Weigh proprietary versus open-source models: proprietary LLMs offer advanced capabilities but may have higher costs and data privacy trade-offs, ...Open-Source Vs. Proprietary... · Top Llms Of 2025 · Llms For Problem Solving
[178]
Open Source or Proprietary LLMs? - EM360Tech
Mar 20, 2024 · Disadvantages and risks of proprietary LLMs: Lack of transparency: The inner workings are opaque to outside researchers and the public. We ...
[179]
Open Source vs. Proprietary LLMs: Arms Race for AI Leadership
Jan 30, 2025 · While open source models offer greater control and customization potential, proprietary models currently lead in general performance and ease ...
[180]
Do you think open source models continue to keep pace ... - Reddit
Jul 24, 2025 · Open source will likely always stay close behind proprietary models due to the practice of model distillation. It's much easier to stay in the ...Missing: dynamics | Show results with:dynamics
[181]
LLMs Explained: Open-Source Vs Proprietary AI Models - AceCloud
Sep 4, 2025 · Learn how open-source and proprietary LLMs differ in cost, control, customization, and innovation. Get insights to make the right AI choice.Missing: dynamics | Show results with:dynamics
[182]
Open vs. Closed LLMs in 2025: Strategic Tradeoffs for Enterprise AI
biomedicine, law, ...Missing: dynamics | Show results with:dynamics