Fact-checked by Grok 2 weeks ago

Generative pre-trained transformer

A generative pre-trained transformer (GPT) is a type of large language model that employs a transformer-based neural network architecture to generate human-like text by predicting the next token in a sequence, initially introduced by OpenAI in 2018 as a method for unsupervised pre-training followed by supervised fine-tuning on downstream tasks.^[1] This approach leverages massive datasets for pre-training, enabling the model to capture broad linguistic patterns before adaptation to specific applications like translation, summarization, and question-answering.^[2] The foundational GPT model, detailed in the 2018 paper "Improving Language Understanding by Generative Pre-Training," utilized a decoder-only transformer with 12 layers and masked self-attention, trained on the BookCorpus dataset to achieve state-of-the-art results on tasks such as natural language inference and question answering after fine-tuning.^[1] Subsequent iterations scaled this paradigm dramatically: GPT-2 (2019) expanded to 1.5 billion parameters and demonstrated emergent capabilities in zero-shot learning without task-specific fine-tuning, raising concerns about potential misuse in generating deceptive content.^[3] GPT-3 (2020), with 175 billion parameters, further advanced few-shot learning, performing competitively on diverse benchmarks like GLUE and SuperGLUE through in-context examples alone, marking a shift toward versatile, general-purpose AI systems.^[4] Building on these, GPT-4 (2023) introduced multimodal capabilities, processing both text and images to produce coherent outputs, and achieved human-level performance on professional exams such as the bar and SAT, while incorporating safety measures like reinforcement learning from human feedback (RLHF) to align outputs with ethical guidelines.^[5] The series has continued to evolve, with models such as GPT-4o (2024), GPT-5 (August 2025), and GPT-5.1 (November 2025) enhancing reasoning, multimodality, and conversational abilities.^[6]^[7]^[8] The GPT series has profoundly influenced natural language processing by popularizing autoregressive generation and scaling laws, where model performance improves predictably with increased compute, data, and parameters, though it also sparked debates on energy consumption, bias amplification, and the need for robust safeguards against hallucinations and misinformation.^[9]

Overview and Background

Definition and Core Principles

A generative pre-trained transformer (GPT) is a type of large language model that employs an autoregressive transformer architecture, specifically designed to predict the next token in a sequence of text based on preceding tokens. This approach enables the model to generate coherent and contextually relevant text by modeling the probability distribution over possible continuations. Introduced as a method to enhance natural language understanding, GPT models are characterized by their ability to process and produce human-like language through sequential prediction. At its core, the GPT framework rests on three key principles: generativity, pre-training, and transformer-based processing. Generativity refers to the model's capacity to produce novel text outputs rather than merely classifying or retrieving information, allowing applications in tasks like text completion and creative writing. Pre-training involves unsupervised learning on massive corpora of unlabeled text data, which equips the model with broad linguistic knowledge before any task-specific adaptation. The transformer foundation leverages self-attention mechanisms to efficiently capture long-range dependencies in sequences, replacing recurrent structures for parallelizable computation.^[10] Autoregressive generation in GPT operates probabilistically, factorizing the joint probability of a sequence into conditional probabilities for each successive token. For a sequence x = (x_1, x_2, \dots, x_n), the model computes the likelihood as P(x) = \prod_{t=1}^n P(x_t | x_{<t}), where x_{<t} = (x_1, \dots, x_{t-1}). The next-token prediction is given by:

P(x_t \mid x_{<t}) = \softmax \left( \Transformer(x_{<t}) \right)

This formulation allows the model to sample or select tokens iteratively, building outputs token by token while conditioning on prior context. The effectiveness of GPT models is profoundly influenced by scaling: increasing the number of parameters (often billions) and the volume of training data unlocks emergent abilities, such as in-context learning, where the model adapts to new tasks from examples provided in the prompt without parameter updates. These capabilities arise unpredictably at sufficient scale, marking a shift from predictable performance improvements to novel behaviors like few-shot reasoning.^[4]^[11]

Historical Context

Prior to the development of transformer-based models, natural language processing (NLP) relied heavily on recurrent neural networks (RNNs) and their variants, such as long short-term memory (LSTM) units, which processed sequences sequentially and struggled with capturing long-range dependencies due to issues like vanishing gradients. These architectures also faced challenges in parallelization during training, as each time step depended on the previous one, limiting scalability on modern hardware like GPUs. The transformer architecture, introduced in 2017 by Vaswani et al., marked a pivotal advancement by replacing recurrence with self-attention mechanisms, enabling full parallelization and more effective handling of long sequences in NLP tasks such as machine translation. This innovation, detailed in the paper "Attention Is All You Need," demonstrated superior performance on benchmarks like the WMT 2014 English-to-German translation task, achieving a BLEU score of 28.4 compared to previous state-of-the-art methods. The self-attention mechanism allowed the model to weigh the importance of different parts of the input simultaneously, addressing the sequential bottlenecks of RNNs and paving the way for larger-scale language models. In the late 2010s, the field shifted toward pre-training paradigms to leverage vast amounts of unlabeled data, moving from purely discriminative models to generative approaches that could learn rich representations through unsupervised tasks like next-token prediction. This evolution was driven by the need to scale models amid growing datasets, as traditional supervised learning proved insufficient for the complexity of real-world language understanding. A key influence was BERT's bidirectional pre-training in 2018 by Devlin et al., which masked words and predicted them contextually from both directions, achieving state-of-the-art results on tasks like GLUE with an average score of 80.5; however, this contrasted with the unidirectional, autoregressive generative pre-training that would define GPT-style models.

Technical Architecture

Transformer Foundations

The Transformer architecture, introduced in 2017, forms the foundational backbone for generative pre-trained transformer models like GPT. It replaces recurrent neural networks with a mechanism centered on attention, enabling parallel processing of sequences and capturing long-range dependencies more effectively. The original design features an encoder-decoder structure: the encoder processes input sequences into continuous representations, while the decoder generates outputs autoregressively by attending to the encoder's outputs and previously generated tokens. However, GPT models adapt this by employing a decoder-only configuration, which omits the encoder and relies solely on masked self-attention within the decoder to model sequential generation without bidirectional context from future tokens.^[10]^[1] At the core of the Transformer is the self-attention mechanism, which allows each position in a sequence to attend to all others, computing weighted representations based on their relevance. This is implemented via scaled dot-product attention, where for input matrices of queries Q, keys K, and values V (each of dimension d_k \times d_v), the attention output is given by:

\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V

The scaling factor \sqrt{d_k} prevents vanishing gradients in the softmax, ensuring stable computation as dimensions increase. To enhance expressiveness, multi-head attention projects Q, K, and V into h parallel subspaces (typically h=8), computes attention independently in each head, and concatenates the results before a final linear projection. This allows the model to jointly attend to information from different representation subspaces.^[10] Transformers incorporate positional encodings to inject sequence order information, as self-attention is inherently permutation-invariant. In the original formulation, fixed sinusoidal encodings are added to input embeddings using sine and cosine functions of different frequencies:

PE_{(pos,2i)} = \sin\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right), \quad PE_{(pos,2i+1)} = \cos\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)

where pos is the position and i indexes the dimension up to d_{\text{model}} (e.g., 512). Variants, such as those used in GPT, employ learned positional embeddings trained jointly with the model parameters, offering flexibility for varying sequence lengths. Following attention, each layer includes position-wise feed-forward networks—two linear transformations with a ReLU activation in between—to apply non-linearities independently to each position, expanding to an inner dimension (e.g., 2048) before projecting back.^[10]^[1] For stable training at scale, Transformers integrate residual connections around each sub-layer (attention and feed-forward) and apply layer normalization afterward, yielding outputs of the form \text{LayerNorm}(x + \text{Sublayer}(x)). These elements mitigate vanishing gradients and enable the stacking of multiple layers (typically 6 in the original, up to 12 or more in adaptations) without degradation in performance. This design supports the autoregressive generation in GPT by conditioning each output token on prior ones through causal masking in self-attention.^[10]

Pre-training and Fine-tuning Mechanisms

The pre-training phase of generative pre-trained transformers (GPTs) involves unsupervised learning on vast text corpora to develop broad language understanding capabilities. These corpora typically include diverse sources like web crawls and books, enabling the model to capture patterns in natural language without task-specific labels. The core objective is causal language modeling, where the model predicts the next token in a sequence given all preceding tokens, minimizing the cross-entropy loss defined as

L = -\sum \log P(x_t \mid x_{<t}),

with x_t denoting the target token and x_{<t} the prior context.^[1] This autoregressive approach leverages the transformer's self-attention mechanism to model long-range dependencies, fostering emergent abilities in text generation and comprehension. Data preprocessing is crucial for handling internet-scale, heterogeneous inputs. Tokenization often employs Byte-Pair Encoding (BPE), which merges frequent character pairs into subword units to manage vocabulary size while preserving rare words and handling out-of-vocabulary terms efficiently.^[12] Context windows, typically spanning hundreds to thousands of tokens, limit the sequence length processed per input to balance computational feasibility with the need to capture extended contexts from diverse data sources like multilingual web content and literature. Preprocessing also involves deduplication, filtering for quality, and normalization to mitigate biases and noise inherent in large-scale scraping. Fine-tuning adapts the pre-trained model for downstream tasks through supervised learning on curated, labeled datasets, aligning outputs more closely with specific objectives like classification or instruction-following. In advanced implementations, this extends to reinforcement learning from human feedback (RLHF), where a reward model trained on human preferences guides policy optimization via proximal policy optimization (PPO), enhancing helpfulness, truthfulness, and harmlessness. This two-stage process—supervised fine-tuning followed by RLHF—refines the model's generative behavior while preserving its foundational knowledge.^[13] Scaling laws provide empirical guidance for optimizing training efficiency, revealing power-law relationships between model performance (measured by perplexity or loss), parameter count, dataset size, and compute budget. Early findings indicated that loss decreases predictably with increased model size and data, but subsequent analysis refined this to the Chinchilla hypothesis, advocating an equal allocation of compute to parameters and tokens (approximately 20 tokens per parameter) for compute-optimal performance, challenging prior emphases on extreme parameter scaling. These relations underscore the importance of balanced resource investment to achieve state-of-the-art generative proficiency.

Model Development and Evolution

Early Developments

The Generative Pre-trained Transformer (GPT) was first introduced by researchers at OpenAI in 2018 as a novel approach to natural language processing (NLP). The inaugural model, known as GPT-1, featured approximately 117 million parameters and utilized a 12-layer Transformer decoder architecture with 768-dimensional states and 12 attention heads.^[1] It was pre-trained on the BookCorpus dataset, comprising over 7,000 unpublished books totaling around 800 million words, using unsupervised causal language modeling to predict the next token in sequences up to 512 tokens long.^[1] This pre-training phase achieved a token-level perplexity of 18.4, enabling the model to learn general language representations without task-specific supervision.^[1] The foundational work was detailed in the paper "Improving Language Understanding by Generative Pre-Training" by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.^[1] The primary motivation was to overcome the data efficiency challenges in NLP, where traditional supervised methods rely on scarce labeled datasets, leading to models that generalize poorly across tasks.^[1] In contrast, the GPT framework employed generative pre-training on vast unlabeled text to build robust representations, followed by fine-tuning on smaller labeled datasets for specific downstream tasks, requiring minimal architectural modifications.^[1] This semi-supervised strategy drew inspiration from earlier unsupervised pre-training techniques but adapted them to the Transformer architecture for scalable transfer learning.^[1] Evaluations demonstrated the efficacy of this approach through zero-shot, few-shot, and fine-tuned settings on the GLUE benchmark, a suite of diverse NLP tasks including textual entailment, question answering, and semantic similarity.^[1] GPT-1 achieved an overall GLUE score of 72.8, surpassing the prior state-of-the-art of 68.9 and setting new records on 7 out of 9 tasks.^[1] Notably, the fine-tuned model showed absolute gains such as 8.9% on the Stories Cloze Test for commonsense reasoning and 5.7% on the RACE dataset for question answering.^[1] Initial reception positioned GPT-1 as a proof-of-concept for generative pre-training, highlighting its ability to outperform purely discriminative, task-specific models in transfer learning scenarios.^[1] The work's influence is evidenced by over 17,000 citations, underscoring its role in shifting NLP paradigms toward large-scale pre-training.^[14]

GPT Series Milestones

The GPT series marked significant advancements in scaling large language models, beginning with GPT-2 in 2019, which featured 1.5 billion parameters and was trained on the WebText dataset comprising 40 gigabytes of internet text scraped for quality.^[3] This model demonstrated strong unsupervised multitask learning capabilities across various language tasks, achieving state-of-the-art results on seven out of eight evaluated datasets without task-specific fine-tuning.^[12] Due to concerns over potential misuse, such as generating deceptive or harmful content, OpenAI initially withheld the full model release and instead conducted staged rollouts, including safety demonstrations and partnerships for responsible deployment research.^[3] The complete 1.5 billion parameter version, along with code and model weights, was eventually released in November 2019 to foster broader research.^[15] Building on this foundation, GPT-3 was introduced in 2020 with 175 billion parameters, representing a substantial scale-up that enabled emergent abilities like in-context learning, where the model could perform tasks effectively using only a few examples provided in the prompt without any fine-tuning.^[4] This few-shot learning paradigm allowed GPT-3 to generalize across diverse natural language processing tasks, including translation, question answering, and cloze completion, often approaching or surpassing fine-tuned smaller models.^[16] OpenAI launched the GPT-3 API shortly after, enabling developer access, and released variants like Davinci, a refined version optimized for instruction-following and creative tasks.^[4] Subsequent iterations, GPT-3.5 in late 2022 and GPT-4 in 2023, incorporated reinforcement learning from human feedback (RLHF) to better align outputs with user intentions and reduce harmful responses, building on techniques first detailed in the InstructGPT framework.^[17]^[18] GPT-3.5 powered the initial ChatGPT interface, emphasizing conversational coherence and safety through RLHF fine-tuning on human preferences.^[17] GPT-4 extended this with multimodal capabilities, processing both text and images as inputs while generating text outputs, which improved performance on vision-language tasks like visual question answering.^[19] This model achieved human-level results on professional benchmarks such as the Uniform Bar Examination, underscoring the impact of scaling combined with alignment methods.^[20] From 2024 onward, the series continued to evolve toward greater multimodality and reasoning depth. GPT-4o, released in May 2024, integrated real-time audio processing alongside text and vision, enabling responsive voice interactions with low latency and multilingual support, while maintaining cost efficiency at half the price of prior models for similar capabilities.^[6] Later that year, the o1-preview model introduced advanced internal reasoning chains, simulating step-by-step thought processes to tackle complex problems in science, coding, and mathematics more reliably than previous GPT variants.^[21] In August 2025, OpenAI released GPT-5, its most advanced model to date, combining enhanced reasoning capabilities with non-reasoning functionality under a unified interface, enabling expert-level performance across diverse tasks while prioritizing speed and accessibility.^[7] This was followed by GPT-5.1 in November 2025, which further improved conversational fluency, instruction-following, and customization options, building on GPT-5's foundations for more adaptive and user-aligned interactions.^[8] Across these developments, parameter counts remained undisclosed for proprietary reasons beyond GPT-3, but training compute trends escalated dramatically, with GPT-5 utilizing approximately 5 \times 10^{25} floating-point operations (FLOPs) and frontier models by late 2025 exceeding previous scales, reflecting exponential growth in computational resources to drive capability improvements.^[22]

Variants and Adaptations

Foundation Models

Foundation models are large-scale machine learning models trained on broad, diverse datasets that can be adapted to a wide range of downstream tasks with minimal additional training, such as through prompting or fine-tuning.^[23] This paradigm emphasizes general-purpose capabilities derived from massive pre-training on internet-scale data, enabling versatile applications across domains like natural language processing, vision, and robotics without the need for task-specific architectures from scratch.^[23] The Generative Pre-trained Transformer (GPT) series exemplifies foundation models through its emphasis on zero-shot and few-shot learning, where the model performs tasks based solely on prompts without prior task-specific training.^[4] GPT models, particularly GPT-3 with its 175 billion parameters, demonstrate emergent abilities—previously unobserved capabilities that arise predictably as model scale increases, such as multilingual translation, code generation, and arithmetic reasoning—solely from larger training data and parameters.^[4]^[11] These behaviors highlight how GPT's architecture, built on the transformer framework, leverages scale to unlock broad intelligence, allowing it to handle over 100 natural language tasks via simple textual interfaces.^[4] A key example of GPT's impact as a foundation model is its deployment through OpenAI's API, which has fostered an ecosystem of over 300 applications (as of 2021) integrating GPT-3 for features like search, conversation, and text completion, driving developer innovation and economic value in AI services.^[24] In comparison to other foundation models like Google's PaLM (540 billion parameters, focused on efficient scaling via the Pathways system) or Meta's LLaMA series (7 to 65 billion parameters, with open-weight releases for research accessibility), GPT stands out for its proprietary development and closed-source training details, limiting reproducibility but enabling controlled commercial scaling.^[25]^[26] Recent advancements include GPT-4o (2024), which extends foundation model capabilities to multimodal inputs like text, images, and audio for more integrated applications.^[27]

Task-Specific and Domain-Specific Models

Task-specific fine-tuning adapts pre-trained GPT models to particular tasks such as classification, summarization, or question-answering by incorporating instruction tuning, where models learn to follow user directives through supervised fine-tuning on task-oriented datasets.^[18] A prominent example is InstructGPT, released by OpenAI in 2022, which refines GPT-3 using reinforcement learning from human feedback (RLHF) to enhance instruction-following capabilities across diverse tasks, achieving superior performance on benchmarks like the OpenAI Evals compared to base models while maintaining coherence.^[18] Domain-specific models extend GPT architectures by pre-training or fine-tuning on specialized corpora to excel in niche areas, such as biomedical text or programming code. BioGPT, developed by Microsoft Research in 2022, is a generative Transformer pre-trained on over 15 million PubMed articles and abstracts, enabling tasks like named entity recognition and relation extraction with state-of-the-art results on datasets including BC5CDR (F1 score of 0.902).^[28] Similarly, CodeGPT, a GPT-2-based model from Microsoft, is pre-trained on code repositories from GitHub, facilitating code generation and understanding in languages like Python.^[29] Parameter-efficient techniques like Low-Rank Adaptation (LoRA) enable domain or task specialization without retraining the entire model, by injecting low-rank matrices into Transformer layers to update only a small fraction of parameters—typically 0.1%—while preserving the base model's knowledge.^[30] These methods are evaluated on domain benchmarks, such as MedQA for biomedical applications, where BioGPT achieves 78.2% accuracy on USMLE-style questions, outperforming GPT-3's 67.9% by leveraging domain-specific priors.^[28] Such adaptations yield trade-offs, including heightened precision in targeted domains at the expense of broader generality, as specialized models may underperform on out-of-domain tasks due to overfitting on niche data. Open-source variants like GPT-J from EleutherAI, a 6-billion-parameter model released in 2021, support custom domain fine-tuning through accessible weights, allowing researchers to apply LoRA-style methods for tailored applications without proprietary barriers.^[31] More recent examples include adaptations of GPT models for advanced code generation, such as those integrated into tools like GitHub Copilot, which build on GPT architectures for real-time developer assistance as of 2025.^[32]

Applications and Implications

Generative Capabilities

Generative pre-trained transformers (GPT) models generate text autoregressively by predicting the next token in a sequence based on preceding context. This process enables open-ended text production, where the model samples from a probability distribution over the vocabulary to continue the sequence iteratively.^[33] To control the output's diversity and coherence, GPT models employ various sampling strategies during inference. Top-k sampling restricts selection to the k most probable tokens, promoting focused and coherent generations while limiting exposure to low-probability outliers; for instance, k=40 or k=640 values are commonly used to balance quality. Nucleus sampling, introduced in GPT-2, dynamically truncates the distribution to the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., p=0.95), adapting better to varying uncertainty levels and reducing incoherent outputs compared to fixed top-k. Temperature scaling adjusts the softmax probabilities by raising logits to the power of 1/t, where lower values (e.g., t=0.7) enhance coherence by sharpening the distribution, and higher values (e.g., t=1.0) increase diversity at the risk of repetition or irrelevance. These methods, often combined, allow tuning for tasks requiring creativity versus precision.^[33] GPT models demonstrate strong generative strengths in creative writing, dialogue, and code generation. In creative writing, GPT-4 produces coherent storytelling and poetry that rivals human output, such as generating satirical narratives from visual prompts or detailed fictional scenarios.^[19] For dialogue, it maintains context across multi-turn interactions, inferring user intent with high steerability and achieving a 70.2% human preference rate over prior models in conversational tasks.^[19] In code generation, GPT-4 solves 67% of Python programming problems on the HumanEval benchmark, producing functional code and identifying vulnerabilities like SQL injection.^[19] Evaluation of GPT generation focuses on fluency, coherence, and quality. Perplexity measures fluency by quantifying prediction uncertainty, with GPT-3 achieving 20.50 on the Penn Treebank dataset, outperforming prior state-of-the-art by 15 points.^[33] BLEU and ROUGE assess coherence against reference texts, as seen in GPT-3's 39.5 BLEU score for Romanian-to-English translation.^[33] Human judgments provide holistic quality assessments, with GPT-4 evaluators preferring its outputs in 70% of blind comparisons for creative and dialogic tasks.^[19] Despite these capabilities, GPT generation suffers from repetition and hallucinations. Repetition arises from over-reliance on high-probability patterns, leading to redundant phrases in longer outputs. Hallucinations involve fabricating plausible but false information, such as incorrect historical facts, due to training data gaps or overconfidence in parametric knowledge.^[34] Mitigations include beam search, which explores multiple candidate sequences to favor diverse, high-scoring paths and reduce repetition, as applied in GPT-3 tasks with beam width 4.^[33] Constrained decoding techniques, like factual-nucleus sampling, further address hallucinations by dynamically adjusting probabilities to prioritize verifiable content during inference.^[34]

Broader Impacts and Challenges

The deployment of generative pre-trained transformer (GPT) models has sparked significant economic transformations, particularly in creative and knowledge-based sectors. In writing and creative fields, these models raise concerns about job displacement, as they automate tasks such as content generation and editing, potentially reducing demand for entry-level roles in journalism, marketing, and copywriting.^[35] Conversely, AI assistants like GitHub Copilot, powered by GPT architectures, have boosted developer productivity by accelerating code completion and debugging, with studies showing up to 55% faster task completion in software engineering without compromising quality.^[36] Overall, while some analyses indicate modest net labor market effects with no significant changes in earnings or hours worked as of 2025, the potential for widespread task automation could reshape up to 80% of U.S. jobs by enabling 10% or more of tasks to be performed twice as quickly.^[37]^[38] Technical challenges in GPT deployment center on the immense computational demands, which impose high economic and environmental costs. Training large-scale models like GPT-4 is estimated to cost around $100 million, driven by the need for thousands of high-end GPUs and extensive energy consumption exceeding 50 GWh—roughly 40 times that of GPT-3.^[39]^[40] These requirements contribute to a substantial environmental footprint, including carbon emissions from data centers equivalent to the annual output of thousands of households, as well as resource depletion from server production and cooling water usage.^[41]^[42] To address safety risks, organizations like OpenAI implement layered safeguards, including external red-teaming to probe for biases, misinformation, and harmful outputs. Red-teaming exercises have identified vulnerabilities such as racial and gender biases in responses, as well as susceptibility to jailbreaking prompts that bypass content filters to elicit unsafe content like explicit material or biased advice.^[43]^[44]^[45] Despite these mitigations, ongoing issues persist, with joint evaluations between OpenAI and Anthropic revealing persistent misalignment risks in advanced models as of 2025.^[46] Looking toward future directions in 2025, efforts focus on efficient inference techniques like quantization, which reduces model precision to lower memory and energy use during deployment, achieving up to 75% reductions in computational overhead while maintaining accuracy for edge AI applications.^[47]^[48] Additionally, trends emphasize hybrid human-AI systems, integrating GPT models with human oversight for real-time decision support, enhancing trustworthiness and adaptability in domains like mobile computing and creative workflows.^[49]^[50]

Terminology and Reception

Brand and Naming Issues

OpenAI has established "GPT" as a branded term for its series of generative pre-trained transformer models, with successful trademark registrations for specific iterations such as GPT-3 in 2021 and ongoing applications for variants like GPT-4 and GPT-5.^[51]^[52] However, the broader acronym "GPT," standing for Generative Pre-trained Transformer, has faced challenges in securing standalone trademark protection in key jurisdictions. In 2024, the United States Patent and Trademark Office (USPTO) denied OpenAI's application to register "GPT" as a trademark, ruling that the term is "merely descriptive" of a category of AI models and has become generic through widespread use, preventing exclusive ownership.^[53]^[54] Initially, the European Union Intellectual Property Office (EUIPO) approved a trademark for "GPT" in 2024, but on October 23, 2025, the EUIPO invalidated OpenAI's trademarks for "GPT," "GPT-3," "GPT-4," and "GPT-5," deeming "GPT" too generic for AI-related goods and citing lack of distinctiveness.^[55]^[56] This outcome aligns the EUIPO's assessment more closely with the USPTO's, underscoring growing recognition of genericization risks across regions. The explosive popularity of OpenAI's GPT series, beginning with GPT-3 in 2020, has accelerated the genericization of the term, where "GPT" is increasingly used in media, academia, and industry to refer not just to OpenAI's models but to any similar autoregressive language models.^[57] This blurring has led to controversies over nomenclature, including debates about media misuse that conflate "GPT" with the general transformer architecture or unrelated AI systems, fostering confusion among non-experts.^[54] While no major lawsuits directly challenging OpenAI's GPT branding from competitors like Anthropic have emerged between 2023 and 2025, the USPTO's 2024 denial emphasized "GPT" as a generic descriptor for generative AI technologies, intensifying discussions on protecting proprietary names amid rapid sector growth.^[57] These branding issues have significant implications for AI literature and discourse, where imprecise use of "GPT" can obscure distinctions between OpenAI's proprietary models and open-source alternatives, prompting a shift toward more technical terminology like "decoder-only large language models" (LLMs) to denote the underlying architecture without brand connotations.^[58] This evolution mirrors historical cases of genericide, such as "Kleenex" becoming synonymous with facial tissues or "Xerox" for photocopying, where dominant brands risk losing trademark exclusivity as their names enter common parlance as verbs or generics—evident in phrases like "GPT me a summary."^[54]^[57] OpenAI's brand guidelines continue to assert "GPT" as proprietary, but the trend toward generic adoption underscores the challenges of maintaining distinctiveness in a fast-evolving field.^[59]

Criticisms and Limitations

Generative pre-trained transformers (GPTs) have faced significant criticism for amplifying biases present in their training data, leading to outputs that perpetuate gender, racial, and other stereotypes. For instance, studies have shown that GPT models generate text reinforcing gender stereotypes, such as associating certain professions disproportionately with men or women, due to skewed representations in web-scraped corpora. Similarly, racial biases manifest in responses that stereotype ethnic groups, as evidenced by evaluations of GPT-4 where the model produced vignettes embedding harmful assumptions about race and ethnicity in clinical scenarios. The seminal critique by Bender et al. (2021) highlights how large language models like GPTs act as "stochastic parrots," mindlessly regurgitating patterns from biased data without comprehension, thereby exacerbating societal inequalities. A 2024 UNESCO study further confirmed these tendencies in large language models, revealing regressive gender stereotypes and racial biases in generated content across multiple LLMs, including GPT variants.^[60]^[61]^[62] Reliability issues in GPT models stem from frequent factual inaccuracies, often termed "hallucinations," where the system confidently outputs incorrect information. For example, GPT-4 has been found to hallucinate in up to 58-82% of legal queries, fabricating details that mimic authoritative responses but lack veracity. These errors arise because GPTs rely on pattern matching from training data rather than genuine understanding, failing to perform novel reasoning outside learned distributions. Research from Stanford HAI demonstrates that even advanced models like GPT-4o exhibit inconsistent reasoning, struggling with tasks requiring true logical deduction beyond superficial mimicry. A 2024 Nature study on hallucinations classifies them as inherent distortions in LLMs, noting that GPTs cannot fully eliminate them without compromising fluency, as they prioritize probabilistic generation over factual grounding.^[63]^[64]^[65] Theoretical critiques of GPTs center on an over-reliance on scaling compute and data volume, which critics argue prioritizes brute force over architectural or methodological innovation. The scaling hypothesis, positing that larger models inherently improve capabilities, has shown diminishing returns, as evidenced by a 2025 PNAS study where model size yielded sharply reduced gains in persuasiveness beyond certain thresholds. Debates persist on whether GPTs represent progress toward artificial general intelligence (AGI) or mere sophisticated mimicry; a 2023 arXiv paper contends that current transformer architectures enable only algorithmic imitation, unlikely to achieve true AGI without fundamental shifts. Apple researchers in 2024 critiqued claims of emergent reasoning in models like GPT-4o, showing they fail explicit algorithms and exhibit inconsistent puzzle-solving, suggesting pattern-based simulation rather than authentic cognition. A 2024 Science article surveys these debates, noting that while GPT-4 sparked "AGI" discussions, experts emphasize its limitations in generalizable intelligence.^[66]^[67]^[64]^[68] In response to these criticisms, the AI community has advanced alignment research to enhance GPT safety and equity, including OpenAI's 2025 collective alignment initiatives that incorporate public input to refine model behavior specifications. Efforts like Anthropic's 2025 evaluations with OpenAI focus on detecting hidden misalignments, such as scheming behaviors, through targeted testing and training adjustments. To address biases, researchers have developed diverse datasets and mitigation techniques; for example, a 2025 arXiv study on hybrid human-LLM crowdsourcing demonstrates reduced biases to negligible levels by curating representative training data. Direct preference optimization methods, as outlined in a 2024 ACL paper, further align GPT-like models by fine-tuning on debaised preferences, improving fairness in outputs without extensive retraining. These ongoing interventions, including rubric-based rewards and verifiable reasoning protocols in 2025 models, aim to balance capabilities with reliability.^[69]^[70]^[71]^[72]

References

[1]
[PDF] Improving Language Understanding by Generative Pre-Training
Model specifications Our model largely follows the original transformer work [62]. We trained a. 12-layer decoder-only transformer with masked self-attention ...
[2]
Improving language understanding with unsupervised learning
Jun 11, 2018 · This provides some insight into why generative pre-training can improve performance on downstream tasks. We can also use the existing ...Missing: original | Show results with:original
[3]
Better language models and their implications - OpenAI
Feb 14, 2019 · Last year, OpenAI's Generative Pre-trained Transformer (GPT) showed that language models trained on large amounts of data can be fine-tuned ...
[4]
[2005.14165] Language Models are Few-Shot Learners - arXiv
May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
[5]
[2303.08774] GPT-4 Technical Report - arXiv
Mar 15, 2023 · We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
[6]
Generative Pre-trained Transformer: A Comprehensive Review on ...
May 11, 2023 · This review provides a detailed overview of the GPT, including its architecture, working process, training procedures, enabling technologies, and its impact on ...
[7]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
[8]
[2206.07682] Emergent Abilities of Large Language Models - arXiv
Jun 15, 2022 · Emergent abilities are abilities not present in smaller models but present in larger models, and cannot be predicted by extrapolating smaller ...
[9]
[PDF] Language Models are Unsupervised Multitask Learners | OpenAI
Natural language processing tasks, such as ques- tion answering, machine translation, reading com- prehension, and summarization, are typically.
[10]
[PDF] Training language models to follow instructions with human feedback
Jan 27, 2022 · In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
[11]
https://arxiv.org/abs/2206.07682
[12]
GPT-2: 1.5B release - OpenAI
Nov 5, 2019 · As the final model release of GPT-2's staged release, we're releasing the largest version (1.5B parameters) of GPT-2 along with code and ...
[13]
[PDF] Language Models are Few-Shot Learners - NIPS papers
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of ...
[14]
Introducing ChatGPT - OpenAI
Nov 30, 2022 · ChatGPT is a conversational model that can answer follow-up questions, admit mistakes, and challenge incorrect premises. It is fine-tuned from ...OpenAI announces new... · Introducing ChatGPT search · Research · Safety
[15]
Training language models to follow instructions with human feedback
Mar 4, 2022 · In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
[16]
[PDF] GPT-4 Technical Report - OpenAI
Mar 27, 2023 · [8] A. Radford, “Improving language understanding with unsupervised learning.” https://ope- nai.com/research/language-unsupervised, June 2018.
[17]
GPT-4 - OpenAI
Mar 14, 2023 · We've created GPT-4, the latest milestone in OpenAI's effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image ...
[18]
Hello GPT-4o - OpenAI
May 13, 2024 · We're announcing GPT-4 Omni, our new flagship model which can reason across audio, vision, and text in real time.
[19]
Introducing OpenAI o1-preview
Sep 12, 2024 · Update on September 17, 2024: Rate limits are now 50 queries per week for o1‑preview and 50 queries per day for o1‑mini.How It Works · Safety · How To Use Openai O1
[20]
Notes on GPT-5 training compute - Epoch AI
Oct 13, 2025 · In early 2025, RL compute was small—maybe 1-10% of pre-training. But this is scaling up fast: OpenAI scaled RL by 10× from o1 to o3, and xAI did ...
[21]
[PDF] Artificial Intelligence Index Report 2025 | Stanford HAI
Feb 2, 2025 · New in this year's report are in-depth analyses of the evolving landscape of AI hardware, novel estimates of inference costs, and new analyses ...
[22]
On the Opportunities and Risks of Foundation Models - arXiv
This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (eg, language, vision, robotics, ...
[23]
GPT-3 powers the next generation of apps - OpenAI
Mar 25, 2021 · Over 300 applications are delivering GPT-3–powered search, conversation, text completion, and other advanced AI features through our API.
[24]
PaLM: Scaling Language Modeling with Pathways - arXiv
Apr 5, 2022 · We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
[25]
LLaMA: Open and Efficient Foundation Language Models - arXiv
Feb 27, 2023 · We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens.
[26]
BioGPT: Generative Pre-trained Transformer for Biomedical Text ...
Oct 19, 2022 · In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
[27]
LoRA: Low-Rank Adaptation of Large Language Models - arXiv
Jun 17, 2021 · We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the ...
[28]
EleutherAI/gpt-j-6b - Hugging Face
May 3, 2023 · GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. "GPT-J" refers to the class of model, while "6B" represents the number of ...
[29]
https://arxiv.org/abs/2102.04664
[30]
https://arxiv.org/abs/2106.09685
[31]
https://huggingface.co/EleutherAI/gpt-j-6b
[32]
[PDF] Developer Productivity With and Without GitHub Copilot - arXiv
Sep 24, 2025 · This study investigates the real-world impact of the generative AI (GenAI) tool GitHub Copilot on developer activity and perceived ...Missing: GPT displacement
[33]
[PDF] Large Language Models, Small Labor Market Effects
May 28, 2025 · Despite investments, AI chatbots have had no significant impact on earnings or recorded hours, with effects larger than 1% ruled out.
[34]
[PDF] The Economic Impact of Generative AI
By one estimate, close to 80% of the jobs in the U.S. economy could see at least 10% of their tasks done twice as quickly (with no loss in quality) via the use ...
[35]
The rising costs of training frontier AI models - arXiv
Feb 7, 2025 · This paper develops a detailed cost model to address this gap, estimating training costs using three approaches that account for hardware, energy, cloud rental ...
[36]
Electricity Demand and Grid Impacts of AI Data Centers - arXiv
Sep 10, 2025 · Furthermore, training GPT-4 required an estimated over 50 GWh of electricity, approximately 40 times more than GPT-3, and equivalent to nearly ...
[37]
[PDF] The Environmental Impacts of Machine Learning Training ... - arXiv
Oct 10, 2025 · We consider energy consumption, carbon footprint and metallic resource depletion over server production and usage, and data center cooling usage ...
[38]
Holistically Evaluating the Environmental Impact of Creating ... - arXiv
Mar 3, 2025 · Training these models requires massive computational resources, which, in turn, require large amounts of energy. Powering training both emits ...
[39]
[PDF] Red teaming ChatGPT via Jailbreaking: Bias, Robustness ... - arXiv
Our red-teaming has revealed several behaviors ex- hibited by ChatGPT that may have potential ethical implications, such as bias in programming, susceptibility.
[40]
GPT-4o System Card | OpenAI
Aug 8, 2024 · This report outlines the safety work carried out prior to releasing GPT-4o including external red teaming, frontier risk evaluations ...
[41]
[PDF] OpenAI's Approach to External Red Teaming for AI Models and ...
In the case of DALL-E 3, open-ended red teaming uncovered gaps in areas such as misinformation- prone images, jailbreaks enabling sexually explicit content, ...
[42]
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise
Aug 27, 2025 · OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other's models for misalignment, ...
[43]
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs ...
Apr 4, 2025 · We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types.
[44]
[PDF] Energy Use of AI Inference: Efficiency Pathways and Test ... - arXiv
By quantifying achievable efficiency gains at the model, serving platform, and hardware levels, we find that individual levers yield median reductions of 1.5– ...
[45]
Achieving Trustworthy Real-Time Decision Support Systems with ...
Jun 24, 2025 · This paper investigates real-time decision support systems that leverage low-latency AI models, bringing together recent progress in holistic AI-driven ...
[46]
Adaptive and Resource-efficient Agentic AI Systems for Mobile and ...
Sep 30, 2025 · This article fills that gap by providing the first systematic survey on adaptive and resource-efficient agentic AI systems on mobile/edge ...
[47]
GPT-3 Trademark of OpenAI, L.P. - Registration Number 6294671
Registration Number. 6294671 ; Word Mark. GPT-3 ; Status. 700 - Registered ; Status Date. 2021-03-16 ; Filing Date. 2020-08-04.
[48]
GPT-5 - OpenAI OpCo, LLC Trademark Registration - USPTO .report
Jul 18, 2023 · Trademark registration by OpenAI OpCo, LLC for the trademark GPT ... Class Status, ACTIVE. Primary US Classes. 021: Electrical Apparatus ...
[49]
No 'GPT' trademark for OpenAI - TechCrunch
Feb 15, 2024 · The US Patent and Trademark Office has denied OpenAI's attempt to trademark “GPT,” ruling that the term is “merely descriptive” and therefore unable to be ...
[50]
OpenAI can't register 'GPT' as a trademark — yet | The Verge
Feb 16, 2024 · The US Patent and Trademark Office (PTO) has denied OpenAI's application to register the word GPT, which means generative pre-trained transformer, saying GPT ...
[51]
Many lessons of OpenAI's trademark GPT - Reggster
Feb 27, 2024 · OpenAI's trademark “GPT” is an acronym for “Generative Pre-trained Transformer”. While the EUIPO has accepted OpenAI's trademark GPT, the US ...
[52]
"GPT" Too Generic for Trademark Protection, Says USPTO - Gerben IP
May 14, 2025 · The USPTO rejected OpenAI's bid to trademark “GPT,” calling it a generic term for AI models. OpenAI's fight for “ChatGPT” trademark continues.
[53]
Decoder-Only Transformers: The Workhorse of Generative LLMs
Mar 4, 2024 · The decoder-only transformer architecture is one of the most fundamental and important ideas in AI research.Missing: terminology | Show results with:terminology
[54]
Design Guidelines - OpenAI
The "OpenAI" name, the OpenAI logo, the "ChatGPT" and “GPT” brands, and other OpenAI trademarks, are property of OpenAI. These guidelines are intended to ...
[55]
On the Dangers of Stochastic Parrots - ACM Digital Library
Mar 1, 2021 · The paper asks how big is too big for language models, what risks exist, and how to mitigate them, including environmental and financial costs.
[56]
Generative AI: UNESCO study reveals alarming evidence of ...
Jul 5, 2024 · A UNESCO study revealed worrying tendencies in Large Language models (LLM) to produce gender bias, as well as homophobia and racial stereotyping.
[57]
AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More ...
May 23, 2024 · Our previous study of general-purpose chatbots found that they hallucinated between 58% and 82% of the time on legal queries, highlighting the risks of ...
[58]
Understanding the Strengths and Limitations of Reasoning Models ...
We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.Interleaved Reasoning for... · GSM-Symbolic · Work with usMissing: lack | Show results with:lack<|control11|><|separator|>
[59]
AI hallucination: towards a comprehensive classification of distorted ...
Sep 27, 2024 · They argue that hallucination represents an inherent trait of the GPT model and suggest that completely eradicating hallucinations without ...
[60]
Scaling language model size yields diminishing returns for ... - PNAS
In this paper, we estimated the association between language model size and model persuasiveness. Our results offer evidence of sharply diminishing returns, ...Abstract · Results · Materials And Methods<|control11|><|separator|>
[61]
Collective alignment: public input on our Model Spec | OpenAI
Aug 27, 2025 · We surveyed over 1,000 people worldwide on how our models should behave and compared their views to our Model Spec.
[62]
Findings from a Pilot Anthropic - OpenAI Alignment Evaluation ...
Aug 27, 2025 · In early summer 2025, Anthropic and OpenAI agreed to evaluate each other's public models using in-house misalignment-related evaluations. We are ...
[63]
Mitigating Bias in Language Models through Direct ... - ACL Anthology
This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in LLM-generated English ...