Llama (language model)
Llama is a family of large language models developed by Meta AI, Meta Platforms' artificial intelligence research division, initially released in February 2023 as efficient, research-oriented foundational models with up to 65 billion parameters trained to achieve high performance on language understanding and generation benchmarks.[1] Subsequent iterations, including Llama 2 in 2023, Llama 3 in April 2024 with 8 billion and 70 billion parameter variants, Llama 3.1 in July 2024 featuring a 405 billion parameter model claimed as the largest and most capable openly available at the time, Llama 3.2 in September 2024 adding vision capabilities and lightweight variants for edge devices, and Llama 4 in April 2025 introducing natively multimodal models like Scout and Maverick with extended context lengths, have progressively enhanced capabilities in multilingual processing, coding, reasoning, and vision-language tasks.[2][3][4][5] These models are distributed as open weights under Meta's custom license, which permits non-commercial research, fine-tuning, and limited commercial deployment while imposing restrictions such as prohibitions on use by entities exceeding certain user thresholds (e.g., 700 million monthly active users) without prior approval, requirements for attribution, and bans on training competing models with Llama outputs—terms that enable broad accessibility but have fueled debates over whether they qualify as fully open-source software under standards like those of the Open Source Initiative.[6][7] Llama's releases have demonstrated competitive or superior performance against proprietary models on metrics like reasoning and coding benchmarks, with Llama 3.1 405B rivaling closed systems in evaluations and the series achieving over 10-fold usage growth since 2023 through integrations in applications ranging from chatbots to enterprise tools.[8][9] The emphasis on efficiency—such as through architectural innovations enabling smaller models to punch above their parameter counts—has positioned Llama as a cornerstone for advancing accessible AI development, though its license limitations highlight tensions between corporate control and community-driven innovation in the AI ecosystem.[3][10]Development History
Inception and Llama 1 (2023)
Meta's Fundamental AI Research (FAIR) lab initiated the LLaMA project to create efficient large language models capable of state-of-the-art performance using fewer parameters and less computational resources than prevailing models like GPT-3.[1] The effort emphasized foundational models for research, trained exclusively on publicly available data to prioritize accessibility and reproducibility.[1] This approach contrasted with closed proprietary systems, aiming to advance scientific understanding of language model scaling laws and robustness.[1] On February 24, 2023, Meta publicly announced LLaMA (Large Language Model Meta AI), releasing model weights in four sizes: 7 billion, 13 billion, 33 billion, and 65 billion parameters.[1] [11] The models employed a standard autoregressive transformer architecture, predicting subsequent tokens in sequences, and were trained on approximately 1 trillion tokens for the 7B variant and 1.4 trillion tokens for the larger models, drawn from text in the 20 most spoken languages using Latin and Cyrillic scripts.[1] Custom data curation filtered out low-quality content, focusing on high-quality subsets to enhance efficiency.[1] LLaMA 1 demonstrated competitive or superior results on benchmarks such as MMLU and GSM8K compared to larger models, underscoring the viability of optimized training over sheer scale.[1] However, like contemporaries, it exhibited limitations including factual inaccuracies, biases from training data, and potential for generating toxic outputs.[1] Access was restricted under a non-commercial research license, requiring researchers from academia, government, or industry to apply for approval, reflecting Meta's intent to support targeted scientific inquiry rather than broad commercial deployment.[1] This controlled release facilitated rapid community experimentation while mitigating misuse risks.[1]
Leak of Model Weights
In February 2023, Meta released LLaMA, a family of large language models ranging from 7 billion to 65 billion parameters, exclusively to approved academic researchers, civil society organizations, and government entities under a restrictive research license prohibiting commercial use.[12] On March 3, 2023, an anonymous user on 4chan posted a BitTorrent magnet link containing the model's checkpoint weights, making them publicly downloadable without authorization.[13][14] The leak originated from an individual with approved access, as Meta had distributed the weights to approximately 4,000 recipients prior to the incident, though the company implemented download limits and monitoring to prevent unauthorized sharing.[15] The unauthorized distribution rapidly proliferated across platforms like GitHub and torrent sites, enabling hobbyists and developers to deploy LLaMA on consumer hardware using optimized inference tools such as ggml, developed by Georgi Gerganov shortly after the leak.[15] Meta confirmed the breach but did not pursue aggressive legal enforcement against downloaders, citing challenges in tracking widespread dissemination and a strategic pivot toward greater openness in subsequent releases.[16] In response to the event, U.S. Senators Richard Blumenthal and others sent a letter to Meta CEO Mark Zuckerberg on June 6, 2023, questioning the company's risk assessment processes, safeguards against misuse for generating disinformation or harmful content, and failure to notify authorities promptly.[16] The leak accelerated community-driven fine-tuning efforts, including Stanford's Alpaca model, which achieved competitive performance with minimal additional training data, and Koila Alpaca variants, demonstrating LLaMA's efficiency for resource-constrained environments.[12] While proponents argued it fostered innovation by democratizing access to high-performing models, critics highlighted risks of deploying unmitigated systems capable of producing biased or unsafe outputs without Meta's intended controls.[15][16] This incident influenced Meta's decision to release LLaMA 2 under a more permissive license later in 2023, incorporating safety improvements absent in the leaked version.[12]Llama 2 (2023)
Llama 2 is a collection of large language models developed by Meta, released on July 18, 2023, as a successor to the earlier Llama models.[17][18] The family includes base pretrained models and instruction-tuned variants (Llama 2-Chat) in three sizes: 7 billion, 13 billion, and 70 billion parameters.[19][20] These models were pretrained on approximately 2 trillion tokens of publicly available data, representing a 40% increase in training data volume compared to Llama 1, with a doubled context length of 4,096 tokens.[21][22] Key architectural enhancements in Llama 2 over Llama 1 include the adoption of grouped-query attention, which reduces memory usage and improves inference efficiency by sharing key and value heads across query heads, and optimizations for faster dialogue performance in the chat variants.[23] The instruction-tuning process for Llama 2-Chat incorporated over 1 million human-generated samples, emphasizing rejection sampling and supervised fine-tuning to enhance response quality and safety alignment.[21] These changes resulted in superior performance on benchmarks such as reasoning, coding, and knowledge tasks relative to Llama 1 equivalents, though the 70B model still trailed proprietary models like GPT-3.5 in some evaluations.[24][22] Meta released the model weights and inference code under the Llama 2 Community License, a custom agreement permitting research and commercial use for entities with fewer than 700 million monthly active users, beyond which a separate license from Meta is required. This license, while enabling broad access including commercial applications, does not meet the Open Source Initiative's Open Source Definition due to its user-scale discrimination and failure to allow unrestricted redistribution or modification without attribution clauses.[25] Critics, including the OSI, have argued that labeling it "open source" misrepresents its restrictive nature, as the training dataset remains proprietary and cannot be independently reproduced.[26][27] Despite these limitations, the release facilitated widespread adoption, with integrations on platforms like Hugging Face and optimizations for hardware from AMD, Intel, Nvidia, and others.[19][28]Llama 3 and Variants (2024)
Meta released Llama 3 on April 18, 2024, introducing pretrained and instruction-tuned language models in 8 billion (8B) and 70 billion (70B) parameter sizes.[2] These models were designed to support a wide array of applications, including multilingual tasks, coding, and reasoning, with pretraining on approximately 15 trillion tokens.[29] Meta positioned Llama 3 as achieving state-of-the-art performance among openly available large language models at the time, emphasizing improvements in logical reasoning and reduced hallucination rates compared to prior versions.[2] In July 2024, Meta extended the Llama 3 family with Llama 3.1, released on July 23, featuring models in 8B, 70B, and a flagship 405B parameter variant.[3] The 405B model was described by Meta as the largest and most capable openly released foundation model, rivaling closed-source competitors in general knowledge, mathematics, and tool-use benchmarks, while incorporating a 128,000-token context window and native multilingual support for eight languages.[3] Llama 3.1 introduced built-in tool-calling capabilities, trained on three specific tools for tasks like code execution and web browsing, alongside enhanced safety mitigations to address risks such as jailbreaking and harmful outputs.[30] Further variants arrived with Llama 3.2 on September 25, 2024, focusing on efficiency for edge devices and multimodal inputs.[4] This release included lightweight text-only models at 1B and 3B parameters, optimized for mobile deployment with reduced latency, and vision-enabled models at 11B and 90B parameters capable of processing image inputs alongside text for tasks like visual question answering.[4] All Llama 3 variants were distributed under a community license permitting commercial use with restrictions on training derivative models exceeding 700 million users or high-risk applications without additional safeguards.[31]Llama 4 (2025)
Llama 4 represents Meta's initial release of natively multimodal large language models on April 5, 2025, marking a shift toward integrated text and vision processing via early fusion techniques that unify input tokens in a shared embedding space.[5] The family comprises two open-weight mixture-of-experts (MoE) models: Llama 4 Scout, with 109 billion total parameters (17 billion active across 16 experts), and Llama 4 Maverick, featuring 17 billion active parameters across 128 experts.[5][32] Both support a 10 million token context window, enabling extended reasoning over vast inputs, and are optimized for efficiency in deployment on standard hardware.[33] Meta positioned these as foundational for AI agents capable of advanced reasoning and action, with Scout emphasizing speed and Maverick targeting superior multimodal performance.[34] Performance evaluations indicate Llama 4 Maverick outperforms models like OpenAI's GPT-4o and Google's Gemini 2.0 Flash in multimodal benchmarks, particularly in vision-language tasks, due to its sparse MoE architecture that activates only a subset of parameters per inference.[5] Independent assessments confirm efficiency gains, with lower inference costs compared to dense counterparts of similar capability, though real-world scaling depends on expert routing quality.[35] Meta's announcements, including those from CEO Mark Zuckerberg, highlighted these models' role in advancing open-source AI, with over 650 million prior Llama downloads underscoring ecosystem adoption.[33][36] Subsequent developments include delays to larger variants, such as the anticipated Llama 4 Behemoth, originally slated for summer but postponed to fall 2025 or beyond amid training challenges.[37] A planned Llama 4.X iteration targets year-end 2025 release, focusing on enhanced reasoning capabilities.[38] These models retain Meta's permissive licensing for research and commercial use, excluding military applications, while incorporating safeguards like Llama Guard 4 for content moderation.[39] As of October 2025, Llama 4's open weights have facilitated rapid community fine-tuning, though Meta's internal benchmarks—potentially optimistic—require external validation for claims of frontier-level parity.[40]Technical Specifications
Core Architecture
The Llama family of language models employs a decoder-only transformer architecture optimized for autoregressive generation, processing input sequences to predict subsequent tokens without an encoder component. This design facilitates efficient next-token prediction, stacking multiple identical transformer decoder layers, each comprising a self-attention mechanism followed by a feed-forward network. Linear projections within these layers omit bias terms to reduce parameters and enhance training stability, while embeddings for input tokens and output logits are often tied to share learned representations.[41] Layer normalization uses RMSNorm applied before both the attention and feed-forward sublayers, promoting stable gradients during training compared to alternatives like LayerNorm. Feed-forward networks adopt the SwiGLU activation, which applies a gated linear unit mechanism—splitting the projection into two parts, one passed through Swish and the other through a sigmoid—to improve expressivity over standard ReLU or GELU while maintaining computational efficiency. Positional information is encoded via Rotary Positional Embeddings (RoPE), rotating query and key vectors in the attention mechanism to inject relative positional dependencies without absolute encodings, enabling extrapolation to longer contexts beyond training lengths.![RoPE with \theta = 500{,}000][center]
RoPE configurations vary by version, with Llama 3 extending the base frequency to support up to 128,000 tokens.[41][29] Self-attention evolves across versions for scalability: early Llama 1 and smaller Llama 2 models (7B and 13B) use multi-head attention, whereas larger Llama 2 variants (30B and 70B) introduce grouped-query attention (GQA), partitioning query heads into groups that share fewer key-value heads to balance expressivity and inference speed by minimizing memory for key-value caches. Llama 3 standardizes GQA across all sizes, including 8B, further optimizing for deployment on resource-constrained hardware. Llama 4 introduces interleaved attention layers that dispense with positional embeddings entirely, relying on alternative mechanisms for sequence ordering to accommodate native multimodality and extended contexts. These choices prioritize inference efficiency and long-context handling, with model dimensions scaling from 7 billion to 405 billion parameters in pretrained variants.[41][29][5]
Training Processes
The pre-training phase for Llama models employs a self-supervised next-token prediction objective on massive text corpora, utilizing a decoder-only transformer architecture with optimizations such as grouped-query attention and rotary positional embeddings to enhance efficiency and long-context handling. Meta conducts this distributed training on proprietary GPU clusters, leveraging frameworks like PyTorch with custom optimizations for mixed-precision arithmetic (e.g., FP16 or BF16) and techniques like ZeRO sharding to maximize FLOPs utilization, often exceeding 50% model FLOPs utilization in later iterations. Learning rate schedules typically follow a cosine decay or linear warmup followed by annealing, with peak rates around 3-6 × 10^{-4} scaled inversely with model size.[2][3][42] For the initial Llama 1 models (2023), pre-training occurred on approximately 1.4 trillion tokens sourced primarily from public web crawls, with compute estimated at around 5 × 10^{23} FLOPs for the 65B variant, though exact figures were not publicly detailed by Meta. Llama 2 (2023) scaled to 2 trillion tokens across its 7B, 13B, and 70B variants, utilizing ~8.26 × 10^{23} FLOPs for the largest model—1.5 times the compute of Llama 1 equivalents—while incorporating longer context training up to 4,096 tokens and rejection sampling for improved stability.[43] Llama 3 models (2024) expanded pre-training to over 15 trillion tokens for the 8B and 70B parameter sizes, a sevenfold increase over Llama 2, with rigorous data preprocessing pipelines including heuristic filtering for quality, deduplication, and classification to prioritize diverse, high-value sources like academic texts and code. This phase emphasized extended context lengths up to 8,192 tokens during training, followed by progressive extension techniques. The 405B Llama 3.1 variant maintained similar token scale but incorporated 128,000-token context training, with final-stage linear annealing of the learning rate over the last 40 million tokens to stabilize convergence. Compute for Llama 3 70B reached approximately 6.3 × 10^{24} FLOPs, derived from standard scaling laws (6 × parameters × tokens).[2][3] Llama 4 (2025) introduced multimodal pre-training with early fusion of text, image, and video tokens into a unified token stream, trained on over 30 trillion tokens using a mixture-of-experts (MoE) architecture for computational efficiency during sparse activation. A custom MetaP technique automated hyperparameter tuning, such as layer norms and expert routing, to mitigate instability in large-scale runs on clusters exceeding 100,000 H100-equivalent GPUs. Post-fusion alignment training integrated separate vision encoders with the LLM backbone, emphasizing causal reasoning across modalities.[5][44]Datasets and Data Curation
The Llama family of models relies on massive pretraining datasets drawn exclusively from publicly available sources to ensure ethical data usage and regulatory compliance, avoiding any proprietary Meta user data such as private Facebook or Instagram posts.[2][41] This curation strategy prioritizes high-quality, diverse text to enhance model generalization while mitigating risks like toxicity or bias amplification from unfiltered web scrapes.[2] The original LLaMA models (2023) were pretrained on around 1.4 trillion tokens, processed through filtering pipelines that included deduplication and quality scoring to remove low-value content, drawing from web crawls like Common Crawl, academic repositories, and open code sources.[45] Llama 2 (2023) expanded this to 2 trillion tokens of public data, incorporating longer context handling and refined filtering to improve efficiency over its predecessor, though specific composition details remain proprietary to protect against scraping incentives.[21][41] These steps involved heuristic-based removal of duplicates and heuristics for relevance, emphasizing scalability without compromising openness.[41] Llama 3 (2024) and its variants scaled dramatically to over 15 trillion tokens—seven times larger than Llama 2—sourced from public internet data with enhanced multilingual coverage (over 5% non-English across 30+ languages) and four times more code data for technical proficiency.[2] Curation employed advanced pipelines: heuristic and NSFW filters to excise harmful content, semantic deduplication to eliminate redundancies, and classifiers trained on prior Llama outputs to score text quality, with data mix optimized via experiments for domains like STEM, coding, trivia, and history.[2] This multi-stage process, informed by scaling laws, aimed to balance volume with precision, reducing noise that could degrade causal reasoning or factual recall in downstream tasks.[2] Subsequent iterations, including Llama 3.1 (2024), maintained the ~15 trillion token scale while refining post-training filters for safety and alignment, underscoring a commitment to iterative quality over sheer quantity.[3] Overall, Meta's approach contrasts with closed models by forgoing internal data hoards, potentially introducing web-scale biases from sources like uncurated crawls, but enabling verifiable reproducibility through public sourcing.[2][41]Fine-Tuning Methods
Supervised fine-tuning (SFT) forms the initial stage of adapting base Llama models for instruction-following, involving training on datasets comprising prompt-response pairs generated from public sources, synthetic data, and human annotations. For Llama 2, this included filtering and ranking outputs from the base model using a quality model, followed by training on approximately 27,000 high-quality, human-annotated conversations, supplemented by rejection sampling to promote output diversity.[46] Llama 3 extended this with over 10 million human preference annotations, incorporating techniques like grouped query attention for efficient long-context handling during post-training.[2] [47] Reinforcement learning from human feedback (RLHF) refines SFT outputs by aligning them with human preferences, using a reward model trained on ranked response pairs to optimize the policy via proximal policy optimization (PPO). In Llama 2, the reward model drew from 1 million human preference labels across categories like helpfulness and safety, enabling iterative improvements in coherence and reduced hallucinations.[46] Llama 3's RLHF incorporated iterative human feedback loops and direct preference optimization variants to enhance robustness, with safety-specific RLHF targeting jailbreak vulnerabilities through adversarial red-teaming datasets exceeding 1 million examples.[2] [47] Additional alignment techniques include rejection sampling fine-tuning (RFT), where multiple base model completions are generated per prompt, ranked by a reward model, and the highest-scoring retained to expand the SFT dataset without further sampling during training. This method, applied in Llama 2, yielded gains in benchmarks like Helpful and Harmless evaluations, with the 70B chat variant outperforming comparably sized closed models.[46] For safety, Meta integrates system-level mitigations alongside fine-tuning, such as content filters, though critiques note that RLHF's reliance on crowd-sourced preferences can embed subjective biases from annotator demographics, potentially limiting generalizability.[46] Community adaptations of Llama models frequently employ parameter-efficient fine-tuning (PEFT) methods like low-rank adaptation (LoRA), which updates only a small subset of parameters via low-rank matrices, reducing memory needs by up to 90% compared to full fine-tuning. Tools such as Hugging Face's PEFT library facilitate LoRA on Llama variants for domain-specific tasks, with quantized variants (QLoRA) enabling single-GPU training on consumer hardware. While effective for resource-constrained settings, PEFT may underperform full fine-tuning on complex alignments, as evidenced by benchmark drops in instruction adherence when rank is insufficiently high.Performance Evaluation
Benchmark Results Across Versions
Successive iterations of the Llama family have demonstrated progressive gains in benchmark performance, particularly on evaluations of general knowledge, reasoning, coding, and mathematics, driven by increases in model parameters, refined training objectives, and expanded datasets. Early versions like Llama 2 established competitive baselines against closed-source models of similar scale, while later releases such as Llama 3.1 and Llama 4 approached or exceeded frontier capabilities in select domains. These results are primarily self-reported by Meta in accompanying technical reports and blog announcements, with independent verifications available via open weights on platforms like Hugging Face, though benchmark saturation and potential overfitting to evaluation sets remain concerns in the field.[3][5] Key quantitative improvements are evident in standard academic benchmarks. For instance, on the Massive Multitask Language Understanding (MMLU) test, which assesses zero- or few-shot performance across 57 subjects, Llama 2's 70B parameter model achieved 68.9% accuracy, trailing GPT-3.5 Turbo's 70.0% but surpassing it in efficiency per parameter.[48] Llama 3's 70B variant advanced to 82.0% on MMLU (5-shot), outperforming Mistral 7B and approaching GPT-4 levels at lower inference costs.[2] The Llama 3.1 405B model further improved to 88.6% on MMLU, rivaling GPT-4o's 88.7% while leading in multilingual subsets.[3]| Model Version | Parameters | MMLU (%) | HumanEval (%) | GSM8K (%) |
|---|---|---|---|---|
| Llama 2 | 70B | 68.9 | 29.9 | 56.8 |
| Llama 3 | 70B | 82.0 | 62.3 | 79.6 |
| Llama 3.1 | 405B | 88.6 | 89.0 | 96.8 |
Comparative Analysis with Competitors
Llama models, particularly the Llama 3.1 405B variant released in July 2024, demonstrated competitive performance against closed-source counterparts on standardized benchmarks such as MMLU (87.3% accuracy, surpassing GPT-4 Turbo's 86.5% and Claude 3 Opus's 86.8%) and GPQA (diamond subset, 51.1% vs. GPT-4o's 48.0%).[50] However, Claude 3.5 Sonnet, launched in June 2024, frequently outperformed Llama 3.1 across coding tasks (e.g., HumanEval: 92% vs. Llama's 89%) and reasoning benchmarks like GPQA (59.4% vs. Llama's 51.1%), attributed to Anthropic's emphasis on safety-aligned post-training optimizations.[51] [52] Gemini 1.5 Pro excelled in long-context retrieval (e.g., 71.9% on vision benchmarks vs. Llama's lower scores), leveraging Google's vast multimodal data, while GPT-4o maintained edges in speed and latency for real-time applications (2x faster inference than predecessors).[53] [54]| Benchmark | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet | Gemini 1.5 Pro |
|---|---|---|---|---|
| MMLU | 87.3% | 88.7% | 88.7% | 85.9% |
| HumanEval | 89% | 90.2% | 92% | 84.1% |
| GPQA | 51.1% | 48.0% | 59.4% | 53.9% |