Prompt engineering
Prompt engineering is the process of crafting and refining natural language instructions, known as prompts, to guide large language models (LLMs) and vision-language models (VLMs) toward generating accurate, relevant, and task-specific outputs without modifying the underlying model parameters.[1] This technique leverages the embedded knowledge within pretrained models, enabling users to extend their capabilities across diverse applications such as question-answering, commonsense reasoning, and natural language processing tasks.[2] Emerging prominently with the rise of transformer-based LLMs like GPT-3 in the early 2020s, prompt engineering has evolved into a critical discipline in generative AI, optimizing interactions to maximize utility, truthfulness, and efficiency.[3][4] At its core, prompt engineering involves structuring inputs as "programs" for AI systems, where the quality of the prompt directly influences performance metrics like accuracy and coherence.[4] Key techniques include zero-shot prompting, where models infer tasks from descriptions alone; few-shot prompting, which provides examples to demonstrate desired behaviors; and chain-of-thought prompting, which encourages step-by-step reasoning to improve complex problem-solving.[1] Advanced methods extend to automatic prompt generation, where LLMs themselves optimize instructions, often outperforming human-crafted ones on benchmarks across 24 natural language processing tasks.[4] These approaches are particularly valuable in resource-constrained settings, as they avoid costly fine-tuning while eliciting structured knowledge from models pretrained on vast datasets.[2] The field has seen rapid development since 2022, with surveys as of 2024 documenting over 40 research papers on dozens of distinct prompting methods applied to various NLP tasks, and continued advancements into 2025 including context engineering and enhanced automatic optimization techniques.[2][5][6] Applications span information extraction, creative generation, and multimodal tasks in VLMs, with systematic surveys emphasizing the need for taxonomies to navigate the growing complexity of techniques.[1] Despite its promise, challenges persist in ensuring prompt robustness across models and domains, underscoring ongoing research into automated and meta-prompting strategies.[2]Fundamentals
Definition and Principles
Prompt engineering is the systematic process of designing, iterating, and refining inputs—typically textual prompts—to guide large language models (LLMs) or multimodal AI systems toward producing desired outputs. This practice involves crafting prompts that leverage the model's pre-trained knowledge without requiring model retraining or fine-tuning, making it a cost-effective approach for optimizing performance across diverse tasks. By carefully structuring prompts, engineers can elicit more accurate, relevant, and coherent responses from models that operate as black boxes, where internal mechanisms are not directly accessible. The importance of prompt engineering stems from its ability to enhance model efficacy in real-world applications, such as natural language generation, classification, question answering, and reasoning tasks. It mitigates common issues like hallucinations—where models generate plausible but incorrect information—by constraining the output space and providing explicit guidance. This method improves efficiency, as well-tested prompts can achieve results comparable to or better than supervised fine-tuning, while reducing computational demands. For instance, in enterprise settings, prompt engineering enables rapid adaptation of LLMs to domain-specific needs, such as legal document analysis or customer support, without extensive data labeling. Core principles of prompt engineering emphasize clarity, specificity, context provision, and iterative refinement to bridge the gap between human intent and model capabilities. Clarity requires using unambiguous language to avoid misinterpretation, ensuring the prompt directly conveys the task without extraneous details. Specificity involves defining precise constraints, such as output format (e.g., JSON or bullet points) or length limits, to align responses with user expectations. Context provision entails supplying relevant background, examples, or role assignments (e.g., "You are a helpful assistant") to prime the model, drawing on its in-context learning abilities. Finally, iteration—testing variations and analyzing outputs—allows for progressive improvements, often guided by metrics like accuracy or coherence scores. These principles are particularly vital for black-box models like the GPT series, where prompt design serves as the primary interface for controlling behavior. In practice, prompt engineering manifests in varying levels of structure. A simple zero-shot prompt might instruct: "Classify this text as positive or negative: The movie was thrilling and well-acted." This relies solely on the model's inherent understanding without examples. In contrast, a more structured prompt could add context: "You are a sentiment analyst. Review the following customer feedback and classify it as positive, negative, or neutral, explaining your reasoning: The service was prompt but the food arrived cold." Such refinements demonstrate how principles like specificity and context can substantially improve output quality.Basic Prompting Methods
Basic prompting methods form the foundation of interacting with large language models (LLMs), enabling users to elicit desired outputs through carefully crafted instructions without requiring model retraining. These techniques prioritize simplicity and directness, making them accessible for beginners tackling straightforward tasks such as classification, translation, or generation. Among the core approaches are zero-shot and few-shot prompting, which rely on in-context learning to adapt the model's pre-trained knowledge to new problems. Zero-shot prompting involves providing a direct natural language instruction to the model without any task-specific examples, allowing it to infer and perform the required action based solely on its training. For instance, a prompt like "Translate the following sentence to French: Hello world" can yield accurate translations for simple linguistic tasks, as the model draws on generalized patterns from its vast pre-training data. This method is particularly effective for well-represented domains like basic question answering or sentiment analysis, where GPT-3 achieved 81.5 F1 score on the CoQA dataset in zero-shot settings.[7] However, zero-shot prompting exhibits limitations in novel or complex domains, such as natural language inference tasks, where performance drops significantly—for example, only 14.6% accuracy on Natural Questions—due to the absence of guiding demonstrations that could clarify ambiguous instructions.[7] Few-shot prompting builds on zero-shot by incorporating a small number of examples (typically 1-5 input-output pairs) within the prompt to demonstrate the desired format, style, or reasoning pattern, thereby priming the model for better generalization. An example for an analogy task might be: "Q: Bird is to fly as fish is to? A: swim. Q: Car is to drive as boat is to? A:", followed by the new query, which helps the model align its response structure accordingly. This approach enhances performance over zero-shot, with GPT-3 reaching 85.0 F1 on CoQA and 71.2% accuracy on TriviaQA in few-shot scenarios, often rivaling fine-tuned models on benchmarks like reading comprehension.[7] The inclusion of examples mitigates issues in output formatting and improves reliability for tasks requiring specific stylistic adherence, though it demands careful selection of diverse, representative demonstrations to avoid biasing the model. Role-playing prompts assign a specific persona or role to the model to shape its tone, expertise, and response perspective, simulating specialized knowledge or behavioral constraints. For example, "You are a helpful doctor. Diagnose the symptoms: persistent cough and fever" encourages the model to adopt a professional, empathetic voice while focusing on medical reasoning. This technique can improve zero-shot reasoning on arithmetic and commonsense tasks compared to standard prompts, as it leverages the model's ability to emulate roles from training data. Role-playing is especially useful for interactive applications like customer support or creative writing, where it influences output coherence and relevance without additional examples. Effective prompts typically comprise four key structural elements: clear instructions detailing the task, relevant context to ground the response, the primary input data, and an output format specification to ensure parsable results. Instructions should be placed at the prompt's beginning for emphasis, such as "Summarize the following article in three bullet points," while context provides background like "Focus on environmental impacts." Input data follows as the core query, and output indicators—e.g., "Output in JSON format: {'key': 'value'}" or "Use bullet points"—guide structured generation, reducing ambiguity and improving usability across tasks. Separators like "###" or triple quotes help delineate these elements, enhancing the model's focus. Evaluating basic prompts involves assessing output quality through metrics like accuracy, which measures factual correctness against ground truth (e.g., exact match or F1 score), and coherence, which evaluates logical flow and relevance using human judgments or automated proxies like perplexity. For instance, accuracy is critical for classification tasks, while coherence ensures narrative consistency in generation. An iterative refinement process is essential: start with zero-shot prompts, test on sample inputs, measure metrics, then incorporate few-shot examples or role adjustments based on failures, repeating until performance stabilizes. This cycle, often yielding 10-20% gains per iteration on benchmarks, underscores the empirical nature of prompt design.Historical Development
Origins in Early NLP
The roots of prompt engineering can be traced to early natural language processing (NLP) systems in the mid-20th century, where manual crafting of inputs was essential for eliciting desired responses from rule-based programs. A seminal example is ELIZA, developed in 1966 by Joseph Weizenbaum at MIT, which simulated conversation through pattern-matching rules and scripted responses to user inputs. ELIZA relied on hand-crafted templates to detect keywords in user statements and generate replies, such as rephrasing the input as a question to mimic a psychotherapist; this approach highlighted the critical role of input structure in guiding system behavior, though limited to rigid, predefined patterns. In the 1990s, statistical NLP extended these ideas through template-filling techniques in information extraction tasks, particularly during the Message Understanding Conferences (MUC) organized by DARPA starting in 1987. Systems in MUC-1 and subsequent iterations used hand-crafted rules to parse texts and populate fixed templates with slots for entities like events, participants, and locations, as seen in early evaluations of naval message processing. This era marked a shift from purely symbolic AI to probabilistic methods, yet still required meticulous input preprocessing—such as rule-based annotation of training data—to achieve reliable parsing accuracy, often around 60-70% for template completion in controlled domains. The emphasis on crafting inputs to align with statistical models foreshadowed later prompting strategies. Early analogs to prompting appeared in information retrieval (IR) systems of the 1970s and 1980s, where query formulation directly influenced search outcomes, and in machine learning pipelines involving feature engineering. In IR, Boolean queries—combining terms with operators like AND and OR—demanded precise phrasing to retrieve relevant documents, as demonstrated in the SMART system developed by Gerard Salton, which evaluated query effectiveness on test collections with recall rates varying by up to 30% based on formulation. Similarly, feature engineering in early ML for NLP tasks, such as part-of-speech tagging, involved manual selection and transformation of input representations (e.g., n-grams or lexical rules) to optimize classifier performance, underscoring input sensitivity as a core design principle. The transition toward neural approaches in the 2000s amplified these concepts, particularly with sequence-to-sequence (seq2seq) models that revealed how input phrasing impacted output quality. Introduced by Sutskever et al. in 2014 for machine translation, seq2seq architectures using recurrent neural networks (RNNs) processed variable-length inputs to generate translations, where subtle changes in source sentence structure—such as word order or punctuation—could alter BLEU scores by 2-5 points, emphasizing the need for careful input design. This sensitivity extended to RNN-based tasks like sentiment analysis, where early models showed performance gains from engineered input formats, such as negation handling or context windows, achieving accuracies up to 85% on benchmark datasets when inputs were optimized. These developments bridged rule-based crafting to modern prompting, setting the stage for transformer-era innovations.Key Advances with Transformer Models
The introduction of the Transformer architecture in 2017 revolutionized natural language processing by replacing recurrent layers with self-attention mechanisms, allowing models to capture long-range dependencies across entire input sequences in parallel.[8] This design enabled more flexible and context-aware handling of variable-length inputs, such as prompts, without the computational inefficiencies of sequential processing, thereby setting the stage for prompt engineering as a core interaction paradigm with large language models. From 2018 to 2020, bidirectional models like BERT advanced prompt-based interactions through masked language modeling, where cloze-style prompts—requiring models to predict masked tokens based on bidirectional context—uncovered emergent abilities in tasks like question answering and sentiment analysis, often outperforming traditional fine-tuning approaches.[9] OpenAI's GPT-2, released in 2019, demonstrated unsupervised multitask learning via simple completion prompts, achieving state-of-the-art zero-shot performance on language modeling benchmarks with its 1.5 billion parameters.[10] The 2020 launch of GPT-3, scaling to 175 billion parameters, further amplified these capabilities, showing that few-shot prompts with in-context examples could elicit strong performance across diverse NLP tasks like translation and summarization, with improvements scaling logarithmically with prompt length and example count; this era popularized "prompt hacking" as practitioners iteratively refined inputs to unlock model potential.[11] Empirical studies on scaling laws from 2020 onward, including the 2022 Chinchilla analysis, confirmed that prompt efficacy in large autoregressive models correlates with increased parameter counts and training data, predicting performance gains of up to 10-20% on downstream tasks as models exceed 100 billion parameters.[12] Tools like PromptSource, introduced in 2022, standardized prompt creation and sharing by integrating datasets with templating functions, enabling researchers to curate task-specific inputs reproducibly and accelerating community-driven advancements in prompt design.[13] By 2024 and 2025, prompt engineering extended to multimodal contexts with models like GPT-4o, which natively processes interleaved text, audio, and vision prompts to perform real-time reasoning, such as describing images while responding to voice queries, with latency reduced by 2x compared to GPT-4 Turbo.[14] This period also saw the proliferation of automated prompt optimization tools, integrated into ecosystems around models like Grok-2 (released August 2024), which supports advanced instruction-following via refined prompts and achieves competitive benchmarks in reasoning tasks.[15] In 2025, xAI released Grok-3 in February and Grok-4 in July, further enhancing multimodal prompting and reasoning capabilities in large-scale models.Text-to-Text Techniques
In-Context Learning
In-context learning refers to the emergent ability of large language models (LLMs) to adapt to new tasks by conditioning their outputs on a few demonstrations provided directly in the input prompt, without any updates to the model's parameters. This capability was first prominently demonstrated in GPT-3, where the model generalized to unseen tasks using zero, one, or a small number of input-output examples embedded in the prompt, marking a shift from traditional fine-tuning approaches. Earlier models like GPT-2 showed preliminary signs of this behavior, but it became more reliable and pronounced in larger-scale architectures. The underlying mechanisms of in-context learning involve the transformer's attention mechanism, which implicitly simulates a form of fine-tuning by weighting and integrating information from the prompt tokens during inference. Specifically, induction heads—specialized attention patterns—enable the model to detect and copy relevant patterns from the examples, facilitating task adaptation through gradient-like updates encoded in the forward pass. Effective in-context learning also depends on careful selection of prompt examples, prioritizing diversity to cover varied scenarios and relevance to the target input to maximize generalization. In practice, in-context learning applies to tasks such as text classification and generation, where 3-5 input-output pairs are often sufficient to guide the model. For instance, in question answering, a prompt might include examples like:The model then completes the response based on the pattern. Variants include dynamic few-shot learning, where examples are selected at inference time based on similarity to the query, enhancing adaptability without predefined prompts. However, limitations arise from context length constraints, as models struggle with long prompts exceeding token limits, typically around 4,000 tokens in early implementations. Empirical studies show that in-context learning performance improves with increasing model size, as larger LLMs better capture complex patterns from few examples, and with prompt length up to the context window, where additional demonstrations boost accuracy until saturation. This approach extends to reasoning tasks through methods like chain-of-thought prompting, which builds on example-based adaptation by incorporating step-by-step demonstrations.Q: What is the capital of [France](/page/France)? A: [Paris](/page/Paris) Q: What is the capital of Japan? A: [Tokyo](/page/Tokyo) Q: What is the capital of [Brazil](/page/Brazil)? A:Q: What is the capital of [France](/page/France)? A: [Paris](/page/Paris) Q: What is the capital of Japan? A: [Tokyo](/page/Tokyo) Q: What is the capital of [Brazil](/page/Brazil)? A: