Fact-checked by Grok 2 weeks ago

Natural language generation

Natural language generation (NLG) is the subfield of and focused on the automatic production of human-readable text from structured data or other non-linguistic inputs to achieve specific communicative objectives. This process involves transforming abstract representations, such as databases, knowledge graphs, or semantic structures, into coherent, fluent, and contextually appropriate outputs. Unlike , which interprets text, NLG emphasizes deliberate construction to meet goals like informing, persuading, or entertaining. The core architecture of NLG systems typically follows a comprising content planning (selecting relevant information), discourse structuring (organizing it logically), sentence planning (choosing words and aggregation), and surface realization (ensuring grammaticality and fluency). Early NLG efforts in the and relied on rule-based and template-filling methods for simple tasks, such as generating reports or database summaries, but these were limited in flexibility and scalability. By the , more sophisticated systems emerged, incorporating knowledge representation techniques to handle complex domains like medical reporting and explanations. In the , the advent of statistical and neural approaches revolutionized NLG, enabling end-to-end models that learn directly from data-to-text pairs without explicit modular stages. Transformer-based large language models, such as variants including the GPT-5 series released in 2025, have further advanced the field by producing diverse, creative text for applications including systems, content summarization, and . These neural methods excel in handling open-ended generation but introduce challenges like factual inaccuracies (hallucinations) and the need for . As of 2025, NLG applications span numerous domains, from personalized descriptions derived from product ontologies to accessible explanations of data for non-experts. In task-oriented dialogue systems, NLG integrates with to generate responses that align with user intents and system policies. Evaluation metrics for NLG emphasize , adequacy, and informativeness, often using automated scores like alongside human judgments. Ongoing research addresses ethical concerns, such as bias mitigation and ensuring generated text's trustworthiness, particularly in high-stakes areas like legal or healthcare communication.

Fundamentals

Definition and Scope

Natural language generation (NLG) is the subfield of and concerned with the construction of computer systems that produce understandable texts in human languages from underlying non-linguistic representations of information, such as , knowledge bases, or structured inputs. This process involves deliberately constructing natural language text to meet specified communicative goals, transforming data into coherent, fluent output that mimics human-like expression. In scope, NLG focuses on output generation rather than input parsing, distinguishing it from (NLU), which maps text to meaning representations. While NLU interprets unstructured language, NLG inverts this by generating text from semantic or structured sources, encompassing sub-tasks such as text summarization from documents and dialogue response creation in conversational systems. NLG operates as the inverse of (NLP) in the broader communication pipeline, where NLP encompasses both understanding and generation but NLG specifically handles the production phase. Key concepts in NLG include the of diverse input types—ranging from numerical to semantic representations—into varied output forms like reports, captions, or descriptive narratives. For instance, inputs from knowledge bases might yield explanatory texts, emphasizing the need for and appropriateness in the generated language. As a core component of human-AI interaction, NLG enables machines to communicate effectively in , bridging the gap between computational systems and human users by producing readable and informative text from abstract . This capability supports applications in AI-driven interfaces, where generated language enhances and interpretability of machine outputs.

Historical Development

The origins of natural language generation (NLG), the subfield of focused on producing coherent and contextually appropriate text from non-linguistic inputs, trace back to the mid-20th century amid broader advances in and . Foundational theoretical work in the 1950s and 1960s, particularly Noam Chomsky's introduction of in , emphasized hierarchical structures and transformational rules for language production, influencing early computational efforts to model text generation as a systematic process akin to human . By the , initial experiments in rule-based systems emerged, building on these linguistic theories to generate simple sentences from logical representations, though limited by computational constraints and a lack of empirical data. The classical era of NLG in the 1980s and 1990s shifted toward structured pipeline architectures, emphasizing modular processes for content planning, sentence structuring, and surface realization. David D. McDonald's 1982 work on salience in selection mechanisms highlighted how prioritizing key information could guide text construction in rule-based generators, addressing challenges in choosing what to express from complex inputs. Concurrently, the PENMAN project, developed by William C. Mann at the Information Sciences Institute, introduced a comprehensive text generation system that integrated knowledge representation with rhetorical planning, enabling the production of multi-sentence discourses. A pivotal contribution was Rhetorical Structure Theory (RST), formalized by Mann and Thompson in 1988, which modeled text coherence through hierarchical relations between spans (e.g., elaboration, contrast), providing a framework for organizing generated content to mimic human argumentation and narrative flow. These template- and rule-driven approaches dominated, focusing on domain-specific applications like weather reports, but struggled with and flexibility. The 2000s marked a transition to data-driven paradigms, incorporating statistical methods to handle variability in language output. Irene Langkilde's system (2000) represented a breakthrough by combining symbolic input representations with statistical optimization over vast realization forests, drawing techniques from to select fluent sentences probabilistically rather than exhaustively via rules. This integration allowed NLG to leverage parallel corpora and n-gram models, improving robustness in noisy or ambiguous scenarios, and paved the way for systems that balanced interpretability with empirical performance. Key events during this period included the establishment of the International Natural Language Generation Conference (INLG), with workshops dating back to 1983 and formal conferences beginning around 2000, fostering collaboration on benchmarks and evaluation metrics. From the 2010s onward, the advent of revolutionized NLG, enabling end-to-end models that bypassed traditional pipelines. The Transformer architecture, introduced by et al. in 2017, used self-attention mechanisms to capture long-range dependencies in sequences, dramatically enhancing generation quality and efficiency for tasks like summarization and dialogue. Subsequent models like OpenAI's GPT series, starting with in 2018, scaled unsupervised pretraining on massive corpora to produce diverse, context-aware text, while Google's (Raffel et al., 2020) unified NLG tasks under a text-to-text framework, achieving state-of-the-art results through on diverse datasets. Notable advancements since then include OpenAI's (2020) and (2023), which demonstrated unprecedented scale in parameter size and performance, alongside models like Google's (2022) and Meta's series, enhancing NLG's versatility and integration with multimodal tasks. The confluence of , neural networks, and increased computational power has since driven NLG toward more scalable, general-purpose systems, with ongoing INLG conferences highlighting impacts on and ethical considerations.

Methodologies

Classical Pipeline Approaches

Classical pipeline approaches in natural language generation (NLG) rely on a modular, sequential that decomposes the generation process into distinct stages, transforming non-linguistic input —such as databases or representations—into coherent human-readable text. This typically consists of three primary phases: content planning, which determines the relevant information to include; sentence planning (or microplanning), which organizes that information into logical structures; and surface realization, which applies linguistic rules to produce grammatical output. Unlike end-to-end neural models, these pipelines offer high interpretability and fine-grained control, allowing developers to intervene at specific stages for or error correction, though they require extensive manual engineering. The key components form a structured where each stage builds on the previous one to ensure systematic text production. In content planning, rules or schemas select and organize messages from input data, often drawing on domain-specific knowledge bases to decide what facts to convey and in what order, such as prioritizing critical events in a report. Sentence planning then aggregates related messages, performs lexical choice to select appropriate words, and generates referring expressions to maintain . Finally, surface realization linearizes this abstract structure into surface forms using syntactic and morphological rules, ensuring fluency and correctness. This decomposition, rooted in early NLG theory, enables targeted development but demands integration across modules to avoid inconsistencies. Rule-based methods dominate these pipelines, employing templates for simple slot-filling, formal grammars for syntactic construction, and knowledge bases for semantic guidance. Templates provide predefined patterns with placeholders for data, offering efficiency in controlled domains but limited variability. More sophisticated approaches use unification-based grammars, which merge feature structures to resolve choices like lexical selection through argumentation over rhetorical relations. A seminal example is (Functional Unification Formalism), an early system that implements unification grammars to control lexical choice and generate varied realizations from abstract inputs, emphasizing declarative rules over procedural coding. These methods leverage hand-crafted resources, such as systemic grammars or meaning-text theory, to encode linguistic knowledge explicitly. Despite their strengths, classical pipelines exhibit notable limitations, including rigidity in handling novel inputs or ambiguities, as rules cannot easily generalize beyond encoded scenarios. Developing and maintaining these systems is labor-intensive, requiring expert for grammars, lexicons, and domain rules, which scales poorly to new applications. They dominated NLG research and deployment until the early , when techniques began offering greater flexibility. A representative example is (Forecast Generator) system, which produces textual forecasts from meteorological data using a domain-specific . FoG employs rule-based content planning to select key weather events, sentence planning for aggregation and phrasing choices (e.g., using for uncertain predictions), and surface realization via templates and simple grammars to generate readable bulletins. Deployed for Canadian weather services, it demonstrated the practicality of pipelines in operational settings, producing forecasts in English and French while highlighting the need for corpus-informed rules to ensure naturalness. Modern alternatives, such as neural decoders, have since reduced the of such systems for broader applicability.

Machine Learning-Based Techniques

Machine learning-based techniques in natural language generation (NLG) represent a shift from rule-based systems to data-driven approaches that learn patterns directly from annotated corpora, enabling more flexible and scalable text production. Early statistical methods laid the foundation by employing probabilistic models to capture linguistic regularities. N-gram-based generation, for instance, models the probability of word sequences using Markov assumptions, where the likelihood of a word depends on the preceding n-1 words, facilitating simple yet effective completion in NLG tasks. These models were often combined with maximum entropy frameworks, which optimize feature-based probabilities without assuming feature independence, as demonstrated in trainable systems for surface realization that generate from semantic using annotated . A pivotal advancement came with neural architectures, particularly sequence-to-sequence () models, which use encoder-decoder frameworks to map input sequences—such as structured data or meaning representations—to output text. Introduced using (LSTM) networks, these models encode the input into a fixed-dimensional and decode it autoregressively, achieving strong performance in tasks like that parallel NLG applications. To address limitations in handling long-range dependencies, attention mechanisms were integrated, allowing the decoder to dynamically on relevant input parts during generation; this culminated in the Transformer architecture, which relies entirely on self-attention layers to process sequences in parallel, revolutionizing NLG by improving coherence and efficiency in producing fluent text from diverse inputs. End-to-end learning extends these neural approaches by directly mapping structured inputs, like database records or RDF triples, to natural language outputs without intermediate symbolic stages, trained via . The core objective in such frameworks is to minimize the negative log-likelihood loss: L = -\sum_{t=1}^{T} \log P(y_t \mid y_{<t}, x) where x denotes the input , y_{<t} the partial output up to timestep t, and T the output , enabling models to learn holistic mappings from . This has been applied effectively in data-to-text generation, producing descriptive text from tabular or graph-structured information. Key datasets supporting these methods include WebNLG, which provides RDF triple sets paired with verbalizations for training RDF-to-text systems across multiple languages, and the E2E dataset, comprising dialogue acts in the restaurant domain mapped to referring expressions, designed to evaluate end-to-end NLG in spoken systems. More recent developments leverage pre-trained large models (LLMs) for NLG by them on task-specific data, enhancing generation quality through from vast unlabeled corpora. Models like , pre-trained generatively on next-token prediction, excel in open-ended text production and can incorporate controllability through structured prompts that guide output towards desired attributes, such as style or factual accuracy. Similarly, BERT's bidirectional pre-training on masked language modeling allows for conditional generation tasks, where encoder components process inputs to inform outputs, though adaptations like extend this for full NLG. These techniques have demonstrated superior fluency and diversity in applications ranging from summarization to personalized , often outperforming earlier neural baselines on benchmarks like scores in controlled evaluations.

Hybrid and Emerging Methods

Hybrid systems in natural language generation integrate rule-based planning with neural realization to leverage the strengths of both paradigms, enabling structured content while producing fluent outputs. For instance, the separates the process into a symbolic planning stage that ensures fidelity to input data and a neural generation stage for linguistic realization, improving control over output structure without sacrificing naturalness. This approach balances the interpretability and precision of classical methods with the flexibility of , particularly in data-to-text tasks where adherence to source information is crucial. Controllable natural language generation techniques allow for targeted attribute control during text production, addressing limitations in unconstrained neural models. Plug-and-Play Language Models (PPLM) achieve this by steering pretrained language models using lightweight attribute classifiers that manipulate activation patterns without the base model, enabling attributes like sentiment or to be controlled dynamically. methods further enhance fidelity by optimizing generation for specific constraints, such as factual accuracy in summaries, through reward signals derived from external verifiers. Multimodal natural language generation extends text production to incorporate non-textual inputs like images or videos, fostering richer interactions. Vision-language models such as CLIP facilitate this by aligning visual and textual representations, allowing generators to produce descriptive captions or narratives grounded in visual content through integrated encoding-decoding pipelines. Recent advancements as of 2025 include natively large language models like Llama 4 variants, which process text and images for more cross-modal generation in applications such as visual . These systems improve coherence between modalities, as seen in applications where image features guide narrative flow, reducing mismatches in generated descriptions. Neuro-symbolic methods that merge neural networks with symbolic logic have become established approaches to enhance reasoning and interpretability in NLG. These integrate logical rules into neural architectures for tasks like , where symbolic inference ensures consistency while neural components handle linguistic variability, as surveyed in recent frameworks from 2024. Ethical considerations, particularly bias mitigation, are integral to these developments; techniques such as and counterfactual fairness interventions counteract or racial biases in generated text by balancing training distributions and evaluating outputs against fairness metrics. Addressing scalability issues in large models for involves strategies to curb hallucinations, where models produce unverifiable content. Retrieval-augmented (RAG) mitigates this by conditioning outputs on retrieved external knowledge, improving factual accuracy in knowledge-intensive tasks by up to 20-30% on benchmarks like open-domain without expanding model parameters. Recent hybrid RAG systems as of 2025 further refine this by combining dense and sparse retrieval for enhanced performance in dynamic environments. This method supports efficient scaling by offloading memory to non-parametric stores, enabling reliable in resource-constrained environments.

Core Processes

Content Determination

Content determination is the initial phase in the natural language generation (NLG) , where the system transforms raw input —such as database records or outputs—into a set of communicative goals by selecting, aggregating, and prioritizing relevant for expression in text. This process ensures that the generated output focuses on key facts while avoiding , aligning the content with the intended , such as informing or persuading the audience. Aggregation involves grouping related points to improve and reduce . Techniques for content determination include schema-based selection, which uses predefined templates to identify pertinent data elements based on domain-specific criteria, and Rhetorical Structure Theory (RST), which organizes selected content into a hierarchical discourse structure to guide overall text coherence. RST, introduced by Mann and Thompson, defines relations between text spans (e.g., elaboration or contrast) to prioritize information that supports the primary communicative intent. Content planning algorithms often employ rule-based systems to evaluate input against goals, such as including only statistically significant trends in a report. A representative example occurs in automated report generation from sensor , where the system selects key statistics—like and peak from hourly readings—while omitting redundant entries, applying aggregation rules to summarize numerical into concise descriptors such as "The was mild, with gusty winds." This selection ensures the text remains focused and readable without overwhelming the reader with raw details. Unique challenges in content determination arise when handling incomplete or conflicting sources, such as missing values in a or contradictory records from multiple sensors, which can lead to biased or inaccurate selections if not resolved through imputation or heuristics. Systems must incorporate validation steps to detect and mitigate these issues, ensuring robust content choices. This phase applies to various input types, ranging from structured data like relational tables, where selection involves querying specific rows and columns, to semi-structured formats such as knowledge graphs, where traversal algorithms identify relevant nodes and edges for inclusion. The determined content then informs subsequent microplanning, where rhetorical relations and ordering are refined for textual expression.

Microplanning

Microplanning is the intermediate stage in natural language generation (NLG) pipelines where selected content from the document planning phase is organized into coherent textual units, focusing on decisions that ensure logical flow and linguistic appropriateness before surface realization. This process transforms abstract representations, such as propositions or facts, into structured specifications for , addressing how is packaged to achieve communicative goals. According to the classic framework outlined by Reiter and Dale, microplanning bridges high-level content selection and low-level syntactic formation by handling choices that impact readability and coherence. Core tasks in microplanning include structuring sentences through discourse relations, lexical choice, and aggregation of clauses. Discourse relations, such as elaboration or , are often modeled using Rhetorical Structure Theory (RST), which organizes text into hierarchical trees where relations link spans to convey intentions like explanation or justification. For instance, in explanatory texts, a cause-effect relation might connect a precipitating event to its outcome, ensuring the generated paragraph flows logically from bullet-point facts like " experienced low oxygen levels" to "This led to respiratory distress." Lexical choice involves selecting words or phrases that best convey meaning while considering context, such as choosing "decline" over "drop" for medical reports to match register, guided by resources like VerbNet for semantic compatibility. Aggregation merges related clauses to avoid , for example, combining multiple similar events into a single sentence like "The had three successive bradycardias down to 69 bpm" instead of separate statements, using rule-based heuristics or statistical methods. Referring expressions are generated during microplanning to maintain , resolving anaphora through theories like centering theory, which tracks salience across utterances to decide between pronouns and full descriptions. For example, in a sequence describing events, a highly salient (e.g., "the patient") might be referred to as "he" in subsequent sentences if it remains the focus, following principles of local . An incremental prioritizes attributes in descriptions, such as type before color, to generate concise yet informative references like "the red car" only when necessary. Formalisms like the schema support these decisions by providing schemas for rhetorical relations in RST-based generation, ensuring relations are realized appropriately in text spans. Linguistic aspects such as , , and are selected based on contextual cues during microplanning to align with the intended temporal or evidential stance. For instance, and might be chosen for completed events in reports, as in "The treatment had been administered," while like "may" introduces for hypothetical outcomes. These choices are encoded in semantic representations passed to realization, drawing from input specifications like event types and arguments.

Realization and Generation

Realization and generation, often termed surface realization, constitutes the concluding phase of the natural language generation (NLG) pipeline, transforming abstract representations from microplanning—such as conceptual structures or deep syntactic forms—into coherent, grammatical text. This process ensures that the output adheres to linguistic rules, producing sentences that are syntactically correct and morphologically appropriate for the target language. Syntactic realization maps logical forms to surface structures by selecting and ordering words within grammatical frameworks. Early systems leveraged Generalized Phrase Structure Grammar (GPSG), a context-free formalism that supports efficient generation through feature percolation and unification, enabling the production of varied syntactic variants from a single input representation. offer an alternative, using elementary trees as building blocks that can be combined via and adjunction to handle dependencies like relative clauses, providing precise control over sentence complexity in NLG. Comprehensive implementations, such as system, integrate principles with unification to realize deep-syntactic inputs into full English sentences, demonstrating reusability across diverse NLG applications. Morphological generation addresses word-level adjustments, inflecting lemmas according to syntactic features like tense, number, and case to form complete lexical items. For example, it conjugates verbs (e.g., "" to "rained" for ) and pluralizes nouns based on orthographic rules and exceptions. Robust finite-state implementations achieve high accuracy by prioritizing rules for irregularities, such as deriving "stimuli" from "stimulus+s_N" while handling over 1,100 exceptional lemmata for consonant doubling and other patterns. A illustration of the process transforms an abstract input like "event: , location: , time: yesterday" into the "It rained in the yesterday," where syntactic frames embed the event, adjuncts specify location and time, and morphological rules apply . Key algorithms for syntactic realization include chart-based methods, which use bottom-up dynamic programming to parse and assemble structures from lexical entries, as adapted for (CCG) to cover logical forms with bit-vector tracking for efficiency. Optimization techniques employ to jointly select lexical choices and structures, minimizing length or maximizing compactness while enforcing grammatical constraints, often integrating with content selection for improved output density. Output polishing refines the generated text by applying orthographic rules for , , and spacing, alongside basic checks for , ensuring the final product reads naturally without altering core semantics.

Applications

Data-to-Text Systems

Data-to-text systems in natural language generation (NLG) focus on transforming structured data, such as tables or , into coherent, human-readable text. These systems are particularly valuable in domains requiring regular reporting from quantitative inputs, where manual writing is time-intensive or error-prone. Early examples include the SUMTIME system, which generates textual forecasts from numerical meteorological data for reports, demonstrating how rule-based pipelines can produce reliable summaries tailored to specific user needs like safety-critical operations. Similarly, in , data-to-text approaches automate summaries of data, converting tabular records of prices, volumes, and trends into overviews that highlight key movements and implications for investors. Adapting classical NLG pipelines for data-to-text involves customizing stages like content determination and microplanning to handle tabular or relational inputs. For instance, meaning representation languages such as Abstract Meaning Representation (AMR) facilitate the mapping of structured data to semantic graphs, enabling systematic selection and aggregation of relevant facts while preserving logical relationships. This adaptation ensures that generated text adheres to domain-specific conventions, such as emphasizing temporal sequences in weather data or causal inferences in financial trends. Case studies illustrate practical impacts, including systems for generating product descriptions from attribute-value pairs, which support scalable for catalogs. Such systems enhance accessibility, particularly for visually impaired users, by converting geo-referenced or tabular data into auditory-readable narratives via screen readers, as explored in projects linking map data to descriptive text. Modern enhancements leverage neural architectures for more fluent and context-aware generation. For example, end-to-end models like DataTuner employ sequence-to-sequence approaches to process structured inputs, improving and factual alignment in outputs compared to traditional methods. The domain has seen growth in sports commentary, where systems like those trained on the SportSett: dataset produce NBA game recaps from play-by-play statistics, capturing highlights and narratives with to event data.

Conversational and Interactive Uses

Natural language generation (NLG) plays a pivotal role in conversational and interactive systems, enabling the production of human-like responses in real-time dialogue. These systems integrate NLG with (NLU) to form end-to-end pipelines that interpret user intents and generate coherent outputs, evolving from modular architectures to unified neural models that reduce error propagation. Early examples include rule-based chatbots like , developed by Richard Wallace in 1995, which used pattern-matching via (AIML) to generate responses without deep contextual reasoning. This marked a foundational shift toward interactive NLG, though limited to scripted interactions. Advancements in neural architectures have transformed conversational NLG, with systems like BlenderBot, released by in 2020, employing large-scale transformer-based models to produce open-domain responses that maintain fluency and relevance across turns. Techniques such as response generation from dialogue acts—abstract representations of communicative intentions—allow NLG modules to convert structured plans into natural utterances, often integrated with dialogue management frameworks like Partially Observable Markov Decision Processes (POMDPs) for tracking hidden user states and context. POMDPs enable probabilistic belief updates over dialogue history, facilitating adaptive generation in uncertain environments. In practical applications, NLG powers task-oriented personal assistants like Apple's and Amazon's , which generate responses to fulfill user goals such as scheduling or by verbalizing dialogue states and actions. Customer service chatbots in banking domains similarly leverage NLG to produce personalized, context-aware replies, drawing on non-linguistic like histories to enhance response in multi-turn interactions. Key challenges in conversational NLG include maintaining coherent context across extended dialogues, where models must resolve coreferences and track evolving states to avoid repetition or drift. Handling in user inputs—such as vague intents or polysemous queries—further complicates generation, often requiring clarification strategies to elicit precise information without disrupting flow. Recent advancements distinguish between retrieval-based approaches, which select pre-defined responses from a corpus for consistency and speed, and generative methods, which synthesize novel outputs for flexibility but risk hallucinations. Datasets like MultiWOZ, introduced by Budzianowski et al. in 2018, have driven progress by providing multi-domain, annotated dialogues for training end-to-end systems that simulate real-world task-oriented interactions.

Multimedia and Creative Generation

Natural language generation (NLG) in multimedia contexts involves producing textual descriptions from non-textual inputs such as images and videos, enabling applications like automated captioning for and indexing. A seminal approach is the "Show and Tell" model, which combines a (CNN) to encode visual features with a (RNN) to decode them into coherent sentences, achieving state-of-the-art performance on benchmarks at the time. This encoder-decoder architecture has influenced subsequent multimodal NLG systems by demonstrating how visual embeddings can guide sequence generation. Datasets like MS COCO, released in 2014, have been pivotal for training such models, providing over 120,000 images paired with multiple human-annotated captions to support evaluation of descriptive accuracy and diversity. In creative NLG tasks, systems generate artistic text outputs, such as stories or , often using character-level RNNs to capture stylistic nuances. The Char-RNN framework exemplifies this by training on literary corpora like Shakespeare's works to produce novel verses, highlighting the potential of neural to mimic poetic structures through on sequences. Computational humor generation, particularly punchline prediction, employs probabilistic models to extend setups with unexpected resolutions, as seen in frameworks integrating surprise metrics and for creation. These methods underscore NLG's role in fostering originality, though outputs often require human refinement to align with cultural nuances. Representative examples include meme generation, where multimodal models pair image templates with contextually humorous captions generated via transformer-based language models, as in systems trained on corpora to automate viral content creation. leverages NLG for dynamic storytelling, with AI-driven engines generating branching narratives in response to user inputs, exemplified by platforms that use large language models to evolve plotlines in . A key challenge in these creative applications is ensuring novelty, as generative models tend to produce individually innovative text but reduce collective diversity by converging on similar patterns, potentially limiting broader artistic impact. Multimodal NLG extends to integrating audio and speech inputs, particularly in accessibility tools that transcribe spoken content into readable text for the hearing impaired. Systems combining automatic generate captions from live audio streams, improving in video conferencing and educational videos. Emerging trends highlight AI-human collaborations in artistic domains, such as using for scriptwriting, where the model generates dialogue and plot outlines from prompts, facilitating co-creative processes in and theater production as demonstrated in its capabilities. These integrations briefly extend to conversational elements in , enhancing immersive narratives with generated responses.

Assessment and Challenges

Evaluation Metrics

Evaluating the quality of natural language generation (NLG) systems requires a combination of intrinsic and extrinsic metrics to assess aspects such as , adequacy, and . Intrinsic metrics focus on the generated text in , often comparing it to texts, while extrinsic metrics evaluate the text's in achieving a specific task or goal, typically through user interaction or downstream performance. These approaches address the challenges of NLG evaluation, where traditional metrics from have been adapted but often fall short in capturing semantic nuances and contextual appropriateness. Intrinsic metrics, such as , measure surface-level similarities between generated and reference texts using n-gram overlap. Introduced for evaluation, BLEU computes a score based on of n-grams, modified by a brevity penalty to avoid favoring short outputs. The is given by: \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n \right) where BP is the brevity penalty, p_n is the modified n-gram , w_n are weights (typically uniform), and N is the maximum n-gram order (often 4). Despite its widespread use, BLEU has limitations in capturing semantics, as it penalizes valid paraphrases and struggles with diverse expressions common in NLG. Another intrinsic metric, , is particularly suited for summarization tasks within NLG and emphasizes over by measuring overlap of n-grams, longest common subsequences, or skip-bigrams between generated and reference summaries. Variants like (n-gram ) and (sequence-based) provide flexible assessments, though like , they overlook deeper meaning and coherence. These metrics enable quick, automated comparisons but correlate poorly with human perceptions of quality in open-ended generation scenarios. Extrinsic metrics assess NLG output through its practical impact, such as task success rates in applications like report generation, where user comprehension or accuracy is measured. For instance, in data-to-text systems, success might be quantified by how well generated reports inform user actions compared to human-written ones. Human judgments often complement these, using Likert scales to rate dimensions like (grammaticality and naturalness) or adequacy ( to input data), providing nuanced insights but requiring careful guidelines to ensure reliability. Advanced measures like BERTScore address the semantic shortcomings of n-gram-based metrics by leveraging contextual embeddings from pre-trained models such as to compute token-level similarities via cosine distance. This yields a , , and F1 score that better aligns with human evaluations, especially for paraphrases and diverse phrasings in NLG tasks, though it remains computationally intensive. Benchmarks and shared tasks facilitate standardized evaluation, such as the Second Multilingual Surface Realisation Shared Task (SR'19), which assessed NLG systems across languages using automatic metrics alongside human assessments to promote multilingual robustness. These initiatives highlight the need for diverse datasets and metrics tailored to non-English generation. Balancing human and automatic evaluation involves trade-offs: automatic metrics offer and , while human judgments capture subjective qualities but introduce variability and cost. Crowdsourcing platforms like enable large-scale human evaluations by distributing annotation tasks to remote workers, often with quality controls such as qualification tests, though results must be validated against expert judgments to mitigate biases.

Key Limitations and Future Directions

Neural natural language generation (NLG) models frequently produce hallucinations, generating fluent but factually incorrect or unsubstantiated content due to inconsistencies in training data or inadequate decoding strategies. This issue is particularly pronounced in abstractive tasks like summarization, where models invent details not present in the input, undermining reliability in applications such as or healthcare. Additionally, amplification occurs when models perpetuate and exacerbate societal stereotypes from training data, such as gender biases in occupational descriptions, as demonstrated in analyses of word embeddings that influence generated text. Ethical concerns in NLG arise from the potential for propagation through hallucinated outputs, which can mislead users in high-stakes domains like legal or reporting, and from gaps in where large models (LLMs) struggle to adhere to user-specified constraints without veering into harmful content. challenges further compound these issues, as the high computational costs of and deploying large models limit , restrict output length, and hinder real-time to new domains or data. Domain remains difficult, often requiring extensive retraining that exacerbates resource demands for non-English or specialized contexts. These limitations also highlight inadequacies in current evaluation metrics, which struggle to detect subtle hallucinations or biases comprehensively. Future directions in NLG emphasize developing interpretable systems through explainable techniques, such as leveraging LLMs to generate human-readable rationales for outputs, to enhance trust and debugging in complex models. Integration with for embodied communication represents another promising avenue, enabling robots to produce context-aware responses grounded in physical interactions and . Personalized generation, which tailors outputs to individual user profiles via contexts, is gaining traction to improve in recommendation and systems. Research gaps persist in low-resource languages, where scarce datasets impede effective NLG development, and in real-time ethical filtering mechanisms to dynamically mitigate biases or during generation. Addressing these could involve hybrid approaches combining external knowledge bases with efficient, lightweight models to broaden NLG's applicability.

References

  1. [1]
    [PDF] Introduction to the Special Issue on Natural Language Generation
    Natural language generation is the process of deliberately constructing a natural language text in order to meet specified communicative goals. A more recent ...
  2. [2]
    [PDF] What is NLG? - ACL Anthology
    Giving an adequate general definition of the input to natural language generation. (NLG), and hence to NLG itself, is a noto-.
  3. [3]
    [PDF] The Natural Language Generation Pipeline, Neural Text ... - HAL
    Dec 8, 2020 · In this short paper, we surveyed a number of neural data-to-text generation models which im- plement some or all of the NLG pipeline sub-tasks.
  4. [4]
    [PDF] Large Language Models: A Survey - arXiv
    Abstract—Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, ...
  5. [5]
    [PDF] Natural Language Generation in the context of the Semantic Web
    This paper provides an overview of natural language generation approaches in the context of the Semantic Web. A review like this is clearly very useful and ...
  6. [6]
    [PDF] A Generative Model for Joint Natural Language Understanding and ...
    Natural language understanding (NLU) and natural language generation (NLG) are two fundamental and related tasks in building task-oriented dialogue systems ...
  7. [7]
    [PDF] Measuring Attribution in Natural Language Generation Models
    Large neural models have brought a new challenge to natural language generation (NLG): It has become imperative to ensure the safety and reliability of the ...
  8. [8]
    [PDF] Building Natural Language Generation - Macquarie University
    This book describes NATURAL LANGUAGE GENERATION (NLG), which is a sub- field of artificial intelligence and computational linguistics that is concerned with ...
  9. [9]
    NLP - overview - Stanford Computer Science
    The field of natural language processing began in the 1940s, after World War II. At this time, people recognized the importance of translation from one ...
  10. [10]
    The Key to the Selection Problem in Natural Language Generation
    McDonald. 1982. Salience: The Key to the Selection Problem in Natural Language Generation. In 20th Annual Meeting of the Association for Computational ...Missing: JSAL | Show results with:JSAL
  11. [11]
    [PDF] Natural Language Generation - DTIC
    Penman is a natural language sentence generation program being developed at USC/ISI (the. Information Sciences Institute of the University of Southern ...
  12. [12]
    [PDF] Rhetorical structure theory: Toward a functional the
    In Mann and Thompson (1988), an unabridged version of this paper, the definitional uses of the following terms are discussed: nouns text span, reader, writer, ...
  13. [13]
    International Natural Language Generation Conference (INLG)
    INLG'2000 Proceedings of the First International Conference on Natural Language Generation 41 papers. 1996 · Eighth International Natural Language Generation ...
  14. [14]
    [PDF] Building Applied Natural Language Generation Systems
    Natural Language Generation (NLG) is about building systems that produce understandable texts from non-linguistic data, including tasks like content ...
  15. [15]
    [PDF] Using Argumentation to Control Lexical Choice: A Functional ...
    This thesis presents the FUF generation formalism. FUF is a declarative formalism derived from the Functional Unification Grammars formalism which I have ...
  16. [16]
    Trainable Methods for Surface Natural Language Generation
    Adwait Ratnaparkhi. 2000. Trainable Methods for Surface Natural Language Generation. In 1st Meeting of the North American Chapter of the Association for ...
  17. [17]
    Sequence to Sequence Learning with Neural Networks - arXiv
    Sep 10, 2014 · In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
  18. [18]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  19. [19]
    The WebNLG Challenge: Generating Text from RDF Data
    The WebNLG challenge consists in mapping sets of RDF triples to text. It provides a common benchmark on which to train, evaluate and compare “microplanners”.
  20. [20]
    The E2E Dataset: New Challenges For End-to-End Generation - arXiv
    Jun 28, 2017 · This paper describes the E2E data, a new dataset for training end-to-end, data-driven natural language generation systems in the restaurant domain.Missing: 2018 | Show results with:2018
  21. [21]
    BERT: Pre-training of Deep Bidirectional Transformers for Language ...
    BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  22. [22]
    Separating Planning from Realization in Neural Data-to-Text ... - arXiv
    We propose to split the generation process into a symbolic text-planning stage that is faithful to the input, followed by a neural generation stage that ...
  23. [23]
    Step-by-Step: Separating Planning from Realization in Neural Data ...
    Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation. In Proceedings of the ...
  24. [24]
    Plug and Play Language Models: A Simple Approach to Controlled ...
    We propose a simple alternative: the Plug and Play Language Model (PPLM) for controllable language generation, which combines a pretrained LM with one or more ...
  25. [25]
    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    May 22, 2020 · We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric ...
  26. [26]
    Learning Transferable Visual Models From Natural Language ...
    The paper proposes learning visual models by predicting image-caption pairs, then using natural language for zero-shot transfer to downstream tasks.
  27. [27]
    Five sources of bias in natural language processing - PMC
    Data augmentation by controlling the gender attribute is an effective technique in mitigating gender bias in NLP processes (Dinan et al., 2020; Sun et al., 2019) ...
  28. [28]
    [PDF] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming ...
  29. [29]
    Building Natural Language Generation Systems
    30-day returnsBuilding Natural Language Generation Systems. Building Natural Language Generation Systems ... Ehud Reiter, University of Aberdeen, Robert Dale, Macquarie ...
  30. [30]
    Artificial Intelligence | Natural Language Generation - GeeksforGeeks
    Jul 9, 2025 · How does NLG work · Content Determination: The system decides which information from the input data is relevant and should be mentioned.
  31. [31]
    [PDF] Rhetorical Structure Theory: - A Theory of Text Organization
    Finally, RST provides a framework for investigating Relational Propositions, which are unstated but inferred propositions that arise from the text structure in ...
  32. [32]
    [PDF] Statistical Acquisition of Content Selection Rules for Natural ...
    These examples illustrate that content selection rules should capture cases where an attribute should be included only under certain conditions; that is ...
  33. [33]
    Survey of the State of the Art in Natural Language Generation
    Jan 27, 2018 · This paper surveys Natural Language Generation (NLG), which is generating text or speech from non-linguistic input, and its core tasks.Missing: pipeline determination
  34. [34]
    A Comprehensive Review of Handling Missing Data - arXiv
    Apr 7, 2024 · This article reviews existing literature on handling missing values. It compares and contrasts existing methods in terms of their ability to handle different ...
  35. [35]
    A survey on missing data in machine learning | Journal of Big Data
    Oct 27, 2021 · In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques.Missing: NLG | Show results with:NLG<|control11|><|separator|>
  36. [36]
    [PDF] Natural Language Generation from Graphs
    This NLG engine used canned text, templates, and grammar rules to produce texts and hypertexts. • Helping technical authors produce instructions for using ...
  37. [37]
    None
    ### Summary of Implications of TAG for NLG, Focusing on Syntactic Realization
  38. [38]
    An Overview of SURGE: a Reusable Comprehensive Syntactic ...
    This paper describes surge, a syntactic realization front-end for natural language generation systems. By gradually integrating complementary aspects of ...Missing: seminal | Show results with:seminal
  39. [39]
    [PDF] Robust, applied morphological generation - ACL Anthology
    As an individual module, the morphological generator will be more easily shareable between several different NLG appli- cations, and integrated into new ones.
  40. [40]
    [PDF] Adapting Chart Realization to CCG
    We describe a bottom-up chart re- alization algorithm adapted for use with Combinatory Categorial Grammar. (CCG), and show how it can be used to.
  41. [41]
    [PDF] Using Integer Linear Programming for Content Selection ...
    Our method is the first one to consider content selection, lexicalization, and sentence aggregation as an ILP joint optimization problem in the context of multi ...
  42. [42]
    Acquiring Input Features from Stock Market Summaries: A NLG ...
    Aug 7, 2025 · Bold text: Market summary that refers to long-term market data. In this work, we present WSJ-Markets, a financial data-to-text generation.
  43. [43]
    [PDF] Abstract Meaning Representation for Sembanking - ACL Anthology
    We describe Abstract Meaning Represen- tation (AMR), a semantic representation language in which we are writing down the meanings of thousands of English ...
  44. [44]
    [PDF] Atlas.txt: Linking Geo-referenced Data to Text for NLG - ACL Anthology
    Geo-referenced data which are often communicated via maps are inaccessible to the visually impaired popula- tion. We summarise existing approaches to improving.
  45. [45]
    [PDF] SportSett:Basketball - A robust and maintainable dataset for Natural ...
    In this resource paper, we introduce the Sport-. Sett:Basketball database1. This easy-to-use re- source allows for simple scripts to be written which ...
  46. [46]
    [2004.13637] Recipes for building an open-domain chatbot - arXiv
    Abstract:Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural ...Missing: Facebook | Show results with:Facebook
  47. [47]
    Distilling the Knowledge of Large-scale Generative Models into ...
    Aug 28, 2021 · On the other hand, retrieval models could return responses with much lower latency but show inferior performance to the large-scale generative ...
  48. [48]
    MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for ...
    Sep 29, 2018 · MultiWOZ is a large, fully-labeled dataset of 10k human-human written conversations across multiple domains, used for task-oriented dialogue ...
  49. [49]
    [1411.4555] Show and Tell: A Neural Image Caption Generator - arXiv
    Nov 17, 2014 · Show and Tell: A Neural Image Caption Generator. Authors:Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.
  50. [50]
    COCO dataset
    COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features.Dataset · People · Tasks · Evaluate
  51. [51]
    The Unreasonable Effectiveness of Recurrent Neural Networks
    May 21, 2015 · This post is about sharing some of that magic with you. We'll train RNNs to generate text character by character and ponder the question “how is that even ...
  52. [52]
    A Computational Model of Linguistic Humor in Puns - PMC
    Our work is the first, to our knowledge, to integrate a computational model of general language understanding and humor theory to quantitatively predict humor ...
  53. [53]
    [PDF] Computational Creativity in Meme Generation: A Multimodal Approach
    Mar 23, 2023 · In this paper, we explore computational creativity in internet meme generation, employing a multimodal framework that integrates natural ...
  54. [54]
    Generative AI enhances individual creativity but reduces ... - Science
    Jul 12, 2024 · These results point to an increase in individual creativity at the risk of losing collective novelty. This dynamic resembles a social dilemma: ...
  55. [55]
    How to Build AI Speech-to-Text and Text-to-Speech Accessibility ...
    Sep 1, 2025 · This is where AI-driven accessibility tools can make a difference. From real-time captioning to adaptive reading support, artificial ...Missing: NLG | Show results with:NLG
  56. [56]
    [PDF] BLEU: a Method for Automatic Evaluation of Machine Translation
    BLEU is a method for automatic machine translation evaluation, measuring closeness to human translations using a weighted average of phrase matches. It is ...
  57. [57]
    ROUGE: A Package for Automatic Evaluation of Summaries
    Cite (ACL):: Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, ...
  58. [58]
    Human evaluation of automatically generated text: Current trends ...
    This paper provides an overview of how (mostly intrinsic) human evaluation is currently conducted and presents a set of best practices, grounded in the ...
  59. [59]
    BERTScore: Evaluating Text Generation with BERT - arXiv
    We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token.
  60. [60]
    The Second Multilingual Surface Realisation Shared Task (SR'19)
    We report results from the SR'19 Shared Task, the second edition of a multilingual surface realisation task organised as part of the EMNLP'19 Workshop.Missing: paper | Show results with:paper
  61. [61]
    Evaluating Amazon's Mechanical Turk as a Tool for Experimental ...
    Amazon Mechanical Turk (AMT) is an online crowdsourcing service where anonymous online workers complete web-based tasks for small sums of money.
  62. [62]
    Survey of Hallucination in Natural Language Generation - arXiv
    Feb 8, 2022 · In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG.
  63. [63]
    [PDF] On the Risk of Misinformation Pollution with Large Language Models
    Dec 6, 2023 · Our results show that (1) LLMs are excellent controllable misinformation generators, making them prone to potential misuse (§ 3), (2) deliber-.
  64. [64]
    LLM-based NLG Evaluation: Current Status and Challenges
    In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively.
  65. [65]
    LLMs for XAI: Future Directions for Explaining Explanations - arXiv
    May 9, 2024 · We investigate the use of Large Language Models (LLMs) to transform ML explanations into natural, human-readable narratives.
  66. [66]
    Natural language generation for social robotics: opportunities and ...
    Mar 11, 2019 · This article summarizes the state of the art in the two individual research areas of social robotics and natural language generation.
  67. [67]
    [PDF] Personalized Generation In Large Model Era: A Survey
    Jul 27, 2025 · Fundamentally, PGen entails user modeling based on various personalized con- texts and multimodal instructions, extracting per- sonalized ...
  68. [68]
    [PDF] Most NLG is Low-Resource: here's what we can do about it
    In this position paper, we initially present the challenges researchers & developers of- ten encounter when dealing with low-resource settings in NLG. We then ...
  69. [69]
    [PDF] Ethics Issues in Natural Language Generation Systems
    With the advent of big data, there is increasingly a need to distill information computed from these datasets into automated summaries and reports.