Fact-checked by Grok 2 weeks ago

Natural language generation

Natural language generation (NLG) is the subfield of artificial intelligence and computational linguistics focused on the automatic production of human-readable text from structured data or other non-linguistic inputs to achieve specific communicative objectives.^[1] This process involves transforming abstract representations, such as databases, knowledge graphs, or semantic structures, into coherent, fluent, and contextually appropriate natural language outputs. Unlike natural language understanding, which interprets text, NLG emphasizes deliberate construction to meet goals like informing, persuading, or entertaining.^[2] The core architecture of NLG systems typically follows a pipeline comprising content planning (selecting relevant information), discourse structuring (organizing it logically), sentence planning (choosing words and aggregation), and surface realization (ensuring grammaticality and fluency). Early NLG efforts in the 1970s and 1980s relied on rule-based and template-filling methods for simple tasks, such as generating weather reports or database summaries, but these were limited in flexibility and scalability.^[1] By the 1990s, more sophisticated systems emerged, incorporating knowledge representation techniques to handle complex domains like medical reporting and expert system explanations. In the 2010s, the advent of statistical and neural machine learning approaches revolutionized NLG, enabling end-to-end models that learn directly from data-to-text pairs without explicit modular stages.^[3] Transformer-based large language models, such as GPT variants including the GPT-5 series released in 2025, have further advanced the field by producing diverse, creative text for applications including dialogue systems, content summarization, and automated journalism.^[4]^[5] These neural methods excel in handling open-ended generation but introduce challenges like factual inaccuracies (hallucinations) and the need for controllability.^[3] As of 2025, NLG applications span numerous domains, from personalized e-commerce descriptions derived from product ontologies to accessible explanations of semantic web data for non-experts.^[6] In task-oriented dialogue systems, NLG integrates with natural language understanding to generate responses that align with user intents and system policies.^[7] Evaluation metrics for NLG emphasize fluency, adequacy, and informativeness, often using automated scores like BLEU alongside human judgments. Ongoing research addresses ethical concerns, such as bias mitigation and ensuring generated text's trustworthiness, particularly in high-stakes areas like legal or healthcare communication.^[8]^[9]

Fundamentals

Definition and Scope

Natural language generation (NLG) is the subfield of artificial intelligence and computational linguistics concerned with the construction of computer systems that produce understandable texts in human languages from underlying non-linguistic representations of information, such as databases, knowledge bases, or structured inputs.^[10] This process involves deliberately constructing natural language text to meet specified communicative goals, transforming data into coherent, fluent output that mimics human-like expression.^[1] In scope, NLG focuses on output generation rather than input parsing, distinguishing it from natural language understanding (NLU), which maps text to meaning representations.^[1] While NLU interprets unstructured language, NLG inverts this by generating text from semantic or structured sources, encompassing sub-tasks such as text summarization from documents and dialogue response creation in conversational systems.^[6] NLG operates as the inverse of natural language processing (NLP) in the broader communication pipeline, where NLP encompasses both understanding and generation but NLG specifically handles the production phase.^[1] Key concepts in NLG include the transformation of diverse input types—ranging from numerical data to semantic representations—into varied output forms like reports, captions, or descriptive narratives.^[6] For instance, inputs from knowledge bases might yield explanatory texts, emphasizing the need for coherence and context appropriateness in the generated language.^[10] As a core component of human-AI interaction, NLG enables machines to communicate effectively in natural language, bridging the gap between computational systems and human users by producing readable and informative text from abstract data.^[1] This capability supports applications in AI-driven interfaces, where generated language enhances accessibility and interpretability of machine outputs.^[6]

Historical Development

The origins of natural language generation (NLG), the subfield of artificial intelligence focused on producing coherent and contextually appropriate text from non-linguistic inputs, trace back to the mid-20th century amid broader advances in computational linguistics and artificial intelligence. Foundational theoretical work in the 1950s and 1960s, particularly Noam Chomsky's introduction of generative grammar in Syntactic Structures, emphasized hierarchical structures and transformational rules for language production, influencing early computational efforts to model text generation as a systematic process akin to human speech synthesis. By the 1970s, initial experiments in rule-based systems emerged, building on these linguistic theories to generate simple sentences from logical representations, though limited by computational constraints and a lack of empirical data.^[11] The classical era of NLG in the 1980s and 1990s shifted toward structured pipeline architectures, emphasizing modular processes for content planning, sentence structuring, and surface realization. David D. McDonald's 1982 work on salience in selection mechanisms highlighted how prioritizing key information could guide text construction in rule-based generators, addressing challenges in choosing what to express from complex inputs.^[12] Concurrently, the PENMAN project, developed by William C. Mann at the USC Information Sciences Institute, introduced a comprehensive text generation system that integrated knowledge representation with rhetorical planning, enabling the production of multi-sentence discourses.^[13] A pivotal contribution was Rhetorical Structure Theory (RST), formalized by Mann and Thompson in 1988, which modeled text coherence through hierarchical relations between spans (e.g., elaboration, contrast), providing a framework for organizing generated content to mimic human argumentation and narrative flow.^[14] These template- and rule-driven approaches dominated, focusing on domain-specific applications like weather reports, but struggled with scalability and flexibility. The 2000s marked a transition to data-driven paradigms, incorporating statistical methods to handle variability in language output. Irene Langkilde's HALogen system (2000) represented a breakthrough by combining symbolic input representations with statistical optimization over vast realization forests, drawing techniques from machine translation to select fluent sentences probabilistically rather than exhaustively via rules. This integration allowed NLG to leverage parallel corpora and n-gram models, improving robustness in noisy or ambiguous scenarios, and paved the way for hybrid systems that balanced interpretability with empirical performance. Key events during this period included the establishment of the International Natural Language Generation Conference (INLG), with workshops dating back to 1983 and formal conferences beginning around 2000, fostering collaboration on benchmarks and evaluation metrics.^[15] From the 2010s onward, the advent of deep learning revolutionized NLG, enabling end-to-end models that bypassed traditional pipelines. The Transformer architecture, introduced by Ashish Vaswani et al. in 2017, used self-attention mechanisms to capture long-range dependencies in sequences, dramatically enhancing generation quality and efficiency for tasks like summarization and dialogue. Subsequent models like OpenAI's GPT series, starting with GPT-1 in 2018, scaled unsupervised pretraining on massive corpora to produce diverse, context-aware text, while Google's T5 (Raffel et al., 2020) unified NLG tasks under a text-to-text framework, achieving state-of-the-art results through fine-tuning on diverse datasets. Notable advancements since then include OpenAI's GPT-3 (2020) and GPT-4 (2023), which demonstrated unprecedented scale in parameter size and performance, alongside models like Google's PaLM (2022) and Meta's Llama series, enhancing NLG's versatility and integration with multimodal tasks.^[16] The confluence of big data, neural networks, and increased computational power has since driven NLG toward more scalable, general-purpose systems, with ongoing INLG conferences highlighting impacts on accessibility and ethical considerations.^[15]

Methodologies

Classical Pipeline Approaches

Classical pipeline approaches in natural language generation (NLG) rely on a modular, sequential architecture that decomposes the generation process into distinct stages, transforming non-linguistic input data—such as databases or knowledge representations—into coherent human-readable text. This pipeline typically consists of three primary phases: content planning, which determines the relevant information to include; sentence planning (or microplanning), which organizes that information into logical structures; and surface realization, which applies linguistic rules to produce grammatical output. Unlike end-to-end neural models, these pipelines offer high interpretability and fine-grained control, allowing developers to intervene at specific stages for domain adaptation or error correction, though they require extensive manual engineering.^[17] The key components form a structured framework where each stage builds on the previous one to ensure systematic text production. In content planning, rules or schemas select and organize messages from input data, often drawing on domain-specific knowledge bases to decide what facts to convey and in what order, such as prioritizing critical events in a report. Sentence planning then aggregates related messages, performs lexical choice to select appropriate words, and generates referring expressions to maintain discourse coherence. Finally, surface realization linearizes this abstract structure into surface forms using syntactic and morphological rules, ensuring fluency and correctness. This decomposition, rooted in early NLG theory, enables targeted development but demands integration across modules to avoid inconsistencies.^[17] Rule-based methods dominate these pipelines, employing templates for simple slot-filling, formal grammars for syntactic construction, and knowledge bases for semantic guidance. Templates provide predefined patterns with placeholders for data, offering efficiency in controlled domains but limited variability. More sophisticated approaches use unification-based grammars, which merge feature structures to resolve choices like lexical selection through argumentation over rhetorical relations. A seminal example is FUF (Functional Unification Formalism), an early system that implements unification grammars to control lexical choice and generate varied realizations from abstract inputs, emphasizing declarative rules over procedural coding.^[17]^[18] These methods leverage hand-crafted resources, such as systemic grammars or meaning-text theory, to encode linguistic knowledge explicitly. Despite their strengths, classical pipelines exhibit notable limitations, including rigidity in handling novel inputs or ambiguities, as rules cannot easily generalize beyond encoded scenarios. Developing and maintaining these systems is labor-intensive, requiring expert knowledge engineering for grammars, lexicons, and domain rules, which scales poorly to new applications. They dominated NLG research and deployment until the early 2010s, when machine learning techniques began offering greater flexibility.^[17] A representative example is the FoG (Forecast Generator) system, which produces textual weather forecasts from meteorological data using a domain-specific pipeline. FoG employs rule-based content planning to select key weather events, sentence planning for aggregation and phrasing choices (e.g., using vagueness for uncertain predictions), and surface realization via templates and simple grammars to generate readable bulletins. Deployed for Canadian weather services, it demonstrated the practicality of pipelines in operational settings, producing forecasts in English and French while highlighting the need for corpus-informed rules to ensure naturalness. Modern machine learning alternatives, such as neural decoders, have since reduced the modularity of such systems for broader applicability.

Machine Learning-Based Techniques

Machine learning-based techniques in natural language generation (NLG) represent a shift from rule-based systems to data-driven approaches that learn patterns directly from annotated corpora, enabling more flexible and scalable text production. Early statistical methods laid the foundation by employing probabilistic models to capture linguistic regularities. N-gram-based generation, for instance, models the probability of word sequences using Markov assumptions, where the likelihood of a word depends on the preceding n-1 words, facilitating simple yet effective sentence completion in NLG tasks. These models were often combined with maximum entropy frameworks, which optimize feature-based probabilities without assuming feature independence, as demonstrated in trainable systems for surface realization that generate syntactic structures from semantic inputs using annotated data.^[19] A pivotal advancement came with neural architectures, particularly sequence-to-sequence (Seq2Seq) models, which use encoder-decoder frameworks to map input sequences—such as structured data or meaning representations—to output text. Introduced using long short-term memory (LSTM) networks, these models encode the input into a fixed-dimensional vector and decode it autoregressively, achieving strong performance in tasks like machine translation that parallel NLG applications. To address limitations in handling long-range dependencies, attention mechanisms were integrated, allowing the decoder to focus dynamically on relevant input parts during generation; this culminated in the Transformer architecture, which relies entirely on self-attention layers to process sequences in parallel, revolutionizing NLG by improving coherence and efficiency in producing fluent text from diverse inputs.^[20]^[21] End-to-end learning extends these neural approaches by directly mapping structured inputs, like database records or RDF triples, to natural language outputs without intermediate symbolic stages, trained via maximum likelihood estimation. The core objective in such frameworks is to minimize the negative log-likelihood loss:

L = -\sum_{t=1}^{T} \log P(y_t \mid y_{<t}, x)

where x denotes the input sequence, y_{<t} the partial output up to timestep t, and T the output length, enabling models to learn holistic mappings from data. This paradigm has been applied effectively in data-to-text generation, producing descriptive text from tabular or graph-structured information. Key datasets supporting these methods include WebNLG, which provides RDF triple sets paired with verbalizations for training RDF-to-text systems across multiple languages, and the E2E dataset, comprising dialogue acts in the restaurant domain mapped to referring expressions, designed to evaluate end-to-end NLG in spoken systems.^[22]^[23] More recent developments leverage pre-trained large language models (LLMs) for NLG by fine-tuning them on task-specific data, enhancing generation quality through transfer learning from vast unlabeled corpora. Models like GPT, pre-trained generatively on next-token prediction, excel in open-ended text production and can incorporate controllability through structured prompts that guide output towards desired attributes, such as style or factual accuracy. Similarly, BERT's bidirectional pre-training on masked language modeling allows fine-tuning for conditional generation tasks, where encoder components process inputs to inform decoder outputs, though adaptations like BART extend this for full seq2seq NLG. These techniques have demonstrated superior fluency and diversity in applications ranging from summarization to personalized content creation, often outperforming earlier neural baselines on benchmarks like BLEU scores in controlled evaluations.^[24]

Hybrid and Emerging Methods

Hybrid systems in natural language generation integrate rule-based planning with neural realization to leverage the strengths of both paradigms, enabling structured content planning while producing fluent outputs. For instance, the Plan-and-Generate framework separates the process into a symbolic planning stage that ensures fidelity to input data and a neural generation stage for linguistic realization, improving control over output structure without sacrificing naturalness.^[25] This approach balances the interpretability and precision of classical methods with the flexibility of machine learning, particularly in data-to-text tasks where adherence to source information is crucial.^[26] Controllable natural language generation techniques allow for targeted attribute control during text production, addressing limitations in unconstrained neural models. Plug-and-Play Language Models (PPLM) achieve this by steering pretrained language models using lightweight attribute classifiers that manipulate activation patterns without fine-tuning the base model, enabling attributes like sentiment or toxicity to be controlled dynamically.^[27] Reinforcement learning methods further enhance fidelity by optimizing generation for specific constraints, such as factual accuracy in summaries, through reward signals derived from external verifiers.^[28] Multimodal natural language generation extends text production to incorporate non-textual inputs like images or videos, fostering richer interactions. Vision-language models such as CLIP facilitate this by aligning visual and textual representations, allowing generators to produce descriptive captions or narratives grounded in visual content through integrated encoding-decoding pipelines.^[29] Recent advancements as of 2025 include natively multimodal large language models like Llama 4 variants, which process text and images for more coherent cross-modal generation in applications such as visual storytelling.^[30] These systems improve coherence between modalities, as seen in applications where image features guide narrative flow, reducing mismatches in generated descriptions. Neuro-symbolic methods that merge neural networks with symbolic logic have become established approaches to enhance reasoning and interpretability in NLG. These integrate logical rules into neural architectures for tasks like question answering, where symbolic inference ensures consistency while neural components handle linguistic variability, as surveyed in recent frameworks from 2024.^[31] Ethical considerations, particularly bias mitigation, are integral to these developments; techniques such as data augmentation and counterfactual fairness interventions counteract gender or racial biases in generated text by balancing training distributions and evaluating outputs against fairness metrics.^[32] Addressing scalability issues in large language models for generation involves strategies to curb hallucinations, where models produce unverifiable content. Retrieval-augmented generation (RAG) mitigates this by conditioning outputs on retrieved external knowledge, improving factual accuracy in knowledge-intensive tasks by up to 20-30% on benchmarks like open-domain question answering without expanding model parameters.^[33] Recent hybrid RAG systems as of 2025 further refine this by combining dense and sparse retrieval for enhanced performance in dynamic environments.^[34] This method supports efficient scaling by offloading memory to non-parametric stores, enabling reliable generation in resource-constrained environments.^[35]

Core Processes

Content Determination

Content determination is the initial phase in the natural language generation (NLG) pipeline, where the system transforms raw input data—such as database records or sensor outputs—into a set of communicative goals by selecting, aggregating, and prioritizing relevant information for expression in text.^[3] This process ensures that the generated output focuses on key facts while avoiding redundancy, aligning the content with the intended purpose, such as informing or persuading the audience.^[36] Aggregation involves grouping related data points to improve fluency and reduce redundancy. Techniques for content determination include schema-based selection, which uses predefined templates to identify pertinent data elements based on domain-specific criteria, and Rhetorical Structure Theory (RST), which organizes selected content into a hierarchical discourse structure to guide overall text coherence. RST, introduced by Mann and Thompson, defines relations between text spans (e.g., elaboration or contrast) to prioritize information that supports the primary communicative intent.^[37] Content planning algorithms often employ rule-based systems to evaluate input against goals, such as including only statistically significant trends in a report.^[38] A representative example occurs in automated report generation from sensor data, where the system selects key statistics—like average temperature and peak wind speed from hourly readings—while omitting redundant entries, applying aggregation rules to summarize numerical data into concise descriptors such as "The average temperature was mild, with gusty winds." This selection ensures the text remains focused and readable without overwhelming the reader with raw details. Unique challenges in content determination arise when handling incomplete or conflicting data sources, such as missing values in a dataset or contradictory records from multiple sensors, which can lead to biased or inaccurate selections if not resolved through imputation or prioritization heuristics.^[39] Systems must incorporate validation steps to detect and mitigate these issues, ensuring robust content choices. This phase applies to various input types, ranging from structured data like relational tables, where selection involves querying specific rows and columns, to semi-structured formats such as knowledge graphs, where traversal algorithms identify relevant nodes and edges for inclusion.^[40] The determined content then informs subsequent microplanning, where rhetorical relations and ordering are refined for textual expression.^[36]

Microplanning

Microplanning is the intermediate stage in natural language generation (NLG) pipelines where selected content from the document planning phase is organized into coherent textual units, focusing on decisions that ensure logical flow and linguistic appropriateness before surface realization. This process transforms abstract representations, such as propositions or facts, into structured specifications for sentences, addressing how information is packaged to achieve communicative goals. According to the classic framework outlined by Reiter and Dale, microplanning bridges high-level content selection and low-level syntactic formation by handling choices that impact readability and coherence.^[36] Core tasks in microplanning include structuring sentences through discourse relations, lexical choice, and aggregation of clauses. Discourse relations, such as elaboration or contrast, are often modeled using Rhetorical Structure Theory (RST), which organizes text into hierarchical trees where relations link spans to convey intentions like explanation or justification. For instance, in explanatory texts, a cause-effect relation might connect a precipitating event to its outcome, ensuring the generated paragraph flows logically from bullet-point facts like "Patient experienced low oxygen levels" to "This led to respiratory distress." Lexical choice involves selecting words or phrases that best convey meaning while considering context, such as choosing "decline" over "drop" for medical reports to match register, guided by resources like VerbNet for semantic compatibility. Aggregation merges related clauses to avoid redundancy, for example, combining multiple similar events into a single sentence like "The patient had three successive bradycardias down to 69 bpm" instead of separate statements, using rule-based heuristics or statistical methods. Referring expressions are generated during microplanning to maintain discourse coherence, resolving anaphora through theories like centering theory, which tracks entity salience across utterances to decide between pronouns and full descriptions. For example, in a sequence describing events, a highly salient entity (e.g., "the patient") might be referred to as "he" in subsequent sentences if it remains the focus, following principles of local coherence. An incremental algorithm prioritizes attributes in descriptions, such as type before color, to generate concise yet informative references like "the red car" only when necessary. Formalisms like the APPLET schema support these decisions by providing schemas for rhetorical relations in RST-based generation, ensuring relations are realized appropriately in text spans. Linguistic aspects such as tense, aspect, and modality are selected based on contextual cues during microplanning to align with the intended temporal or evidential stance. For instance, past tense and perfective aspect might be chosen for completed events in reports, as in "The treatment had been administered," while modality like "may" introduces uncertainty for hypothetical outcomes. These choices are encoded in semantic representations passed to realization, drawing from input specifications like event types and arguments.

Realization and Generation

Realization and generation, often termed surface realization, constitutes the concluding phase of the natural language generation (NLG) pipeline, transforming abstract representations from microplanning—such as conceptual structures or deep syntactic forms—into coherent, grammatical text.^[36] This process ensures that the output adheres to linguistic rules, producing sentences that are syntactically correct and morphologically appropriate for the target language.^[36] Syntactic realization maps logical forms to surface structures by selecting and ordering words within grammatical frameworks. Early systems leveraged Generalized Phrase Structure Grammar (GPSG), a context-free formalism that supports efficient generation through feature percolation and unification, enabling the production of varied syntactic variants from a single input representation. Tree-adjoining grammars (TAG) offer an alternative, using elementary trees as building blocks that can be combined via substitution and adjunction to handle dependencies like relative clauses, providing precise control over sentence complexity in NLG.^[41] Comprehensive implementations, such as the SURGE system, integrate systemic functional grammar principles with unification to realize deep-syntactic inputs into full English sentences, demonstrating reusability across diverse NLG applications.^[42] Morphological generation addresses word-level adjustments, inflecting lemmas according to syntactic features like tense, number, and case to form complete lexical items. For example, it conjugates verbs (e.g., "rain" to "rained" for past tense) and pluralizes nouns based on orthographic rules and exceptions.^[36] Robust finite-state implementations achieve high accuracy by prioritizing rules for irregularities, such as deriving "stimuli" from "stimulus+s_N" while handling over 1,100 exceptional lemmata for consonant doubling and other patterns.^[43] A concrete illustration of the process transforms an abstract input like "event: rain, location: city, time: yesterday" into the sentence "It rained in the city yesterday," where syntactic frames embed the event, adjuncts specify location and time, and morphological rules apply past tense inflection.^[36] Key algorithms for syntactic realization include chart-based methods, which use bottom-up dynamic programming to parse and assemble structures from lexical entries, as adapted for Combinatory Categorial Grammar (CCG) to cover logical forms with bit-vector tracking for efficiency.^[44] Optimization techniques employ integer linear programming to jointly select lexical choices and structures, minimizing sentence length or maximizing compactness while enforcing grammatical constraints, often integrating with content selection for improved output density.^[45] Output polishing refines the generated text by applying orthographic rules for punctuation, capitalization, and spacing, alongside basic checks for fluency, ensuring the final product reads naturally without altering core semantics.^[36]

Applications

Data-to-Text Systems

Data-to-text systems in natural language generation (NLG) focus on transforming structured data, such as tables or databases, into coherent, human-readable narrative text. These systems are particularly valuable in domains requiring regular reporting from quantitative inputs, where manual writing is time-intensive or error-prone. Early examples include the SUMTIME system, which generates textual forecasts from numerical meteorological data for offshore weather reports, demonstrating how rule-based pipelines can produce reliable summaries tailored to specific user needs like safety-critical marine operations. Similarly, in finance, data-to-text approaches automate summaries of stock market data, converting tabular records of prices, volumes, and trends into narrative overviews that highlight key movements and implications for investors.^[46] Adapting classical NLG pipelines for data-to-text involves customizing stages like content determination and microplanning to handle tabular or relational inputs. For instance, meaning representation languages such as Abstract Meaning Representation (AMR) facilitate the mapping of structured data to semantic graphs, enabling systematic selection and aggregation of relevant facts while preserving logical relationships.^[47] This adaptation ensures that generated text adheres to domain-specific conventions, such as emphasizing temporal sequences in weather data or causal inferences in financial trends. Case studies illustrate practical impacts, including systems for generating e-commerce product descriptions from attribute-value pairs, which support scalable content creation for online catalogs.^[48] Such systems enhance accessibility, particularly for visually impaired users, by converting geo-referenced or tabular data into auditory-readable narratives via screen readers, as explored in projects linking map data to descriptive text.^[49] Modern enhancements leverage neural architectures for more fluent and context-aware generation. For example, end-to-end models like DataTuner employ sequence-to-sequence approaches to process structured inputs, improving coherence and factual alignment in outputs compared to traditional methods.^[50] The domain has seen growth in sports commentary, where systems like those trained on the SportSett:Basketball dataset produce NBA game recaps from play-by-play statistics, capturing highlights and narratives with high fidelity to event data.^[51]

Conversational and Interactive Uses

Natural language generation (NLG) plays a pivotal role in conversational and interactive systems, enabling the production of human-like responses in real-time dialogue. These systems integrate NLG with natural language understanding (NLU) to form end-to-end pipelines that interpret user intents and generate coherent outputs, evolving from modular architectures to unified neural models that reduce error propagation. Early examples include rule-based chatbots like ALICE, developed by Richard Wallace in 1995, which used pattern-matching via Artificial Intelligence Markup Language (AIML) to generate responses without deep contextual reasoning. This marked a foundational shift toward interactive NLG, though limited to scripted interactions. Advancements in neural architectures have transformed conversational NLG, with systems like BlenderBot, released by Facebook AI in 2020, employing large-scale transformer-based models to produce open-domain responses that maintain fluency and relevance across turns.^[52] Techniques such as response generation from dialogue acts—abstract representations of communicative intentions—allow NLG modules to convert structured plans into natural utterances, often integrated with dialogue management frameworks like Partially Observable Markov Decision Processes (POMDPs) for tracking hidden user states and context. POMDPs enable probabilistic belief updates over dialogue history, facilitating adaptive generation in uncertain environments. In practical applications, NLG powers task-oriented personal assistants like Apple's Siri and Amazon's Alexa, which generate responses to fulfill user goals such as scheduling or information retrieval by verbalizing dialogue states and actions. Customer service chatbots in banking domains similarly leverage NLG to produce personalized, context-aware replies, drawing on non-linguistic data like transaction histories to enhance response relevance in multi-turn interactions. Key challenges in conversational NLG include maintaining coherent context across extended dialogues, where models must resolve coreferences and track evolving states to avoid repetition or drift. Handling ambiguity in user inputs—such as vague intents or polysemous queries—further complicates generation, often requiring clarification strategies to elicit precise information without disrupting flow. Recent advancements distinguish between retrieval-based approaches, which select pre-defined responses from a corpus for consistency and speed, and generative methods, which synthesize novel outputs for flexibility but risk hallucinations.^[53] Datasets like MultiWOZ, introduced by Budzianowski et al. in 2018, have driven progress by providing multi-domain, annotated dialogues for training end-to-end systems that simulate real-world task-oriented interactions.^[54]

Multimedia and Creative Generation

Natural language generation (NLG) in multimedia contexts involves producing textual descriptions from non-textual inputs such as images and videos, enabling applications like automated captioning for accessibility and content indexing. A seminal approach is the "Show and Tell" model, which combines a convolutional neural network (CNN) to encode visual features with a recurrent neural network (RNN) to decode them into coherent sentences, achieving state-of-the-art performance on benchmarks at the time.^[55] This encoder-decoder architecture has influenced subsequent multimodal NLG systems by demonstrating how visual embeddings can guide sequence generation. Datasets like MS COCO, released in 2014, have been pivotal for training such models, providing over 120,000 images paired with multiple human-annotated captions to support evaluation of descriptive accuracy and diversity.^[56] In creative NLG tasks, systems generate artistic text outputs, such as stories or poetry, often using character-level RNNs to capture stylistic nuances. The Char-RNN framework exemplifies this by training on literary corpora like Shakespeare's works to produce novel verses, highlighting the potential of neural networks to mimic poetic structures through unsupervised learning on sequences.^[57] Computational humor generation, particularly punchline prediction, employs probabilistic models to extend setups with unexpected resolutions, as seen in frameworks integrating surprise metrics and semantic ambiguity for pun creation.^[58] These methods underscore NLG's role in fostering originality, though outputs often require human refinement to align with cultural nuances. Representative examples include meme generation, where multimodal models pair image templates with contextually humorous captions generated via transformer-based language models, as in systems trained on internet meme corpora to automate viral content creation.^[59] Interactive fiction leverages NLG for dynamic storytelling, with AI-driven engines generating branching narratives in response to user inputs, exemplified by platforms that use large language models to evolve plotlines in real-time. A key challenge in these creative applications is ensuring novelty, as generative models tend to produce individually innovative text but reduce collective diversity by converging on similar patterns, potentially limiting broader artistic impact.^[60] Multimodal NLG extends to integrating audio and speech inputs, particularly in accessibility tools that transcribe spoken content into readable text for the hearing impaired. Systems combining automatic speech recognition generate captions from live audio streams, improving usability in video conferencing and educational videos.^[61] Emerging trends highlight AI-human collaborations in artistic domains, such as using GPT-3 for scriptwriting, where the model generates dialogue and plot outlines from prompts, facilitating co-creative processes in film and theater production as demonstrated in its few-shot learning capabilities. These integrations briefly extend to conversational elements in virtual reality, enhancing immersive narratives with generated responses.

Assessment and Challenges

Evaluation Metrics

Evaluating the quality of natural language generation (NLG) systems requires a combination of intrinsic and extrinsic metrics to assess aspects such as fluency, adequacy, and coherence. Intrinsic metrics focus on the generated text in isolation, often comparing it to reference texts, while extrinsic metrics evaluate the text's effectiveness in achieving a specific task or goal, typically through user interaction or downstream performance. These approaches address the challenges of NLG evaluation, where traditional metrics from machine translation have been adapted but often fall short in capturing semantic nuances and contextual appropriateness. Intrinsic metrics, such as BLEU, measure surface-level similarities between generated and reference texts using n-gram overlap. Introduced for machine translation evaluation, BLEU computes a score based on precision of n-grams, modified by a brevity penalty to avoid favoring short outputs. The formula is given by:

\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n \right)

where BP is the brevity penalty, p_n is the modified n-gram precision, w_n are weights (typically uniform), and N is the maximum n-gram order (often 4). Despite its widespread use, BLEU has limitations in capturing semantics, as it penalizes valid paraphrases and struggles with diverse expressions common in NLG.^[62] Another intrinsic metric, ROUGE, is particularly suited for summarization tasks within NLG and emphasizes recall over precision by measuring overlap of n-grams, longest common subsequences, or skip-bigrams between generated and reference summaries. Variants like ROUGE-N (n-gram recall) and ROUGE-L (sequence-based) provide flexible assessments, though like BLEU, they overlook deeper meaning and coherence. These metrics enable quick, automated comparisons but correlate poorly with human perceptions of quality in open-ended generation scenarios.^[63] Extrinsic metrics assess NLG output through its practical impact, such as task success rates in applications like report generation, where user comprehension or decision-making accuracy is measured. For instance, in data-to-text systems, success might be quantified by how well generated reports inform user actions compared to human-written ones. Human judgments often complement these, using Likert scales to rate dimensions like fluency (grammaticality and naturalness) or adequacy (fidelity to input data), providing nuanced insights but requiring careful annotation guidelines to ensure reliability.^[64] Advanced measures like BERTScore address the semantic shortcomings of n-gram-based metrics by leveraging contextual embeddings from pre-trained models such as BERT to compute token-level similarities via cosine distance. This yields a precision, recall, and F1 score that better aligns with human evaluations, especially for paraphrases and diverse phrasings in NLG tasks, though it remains computationally intensive.^[65] Benchmarks and shared tasks facilitate standardized evaluation, such as the Second Multilingual Surface Realisation Shared Task (SR'19), which assessed NLG systems across languages using automatic metrics alongside human assessments to promote multilingual robustness. These initiatives highlight the need for diverse datasets and metrics tailored to non-English generation.^[66] Balancing human and automatic evaluation involves trade-offs: automatic metrics offer scalability and consistency, while human judgments capture subjective qualities but introduce variability and cost. Crowdsourcing platforms like Amazon Mechanical Turk enable large-scale human evaluations by distributing annotation tasks to remote workers, often with quality controls such as qualification tests, though results must be validated against expert judgments to mitigate biases.^[64]^[67]

Key Limitations and Future Directions

Neural natural language generation (NLG) models frequently produce hallucinations, generating fluent but factually incorrect or unsubstantiated content due to inconsistencies in training data or inadequate decoding strategies.^[68] This issue is particularly pronounced in abstractive tasks like summarization, where models invent details not present in the input, undermining reliability in applications such as journalism or healthcare.^[68] Additionally, bias amplification occurs when models perpetuate and exacerbate societal stereotypes from training data, such as gender biases in occupational descriptions, as demonstrated in analyses of word embeddings that influence generated text. Ethical concerns in NLG arise from the potential for misinformation propagation through hallucinated outputs, which can mislead users in high-stakes domains like legal or medical reporting, and from gaps in controllability where large language models (LLMs) struggle to adhere to user-specified constraints without veering into harmful content.^[69] Scalability challenges further compound these issues, as the high computational costs of training and deploying large models limit accessibility, restrict output length, and hinder real-time adaptation to new domains or data. Domain adaptation remains difficult, often requiring extensive retraining that exacerbates resource demands for non-English or specialized contexts. These limitations also highlight inadequacies in current evaluation metrics, which struggle to detect subtle hallucinations or biases comprehensively.^[70] Future directions in NLG emphasize developing interpretable systems through explainable AI techniques, such as leveraging LLMs to generate human-readable rationales for outputs, to enhance trust and debugging in complex models.^[71] Integration with robotics for embodied communication represents another promising avenue, enabling robots to produce context-aware natural language responses grounded in physical interactions and sensor data.^[72] Personalized generation, which tailors outputs to individual user profiles via multimodal contexts, is gaining traction to improve relevance in recommendation and feedback systems.^[73] Research gaps persist in low-resource languages, where scarce datasets impede effective NLG development, and in real-time ethical filtering mechanisms to dynamically mitigate biases or misinformation during generation.^[74]^[75] Addressing these could involve hybrid approaches combining external knowledge bases with efficient, lightweight models to broaden NLG's applicability.

References

[1]
[PDF] Introduction to the Special Issue on Natural Language Generation
Natural language generation is the process of deliberately constructing a natural language text in order to meet specified communicative goals. A more recent ...
[2]
[PDF] What is NLG? - ACL Anthology
Giving an adequate general definition of the input to natural language generation. (NLG), and hence to NLG itself, is a noto-.
[3]
[PDF] The Natural Language Generation Pipeline, Neural Text ... - HAL
Dec 8, 2020 · In this short paper, we surveyed a number of neural data-to-text generation models which im- plement some or all of the NLG pipeline sub-tasks.
[4]
[PDF] Large Language Models: A Survey - arXiv
Abstract—Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, ...
[5]
[PDF] Natural Language Generation in the context of the Semantic Web
This paper provides an overview of natural language generation approaches in the context of the Semantic Web. A review like this is clearly very useful and ...
[6]
[PDF] A Generative Model for Joint Natural Language Understanding and ...
Natural language understanding (NLU) and natural language generation (NLG) are two fundamental and related tasks in building task-oriented dialogue systems ...
[7]
[PDF] Measuring Attribution in Natural Language Generation Models
Large neural models have brought a new challenge to natural language generation (NLG): It has become imperative to ensure the safety and reliability of the ...
[8]
[PDF] Building Natural Language Generation - Macquarie University
This book describes NATURAL LANGUAGE GENERATION (NLG), which is a sub- field of artificial intelligence and computational linguistics that is concerned with ...
[9]
NLP - overview - Stanford Computer Science
The field of natural language processing began in the 1940s, after World War II. At this time, people recognized the importance of translation from one ...
[10]
The Key to the Selection Problem in Natural Language Generation
McDonald. 1982. Salience: The Key to the Selection Problem in Natural Language Generation. In 20th Annual Meeting of the Association for Computational ...Missing: JSAL | Show results with:JSAL
[11]
[PDF] Natural Language Generation - DTIC
Penman is a natural language sentence generation program being developed at USC/ISI (the. Information Sciences Institute of the University of Southern ...
[12]
[PDF] Rhetorical structure theory: Toward a functional the
In Mann and Thompson (1988), an unabridged version of this paper, the definitional uses of the following terms are discussed: nouns text span, reader, writer, ...
[13]
International Natural Language Generation Conference (INLG)
INLG'2000 Proceedings of the First International Conference on Natural Language Generation 41 papers. 1996 · Eighth International Natural Language Generation ...
[14]
[PDF] Building Applied Natural Language Generation Systems
Natural Language Generation (NLG) is about building systems that produce understandable texts from non-linguistic data, including tasks like content ...
[15]
[PDF] Using Argumentation to Control Lexical Choice: A Functional ...
This thesis presents the FUF generation formalism. FUF is a declarative formalism derived from the Functional Unification Grammars formalism which I have ...
[16]
Trainable Methods for Surface Natural Language Generation
Adwait Ratnaparkhi. 2000. Trainable Methods for Surface Natural Language Generation. In 1st Meeting of the North American Chapter of the Association for ...
[17]
Sequence to Sequence Learning with Neural Networks - arXiv
Sep 10, 2014 · In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
[18]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
[19]
The WebNLG Challenge: Generating Text from RDF Data
The WebNLG challenge consists in mapping sets of RDF triples to text. It provides a common benchmark on which to train, evaluate and compare “microplanners”.
[20]
The E2E Dataset: New Challenges For End-to-End Generation - arXiv
Jun 28, 2017 · This paper describes the E2E data, a new dataset for training end-to-end, data-driven natural language generation systems in the restaurant domain.Missing: 2018 | Show results with:2018
[21]
BERT: Pre-training of Deep Bidirectional Transformers for Language ...
BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
[22]
Separating Planning from Realization in Neural Data-to-Text ... - arXiv
We propose to split the generation process into a symbolic text-planning stage that is faithful to the input, followed by a neural generation stage that ...
[23]
Step-by-Step: Separating Planning from Realization in Neural Data ...
Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation. In Proceedings of the ...
[24]
Plug and Play Language Models: A Simple Approach to Controlled ...
We propose a simple alternative: the Plug and Play Language Model (PPLM) for controllable language generation, which combines a pretrained LM with one or more ...
[25]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
May 22, 2020 · We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric ...
[26]
Learning Transferable Visual Models From Natural Language ...
The paper proposes learning visual models by predicting image-caption pairs, then using natural language for zero-shot transfer to downstream tasks.
[27]
Five sources of bias in natural language processing - PMC
Data augmentation by controlling the gender attribute is an effective technique in mitigating gender bias in NLP processes (Dinan et al., 2020; Sun et al., 2019) ...
[28]
[PDF] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming ...
[29]
Building Natural Language Generation Systems
30-day returnsBuilding Natural Language Generation Systems. Building Natural Language Generation Systems ... Ehud Reiter, University of Aberdeen, Robert Dale, Macquarie ...
[30]
Artificial Intelligence | Natural Language Generation - GeeksforGeeks
Jul 9, 2025 · How does NLG work · Content Determination: The system decides which information from the input data is relevant and should be mentioned.
[31]
[PDF] Rhetorical Structure Theory: - A Theory of Text Organization
Finally, RST provides a framework for investigating Relational Propositions, which are unstated but inferred propositions that arise from the text structure in ...
[32]
[PDF] Statistical Acquisition of Content Selection Rules for Natural ...
These examples illustrate that content selection rules should capture cases where an attribute should be included only under certain conditions; that is ...
[33]
Survey of the State of the Art in Natural Language Generation
Jan 27, 2018 · This paper surveys Natural Language Generation (NLG), which is generating text or speech from non-linguistic input, and its core tasks.Missing: pipeline determination
[34]
A Comprehensive Review of Handling Missing Data - arXiv
Apr 7, 2024 · This article reviews existing literature on handling missing values. It compares and contrasts existing methods in terms of their ability to handle different ...
[35]
A survey on missing data in machine learning | Journal of Big Data
Oct 27, 2021 · In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques.Missing: NLG | Show results with:NLG<|control11|><|separator|>
[36]
[PDF] Natural Language Generation from Graphs
This NLG engine used canned text, templates, and grammar rules to produce texts and hypertexts. • Helping technical authors produce instructions for using ...
[37]
None
### Summary of Implications of TAG for NLG, Focusing on Syntactic Realization
[38]
An Overview of SURGE: a Reusable Comprehensive Syntactic ...
This paper describes surge, a syntactic realization front-end for natural language generation systems. By gradually integrating complementary aspects of ...Missing: seminal | Show results with:seminal
[39]
[PDF] Robust, applied morphological generation - ACL Anthology
As an individual module, the morphological generator will be more easily shareable between several different NLG appli- cations, and integrated into new ones.
[40]
[PDF] Adapting Chart Realization to CCG
We describe a bottom-up chart re- alization algorithm adapted for use with Combinatory Categorial Grammar. (CCG), and show how it can be used to.
[41]
[PDF] Using Integer Linear Programming for Content Selection ...
Our method is the first one to consider content selection, lexicalization, and sentence aggregation as an ILP joint optimization problem in the context of multi ...
[42]
Acquiring Input Features from Stock Market Summaries: A NLG ...
Aug 7, 2025 · Bold text: Market summary that refers to long-term market data. In this work, we present WSJ-Markets, a ﬁnancial data-to-text generation.
[43]
[PDF] Abstract Meaning Representation for Sembanking - ACL Anthology
We describe Abstract Meaning Represen- tation (AMR), a semantic representation language in which we are writing down the meanings of thousands of English ...
[44]
[PDF] Atlas.txt: Linking Geo-referenced Data to Text for NLG - ACL Anthology
Geo-referenced data which are often communicated via maps are inaccessible to the visually impaired popula- tion. We summarise existing approaches to improving.
[45]
[PDF] SportSett:Basketball - A robust and maintainable dataset for Natural ...
In this resource paper, we introduce the Sport-. Sett:Basketball database1. This easy-to-use re- source allows for simple scripts to be written which ...
[46]
[2004.13637] Recipes for building an open-domain chatbot - arXiv
Abstract:Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural ...Missing: Facebook | Show results with:Facebook
[47]
Distilling the Knowledge of Large-scale Generative Models into ...
Aug 28, 2021 · On the other hand, retrieval models could return responses with much lower latency but show inferior performance to the large-scale generative ...
[48]
MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for ...
Sep 29, 2018 · MultiWOZ is a large, fully-labeled dataset of 10k human-human written conversations across multiple domains, used for task-oriented dialogue ...
[49]
[1411.4555] Show and Tell: A Neural Image Caption Generator - arXiv
Nov 17, 2014 · Show and Tell: A Neural Image Caption Generator. Authors:Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.
[50]
COCO dataset
COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features.Dataset · People · Tasks · Evaluate
[51]
The Unreasonable Effectiveness of Recurrent Neural Networks
May 21, 2015 · This post is about sharing some of that magic with you. We'll train RNNs to generate text character by character and ponder the question “how is that even ...
[52]
A Computational Model of Linguistic Humor in Puns - PMC
Our work is the first, to our knowledge, to integrate a computational model of general language understanding and humor theory to quantitatively predict humor ...
[53]
[PDF] Computational Creativity in Meme Generation: A Multimodal Approach
Mar 23, 2023 · In this paper, we explore computational creativity in internet meme generation, employing a multimodal framework that integrates natural ...
[54]
Generative AI enhances individual creativity but reduces ... - Science
Jul 12, 2024 · These results point to an increase in individual creativity at the risk of losing collective novelty. This dynamic resembles a social dilemma: ...
[55]
How to Build AI Speech-to-Text and Text-to-Speech Accessibility ...
Sep 1, 2025 · This is where AI-driven accessibility tools can make a difference. From real-time captioning to adaptive reading support, artificial ...Missing: NLG | Show results with:NLG
[56]
[PDF] BLEU: a Method for Automatic Evaluation of Machine Translation
BLEU is a method for automatic machine translation evaluation, measuring closeness to human translations using a weighted average of phrase matches. It is ...
[57]
ROUGE: A Package for Automatic Evaluation of Summaries
Cite (ACL):: Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, ...
[58]
Human evaluation of automatically generated text: Current trends ...
This paper provides an overview of how (mostly intrinsic) human evaluation is currently conducted and presents a set of best practices, grounded in the ...
[59]
BERTScore: Evaluating Text Generation with BERT - arXiv
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token.
[60]
The Second Multilingual Surface Realisation Shared Task (SR'19)
We report results from the SR'19 Shared Task, the second edition of a multilingual surface realisation task organised as part of the EMNLP'19 Workshop.Missing: paper | Show results with:paper
[61]
Evaluating Amazon's Mechanical Turk as a Tool for Experimental ...
Amazon Mechanical Turk (AMT) is an online crowdsourcing service where anonymous online workers complete web-based tasks for small sums of money.
[62]
Survey of Hallucination in Natural Language Generation - arXiv
Feb 8, 2022 · In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG.
[63]
[PDF] On the Risk of Misinformation Pollution with Large Language Models
Dec 6, 2023 · Our results show that (1) LLMs are excellent controllable misinformation generators, making them prone to potential misuse (§ 3), (2) deliber-.
[64]
LLM-based NLG Evaluation: Current Status and Challenges
In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively.
[65]
LLMs for XAI: Future Directions for Explaining Explanations - arXiv
May 9, 2024 · We investigate the use of Large Language Models (LLMs) to transform ML explanations into natural, human-readable narratives.
[66]
Natural language generation for social robotics: opportunities and ...
Mar 11, 2019 · This article summarizes the state of the art in the two individual research areas of social robotics and natural language generation.
[67]
[PDF] Personalized Generation In Large Model Era: A Survey
Jul 27, 2025 · Fundamentally, PGen entails user modeling based on various personalized con- texts and multimodal instructions, extracting per- sonalized ...
[68]
[PDF] Most NLG is Low-Resource: here's what we can do about it
In this position paper, we initially present the challenges researchers & developers of- ten encounter when dealing with low-resource settings in NLG. We then ...
[69]
[PDF] Ethics Issues in Natural Language Generation Systems
With the advent of big data, there is increasingly a need to distill information computed from these datasets into automated summaries and reports.