Fact-checked by Grok 2 weeks ago

Automatic summarization

Automatic summarization, also known as automatic text summarization (ATS), is the computational process of generating a concise version of one or more source documents while retaining their core information content and overall meaning, typically to 5–30% of the original length or less. This task addresses the challenge of in an era of vast digital text data, enabling efficient access to key insights from sources like articles, scientific papers, and legal documents. ATS methods are broadly categorized into two primary types: extractive summarization, which identifies and extracts salient sentences or phrases directly from the input text to form the summary, and abstractive summarization, which interprets the source material and generates novel sentences that the essential ideas in a more fluent, human-like manner. approaches combine elements of both to leverage their strengths, such as the factual accuracy of extractive methods with the coherence of abstractive ones. Further distinctions include single-document versus multi-document summarization, where the latter aggregates information across multiple related sources, and generic versus query-focused summarization, tailored to specific user needs. The field originated in the mid-20th century with early statistical techniques, notably Hans Peter Luhn's 1958 work on auto-abstracting, which used word frequency and proximity to select significant sentences from technical literature. Subsequent advancements in the 2000s incorporated and , evolving through recurrent neural networks (RNNs) and (LSTM) models in the to transformer-based architectures like and series in the 2020s. Large language models (LLMs) have recently revolutionized ATS by enabling and context-aware generation, though they introduce challenges like factual hallucinations. Applications of ATS span diverse domains, including aggregation for quick overviews, report condensation to aid diagnostics, and legal analysis for faster case reviews, thereby enhancing productivity in information-intensive fields. Evaluation typically relies on intrinsic metrics like ROUGE scores for lexical overlap with reference summaries and extrinsic measures assessing summary utility in downstream tasks, alongside human judgments for coherence and relevance. Ongoing challenges include ensuring factual consistency, handling long-context documents, and adapting to domain-specific without extensive retraining.

Fundamentals

Definition and Scope

Automatic summarization is the computational process of producing a concise text that captures the most important information from one or more source documents, typically to 10-30% of the original length while preserving semantic meaning and key details without involvement. This task aims to create a reductive transformation of the source material through selection, generalization, or interpretation, enabling efficient information access in an era of . The scope of automatic summarization delineates it from related tasks like paraphrasing or by focusing on condensation rather than equivalence or reformulation. It distinguishes between generic summarization, which provides domain-independent overviews of the source content, and query-focused summarization, which generates tailored outputs responsive to specific queries or interests. Additionally, it covers single-document approaches that individual texts and multi-document strategies that integrate and synthesize across multiple sources to avoid and highlight contrasts or updates. Summaries may also vary in length, ranging from brief indicative versions that outline main topics to more detailed informative ones that elaborate on core elements. Core objectives emphasize producing outputs that achieve for logical flow and , coverage to include essential facts and viewpoints, non-redundancy to eliminate , and to ensure grammatical and stylistic naturalness akin to human writing. These goals guide system design to balance brevity with informativeness, often drawing on high-level approaches like extractive methods, which select existing phrases, versus abstractive ones, which content. Foundational to automatic summarization are basic natural language processing concepts, including tokenization, which segments text into words, subwords, or sentences for analysis, and sentence parsing, which decomposes structures to identify dependencies and relationships. These preprocessing steps enable subsequent modeling of text semantics and , forming the bedrock for more advanced summarization techniques.

Importance and Challenges

Automatic summarization addresses the challenges of by enabling efficient processing of vast textual data, such as news s, legal documents, and medical reports, where it condenses complex information into concise forms to support quick comprehension and decision-making. In practical applications, it facilitates news aggregation by extracting essential events and opinions from multiple sources, streamlines legal document review by highlighting key clauses and precedents, and aids in medical report condensation by summarizing histories and diagnoses for healthcare professionals. Additionally, it enhances snippets by providing brief overviews of web content, improving user navigation in online environments. These roles are particularly beneficial for , offering simplified summaries for non-native language speakers and visually impaired individuals through text-to-speech integrations. The societal impact of automatic summarization has intensified since the with the explosion of , driven by the proliferation of on , academic publications, and online news, which overwhelms human processing capacities, with global creation already exceeding 150 zettabytes annually as of 2025. It delivers gains in time-sensitive fields like , where automated tools can generate summaries of breaking stories in seconds, allowing reporters to focus on rather than initial reading. Broader benefits include supporting workflows by distilling , thereby accelerating discoveries in fields like , and aiding educational settings by creating digestible overviews for students. Overall, these advancements promote equitable access amid growing volumes. Despite its value, automatic summarization faces significant technical challenges, including gaps in semantic understanding that lead to incomplete or distorted representations of source nuance, particularly in context-dependent languages or specialized domains. Bias amplification occurs when models perpetuate skewed perspectives from training data, resulting in unbalanced summaries that favor dominant viewpoints. in generative models, such as abstractive systems, introduces fabricated details not present in the original text, undermining reliability. Scalability issues arise with long texts, where computational demands and loss of degrade performance on documents exceeding thousands of words. Ethical concerns further complicate deployment, as poor summaries risk misrepresenting source intent and disseminating , especially in high-stakes areas like legal or medical contexts where inaccuracies could lead to erroneous decisions. Ensuring preservation of authorial nuance requires safeguards against oversimplification, while addressing potential harms from biased outputs demands diverse training data and transparency in model operations. These issues highlight the need for robust guidelines to mitigate societal risks from automated content generation.

Methods

Extractive Summarization

Extractive summarization is a in automatic text summarization that identifies and selects key sentences or phrases directly from the source document to form a coherent summary, without paraphrasing or generating new content. The core mechanism involves computing scores for candidate sentences based on linguistic and structural features, followed by ranking and selection to meet a desired summary length, typically 10-30% of the original text. Common features include sentence position, where earlier sentences often receive higher weights due to journalistic conventions placing important upfront; TF-IDF, which measures term importance by balancing word frequency in the document against its rarity across a ; and , which assesses a sentence's to the overall text through connectivity or representativeness. These scores are aggregated linearly or via to prioritize sentences that capture the document's main ideas. Pioneering work in extractive summarization dates to the late with Hans Peter Luhn's frequency-based approach, which scans documents to identify "significant" words—those appearing frequently but excluding common —and selects containing the highest concentrations of these words to form an "auto-abstract." This method laid the foundation for statistical extraction by emphasizing lexical salience without requiring deep semantic analysis. A decade later, in 1969, H.P. Edmundson advanced the field with his cue method, which combines multiple cues: location (favoring sentences near the beginning or end), of , and predefined "cue phrases" (e.g., "in conclusion" or "the purpose of") that signal importance, weighted subjectively to compute relevance scores. Edmundson's approach improved upon pure by incorporating structural and contextual indicators, achieving better performance on scientific and technical texts in early evaluations. Statistical models in extractive summarization build on these early ideas with baselines and refinements focused on frequency and position. The Lead-3 baseline, a simple yet robust method, extracts the first three sentences of a document as the summary, exploiting the inverted pyramid structure common in news articles where key facts appear early; it serves as a strong reference in benchmarks, often outperforming more complex systems on datasets like / due to its alignment with human writing patterns. Frequency-driven extraction extends Luhn's principle by using variants of term weighting, such as summing TF-IDF scores across words in a sentence to gauge informativeness, then greedily selecting non-redundant high-scoring sentences to avoid overlap. These models prioritize computational efficiency and interpretability, making them suitable for large-scale applications. Extractive summarization offers advantages such as to the source material, ensuring summaries contain content that avoids fabrication or distortion of facts, which is particularly valuable in domains like legal or texts. Additionally, its reliance on direct selection facilitates easier intrinsic evaluation using metrics like , as overlap with the original can be precisely measured against gold-standard summaries. However, limitations include reduced , as concatenated sentences may lack smooth transitions and cohesive flow, potentially resulting in a read. is another challenge, where similar sentences might be selected if is not explicitly enforced during . A representative example of extractive summarization via graph-based ranking is the TextRank algorithm, which constructs an undirected graph where nodes represent sentences and edges are weighted by (often using TF-IDF vectors); it then applies the algorithm to compute scores, selecting the top-ranked sentences to form the summary. This approach captures global text structure by propagating importance through similarity links, improving over purely local features like frequency. Similarly, LexRank uses on a sentence similarity graph to identify salient nodes, emphasizing cluster-based representativeness for more diverse selections. These graph methods, while unsupervised, have demonstrated competitive performance on single-document tasks by modeling inter-sentence relationships.

Abstractive Summarization

Abstractive summarization generates novel sentences that and synthesize information from the source text, aiming to capture its semantic essence in a more concise and fluent form than the original. The core mechanism entails first deriving a semantic representation of the input, such as through syntactic parse trees or dense vector embeddings, which encodes key concepts and relationships. This representation then informs a process that constructs new text, often guided by linguistic rules or learned patterns to ensure and grammaticality. Early approaches to abstractive summarization, emerging in the , primarily utilized template-based systems and rule-driven paraphrasing. Kathleen McKeown's foundational work, including her discourse-focused text generation , employed predefined templates populated with extracted entities and events from the source, combined with rules for rephrasing to produce summaries that mimicked human abstracts. These methods prioritized interpretability and control but were constrained by hand-crafted rules, limiting their scalability to diverse texts. The paradigm shifted toward neural architectures around 2014, with sequence-to-sequence models incorporating mechanisms enabling end-to-end learning for abstraction. Rush et al. (2015) pioneered this by introducing a local -based encoder- model for summarization, where the generates each summary word conditioned on attended input representations, achieving substantial improvements over prior baselines on datasets. Building on this, Nallapati et al. (2016) adapted attentional encoder- recurrent neural networks for longer documents, addressing challenges like rare word handling and hierarchical structure to produce state-of-the-art abstractive outputs. A pivotal advancement came with pointer-generator networks, as proposed by See et al. (2017), which hybridize generation and extraction within a neural framework. This approach computes a over the vocabulary that interpolates between generating unseen words and pointing to source tokens, allowing the model to reproduce factual details accurately while enabling paraphrasing for novelty; an added coverage mechanism further mitigates repetition by penalizing overlooked input elements. Despite these innovations, abstractive methods face persistent challenges like factual inconsistency, where models may hallucinate or distort information absent from the source, undermining reliability. Abstractive summarization offers advantages in producing human-like and , enabling summaries that integrate across sentences more naturally than extractive alternatives. However, it incurs higher computational demands due to the complexity of generation and remains error-prone, as the reliance on learned abstractions can amplify inaccuracies in underrepresented domains.

Hybrid and Aided Approaches

approaches in automatic summarization integrate extractive and abstractive techniques to leverage the strengths of both paradigms, typically employing extractive methods for initial content selection followed by abstractive refinement for coherent output generation. Early models from the 2010s, such as the hierarchical approach proposed by Wang et al., combined statistical sentence scoring with semi-supervised learning to identify salient elements before generating summaries, achieving improved over pure extractive systems on multi-document tasks. In the neural era, models like the one introduced by Pilault et al. in 2020 used transformer-based extractive pre-selection to compress long documents into key segments, which were then abstractively summarized, demonstrating score improvements of up to 2 points on datasets like compared to standalone abstractive baselines. More recent hybrids, such as SEHY (2022), exploit for extractive section selection prior to abstractive processing, balancing to source content with . Aided summarization extends methods by incorporating human guidance to enhance accuracy and adaptability, often through interactive interfaces where users refine outputs via queries, edits, or loops. For instance, the query-assisted by Narayan et al. (2022) employs to iteratively update summaries based on user-specified queries, enabling targeted from document sets while reducing risks. Semi-supervised , like the representation learning model by Zhong et al. (2023), blend statistical scoring for extractive candidate generation with neural abstractive refinement, using limited labeled data to train on unlabeled corpora for multi-document tasks. Crowd-sourced aided tools, such as those in the aspect-based summarization benchmark by Roit et al. (2023), involve controlled human annotations to guide pipelines, ensuring diverse perspectives in summary generation for open-domain topics. These systems address limitations of fully automated methods by allowing user interventions, such as editing phrases, to maintain factual accuracy. The benefits of and aided approaches lie in their ability to balance extractive —preserving original semantics—and abstractive creativity—producing novel, concise expressions—while mitigating issues like or factual errors in pure paradigms. For example, the joint extractive-abstractive model for financial narratives by et al. (2021) reported 5-10% gains in semantic consistency metrics over non-hybrid baselines, highlighting improved usability in domain-specific applications. Post-2020 developments emphasize human-AI frameworks, such as SUMMHELPER (2023), which facilitates real-time human-computer co-editing of summaries, and design space mappings by Zhang et al. (2022) that outline interaction modes like iterative feedback to foster and in collaborative summarization. These emerging systems, often integrated with large language models, promote scalable processes that enhance summary quality through complementary human oversight.

Techniques

Keyphrase Extraction

Keyphrase extraction is the task of automatically identifying and selecting multi-word terms, such as noun phrases, that best represent the essence or main topics of a document. These keyphrases serve as concise descriptors of the document's content, aiding in indexing, retrieval, and understanding without requiring full reading. Unlike single keywords, keyphrases capture compound concepts (e.g., "automatic summarization" rather than just "summarization"), making them particularly valuable for representing complex ideas in technical or lengthy texts. Supervised methods for keyphrase extraction typically frame the problem as a sequence labeling or task, where classifiers are trained on annotated datasets to distinguish keyphrases from non-keyphrases. Common features include word position (e.g., proximity to the document's beginning or , as phrases appearing early often indicate ), frequency (e.g., term frequency-inverse document frequency, tf-idf, to weigh rarity and occurrence), and co-occurrence (e.g., measuring semantic relatedness with surrounding terms). For instance, Conditional Random Fields () models excel in this context by modeling dependencies across phrase boundaries, using features like part-of-speech tags, dependency parses, and contextual windows to label candidate phrases. In evaluations on scientific articles, CRF-based approaches have demonstrated superior performance over baselines like SVM, achieving F-measures around 32-33% on datasets such as SemEval-2010. Unsupervised methods, in contrast, rely on intrinsic text properties without labeled training data, often employing -based ranking to identify salient phrases. The TextRank algorithm, introduced in , exemplifies this approach by constructing a where candidate phrases (or words) serve as nodes and edges represent similarity based on within a sliding window (typically 2-10 words). Node scores are computed iteratively using a PageRank-inspired voting mechanism, propagating importance from highly connected nodes until , typically after 20-30 iterations; the top-scoring nodes form the extracted keyphrases. This method has shown competitive results on benchmarks like the dataset, with around 31% and around 43% for window size 2. Evaluation of keyphrase extraction commonly uses (fraction of extracted phrases that are correct), (fraction of gold-standard phrases retrieved), and their F-, often computed at top-K (e.g., the top 5 or 10 candidates) to assess performance under practical constraints. These metrics highlight trade-offs, such as high at low K versus broader at higher K, and are standard on datasets like or SemEval. In automatic summarization, extracted keyphrases are frequently applied as features for sentence scoring in extractive methods, where sentences containing more keyphrases receive higher scores, guiding the selection of summary content. This integration enhances focus on topical elements, improving summary .

Single-Document Summarization

Single-document summarization focuses on generating a concise of the key from a single , such as a news article, scientific paper, or , while preserving its core meaning and structure. Unlike multi-document approaches, it emphasizes the internal and logical flow of one , often using extractive or abstractive techniques tailored to the source's and content. Early methods relied on rule-based heuristics, but modern techniques incorporate to select or generate summary elements more effectively. Supervised learning approaches for extractive single-document summarization commonly employ features such as sentence length, position in the document, and similarity to the or headings to rank and select important sentences. These features capture indicators of salience, like shorter sentences for key facts or those closely aligned with the document's for . A seminal work in this area is the trainable document summarizer by Kupiec et al., which uses a Bayesian classifier to estimate the probability of a being included in a human-generated summary based on such features, achieving improved performance over baseline methods on technical documents. Adaptive methods in single-document summarization dynamically adjust the output based on user-specified needs, such as desired summary length or focus on particular aspects like key events or entities. For instance, systems can modulate sentence selection thresholds or reweight features in to produce shorter or more targeted summaries without retraining. This enhances flexibility for diverse applications. Graph-based techniques model the document as a graph where sentences are nodes connected by similarity edges, enabling centrality measures to identify salient content. LexRank, introduced by Erkan and Radev, applies inspired by to compute lexical importance scores for sentences, treating the graph as a and converging on stable rankings for extractive selection; this method outperforms frequency-based baselines on news corpora by better capturing thematic clusters. One key challenge in single-document summarization is maintaining narrative flow, particularly in non-news texts like stories or reports, where extractive methods may disrupt chronological or causal sequences by selecting disjoint sentences. Abstractive approaches aim to mitigate this through paraphrasing, but they risk introducing inconsistencies if not grounded in the source structure. Datasets like /DailyMail, comprising over 300,000 news articles paired with human-written highlights, have become standard for training and evaluating single-document models, facilitating advancements in both extractive and abstractive paradigms.

Multi-Document Summarization

Multi-document summarization (MDS) aims to generate a concise, coherent overview from a collection of related documents, such as news articles or research papers, by integrating key information while minimizing overlap and ensuring comprehensive coverage. Unlike single-document approaches, MDS must synthesize diverse perspectives, often from sources with varying emphases, to produce a unified that captures the essence of the topic. This process emphasizes redundancy reduction—eliminating repetitive content across documents—and , where complementary details are fused into novel expressions. Early frameworks, such as those based on Cross-document Structure Theory (CST), highlight the need to model relations like elaboration (adding details) and subsumption (overlapping ideas) to achieve this balance. Core challenges in MDS include managing , where identical or similar facts recur across sources, potentially inflating summary length without adding value; handling contradictions, such as conflicting reports on events or findings; and performing topic clustering to identify and group sub-themes within the document set. is often addressed through similarity measures like cosine distance on vectors, which filter out near-duplicate content during selection. Contradictions require relational modeling, as in , where conflicting segments (e.g., one document claiming an outcome increase while another reports a decrease) are flagged for inclusion or reconciliation based on user needs, ensuring summaries avoid unsubstantiated claims. Topic clustering, meanwhile, involves grouping documents or by shared themes, using techniques like to partition content and prevent scattered narratives. These issues are exacerbated in large corpora, where input size can exceed thousands of , demanding scalable algorithms to maintain coherence. Key techniques for MDS include Maximal Marginal Relevance (MMR), introduced in 1998, which balances relevance to the central topic with novelty to promote diversity and curb redundancy. MMR selects sentences by maximizing a score that weighs similarity to a query or centroid against dissimilarity to already chosen elements, formalized as: \text{MMR} = \arg\max_{D_i \in R \setminus S} \left[ \lambda \cdot \text{Sim}_1(D_i, Q) - (1 - \lambda) \cdot \max_{D_j \in S} \text{Sim}_2(D_i, D_j) \right] where R is the candidate set, S the selected set, Q the query, \lambda tunes the trade-off (typically 0.5–0.7), and \text{Sim}_1, \text{Sim}_2 are cosine similarities. This greedy reranking has been widely adopted for extractive MDS, reducing overlap in news clusters by up to 20–30% in early evaluations. Hierarchical clustering extends this by organizing documents into nested structures, such as temporal layers for evolving events, enabling summaries that reflect progression (e.g., initial reports to updates). In the SUMMA system, sentences are clustered recursively by burstiness—peaks in coverage—and evenness, optimizing for salience and coherence across levels, which improved human preference by 92% over flat methods on news corpora. Supervised approaches, such as Integer Linear Programming (ILP), formulate MDS as an optimization problem for optimal sentence selection under constraints like length limits. ILP models maximize a linear objective combining sentence importance (predicted via supervised regression on features like position and n-gram overlap) and diversity (e.g., unique bigram coverage), subject to binary variables indicating selection and non-overlap penalties. A 2012 method using Support Vector Regression for importance scoring achieved state-of-the-art ROUGE-2 scores of 0.0817 on DUC 2005 datasets, outperforming greedy baselines by incorporating global constraints solvable in seconds via solvers like GLPK. This supervised paradigm trains on annotated corpora to prioritize informative, non-redundant content. Evaluation in MDS uniquely emphasizes coverage of events or entities across documents, assessing how well summaries capture distributed information rather than isolated facts. Metrics like ROUGE variants measure n-gram overlap with references, but event-focused approaches, such as QA-based scoring in the DIVERSE SUMM benchmark, quantify inclusivity by checking if summaries address diverse question-answer pairs (e.g., "what" and "how" events), revealing gaps in large language models where coverage hovers at 36% despite high faithfulness. Human judgments often prioritize event completeness, as partial coverage can mislead on multi-source topics. Practical examples include summarizing news clusters, as in the Multi-News dataset, which comprises 56,216 pairs of 2–10 articles on events like arrests or elections, enabling models to fuse timelines and perspectives into 260-word overviews that reduce redundancy by integrating overlapping reports. In scientific literature, MDS supports reviews by synthesizing study abstracts; the MSLR2022 shared task, using datasets like MS² (20,000 reviews), tasked systems with generating conclusions on evidence directions (e.g., treatment effects), where top entries improved ROUGE-L by 2+ points via hybrid extractive-abstractive methods tailored to domain-specific clustering.

Advanced Optimization Methods

Advanced optimization methods in automatic summarization leverage mathematical frameworks to select optimal summary elements under constraints such as length budgets. A prominent approach involves submodular functions, which are set functions exhibiting the property of , enabling efficient diverse subset selection for extractive tasks like sentence ranking and coverage maximization. These functions model summarization as optimizing an objective F(S), where S is the summary set, to balance representativeness and diversity while adhering to submodularity: F(A \cup \{e\}) - F(A) \geq F(B \cup \{e\}) - F(B) for all A \subseteq B and e \notin B. Recent advancements integrate large language models (LLMs) into these frameworks, using techniques like prompt-based optimization and fine-tuning to enhance abstractive summarization while addressing hallucinations through submodular coverage constraints. For example, as of 2024, LLM-based methods have improved ROUGE scores on benchmarks like CNN/DailyMail by incorporating deterministic constraints for factual consistency. In practice, submodular functions facilitate greedy algorithms that iteratively select sentences to maximize coverage, providing a principled way to approximate the best summary under budget constraints. Complementary to this, Bayesian approaches address uncertainty in summaries by modeling probabilistic dependencies, such as query relevance or sentence importance, through posterior distributions that incorporate prior knowledge and observed data. For instance, Bayesian query-focused summarization uses hidden variables to estimate sentence contributions, enabling robust handling of ambiguous inputs. These methods offer theoretical advantages, including algorithms' approximation guarantees of (1 - 1/e)-optimality for maximizing submodular functions under constraints. However, their limitations include high , often O(n^2) for evaluating marginal gains over large document sets, which can hinder scalability in applications.

Evaluation

Intrinsic and Extrinsic Metrics

Evaluation of automatic summarization systems relies on intrinsic and extrinsic metrics to assess summary quality. Intrinsic metrics evaluate the summary directly by comparing it to reference summaries or the source text, focusing on aspects such as content coverage, fluency, and coherence without requiring human task performance. These metrics are typically automated and domain-independent, enabling scalable assessment, though they may not fully capture semantic nuances. In contrast, extrinsic metrics measure the utility of a summary in supporting downstream tasks, such as or , where the summary's effectiveness is gauged by its impact on task outcomes like accuracy or . A prominent is (Recall-Oriented Understudy for Gisting Evaluation), introduced in 2004, which quantifies n-gram overlap between the candidate summary and multiple reference summaries to approximate human judgments of informativeness. variants include for unigram overlap, for overlap emphasizing phrase-level matching, and based on the to account for sentence-level structure. The core ROUGE-N formula is defined as: \text{ROUGE-N} = \frac{\sum \min(\text{Count}_{\text{match}}(\text{gram}_n), \text{Count}(\text{gram}_n))}{\sum \text{Count}(\text{gram}_n)} where the numerator sums the minimum matching counts of n-grams across references and candidate, and the denominator sums the counts in references; this recall-focused approach correlates well with human evaluations on datasets like DUC. Another intrinsic method is the Pyramid approach, proposed in 2004, which evaluates content selection by identifying semantic content units (SCUs) from human summaries and scoring a candidate summary based on how many unique SCUs it covers, weighted by their pyramid rank to reflect varying human priorities. This method addresses limitations in n-gram metrics by prioritizing semantic informativeness over surface form. Intrinsic evaluations can be categorized as intra-textual, comparing the summary to the source text for aspects like grammaticality or non-redundancy, or inter-textual, comparing it to reference summaries for content adequacy. Domain-independent intrinsic metrics, such as cosine similarity on text embeddings (e.g., using TF-IDF or neural embeddings like BERT), provide a generic measure of semantic overlap without relying on domain-specific references, often serving as a baseline for broader applicability. For instance, cosine similarity computes the angular distance between vector representations of the summary and reference, yielding values from -1 to 1, where higher scores indicate greater alignment. Extrinsic metrics assess summarization through task-oriented performance, revealing practical utility but requiring controlled experiments. In question answering tasks, summaries are evaluated by how well they enable accurate answers to queries derived from the source, with metrics like F1-score on answer extraction showing that effective summaries reduce retrieval time while maintaining high . Similarly, in reading comprehension benchmarks, extrinsic evaluation measures improvements in comprehension scores when participants use summaries versus full texts, demonstrating correlations with intrinsic scores but highlighting real-world impacts like faster in audit tasks. These approaches underscore that while intrinsic metrics scale well for development, extrinsic ones validate end-use effectiveness.

Qualitative and Domain-Specific Assessment

Qualitative evaluation in automatic summarization relies on human assessors to rate summaries based on subjective criteria such as informativeness, , and , providing insights into aspects that automated metrics may overlook. Assessors typically use Likert-style scales, such as 5-point Mean Opinion Scores (MOS), where ratings range from "very poor" to "very good" for attributes like grammaticality, non-redundancy, referential clarity, , and structure & , which collectively address and . For informativeness, overall responsiveness is scored on similar scales, evaluating how well the summary covers key content without redundancy. Pairwise comparisons, where assessors rank two summaries side-by-side for relative quality, are also employed to reduce and improve reliability in these ratings. In domain-specific contexts, evaluation adapts these qualitative methods to prioritize fidelity to source nuances, including specialized vocabulary and critical entities. For biomedicine, human assessors focus on entity preservation, using rubrics like SaferDx or PDQI-9 to check omission of key medical facts, diagnoses, and terminology accuracy via tools such as UMLS Scorer for groundedness and faithfulness. In finance, ratings emphasize numerical accuracy, verifying retention of vital figures like monetary values through entity-aware assessments to ensure factual precision in summaries. For legal domains, assessors evaluate preservation of case references, dates, and clause linkages, maintaining domain-specific relevance and coherence. These adaptations ensure summaries intra-textually align with source intricacies, such as technical terms in medical or legal texts. Challenges in qualitative and domain-specific assessment include high subjectivity, where inter-rater agreement varies, necessitating expert involvement that escalates costs. Creating summaries is resource-intensive, requiring domain experts for and clear guidelines to mitigate variability. The Text Analysis Conference (TAC) exemplifies structured qualitative scoring, using 5-point scales for / and overall , alongside methods for content units to guide human judgments in guided summarization tasks. Such human evaluations complement baselines like by capturing nuanced quality.

Modern Evaluation Challenges

One of the primary modern challenges in evaluating automatic summarization lies in assessing factuality, particularly the detection of s where summaries introduce unsubstantiated or incorrect information not present in the source material. Traditional metrics like often fail to capture these issues, as they prioritize lexical overlap rather than semantic fidelity, leading to overestimation of summary quality in abstractive systems. Recent approaches, such as the FactCC metric, employ weakly supervised models to verify factual consistency by applying rule-based transformations to source documents and detecting conflicts with generated summaries, achieving improved correlation with human judgments on datasets like /DailyMail. Despite these advances, evaluation remains complex due to its contextual nature, with automatic methods struggling to distinguish subtle factual errors in diverse domains. Multilingual evaluation introduces significant gaps, as most benchmarks and metrics are English-centric, resulting in poor for non-English languages where scarcity exacerbates issues like translation-induced errors in cross-lingual summarization. For instance, automatic metrics such as BERTScore exhibit biases toward high-resource languages, undervaluing summaries in low-resource ones like Swahili or , and failing to account for linguistic nuances across scripts and morphologies. Efforts to address this include meta-evaluation datasets that test robustness across languages, revealing that reference-based evaluators correlate weakly with assessments outside English, prompting calls for more inclusive, multilingual corpora. Bias and fairness pose additional hurdles, as summarization models can amplify inherent biases in source texts, such as gender or racial stereotypes in news articles, which standard metrics overlook by focusing on surface-level accuracy rather than equitable representation. Metrics like bias amplification ratios have been proposed to quantify how summaries exacerbate source biases, but their integration into evaluation pipelines remains limited, often requiring domain-specific adaptations. FactCC has been extended in fairness contexts to flag biased factual inconsistencies, yet comprehensive tools for detecting amplification in real-time generation are still emerging. Scalability challenges arise in evaluating long-form or streaming summaries, where processing extended inputs like books or live news feeds overwhelms traditional metrics designed for short texts, leading to incomplete assessments of over thousands of tokens. Benchmarks for long-context tasks highlight failures in maintaining factual accuracy across extended narratives, with evaluators like those in the ETHIC framework showing that even large models degrade in performance on inputs exceeding 100,000 tokens. This necessitates efficient, hierarchical evaluation methods that can handle dynamic, incremental summarization without prohibitive computational costs. Emerging reference-free metrics leveraging large models (LLMs) since offer promising solutions by bypassing the need for gold-standard references, instead using LLMs to score summaries on criteria like and directly against sources. For example, SummaC employs LLM-based question generation and answering to check factual consistency, outperforming reference-based alternatives on benchmarks like XSum and achieving up to 20% better alignment with human evaluations. However, these metrics face generalization failures across s and styles, where LLMs trained predominantly on English data produce inconsistent scores for morphologically rich or low-resource s, underscoring the need for multilingual .

Applications

Commercial Systems

Commercial automatic summarization systems have proliferated since the mid-2010s, driven by the maturation of and the integration of advanced (NLP) capabilities into scalable . This shift enabled enterprises to access sophisticated summarization without on-premises infrastructure, with major providers launching dedicated services around 2016, coinciding with the broader adoption of in cloud environments. Pricing models typically follow a pay-as-you-go structure, charging based on input volume (e.g., per 1,000 characters or API calls), often with tiered options for high-volume users and free tiers for initial testing. Google Cloud offers summarization through Vertex AI and Document AI, leveraging generative AI models for abstractive summarization that produce concise, human-like overviews of documents or text. These tools support extractive-abstractive approaches, where key sentences are identified before rephrasing, and integrate seamlessly with other Google Cloud services for enterprise workflows like in or legal review. Developers access features via APIs, with options for custom on proprietary data. IBM Watson, via watsonx.ai, provides document summarization using foundation models such as for both extractive and abstractive methods, emphasizing cloud deployments for secure enterprise use. While earlier components like Tone Analyzer focused on sentiment alongside basic text processing, current capabilities extend to generative summarization for reports, transcripts, and legal documents, reducing processing time by up to 90% in case studies like media firm Blendow Group. enable integration into applications, supporting retrieval-augmented generation () for context-aware summaries. Microsoft Azure AI Language (formerly Text Analytics) delivers key phrase extraction alongside extractive and abstractive summarization, using encoder models to rank and generate summaries from unstructured text or conversations. Extractive mode selects salient sentences with relevance scores, while abstractive generates novel phrasing for coherence; both handle documents up to 125,000 characters total across batches via asynchronous APIs in languages like Python and C#. This facilitates enterprise applications in compliance monitoring and knowledge management, with scalability for batch processing. Open-source integrations like Transformers enable enterprise deployment of summarization models, such as or , fine-tuned for specific domains via the . Companies leverage these for custom pipelines in production environments, deploying abstractive models that generate summaries while preserving key information, often combined with cloud infrastructure for inference at scale. Enterprise features include model sharing, evaluation metrics like , and paid services for private repositories and accelerated inference. In news applications, Apple Intelligence incorporates summarization tools within iOS and macOS ecosystems, using on-device generative models to condense articles into digests for notifications and reading apps like . This feature prioritizes key points from lengthy content, enhancing user experience in fast-paced media consumption, though it has faced challenges with accuracy in beta releases.

Real-World Use Cases

In , automatic summarization has enabled the generation of concise news briefs from structured data, allowing outlets to scale coverage efficiently. Since 2014, the (AP) has employed (NLG) technology to automate summaries of corporate earnings reports, transforming raw financial data into readable articles and increasing output from about 300 stories per quarter to over 4,000 without additional staff. This approach has since expanded to other routine reporting, such as sports recaps and election results, freeing journalists for in-depth analysis while maintaining factual accuracy through templated extraction methods. In the legal and enterprise sectors, automatic summarization streamlines contract review and e-discovery processes by condensing voluminous documents into key insights, reducing manual review time significantly. Tools integrated with e-discovery platforms use extractive and abstractive techniques to highlight clauses, risks, and obligations in contracts, enabling faster in mergers and litigation. For instance, in e-discovery, summarization aids early case assessment by generating overviews of document sets, helping legal teams prioritize relevant evidence from terabytes of data and cutting review costs in large cases. In healthcare, automatic summarization facilitates patient record abstraction by synthesizing electronic health records (EHRs) into coherent narratives, supporting clinical and reducing cognitive overload for providers. Systems like demonstrate how summarization tools assist data abstractors in quality metric abstraction by extracting and prioritizing key events from longitudinal records, improving efficiency in tasks such as identifying comorbidities or treatment histories. Similarly, for biomedical literature, techniques applied to abstracts automate the generation of lay summaries or evidence overviews from clinical trials, enhancing accessibility for researchers and patients; a highlights abstractive methods in condensing trial results while preserving medical accuracy. In education, automatic summarization condenses lecture notes and materials, aiding student comprehension and study efficiency. Applications process video lectures or transcripts to produce structured summaries with key points, timestamps, and concept maps, as shown in evaluations where large language models like GPT-3 generated summaries that improved learner retention by 15-20% compared to unassisted notes. For accessibility aids, summarization supports students with disabilities by adapting content into simplified formats, such as variable-length overviews in e-learning platforms that integrate with screen readers, thereby promoting inclusive education through on-demand customization of dense academic texts. On , automatic summarization handles (now X) threads by distilling multi-post discussions into single-paragraph overviews, helping users navigate complex conversations quickly. Bots and platform features employ NLG to generate thread summaries, capturing main arguments and conclusions, which has been implemented in tools that process viral threads to boost engagement without overwhelming readers. In , summarization assists by abstracting user reports or dialogue chains, enabling moderators to abusive content faster; systems, for example, summarize text and image interactions in posts, reducing false positives in detection by providing contextual digests for human review.

History and Developments

Early Foundations

The origins of automatic summarization trace back to the 1950s, when early efforts focused on rule-based systems inspired by (IR) techniques. In 1958, introduced one of the first automated methods for generating abstracts from , using a approach to identify significant based on word and proximity, effectively creating "auto-abstracts" by extracting key excerpts without deep semantic understanding. This work, rooted in IBM's punch-card innovations, marked a foundational shift toward computational text and drew heavily from emerging IR practices, such as indexing and keyword weighting, to prioritize content relevance in large document collections. During the 1960s, these ideas influenced broader IR developments, including models that treated documents as bags of words, laying groundwork for later extractive summarization by emphasizing over manual . The and saw a pivot to linguistic and knowledge-based approaches, emphasizing deeper text comprehension through structured representations. Schank's script , developed in the mid-, proposed using predefined "scripts" — stereotypical sequences of events — to model human understanding of narratives, enabling systems to infer and summarize implied content from partial descriptions in stories or reports. This rationalist , prominent in research, influenced summarization by incorporating to parse and reorganize text elements, moving beyond simple extraction to simulate human-like inference, though it required extensive hand-crafted knowledge bases that limited scalability. By the late , these methods intersected with government-funded initiatives; the program, initiated in the early as a precursor to structured evaluations, began integrating linguistic tools for text analysis, funding research that bridged with to handle real-world document sets. In the 1990s, the field transitioned to statistical extractive methods, driven by advances in and the need for more robust, data-oriented systems. Seminal work by Julian Kupiec and colleagues in 1995 demonstrated a trainable summarizer using probabilistic models to score sentences based on features like title overlap and position, achieving effective extracts by learning from annotated corpora without relying on rigid rules. This data-driven shift, influenced by techniques for relevance ranking and early (MT) efforts in sentence alignment, enabled scalable summarization for news and technical texts, marking a departure from knowledge-intensive approaches toward empirical optimization. Key milestones included the first large-scale evaluations under the program's SUMMAC initiative in 1998, which assessed summarization's utility for tasks, and the inception of the Document Understanding Conference (DUC) in 2001, building on these roots to standardize benchmarks at conferences starting with workshops in 2002. These developments solidified influences from (e.g., term weighting) and MT (e.g., fluency in rephrasing), fostering a hybrid foundation for future progress.

Recent Advances

The advent of architectures marked a pivotal shift in automatic summarization, transitioning from statistical methods to approaches capable of capturing complex semantic relationships. The introduction of the model in 2017 revolutionized the field by relying solely on attention mechanisms, eliminating the need for recurrent or convolutional layers, which enabled more efficient of long sequences and improved performance in sequence-to-sequence tasks like abstractive summarization. This architecture laid the groundwork for subsequent advancements, allowing models to generate coherent summaries that better mimic human-like abstraction. Building on Transformers, specialized models emerged for abstractive summarization. In 2019, (Bidirectional and Auto-Regressive Transformers) was proposed as a denoising pre-trained on large corpora through tasks like text infilling and sentence permutation, achieving state-of-the-art results on datasets such as CNN/ by for summarization, where it outperformed prior extractive baselines in scores by up to 2-3 points. Similarly, the (Text-to-Text Transfer Transformer) model, introduced in late 2019 and refined in 2020, unified all tasks under a text-to-text framework, demonstrating superior abstractive summarization capabilities when , with ROUGE-2 improvements of approximately 1.5 points over non-pretrained models on news summarization benchmarks. The integration of large language models (LLMs) further advanced summarization, particularly through fine-tuning and zero-shot capabilities post-2020. Models in the GPT series, starting with in 2020, enabled zero-shot summarization by leveraging in-context prompting, where summaries are generated without task-specific training, achieving competitive scores (around 20-25 on CNN/Daily Mail) comparable to supervised models in few-shot settings. Prompt-based approaches extended this by incorporating instructions for style or focus, enhancing flexibility without retraining. In the 2020s, innovations addressed limitations in context length and multilingual support; for instance, Longformer (2020) introduced sparse attention patterns to handle documents up to 4,096 tokens—four times longer than standard Transformers—improving summarization of extended texts like legal or scientific articles by reducing complexity. Multilingual extensions, such as mT5 (2021), pretrained on 101 languages, facilitated cross-lingual summarization, yielding improvements of 5-10 points on non-English datasets when fine-tuned. Key datasets have driven these developments by providing diverse training resources. The XSum dataset (2018), comprising over 200,000 articles paired with single-sentence abstractive summaries, emphasized extreme summarization and novel content generation, boosting model training for concise outputs. Complementing this, MLSUM (2020) offered 1.5 million article-summary pairs across five languages (English, , , , ), enabling multilingual model evaluation and reducing language biases in training. Emerging trends focus on controllability and ethical considerations. Controllable summarization techniques, such as CTRLsum (2020), allow users to specify attributes like length or entity focus via prompts or prefixes, generating tailored summaries with up to 15% better alignment to on benchmarks like Multi-News. Post-2023, ethical has gained prominence, with research emphasizing mitigation of biases, hallucinations, and privacy risks in summarization systems, including guidelines for factual consistency evaluation and stakeholder impact assessment in deployment. From 2024 onward, advancements included enhanced zero-shot summarization with models like , which achieved higher scores (e.g., ROUGE-1 around 40-45 on news benchmarks) compared to through better contextual understanding. Techniques such as retrieval-augmented generation () integrated external retrieval to boost factual accuracy and reduce hallucinations in abstractive summaries, with graph-based variants like GraphRAG enabling query-focused multi-document summarization. New datasets, including CS-PaperSum (2025) for -generated summaries of papers, supported domain-specific training and evaluation as of 2025.

References

  1. [1]
    [PDF] A Comprehensive Survey on Automatic Text Summarization ... - arXiv
    Mar 21, 2025 · Definitions for Automatic Text Summarization. Definition 1 (Text Summarization). Text summarization can be defined as a mapping function fθ ...
  2. [2]
    [PDF] A Comprehensive Survey on Text Summarization Systems - HPCC lab
    This paper presents a taxonomy of summarization systems and defines the most important criteria for a summary which can be generated by a system. Additionally,.
  3. [3]
    The Automatic Creation of Literature Abstracts - IBM - IEEE Xplore
    Excerpts of technical papers and magazine articles that serve the purposes of conventional abstracts have been created entirely by automatic means.
  4. [4]
    [PDF] Introduction to the Special Issue on Summarization - ACL Anthology
    A summary can be loosely defined as a text that is produced from one or more ... Mani and M. T. Maybury, editors,. Advances in Automatic Text Summarization.
  5. [5]
  6. [6]
  7. [7]
    Deep learning for text summarization using NLP for automated news ...
    Oct 17, 2025 · Text summarization automatically generates a summary from the source document that includes all pertinent information and key phrases. In the ...Missing: prerequisites parsing
  8. [8]
  9. [9]
    Automatic summarization of scientific articles: A survey - ScienceDirect
    Automatically summarizing scientific articles would help researchers in their investigation by speeding up the research process.
  10. [10]
    A Systematic Survey of Text Summarization: From Statistical ...
    Extractive methods create summaries by extracting sentences from the original documents [131]. Abstractive methods generate the summary word by word with novel ...
  11. [11]
    [PDF] A Survey on Automatic Text Summarization
    Nov 21, 2007 · This simple definition captures three important aspects that characterize research on automatic summarization: • Summaries may be produced from ...
  12. [12]
  13. [13]
    New Methods in Automatic Extracting | Journal of the ACM
    This paper describes new methods of automatically extracting documents for screening purposes, i.e. the computer selection of sentences having the greatest ...
  14. [14]
    [PDF] Ranking Sentences for Extractive Summarization with ...
    In our first experiment, participants were presented with a news article and summaries generated by three systems: the. LEAD baseline, abstracts from See et al.
  15. [15]
    [PDF] TextRank: Bringing Order into Texts - ACL Anthology
    Mihalcea. 2004. Graph-based ranking algorithms for sen- tence extraction, applied to text summarization. In Pro- ceedings of the 42nd Annual ...
  16. [16]
  17. [17]
    A Comprehensive Survey of Abstractive Text Summarization Based ...
    Automatic text summarization (ATS) is becoming an extremely important means to solve this problem. The core of ATS is to mine the gist of the original text and ...
  18. [18]
    Text Generation - Cambridge University Press & Assessment
    Kathleen McKeown. Publisher: Cambridge University Press. Online publication date: December 2009. Print publication year: 1985. Online ISBN: 9780511620751. DOI ...Missing: Constraints | Show results with:Constraints
  19. [19]
    A Neural Attention Model for Abstractive Sentence Summarization
    Sep 2, 2015 · In this work, we propose a fully data-driven approach to abstractive sentence summarization. Our method utilizes a local attention-based model ...
  20. [20]
    Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond
    ### Summary of Key Contribution by Nallapati et al. (2016) to Abstractive Summarization
  21. [21]
    Get To The Point: Summarization with Pointer-Generator Networks
    We use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information.
  22. [22]
    The Factual Inconsistency Problem in Abstractive Text Summarization
    Apr 30, 2021 · This inconsistency between the original text and the summary has caused various concerns over its applicability, and the previous evaluation ...
  23. [23]
    [PDF] A Hybrid Hierarchical Model for Multi-Document Summarization
    While, earlier work on summarization depend on a word score function, which is used to measure sentence rank scores based on (semi-)supervised learn- ing ...
  24. [24]
    [PDF] On Extractive and Abstractive Neural Document Summarization with ...
    Nov 20, 2020 · We present a method to produce abstractive summaries of long documents that exceed sev- eral thousand words via neural abstractive.Missing: core seminal
  25. [25]
    SEHY: A Simple yet Effective Hybrid Model for Summarization of ...
    We propose a Simple yet Effective HYbrid approach, which we call SEHY, that exploits the discourse information of a document to select salient sections instead ...
  26. [26]
    Interactive Query-Assisted Summarization via Deep Reinforcement ...
    Interactive summarization is a task that facilitates user-guided exploration of information within a document set. While one would like to employ state of the ...
  27. [27]
    [PDF] Multi-doc Hybrid Summarization via Salient Representation Learning
    Jul 10, 2023 · This paper proposes a hybrid approach that generates a human-readable summary and extracts key evidence using salient representation learning.Missing: semi- | Show results with:semi-
  28. [28]
    [PDF] Joint abstractive and extractive method for long financial document ...
    In our work we propose an end to end financial nar- rative summarization system that first selects salient sentences from the document and then paraphrases ...Missing: seminal | Show results with:seminal
  29. [29]
    [PDF] SUMMHELPER: Collaborative Human-Computer Summarization
    Dec 6, 2023 · Current approaches for text summarization are predominantly automatic, with rather lim- ited space for human intervention and control.
  30. [30]
    [PDF] Mapping the Design Space of Human-AI Interaction in Text ...
    Jul 10, 2022 · From a human-centered perspective, we map the design opportunities and considerations for human-AI interaction in text summarization and broader ...
  31. [31]
    [PDF] Automatic Keyphrase Extraction: A Survey of the State of the Art
    Our goal in this paper is to survey the state of the art in keyphrase extraction, examining the major sources of errors made by existing systems and discussing ...
  32. [32]
    [PDF] Keyphrase Extraction in Scientific Articles: A Supervised Approach
    This paper contains the detailed approach of automatic extraction of Keyphrases from scientific articles (i.e. research paper) using supervised tool like ...Missing: methods | Show results with:methods
  33. [33]
    [PDF] Extractive Summarization Using Supervised and ... - ACL Anthology
    We investigate the effectiveness of different sentence features with supervised learning to de- cide which sentences are important for summari- zation. After ...Missing: title | Show results with:title
  34. [34]
    (PDF) AdaSum: An adaptive model for summarization - ResearchGate
    The summaries are limited to 250 words in length. The DUC 2007 task was a complex question-focused summariza-.
  35. [35]
    [PDF] A Common Theory of Information Fusion from Multiple Text Sources ...
    We argue that. CST is essential for the analysis of contradiction, redundancy, and complementarity in related documents and for multi-document summarization ( ...<|separator|>
  36. [36]
    [PDF] Survey on Multi-Document Summarization: Systematic Literature ...
    Dec 20, 2023 · The multi-document summarization methods try to produce high-quality summaries of documents with low redundancy. This study conducts a ...Missing: contradictions | Show results with:contradictions
  37. [37]
    [PDF] The Use of MMR, Diversity-Based Reranking for Reordering ...
    MMR combines query relevance with information novelty, aiming to reduce redundancy and maintain relevance in document re-ranking and summarization.
  38. [38]
    [PDF] Hierarchical Summarization: Scaling Up Multi-Document ...
    The hierarchical clustering serves as input to the second step – summarizing given the hierarchy. The hierarchical summary follows the hierarchi- cal structure ...
  39. [39]
    [PDF] Extractive Multi-Document Summarization with Integer Linear ...
    We present a new method to generate extractive multi-document summaries. The method uses Integer Linear Programming to jointly maximize the importance of the ...Missing: supervised | Show results with:supervised
  40. [40]
    [PDF] A Multi-document Summarization Benchmark and a Case Study on ...
    Jun 16, 2024 · Your evaluation should consider coverage of the summary with regard to the question and answers (i.e. how much information in the question ...
  41. [41]
    [PDF] Multi-News: a Large-Scale Multi-Document Summarization Dataset ...
    Table 1: An example from our multi-document sum- marization dataset showing the input documents and their summary. Content found in the summary is color- coded.
  42. [42]
    [PDF] A Shared Task on Multi-document Summarization for Literature ...
    Oct 12, 2022 · For example, one may only want to include the results sentence from an input document if it studies the same population and research question.
  43. [43]
    A Class of Submodular Functions for Document Summarization
    Hui Lin and Jeff Bilmes. 2011. A Class of Submodular Functions for Document Summarization. In Proceedings of the 49th Annual Meeting of the Association for ...
  44. [44]
    [PDF] ROUGE: A Package for Automatic Evaluation of Summaries
    ROUGE stands for Recall-Oriented Understudy for. Gisting Evaluation. It includes measures to auto- matically determine the quality of a summary by.
  45. [45]
    [PDF] A Methodology for Extrinsic Evaluation of Text Summarization
    Extrinsic evaluations concentrate on the use of sum- maries in a specific task, e.g., executing instructions, in- formation retrieval, question answering, and ...
  46. [46]
    Question Answering as an Automatic Evaluation Metric for News ...
    We present an alternative, extrinsic, evaluation metric for this task, Answering Performance for Evaluation of Summaries.<|control11|><|separator|>
  47. [47]
    [PDF] Extrinsic Summarization Evaluation: A Decision Audit Task
    In this work we describe a large-scale extrinsic evaluation of automatic speech summarization technologies for meeting speech. The particular task is a ...
  48. [48]
    TAC 2011 Guided Summarization Task Guidelines
    The goal of the update component in TAC Summarization is to train automatic summarization systems to recognize new (or non-redundant) information in the second ...Missing: core objectives<|control11|><|separator|>
  49. [49]
    Current and future state of evaluation of large language models for ...
    In this review, we examine the current state of LLM evaluation in summarization tasks, highlighting both its applications and limitations in the medical domain.Missing: biomedicine | Show results with:biomedicine
  50. [50]
    [PDF] Legal and Financial Document Summarization Using Transformer
    Experimental results demonstrate marked improvements in summary quality, factual accuracy, and domain-specific relevance. The proposed system is scalable ...
  51. [51]
    Evaluating the Factual Consistency of Abstractive Text Summarization
    We propose a weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and generated ...
  52. [52]
    Reference-free Summarization Evaluation via Semantic Correlation ...
    In this paper, we propose a new automatic reference-free evaluation metric that compares semantic distribution between source document and summary.
  53. [53]
    A Comprehensive Survey on Automatic Text Summarization ... - arXiv
    Mar 20, 2025 · Automatic Text Summarization (ATS) systems are conventionally classified as “Extractive”, “Abstractive”, and “Hybrid” based on their generation ...
  54. [54]
    What is summarization? - Azure AI services | Microsoft Learn
    Sep 27, 2025 · Summarization is a feature offered by Azure AI Language, a combination of generative Large Language models and task-optimized encoder models.Capabilities · Typical workflow
  55. [55]
    AI summarization | Google Cloud
    Use Google's large language models (LLMs), generative AI, and Google Cloud services to summarize documents and text. New customers get up to $300 in free ...
  56. [56]
    AI Document Summarization - IBM
    Dec 15, 2023 · Summarization is the ability to condense long documents into a concise summary that captures the key points of the larger work.
  57. [57]
  58. [58]
    Summarization - Hugging Face
    Summarization creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another ...
  59. [59]
  60. [60]
    The AP Is Using Robots To Write Earnings Reports | TechCrunch
    Jul 1, 2014 · The Associated Press is going to start using “automation technology” AKA robots to deliver stories recapping the quarterly earnings reports ...
  61. [61]
    Artificial Intelligence | The Associated Press
    At AP, we're exploring with AI to see how the technology might streamline news production and enhance editorial efficiency.
  62. [62]
    How AI Enhances Legal Document Review
    Feb 13, 2025 · AI streamlines the legal document review process by automating tasks like eDiscovery, document summarization, and drafting, helping lawyers ...
  63. [63]
    Four Practical Use Cases for Applying Text Summarisation to ... - Epiq
    In eDiscovery, ECA and EDA are vital steps to understand the strengths, weaknesses, and potential scope of a case. During these early phases, legal teams ...
  64. [64]
    Can Patient Record Summarization Support Quality Metric ... - NIH
    Our overall research question for this study is whether a patient record summarization tool, such as HARVEST, supports the needs of DAs during their abstraction ...Data Collection · Results · Workflow
  65. [65]
    Automatically Summarizing Evidence from Clinical Trials
    We present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a ...
  66. [66]
    A systematic review of automatic text summarization for biomedical ...
    Aug 2, 2021 · This review investigates biomedical text summarization, which reduces document length while preserving essence, using hybrid methods, and ...
  67. [67]
    [PDF] Automatically Generated Summaries of Video Lectures May ...
    We introduce a novel technique for auto- matically summarizing lecture videos using large language models such as GPT-3 and we.<|separator|>
  68. [68]
    AccessiLearnAI: An Accessibility-First, AI-Powered E-Learning ...
    Adaptive Content Summarization: The platform provides on-demand text summarization at different levels (short, intermediate, or detailed) using an integrated ...
  69. [69]
    Twitter Auto-Generated Summaries: Simplifying Threads - Digilogy
    Mar 7, 2025 · Learn how Twitter's new auto-generated summaries for threads enhance content discovery and make conversations easier to digest.
  70. [70]
    [PDF] AUTOMATED CONTENT MODERATION
    Most social media companies rely on a hybrid model of automated filtering and human moderators to locate and remove undesirable content. Human moderators ...
  71. [71]
    The Automatic Creation of Literature Abstracts - Semantic Scholar
    The Automatic Creation of Literature Abstracts · H. P. Luhn · Published in IBM Journal of Research and… 1 April 1958 · Computer Science.
  72. [72]
    [PDF] Information Retrieval: The Early Years - Now Publishers
    My own biases influence the story in that most papers have an experimental flavor as opposed to ones more theoretical in nature. Chapter 2 deals with the early ...
  73. [73]
    [PDF] SCRIPTS, PLANS, AND KNOWLEDGE Roger C. Schank and ... - IJCAI
    We describe a theoretical system intended to facilitate the use of knowledge In an understand ing system. The notion of script is introduced to.Missing: 1980s | Show results with:1980s
  74. [74]
    [PDF] Scripts, Plans, Goals, and Understanding - Colin Allen
    Therefore artificial intelligence (henceforth Al) has had to leave such approaches behind and become much more psychological (cf. Schank and Colby, 1973; Bobrow ...Missing: 1980s | Show results with:1980s
  75. [75]
    [PDF] Automatic Text Summarization in TIPSTER - ACL Anthology
    Automatic Text Summarization was added as a major research thrust of the TIPSTER program during TIPSTER Phase III, 1996-1998. It is a natural extension of the ...Missing: 1990s | Show results with:1990s
  76. [76]
    A trainable document summarizer | Proceedings of the 18th annual ...
    Kupiec. Julian Kupiec. Xerox Palo Alto Research Center, 3333 Coyote Hill Road ... Published: 01 July 1995 Publication History. 705citation5,013Downloads.
  77. [77]
    [PDF] A trainable document summarizer - Semantic Scholar
    A trainable document summarizer. @inproceedings{Kupiec1995ATD, title={A trainable document summarizer}, author={Julian Kupiec and Jan O. Pedersen and ...Missing: entropy | Show results with:entropy
  78. [78]
    The TIPSTER SUMMAC Text Summarization Evaluation
    Automatic Text Summarization was added as a major research thrust of the TIPSTER program during TIPSTER Phase III, 1996-1998. It is a natural extension of ...
  79. [79]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  80. [80]
    BART: Denoising Sequence-to-Sequence Pre-training for Natural ...
    Oct 29, 2019 · We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function.
  81. [81]
    [2004.05150] Longformer: The Long-Document Transformer - arXiv
    Apr 10, 2020 · Longformer is a transformer model with a linear attention mechanism, designed to process long sequences by combining local and global attention.
  82. [82]
    [2004.14900] MLSUM: The Multilingual Summarization Corpus - arXiv
    Apr 30, 2020 · We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five ...
  83. [83]
    CTRLsum: Towards Generic Controllable Text Summarization
    Oct 12, 2025 · Our approach enables users to control multiple aspects of generated summaries by interacting with the summarization system through textual input ...<|separator|>
  84. [84]
    Responsible AI Considerations in Text Summarization Research
    We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized ...