Fact-checked by Grok 2 weeks ago

Computational linguistics

Computational linguistics is an interdisciplinary field that applies computational methods to the of , encompassing both theoretical modeling of linguistic structures and practical of systems for language processing, understanding, and . It bridges linguistics, , , and to analyze written and through algorithms, statistical models, and techniques. The field emerged in the mid-20th century and has evolved into a of technologies, powering applications such as , , and chatbots. The origins of computational linguistics trace back to the 1950s, with early work in focusing on and question-answering systems, influenced by pioneers like Noam Chomsky's formal language theories and Warren Weaver's proposals for automated translation. Through the 1970s and 1980s, the discipline shifted from rule-based, symbolic approaches—exemplified by systems like SHRDLU—to statistical methods in the 1990s, leveraging large corpora and probabilistic models for tasks like and . The marked a toward and neural networks, with breakthroughs in transformer architectures and large language models (LLMs) enabling unprecedented advances in contextual language understanding. Today, the field continues to integrate empirical data-driven techniques with , addressing challenges like multilingual processing and ethical considerations in AI. Key subfields within computational linguistics include and , which involve algorithmic analysis of sentence structure; semantics, focusing on meaning representation and inference; and , which models context and discourse in communication. It is closely intertwined with (NLP), where computational linguistics provides the foundational theories for engineering robust language technologies, such as , dialogue systems, and . Recent trends emphasize multimodal integration—combining text with and speech—and fairness in models to mitigate biases, as seen in ongoing on LLMs and their societal impacts. Looking ahead, the discipline is poised to advance explainable , robust multilingual capabilities, and human-AI collaboration, ensuring language technologies remain reliable and inclusive.

Introduction and Overview

Definition and Scope

Computational linguistics is the scientific and engineering discipline concerned with the computational modeling of , employing methods from , , and to understand, generate, and process written and . It focuses on empirical and algorithmic approaches to linguistic phenomena, integrating theoretical insights with practical implementations to build systems capable of handling language data. As an interdisciplinary field, it draws from for foundational principles, for insights into language processing, and for algorithmic tools, aiming to create robust models of human language capabilities. The scope of computational linguistics encompasses core problems central to language modeling, including ambiguity resolution, which addresses structural uncertainties (such as in the sentence "She carried the groceries for Mary," interpretable as either benefiting Mary or carrying them on her behalf) and lexical multiple meanings (like word senses or quantifier scopes); syntax parsing, which involves analyzing sentence structure to determine ; semantic interpretation, which derives meaning representations from text; and , which examines and across multiple sentences or utterances. These problems highlight the field's emphasis on algorithmic solutions to the complexities of , distinguishing it from pure by prioritizing testable, data-informed models over purely descriptive theory, and from broader by grounding its computations in linguistic structures rather than general-purpose algorithms. Over time, the field's scope has evolved from early rule-based paradigms, which relied on hand-crafted grammars to encode linguistic rules, to data-driven approaches that leverage statistical and techniques on large corpora for more flexible and scalable language modeling. Key subfields within computational linguistics mirror the levels of linguistic analysis but adapt them to computational frameworks: morphology deals with the structure and formation of words through rules and patterns; phonology, which deals with the sound systems and rules governing pronunciation in languages; syntax focuses on sentence-level organization and dependencies; semantics addresses meaning construction and representation; and pragmatics explores context-dependent language use, such as implicature and speaker intentions. This structure enables the field to tackle hierarchical aspects of language from subword units to full discourse. Computational linguistics serves as the theoretical backbone for (NLP), a more application-oriented engineering subset concerned with deploying language technologies in real-world systems.

Relation to Linguistics and Computer Science

Computational linguistics serves as an interdisciplinary bridge between and , integrating formal models of language structure with computational techniques to analyze and generate . In its overlap with , the field employs formal grammars, such as context-free grammars (CFGs), to computationally model —the innate knowledge speakers have of their language's grammatical rules. These grammars, introduced by , allow for the precise description of , enabling computational systems to parse sentences and predict in ways that mirror human linguistic intuition. The ties to computer science are evident in the emphasis on algorithmic efficiency and complexity theory, where concepts like the Chomsky hierarchy classify formal languages by their generative power and computational tractability. This hierarchy, ranging from regular languages (recognizable by finite automata) to recursively enumerable languages, informs the design of parsers and analyzers by highlighting the trade-offs between expressive power and computational resources required for processing. Practical implementation occurs through programming paradigms, where linguistic models are encoded in software to handle tasks like syntax tree construction, ensuring scalability for large-scale language data. Computational linguistics also draws influence from by developing simulations of human language processing, which test hypotheses about how the brain acquires and comprehends through algorithmic approximations. These models, often inspired by Chomsky's theories of , explore cognitive mechanisms like incremental and ambiguity , providing empirical grounds for theories of mental language representation. Over time, computational linguistics has evolved as a distinct subdiscipline spanning both and , exemplified by the use of finite-state automata for morphological analysis, which efficiently model rules in agglutinative languages. This approach combines linguistic insights into morpheme concatenation with computer science's to create robust analyzers for and . The field's institutional maturation is marked by the founding in 1962 of what became the Association for Computational Linguistics (ACL) (originally the Association for Machine Translation and Computational Linguistics), which fosters collaboration across these domains through conferences and publications.

History

Origins and Early Developments

The roots of computational linguistics trace back to 17th-century philosophical projects aimed at creating universal languages to facilitate unambiguous communication and reasoning. Gottfried Wilhelm Leibniz's concept of characteristica universalis, a formal symbolic system intended to represent all human thoughts and enable mechanical resolution of disputes through calculation, prefigured modern efforts to model language computationally by emphasizing structured, logical representations over natural language ambiguity. In the mid-20th century, the field emerged amid post-World War II advancements in computing and , with Claude Shannon's 1948 formulation of providing a foundational framework for quantifying linguistic uncertainty. Shannon introduced as a measure of information in communication systems, applying it to model as a probabilistic source where predictability (or redundancy) could be statistically analyzed, influencing early approaches to language processing and . This theoretical groundwork catalyzed practical initiatives, notably Warren Weaver's 1949 memorandum, which proposed as a cryptographic problem solvable via universal logical structures and statistical methods, drawing directly on Shannon's ideas to envision automated interlingual conversion. The first dedicated forum, the 1952 Conference on Mechanical Translation at organized by Victor Yngve, gathered engineers, linguists, and logicians to explore computational language manipulation, marking the discipline's formal inception. Culminating these efforts, the 1954 Georgetown-IBM experiment demonstrated by converting 60 Russian sentences into English using a limited of 250 words and 49 rules on an computer, sparking widespread interest despite its rudimentary scope.

Mid-20th Century Milestones

The mid-20th century marked a pivotal phase in computational linguistics, characterized by the formalization of language structures and the development of initial computational systems, building on theoretical advancements like Chomsky's , which provided a framework for modeling language as computable hierarchies. A key milestone was the rise of formal language theory in the late 1950s and early 1960s, where Noam Chomsky's hierarchy classified languages by their generative power—from regular to context-sensitive—enabling rigorous analysis of language computability and influencing early algorithms for syntactic processing. This theoretical foundation shifted focus from simplistic finite-state models to more expressive phrase-structure grammars, laying groundwork for practical implementations in . In 1966, the Automatic Language Processing Advisory Committee (ALPAC) issued a seminal report critiquing the state of machine translation research, highlighting its high costs, limited accuracy, and failure to achieve fully automated systems despite a decade of U.S. government funding exceeding $20 million. The report concluded that machine translation was not yet viable for practical use and recommended redirecting resources toward basic linguistic research and computational tools, resulting in drastic funding cuts that nearly halted U.S. efforts in the field for two decades and prompted a paradigm shift toward more modest, knowledge-based approaches. This critique underscored the challenges of rule-based systems and emphasized the need for deeper integration of linguistics and computation. The development of early natural language understanding systems exemplified this evolving landscape, with Terry Winograd's SHRDLU program in 1970 representing a breakthrough in interactive within a constrained domain. SHRDLU enabled a simulated to manipulate blocks in a "" through English commands, using procedural representations to parse semantics and perform actions like "Pick up a big red block," demonstrating robust understanding of context, reference, and inference in a limited environment. Published in detail in , the system highlighted the potential of combining syntactic parsing with world knowledge, influencing subsequent research in dialogue systems and knowledge representation. Institutional milestones further solidified the field's growth, including the establishment of the American Journal of Computational Linguistics in 1974 by David G. Hays, which became a primary venue for publishing advances in formal models and algorithms. Renamed Computational Linguistics in 1984, the journal's inaugural issues focused on topics like grammar formalisms and early parsing techniques, fostering a dedicated community under the Association for Computational Linguistics. By the 1970s, computational linguistics began transitioning from purely rule-based systems to empirical s incorporating probability, with probabilistic parsing emerging as a to handle ambiguity by assigning likelihoods to parse trees based on stochastic context-free grammars. This approach, introduced in works like those exploring statistical for grammars, allowed parsers to favor high-probability structures over exhaustive , improving for real-world language data and setting the stage for data-driven techniques.

Theoretical Foundations

Chomsky's Influence

Noam Chomsky's work profoundly shaped computational linguistics by providing formal frameworks for modeling language as a generative system, emphasizing innate structures over purely learned associations. In his 1956 paper, Chomsky introduced a classifying formal grammars into four types—regular, context-free, context-sensitive, and recursively enumerable—each corresponding to increasing and expressive power. This hierarchy demonstrated that natural languages likely require at least context-free grammars for adequate description, influencing the design of early computational parsers and applications in language processing. Chomsky's 1957 book formalized , positing that languages are generated by finite sets of from an underlying abstract system, rather than through probabilistic Markov processes alone. These rules enable the infinite productivity of language from finite means, a core principle that computational linguists adopted to build rule-based systems for syntax analysis. The work critiqued finite-state models as insufficient for capturing linguistic and , paving the way for more sophisticated algorithmic implementations. Chomsky's 1959 review of B.F. Skinner's mounted a sharp critique of , arguing that cannot be explained solely by stimulus-response reinforcement, as it fails to account for the rapid, creative use of novel sentences by children. This led to his hypothesis of (UG), an innate biological endowment of linguistic principles shared across humans, which constrains possible grammars and facilitates acquisition despite limited input. In computational terms, UG inspired models assuming built-in biases in learning algorithms, contrasting with approaches. Building on these ideas, Chomsky's transformational-generative grammar, elaborated in Aspects of the Theory of Syntax (1965), introduced transformations as rules converting deep structures—abstract representations of meaning—into surface structures for utterance. Early computational implementations, such as ATN parsers, drew directly from these transformations to handle syntactic derivations efficiently. Chomsky's rationalist perspective, emphasizing innate knowledge, sparked ongoing debates in computational linguistics against empiricist views that prioritize data-driven learning from corpora. His frameworks underscored the need for theories balancing internal structure with external evidence, influencing hybrid models in language acquisition simulations.

Language Acquisition Models

Computational linguistics explores language acquisition through models that simulate how learners infer grammatical structures from limited input, often building on assumptions like Chomsky's as an innate starting point for parameter setting. Connectionist models, inspired by neural processes in the , use parallel distributed processing () networks to mimic incremental child learning without explicit rules. In the seminal PDP framework, Rumelhart and McClelland demonstrated how multi-layer neural networks can acquire English forms by adjusting connection weights based on exposure to stem- pairs, capturing overgeneralization errors like "goed" seen in children. This approach emphasizes emergent grammatical knowledge from statistical patterns in input, influencing later recurrent network models for in syntax acquisition. Bayesian models frame language acquisition as probabilistic inference, where learners induce grammars by updating hypotheses over possible structures given observed data. These models employ priors to favor simpler or more general grammars, enabling grammar induction from ambiguous input; for instance, Chen's 1995 algorithm uses greedy search in a Bayesian framework to learn probabilistic context-free grammars that outperform n-gram baselines on corpus data. In acquisition contexts, such approaches simulate how children resolve syntactic ambiguities, as in Perfors et al.'s work on hierarchical structure learning via Bayesian inference over tree hypotheses. Usage-based theories implement computationally by treating linguistic knowledge as an inventory of form-meaning pairings learned incrementally from usage. These models extract s—stored patterns like "the X-er the Y-er"—through frequency-based generalization, as in Dunn's usage-based approach integrating unsupervised techniques to build construction inventories from corpora. Computational implementations, such as those reviewed by Doumen et al., demonstrate where novel utterances are parsed against existing constructions and new ones are abstracted via , aligning with child overextension patterns. A central challenge in these models is the , where input data sparsity hinders learning of rare or abstract structures, such as auxiliary fronting in questions. Computational simulations show that without strong inductive biases, models underperform on low-frequency phenomena, mirroring debates on whether innate constraints or rich statistical cues suffice in data-sparse environments. Recent work as of 2024 examines how large language models (LLMs) perform on poverty-of-the-stimulus tasks, often succeeding through scale but revealing limitations in generalizing hierarchical structures without explicit biases, thus testing nativist claims in modern computational settings. Key experiments include the system, a performance-limited model that segments morphemes and induces syntax via distributional analysis of child-directed speech. Developed in the and refined later, MOSAIC simulates early errors like optional infinitives by prioritizing recent input chunks, accurately replicating cross-linguistic patterns in English and Dutch verb marking without predefined rules.

Core Methods and Resources

Annotated Corpora and Data

Annotated corpora in computational linguistics consist of linguistic data, such as text or speech, systematically enriched with interpretive layers to facilitate analysis, model training, and evaluation of systems. These resources typically include annotations for syntactic structure, semantics, , or relations, enabling the study of language patterns and the development of algorithms that mimic human language understanding. Treebanks represent a primary type, providing hierarchical syntactic annotations; for instance, the Penn Treebank, released in 1993, contains over 4.5 million words from sources like , tagged with part-of-speech labels and phrase structure trees using a standardized 36-tag set. Semantic corpora, such as developed since 1997, annotate sentences with frame-semantic roles, linking lexical units to predefined conceptual frames derived from Fillmore's frame semantics theory, covering over 1,200 frames across approximately 200,000 annotated sentences. The annotation process involves creating detailed guidelines to ensure consistency and reliability across annotators. For , guidelines specify tag assignments based on contextual and morphological cues, as in the Penn Treebank's scheme that distinguishes nuances like proper nouns (NNP) from common nouns (NN). Dependency parsing annotations follow frameworks like Universal Dependencies, which define 37 universal relation types (e.g., nsubj for nominal subjects) and emphasize as heads, with guidelines promoting cross-linguistic through single-head trees and no crossing dependencies. resolution annotations, as in the OntoNotes corpus (version 5.0, 2013), mark entities and their co-referring mentions using entity-percentage and mention-detection metrics, with guidelines addressing , predication, and generic mentions to capture discourse-level links. These processes often require multiple annotation passes, with adjudication by experts to resolve discrepancies. Challenges in creating annotated corpora include achieving high inter-annotator agreement, typically measured by , which accounts for chance agreement; this is particularly difficult for semantic tasks due to subjective interpretations. Scalability poses particular issues for low-resource languages, where limited native speakers, orthographic variability, and lack of expert annotators hinder corpus development, resulting in datasets often under 10,000 sentences compared to millions for high-resource languages like English. The evolution of annotation has shifted from labor-intensive manual efforts by trained linguists in early corpora like the Penn Treebank to more efficient crowdsourced and semi-automated approaches. platforms, as evaluated in non-expert annotation studies, enable rapid scaling by distributing tasks to lay annotators with quality controls like majority voting. Semi-automated methods pre-annotate data using preliminary models (e.g., initial parsers) for human correction, maintaining annotation quality through iterative refinement. These corpora have profoundly impacted computational linguistics by enabling paradigms, where annotated examples train models to generalize linguistic patterns. For example, the Switchboard corpus, comprising 1,155 telephone dialogues totaling 260 hours of speech annotated for dialog acts (e.g., statements, questions), has supported the training of dialogue systems and models since its 1997 release. Such resources provide gold-standard training data for tasks like , where treebanks directly inform supervised parsers to achieve attachment accuracies exceeding 90% on held-out test sets.

Parsing and Syntactic Analysis

Parsing and syntactic analysis in computational linguistics involves the computational modeling of sentence structure to identify hierarchical or relational dependencies among words, enabling deeper understanding of grammatical organization. This process typically employs formal grammars and algorithms to resolve the syntactic structure of input sentences, distinguishing between phrase-level groupings (constituency) and word-to-word relations (). Early approaches relied on rule-based systems derived from linguistic theories, while modern implementations often integrate statistical training to handle real-world variability. These techniques form the backbone of many tasks, providing structured representations essential for further semantic interpretation. Constituency parsing aims to divide a into nested constituents, such as noun phrases or verb phrases, based on context-free grammars (CFGs). A foundational for this is the Cocke-Younger-Kasami (CKY) parser, developed in the , which efficiently recognizes whether a string belongs to the language generated by a CFG in . The CKY uses dynamic programming to build a triangular chart, filling cells bottom-up to combine subspans into larger constituents, achieving a of O(n^3) for a of length n. This cubic complexity arises from considering all possible span lengths and starting positions, making it suitable for sentences up to moderate lengths but computationally intensive for longer ones. Dependency , in contrast, models structure as a of directed dependencies between words, emphasizing head-dependent relations without intermediate nodes. Transition-based models simulate the process incrementally, using a and to build the through a sequence of actions like shift, left-arc, and right-arc. The arc-standard , introduced by Nivre in 2003, exemplifies this approach by processing the left-to-right in a single pass, attaching dependents to heads via two arc transitions that ensure projective trees. This linear-time method facilitates easy integration with for action prediction, though it requires careful oracle design to handle non-projectivity in extensions. Feature structures provide a mechanism to represent complex syntactic information, such as , , and lexical properties, in a declarative framework. In (HPSG), developed by Pollard and Sag in the late 1980s and formalized in their work, unification serves as the core operation to merge compatible feature structures during . Unification succeeds if structures are compatible, combining attributes like part-of-speech and into a single representation, or fails otherwise, enforcing grammatical constraints without explicit rule ordering. This typed feature structure approach allows HPSG parsers to handle intricate phenomena like long-distance dependencies through lexical inheritance and structure-sharing. Evaluation of syntactic parsers relies on metrics that compare predicted structures against gold-standard annotations, focusing on boundary and label accuracy. The PARSEVAL measures, proposed by Black et al. in 1991, compute for bracketed constituents by aligning spans and ignoring punctuation or empty categories, with F1-score as the . For dependency parsing, unlabeled attachment score (UAS) and labeled attachment score () assess arc correctness, often exceeding 90% on standard benchmarks. These metrics prioritize exact matches for constituents or arcs, revealing parser robustness to . Natural language sentences often exhibit structural ambiguity, where multiple parses fit the input, necessitating efficient disambiguation strategies. Chart parsing addresses this by maintaining a shared representation of partial parses in a chart, avoiding redundant computation across alternative derivations, as in the bottom-up integration of CKY or top-down Earley variants. To manage exponential ambiguity in practice, beam search prunes low-probability paths during parsing, retaining only the top-k hypotheses at each step based on scores from probabilistic models. This heuristic reduces search space while approximating the maximum likelihood parse, commonly achieving near-optimal accuracy with beam widths of 5-10 in statistical parsers. These methods rely briefly on annotated corpora like the Penn Treebank for training probabilistic components.

Modern Techniques

Statistical and Probabilistic Approaches

The advent of statistical and probabilistic approaches in computational linguistics marked a from rule-based systems to data-driven methods, leveraging large corpora to model language patterns empirically. These techniques treat as a probabilistic process, estimating the likelihood of linguistic structures based on observed frequencies in text . Pioneered in the late 1980s and 1990s by researchers at , this framework emphasized and to handle the sparsity and variability of . N-gram models form a foundational component of statistical language modeling, approximating the probability of a word by on a fixed window of preceding words. In an unigram model, the probability of a word w_i is simply P(w_i), independent of context, while a model estimates P(w_i | w_{i-1}) as the of the pair (w_{i-1}, w_i) divided by the of w_{i-1}. Higher-order n-grams extend this to P(w_i | w_{i-n+1} \dots w_{i-1}), enabling prediction tasks like and text generation. To address data sparsity—where unseen n-grams yield zero probabilities— techniques such as Laplace (add-one) adjust counts by adding a small constant to numerators and denominators, ensuring non-zero estimates for all combinations. Hidden Markov Models (HMMs) extend probabilistic modeling to sequential labeling tasks, such as part-of-speech (POS) tagging, by representing words as observations emitted from hidden states corresponding to POS tags. An HMM defines transition probabilities between tags P(t_i | t_{i-1}) and emission probabilities P(w_i | t_i), both estimated from annotated corpora via maximum likelihood. The efficiently decodes the most likely tag sequence for a given by dynamic programming, maximizing the joint probability P(W, T) = \prod P(t_i | t_{i-1}) P(w_i | t_i). This approach achieved robust performance on unrestricted text, with error rates around 3-5% on English corpora in early implementations. A key evaluation metric for language models is , which quantifies predictive uncertainty as PP(W) = 2^{H(p)}, where H(p) is the H(p) = -\frac{1}{N} \sum_{i=1}^N \log_2 p(w_i | w_{i-n+1} \dots w_{i-1}) over a sequence W of length N. Lower indicates better modeling of the data distribution. The noisy-channel model provided an early probabilistic framework for , positing that a target sentence is a "noisy" version of the source, decoded by maximizing P(T|S) \propto P(S|T) P(T), where P(S|T) is a translation model and P(T) is a . This inspired the Models 1 through 5, developed in the early 1990s, which progressively incorporated alignment probabilities, fertility, and reordering—Model 1 using uniform alignments and expectation-maximization for parameter estimation. These models laid the groundwork for systems, achieving scores of 20-30% on French-English tasks by the mid-1990s. The empirical success of these approaches was catalyzed by resources like the , a 1-million-word tagged collection of 1960s texts, which enabled training and evaluation of statistical parsers and models. In a landmark 1992 study, class-based n-gram models trained on the demonstrated modest improvements of approximately 3% over traditional n-grams through , highlighting the viability of probabilistic methods for broad-coverage language processing.

Neural Networks and Deep Learning

The integration of neural networks and into computational linguistics since the has revolutionized language processing by enabling end-to-end learning from raw text data, surpassing traditional approaches. Recurrent Neural Networks (RNNs), particularly (LSTM) units, emerged as foundational tools for modeling sequential dependencies in language tasks such as and . Introduced by Hochreiter and Schmidhuber in 1997, LSTMs address the in standard RNNs through gating mechanisms that regulate information flow, allowing effective capture of long-range dependencies in sequences up to thousands of timesteps. Their application in gained prominence with sequence-to-sequence models, where LSTMs encode input sentences into fixed-dimensional representations and decode them into outputs, achieving state-of-the-art results on benchmarks like WMT 2014 English-to-French with a score of 34.8. A key advancement in neural representations for language was the development of dense word embeddings, which map words to low-dimensional vectors preserving semantic and syntactic similarities. Word2Vec, proposed by Mikolov et al. in 2013, learns these embeddings via skip-gram or continuous bag-of-words models trained on large corpora, enabling arithmetic operations like "king - man + woman ≈ queen" to reflect analogies. Complementing this, GloVe (Global Vectors) by Pennington et al. in 2014 constructs embeddings by factorizing global word co-occurrence matrices, outperforming Word2Vec on word similarity tasks such as WordSim-353 with a Spearman correlation of 0.76. These static embeddings provided a robust foundation for downstream neural models in computational linguistics. The architecture, introduced by Vaswani et al. in 2017, marked a by replacing recurrence with mechanisms, enabling parallelizable training and better handling of long sequences. Central to Transformers is the self- mechanism, which computes weighted representations of input tokens relative to each other: \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V where Q, K, and V are query, key, and value projections of the input, and d_k is the dimension of the keys; this formulation scales quadratically with sequence length n in time and memory complexity, O(n^2 d), limiting efficiency for very long inputs but excelling in capturing global dependencies. Transformers achieved superior performance on tasks like English-to-German translation, attaining a score of 28.4 on WMT 2014, surpassing prior RNN-based systems. Building on Transformers, pre-trained models like BERT (Bidirectional Encoder Representations from Transformers), developed by Devlin et al. in 2018, introduced contextual embeddings through masked language modeling, where the model predicts randomly masked tokens in bidirectional context during pre-training on corpora like BooksCorpus and . Fine-tuned BERT variants set new benchmarks, such as 80.5% on GLUE tasks, by leveraging from unlabeled data to diverse linguistic applications. In multilingual computational linguistics, mBERT extends to 104 languages via joint pre-training, enabling cross-lingual transfer where models trained on high-resource languages like English perform effectively on low-resource ones, as demonstrated by zero-shot F1 scores of 65-80% on tasks like XNLI across 15 languages. Subsequent developments have focused on scaling Transformer-based models to billions of parameters, leading to large language models (LLMs) such as the series. , introduced by Brown et al. in 2020, demonstrated emergent abilities in for various tasks through in-context prompting, without task-specific . As of 2025, advancements include decoder-only architectures, mixture-of-experts for efficiency, and improved pre-training on diverse multilingual data, further enhancing capabilities in generation, reasoning, and low-resource languages.

Applications

Machine Translation and Generation

Machine translation (MT) involves computational systems that automatically convert text from one to another, a core application in computational linguistics that bridges linguistic theory with practical language processing. Early approaches relied on explicit linguistic rules, while later paradigms shifted to data-driven methods leveraging large corpora for training. These systems have evolved from rigid rule sets to probabilistic models and now to neural architectures, enabling more fluent and context-aware translations. Despite advances, challenges persist in handling , cultural nuances, and low-resource languages. Rule-based machine translation (RBMT) systems form the foundational paradigm, employing hand-crafted linguistic rules to analyze , transfer structures, and generate target output. Transfer models directly map source language structures to target equivalents using bilingual rules for , , and semantics, often requiring language-pair-specific components that limit scalability to multiple languages. To address this, interlingua representations introduce a language-neutral intermediate layer that abstracts meaning into a universal form, allowing pivot translations across pairs without exhaustive rule sets for each. Seminal work on interlingua dates to early proposals for logical formalization in mechanical translation, emphasizing semantic preservation over surface forms. Systems like UNITRAN exemplified this by using principle-based rules for interlingual pivoting, influencing multilingual efforts. Statistical machine translation (SMT) marked a by treating as a probabilistic modeling task, using parallel corpora to infer alignments and generate outputs without explicit rules. Phrase-based systems, a dominant SMT variant, segment source text into phrases rather than words, learning probabilities, reordering models, and model scores from aligned . Alignment identifies correspondences between source and target phrases via expectation-maximization algorithms, while decoding searches for the highest-probability output sequence using heuristics like . The toolkit, released in 2007, standardized phrase-based SMT implementation, supporting factored models for linguistic features and achieving competitive performance on benchmarks like Europarl corpora. SMT's reliance on parallel enabled broader coverage but struggled with long-range dependencies and fluency. Neural machine translation (NMT) revolutionized MT by end-to-end learning with deep neural networks, surpassing SMT in fluency and adequacy for many language pairs. models, introduced in , employ an encoder-decoder architecture where the encoder (typically an RNN or LSTM) compresses the source into a fixed context , and the decoder generates the target autoregressively. This framework, applied to MT, learns direct mappings from input to output, incorporating attention mechanisms in later variants to weigh relevant source parts dynamically. Sutskever et al. demonstrated Seq2Seq's efficacy on English-to-French translation, achieving improvements over phrase-based baselines by capturing global context. NMT's data-hungry nature benefits from large parallel corpora, though it requires substantial computational resources for training. Subsequent advances, such as architectures introduced in 2017, have further improved performance by enabling parallelization and better handling of long dependencies, powering systems like as of 2025. Evaluation of MT systems prioritizes automatic metrics correlating with human judgments of adequacy and fluency. The score, proposed in , quantifies quality via n-gram between machine output and reference translations, modified by a brevity penalty to discourage short outputs. Formally, \text{BLEU} = \text{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \log p_n \right), where p_n is the modified n-gram precision, BP the brevity penalty, w_n weights (often 1/N), and N typically 4; it established a standard for comparing systems, though critics note its limitations in semantic fidelity. Text generation in computational linguistics extends MT principles to producing text, such as summaries or dialogues, often adapting large models for linguistic tasks. GPT-like models, autoregressive transformers pretrained on vast corpora, generate coherent sequences by predicting tokens conditioned on priors, but face challenges in maintaining long-term fluency and . Repetition, factual inconsistencies, and topic drift arise due to exposure in training, where models optimize next-token likelihood without global planning. Adaptations for incorporate linguistic constraints, like syntactic trees, to enhance grammaticality, yet evaluating remains subjective, relying on metrics like or human assessments. Seminal scaled this to 175 billion parameters, enabling few-shot generation but amplifying hallucination risks in specialized linguistic applications. Parallel corpora aid training by providing aligned examples for controlled generation. Since 2020, larger models like and open-source alternatives (e.g., series) have advanced capabilities in creative and factual generation, integrated into applications like chatbots and tools as of 2025.

Information Retrieval and Question Answering

Information retrieval (IR) in computational linguistics focuses on developing algorithms to efficiently search large text collections and rank documents relevant to a user's query, leveraging linguistic structures for improved accuracy. A cornerstone of classical IR is the (VSM), which treats texts as points in a where each represents a unique term from the corpus vocabulary. Queries and documents are represented as vectors, with relevance determined by the between them, allowing for geometric interpretation of semantic proximity. This model, introduced by Salton et al. in 1975, revolutionized automated indexing and retrieval by enabling scalable similarity computations without relying on exact keyword matches. Central to VSM's effectiveness is the TF-IDF weighting scheme, which assigns importance scores to terms based on their within a and rarity across the entire collection. Term (TF) quantifies how often a term appears in a specific , emphasizing content density, while inverse (IDF) penalizes ubiquitous terms like "the" by taking the logarithm of the inverse ratio of documents containing the term. Proposed by Spärck Jones in as a measure of term specificity, TF-IDF enhances ranking by prioritizing discriminative vocabulary, as demonstrated in early experiments on bibliographic datasets where it achieved significant improvements in at top ranks compared to unweighted or -only baselines. For instance, in a of scientific abstracts, TF-IDF-weighted VSM improved retrieval effectiveness substantially. Semantic search advances IR by integrating linguistic meaning beyond surface terms, often using dense vector representations from neural embeddings to capture contextual and synonymous relationships. These embeddings project words or documents into a low-dimensional space where reflects semantic relatedness, enabling matches for paraphrased queries like associating "purchase ticket" with "buy pass." Neural techniques for embeddings, such as those from skip-gram models, have been adapted for IR to improve and document reranking, with studies showing 10-20% gains in mean average precision on TREC benchmarks when replacing sparse TF-IDF vectors with embedding-based ones. Question answering (QA) systems extend IR by not only retrieving relevant texts but also extracting or generating precise answers to queries, often in open-domain settings without predefined knowledge bases. The Stanford Question Answering Dataset (SQuAD), released in 2016, provides a foundational with over 100,000 crowd-sourced questions on passages, each paired with an exact answer span, facilitating evaluation of extractive . SQuAD's scale—nearly two orders larger than prior datasets—has driven progress in models that jointly parse context and questions, with human performance of 82.3% exact match () and 91.2% F1 score. While human performance initially served as the upper bound, advanced transformer-based models have since surpassed these scores, with top results exceeding 90% and 95% F1 as of 2021. A seminal open-domain QA approach is DrQA, introduced by Chen et al. in , which pipelines coarse retrieval from using TF-IDF-matched candidates with a fine-grained neural reader for answer extraction. The reader employs bidirectional LSTMs to encode question-passage pairs and predict answer spans via pointer networks, achieving 69.5% F1 on (for the reader component) and competitive results on TriviaQA without external training data. This retrieval-reading paradigm has influenced subsequent systems by balancing efficiency and accuracy in large-scale corpora. In conversational AI, dialog systems employ intent recognition to classify user goals (e.g., "reserve hotel") and slot-filling to populate structured attributes (e.g., "check-in date: 2025-11-10"). These tasks are often handled jointly to leverage shared linguistic cues, as in the 2016 model by Yao et al., which uses RNNs with to encode utterances for simultaneous intent and slot labeling, outperforming cascaded pipelines by a small margin (approximately 0.5-1% in accuracy and F1) on the ATIS dataset. Intent detection typically frames over predefined categories using softmax over embedding projections, while slot-filling applies BIO tagging to sequences. Surveys highlight over 20 public datasets like MultiWOZ for , underscoring the need for multilingual and multi-domain robustness in task-oriented dialogs; recent large language models have further advanced end-to-end dialog handling as of 2025.

Challenges and Future Directions

Evaluation Metrics and Benchmarks

Evaluation in computational linguistics relies on standardized metrics and benchmarks to assess the performance of language processing systems, ensuring comparability across models and tasks. Intrinsic metrics evaluate specific components of a system in isolation, providing direct measures of accuracy for subtasks such as part-of-speech (POS) tagging, syntactic parsing, and text summarization. For POS tagging and parsing, the F1-score is widely used, balancing (the proportion of predicted tags or parses that are correct) and (the proportion of gold-standard tags or parses that are correctly identified) through the harmonic mean formula, F1 = \frac{2 \times [precision](/page/Precision) \times [recall](/page/The_Recall)}{[precision](/page/Precision) + [recall](/page/The_Recall)}. This metric is particularly effective for imbalanced datasets common in linguistic annotation, where rare syntactic structures might otherwise skew results. In text summarization, the (Recall-Oriented Understudy for Gisting Evaluation) family of metrics measures overlap between generated summaries and reference texts using n-gram matches, with variants like ROUGE-1 (unigrams) and ROUGE-L () capturing lexical and structural similarity. Extrinsic metrics, in contrast, assess the impact of linguistic components on downstream tasks, often incorporating human judgments to evaluate overall utility. For instance, in , adequacy scores from evaluators rate how well a translation conveys the meaning of the source text on a (e.g., 0-5), complementing fluency assessments to gauge real-world effectiveness. These task-specific evaluations reveal how well core linguistic analyses contribute to end-to-end performance, though they are resource-intensive due to the need for annotators. Benchmarks provide unified platforms for comparing systems across multiple tasks, fostering progress in natural language understanding (NLU). The GLUE benchmark, introduced in 2018, aggregates nine diverse NLU tasks—including , , and —using a composite score to rank models on their generalization ability. Building on this, SuperGLUE (2019) escalates difficulty with eight more challenging tasks, emphasizing reasoning and coreference resolution to better test advanced models. For syntactic tasks, the Conference on Computational Natural Language Learning (CoNLL) shared tasks, held annually since , standardize evaluations like dependency parsing through datasets in multiple languages, reporting metrics such as unlabeled attachment score (UAS) and labeled attachment score () to measure parse accuracy. Despite their utility, these metrics and benchmarks face significant limitations. Many automatic metrics exhibit low with judgments, particularly for nuanced tasks like summarization, where surface-level n-gram overlaps fail to capture semantic or . Additionally, systems often falter on adversarial examples—subtly perturbed inputs designed to exploit model weaknesses—highlighting in real-world robustness, as seen in NLU benchmarks where top models drop substantially in accuracy under such attacks. Trends toward multilingual evaluation have continued to address English-centric biases, with foundational benchmarks like XTREME (2020) testing cross-lingual transfer across 40 languages and nine tasks, including POS tagging and inference. As of 2025, evaluation has evolved significantly with the rise of large language models (), incorporating comprehensive frameworks such as the Holistic Evaluation of Language Models (, 2022), which assesses models across multiple metrics including accuracy, fairness, and robustness, and the Beyond benchmark (BIG-bench, 2022), featuring over 200 diverse tasks to probe emergent abilities. Recent developments, including benchmarks in Findings 2025 and NeurIPS Datasets & Benchmarks 2024, further emphasize standardized evaluations for LLM safety, efficiency, and multilingual capabilities.

Ethical Considerations and Bias

Computational linguistics faces significant ethical challenges arising from biases embedded in training data and models, which can perpetuate societal inequalities. Underrepresentation of certain demographic groups in corpora often leads to skewed representations, such as gender stereotypes in word embeddings where terms like "computer programmer" are more closely associated with pronouns than ones due to historical imbalances in text sources like news articles. These biases stem from corpora reflecting real-world disparities, amplifying demographic skews in downstream applications like resolution or . To evaluate and address these issues, researchers employ fairness metrics tailored to language tasks, including demographic parity, which requires equal positive prediction rates across protected groups (e.g., ), and equalized , which ensures comparable true positive and false positive rates between groups conditional on the true label. In contexts, these metrics are applied to assess disparities in tasks like toxicity detection, where models may disproportionately flag content from minority dialects. concerns further complicate ethical practice, particularly in data and , where sensitive user information from or health records can be inadvertently memorized and extracted via attacks like membership inference, risking breaches of regulations such as GDPR. processes involving crowdsourced labor also raise issues of and re-identification when labeling personal narratives. Debiasing techniques have emerged to mitigate these problems, such as adversarial training, where a discriminator is trained alongside the main model to remove sensitive attribute signals from representations, as demonstrated in to reduce gender bias propagation. Counterfactual data augmentation complements this by generating synthetic examples that alter protected attributes while preserving semantics, such as swapping gendered pronouns in sentences to balance training sets and lessen stereotypes in language generation. Neural models can amplify these biases during , exacerbating underrepresentation effects in low-resource languages. Broader impacts include the field's role in detection, where biased models may fail to equitably identify false claims across cultural contexts, and the push for inclusive AI design influenced by post-2020 regulations like the EU AI Act (Regulation (EU) 2024/1689), which requires and conformity assessments—including measures to minimize biases—for high-risk AI systems to promote transparency and equity. Looking ahead to 2025 and beyond, future directions in computational linguistics emphasize developing explainable and robust evaluation frameworks that align more closely with human judgments, particularly for LLMs, while addressing emerging ethical challenges such as to prevent harmful outputs, the environmental of large-scale training, and the global implementation of regulations like the EU AI Act's phased requirements for high-risk systems (fully applicable by 2027). Ongoing research, as highlighted in 2025 tutorials, focuses on integrating privacy-preserving techniques and fairness in and cross-cultural applications to ensure equitable and responsible language technologies.

References

  1. [1]
    Computational Linguistics - Stanford Encyclopedia of Philosophy
    Feb 6, 2014 · Computational linguistics is the scientific and engineering discipline concerned with understanding written and spoken language from a computational ...
  2. [2]
    Speech and Language Processing
    An introduction to natural language processing, computational linguistics, and speech recognition with language models, 3rd edition.
  3. [3]
    Computational Linguistics - an overview | ScienceDirect Topics
    Computational Linguistics (CL) is an area in linguistics that makes use of computational methodologies to identify patterns and similarities in the use of ...
  4. [4]
    The Future of Computational Linguistics: On Beyond Alchemy
    Over the decades, fashions in Computational Linguistics have changed again and again, with major shifts in motivations, methods and applications.<|control11|><|separator|>
  5. [5]
    Opening a New Chapter for Computational Linguistics
    Mar 15, 2025 · Under this new vision, Computational Linguistics strongly encourages research that engages with emerging paradigms, with a particular emphasis ...
  6. [6]
    Natural Language Processing and Computational Linguistics
    Dec 23, 2021 · In this narrower definition, linguistics is concerned with the rules followed by languages as a system, whereas CL, as a subfield of linguistics ...
  7. [7]
    [PDF] Context-Free Grammars and Constituency Parsing
    Our focus in this chapter is context-free grammars and the CKY algorithm for parsing them. Context-free grammars are the backbone of many formal mod- els of the ...
  8. [8]
    [PDF] TIIKEE MODELS FOR TIE DESCRIPTION OF LANGUAGE
    We study the formal properties of a set of grammatical trans- formations that carry sentences with phra.se structure into new sentences with derived phrase.
  9. [9]
    [PDF] Computational Cognitive Linguistics - ACL Anthology
    Cognitive Linguistics provides the basic mechanisms for explaining human language processing, but has traditionally been informal and non-computational.
  10. [10]
    [PDF] Introduction: Cognitive Issues in Natural Language Processing
    This special issue is dedicated to get a better picture of the relationships between computational linguistics and cognitive science. It specifically raises ...
  11. [11]
    What is the ACL and what is Computational Linguistics?
    The association was founded in 1962, originally named the Association for Machine Translation and Computational Linguistics (AMTCL), and became the ACL in 1968.
  12. [12]
    [PDF] A Mathematical Theory of Communication
    This case has applications not only in communication theory, but also in the theory of computing machines, the design of telephone exchanges and other fields.
  13. [13]
    [PDF] warren weaver - CMU School of Computer Science
    15, 1949. It is reprinted by his permission because it is a historical document for machine translation. When he sent it to some 200 of his acquaintances in ...
  14. [14]
    [PDF] The conference on mechanical translation held at M.I.T., June 17-20 ...
    In 1952 he or- ganized a Conference on Mechanical Translation at M.I.T. This report is concerned with providing a precis of the papers and discussions at the ...Missing: Symposium | Show results with:Symposium
  15. [15]
    The Georgetown-IBM experiment demonstrated in January 1954
    The public demonstration of a Russian-English machine translation system in New York in January 1954 – a collaboration of IBM and Georgetown University.
  16. [16]
    Three models for the description of language - IEEE Xplore
    We investigate several conceptions of linguistic structure to determine whether or not they can provide simple and revealing grammars.
  17. [17]
    [PDF] ALPAC-1966.pdf - The John W. Hutchins Machine Translation Archive
    Note: Other CIA funds in support of the Georgetown machine-translation project (amounting to $205,000) were transferred to NSF. See above. 110. Page 125 ...
  18. [18]
    [PDF] ALPAC -- the (in)famous report - ACL Anthology
    The best known event in the history of machine translation is without doubt the publication thirty years ago in November 1966 of the report by the Automatic ...
  19. [19]
    About the Journal | Computational Linguistics
    The journal was established in 1974 by David Hays under the name American Journal of Computational Linguistics and was renamed to its current title in 1984.
  20. [20]
    American Journal of Computational Linguistics (September 1974)
    American Journal of Computational Linguistics (September 1974). David G. Hays (Editor). Anthology ID: J74-1; Month: September; Year: 1974; Address: ...
  21. [21]
    Syntactic Structures, Noam Chomsky - Penn Linguistics
    No information is available for this page. · Learn why
  22. [22]
    Review of B. F. Skinner's Verbal Behavior - Chomsky.info
    The book under review is the product of study of linguistic behavior extending over more than twenty years. Earlier versions of it have been fairly widely ...
  23. [23]
    [PDF] ASPECTS OF THE THEORY OF SYNTAX - Colin Phillips |
    This is Special Technical Report Number 1 I of the Research Labora- tory of Electronics of the Massachusetts Institute of Technology.
  24. [24]
    Parallel Distributed Processing, Volume 1: Explorations in the ...
    They describe a new theory of cognition called connectionism that is challenging the idea of symbolic computation that has traditionally been at the center of ...
  25. [25]
    [PDF] Connectionist Modeling of Language: Examples and Implications
    Rumelhart and McClelland (1986) argued for an alterna- tive view of language in which all items coexist within a sin- gle system whose representations and ...
  26. [26]
    Bayesian Grammar Induction for Language Modeling - ACL Anthology
    Chen. 1995. Bayesian Grammar Induction for Language Modeling. In 33rd Annual Meeting of the Association for Computational Linguistics, pages 228–235, Cambridge, ...Missing: acquisition | Show results with:acquisition
  27. [27]
    [2407.07606] The Computational Learning of Construction Grammars
    Jul 10, 2024 · Abstract:This paper documents and reviews the state of the art concerning computational models of construction grammar learning.
  28. [28]
    [PDF] How poor is the stimulus? Evaluating hierarchical generalization in ...
    Jul 9, 2023 · When acquiring syntax, children consistently choose hierarchical rules over competing non- hierarchical possibilities. Is this preference.
  29. [29]
    MOSAIC+: a cross-linguistic model of verb-marking in typically ...
    MOSAIC (Model of Syntax Acquisition in Children) is a computational model of language learning that successfully simulates the developmental patterning of ...
  30. [30]
    Building a Large Annotated Corpus of English: The Penn Treebank
    Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): ...
  31. [31]
    UD Guidelines - Universal Dependencies
    UD guidelines cover tokenization, morphology, syntax, enhanced dependencies, CoNLL-U format, and annotation guidelines.Missing: parsing | Show results with:parsing
  32. [32]
    A Survey of Corpora for Germanic Low-Resource Languages and ...
    A systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research).
  33. [33]
    Corpus Annotation through Crowdsourcing: Towards Best Practice ...
    In this paper we focus on the use of crowdsourcing methods for corpus acquisition and propose a set of best practice guidelines based in our own experiences in ...
  34. [34]
    Switchboard-1 Release 2 - Linguistic Data Consortium - LDC Catalog
    This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is available ...Missing: computational | Show results with:computational
  35. [35]
    Head-Driven Phrase Structure Grammar, Pollard, Sag
    This book presents the most complete exposition of the theory of head-driven phrase structure grammar (HPSG), introduced in the authors' Information-Based ...Missing: unification feature
  36. [36]
    [PDF] Class-Based n-gram Models of Natural Language - ACL Anthology
    We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words.Missing: seminal | Show results with:seminal
  37. [37]
    [PDF] LONG SHORT-TERM MEMORY 1 INTRODUCTION
    Hochreiter, S. and Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In. Advances in Neural Information Processing Systems 9. MIT ...
  38. [38]
    Sequence to Sequence Learning with Neural Networks - arXiv
    In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
  39. [39]
    Efficient Estimation of Word Representations in Vector Space - arXiv
    Jan 16, 2013 · We propose two novel model architectures for computing continuous vector representations of words from very large data sets.
  40. [40]
    GloVe: Global Vectors for Word Representation - ACL Anthology
    Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on ...
  41. [41]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  42. [42]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  43. [43]
    [1906.01502] How multilingual is Multilingual BERT? - arXiv
    Jun 4, 2019 · In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 ...
  44. [44]
    Machine translation - ScienceDirect
    This paper gives an intellectual overview of the field of machine translation of natural languages (MT). Now 50 years old, this field is one of the oldest ...
  45. [45]
    [PDF] Interlingual machine translation
    Summary: The first part of this paper considers some of the reasons why mechanical translation via a logically formalized interlingua is worth pursuing.Missing: seminal | Show results with:seminal
  46. [46]
    [PDF] Interlingua for multilingual machine translation - ACL Anthology
    In machine translation systems, an intermediate representation is necessary to express the result of sentence analysis. This represents syntactic structure ...
  47. [47]
    [PDF] 1987-UNITRAN: An Interlingual Approach to Machine Translation
    The translation model described in this paper moves away from the language-specific rule-based design, and moves toward a linguistically motivated principle ...Missing: seminal | Show results with:seminal
  48. [48]
    [PDF] Moses: open source toolkit for statistical machine translation
    Moses has shown that it achieves results comparable to the most competitive and widely used statistical ma- chine translation systems in translation quality and.
  49. [49]
    [PDF] BLEU: a Method for Automatic Evaluation of Machine Translation
    BLEU is a method for automatic machine translation evaluation, measuring closeness to human translations using a weighted average of phrase matches. It is ...
  50. [50]
    [PDF] Text Generation: A Systematic Literature Review of Tasks ... - arXiv
    In this paper, we provide an overview of recent text generation research between January 2017 and August 2023, as it permeates most activities in NLP concerned ...
  51. [51]
    [PDF] Long Text Generation by Modeling Sentence-Level and Discourse ...
    Aug 1, 2021 · Generating long and coherent text is an im- portant but challenging task, particularly for open-ended language generation tasks such as.
  52. [52]
    A vector space model for automatic indexing - ACM Digital Library
    Salton, G., and Yang, C.S. On the specification of term values in automatic indexing. J. Documen. 29, 4 (Dec. 1973), 351-372.
  53. [53]
    SQuAD: 100000+ Questions for Machine Comprehension of Text
    Jun 16, 2016 · We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by ...
  54. [54]
    [1704.00051] Reading Wikipedia to Answer Open-Domain Questions
    Mar 31, 2017 · This paper uses Wikipedia for open-domain QA, where answers are text spans in articles, combining search and a neural network model.
  55. [55]
    [PDF] A Joint Model of Intent Determination and Slot Filling for Spoken ...
    The joint model combines intent determination and slot filling using a shared representation, trained with a united loss function, and outperforms separate ...Missing: seminal | Show results with:seminal
  56. [56]
    A Survey of Intent Classification and Slot-Filling Datasets for Task ...
    Jul 26, 2022 · This survey catalogs publicly available datasets for intent classification and slot-filling in task-oriented dialog, aiming to increase their ...
  57. [57]
    ROUGE: A Package for Automatic Evaluation of Summaries
    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for ...
  58. [58]
    Taking MT Evaluation Metrics to Extremes: Beyond Correlation with ...
    In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects.
  59. [59]
    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural ...
    GLUE is a tool for evaluating and analyzing NLU models across diverse tasks. It is model-agnostic and incentivizes sharing knowledge.
  60. [60]
    SuperGLUE: A Stickier Benchmark for General-Purpose Language ...
    May 2, 2019 · In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a ...Missing: original | Show results with:original
  61. [61]
    CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to ...
    The CoNLL 2018 shared task involved training dependency parsers for many languages using Universal Dependencies, without gold-standard annotation.
  62. [62]
    MENLI: Robust Evaluation Metrics from Natural Language Inference
    Jul 12, 2023 · Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial ...
  63. [63]
    XTREME: A Massively Multilingual Multi-task Benchmark for ... - arXiv
    Mar 24, 2020 · A multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
  64. [64]
    Man is to Computer Programmer as Woman is to Homemaker ...
    We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent.
  65. [65]
    Word embeddings quantify 100 years of gender and ethnic ... - PNAS
    Apr 3, 2018 · We develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic ...
  66. [66]
    [PDF] Quantifying Social Biases in NLP: A Generalization and Empirical ...
    In NLP, group fairness metrics are based on performance comparisons for different sets of examples, for example, the comparison of two F1 scores: one for ...
  67. [67]
    [PDF] Differential Privacy in Natural Language Processing: The Story So Far
    Jul 15, 2022 · This paper aims to summarize the vulnerabilities addressed by. Differential Privacy, the current thinking, and above all, the crucial next steps ...
  68. [68]
    [PDF] Privacy leakages on NLP models and mitigations through ... - Hal-Inria
    Jun 23, 2023 · In this paper, we present the main privacy concerns in NLP and a case study conducted in collaboration with the Hospices Civils de Lyon (HCL) to ...
  69. [69]
    [PDF] Bias Mitigation for Large Language Models using Adversarial ...
    We propose a novel debiasing method that employs adversarial learning during model pre-training. Without hyperparameter optimization our comparably ...