Fact-checked by Grok 2 weeks ago

Topic model

A topic model is a type of unsupervised machine learning algorithm designed to automatically discover and annotate latent topics, or thematic patterns, within a large collection of documents, such as text corpora, by analyzing co-occurrences of words without requiring labeled data.^[1] These models treat documents as mixtures of topics and topics as distributions over vocabulary terms, enabling the extraction of interpretable summaries that reveal underlying structures in unstructured text.^[2] Probabilistic topic models, the dominant paradigm in this field, formalize the discovery process through generative statistical frameworks that simulate how documents are produced from hidden topic distributions.^[1] The foundational method, Latent Dirichlet Allocation (LDA), introduced in 2003, posits a three-level hierarchical Bayesian model where each document is a finite mixture over a fixed number of topics, and each topic is an infinite mixture over word probabilities drawn from a Dirichlet prior.^[2] Inference in LDA typically employs variational methods or sampling techniques like Gibbs sampling to approximate posterior distributions, allowing for scalable application to massive datasets.^[1] Topic modeling has evolved from earlier techniques like latent semantic analysis (LSA) in the 1990s, which used singular value decomposition for dimensionality reduction, and probabilistic latent semantic indexing (pLSA) in 1999, which introduced mixture models but suffered from overfitting on unseen documents.^[1] LDA addressed these limitations by incorporating Bayesian priors, paving the way for extensions such as dynamic topic models for time-evolving corpora, supervised variants for classification tasks, and nonparametric models like hierarchical Dirichlet processes that infer the number of topics automatically.^[1] Applications of topic models span natural language processing, information retrieval, and beyond, including document classification, recommendation systems, and exploratory analysis of scientific literature or social media streams.^[1] In recent developments as of 2025, neural topic models integrate deep learning architectures, such as variational autoencoders and Large Language Models (LLMs), to handle short texts and embeddings, improving coherence and scalability on modern hardware.^[3]^[4] Despite challenges in evaluation—often relying on perplexity metrics or human judgments—these techniques remain essential for making sense of vast digital archives.^[1]

Introduction

Definition and Core Concepts

A topic model is a statistical technique for discovering latent thematic structure in a collection of documents, representing each document as a mixture of topics and each topic as a probability distribution over words in a fixed vocabulary.^[2] These models operate under the bag-of-words assumption, treating documents as unordered collections of words where the order and specific positions do not influence the thematic representation, focusing instead on word frequencies to capture semantic content.^[2] Latent topics serve as hidden variables that explain the observed words, enabling the model to infer underlying themes without explicit supervision or annotations.^[5] At the core of topic models is a generative probabilistic framework, which posits an imaginary process by which documents are produced: first, a distribution over topics is selected for the document; then, for each word position, a topic is drawn from this distribution, and a word is sampled from the corresponding topic's word distribution.^[2] This process treats topics as unobserved components that generate the observable word sequences, allowing the model to reverse-engineer the latent structure from the data.^[5] For a single document in this framework, the basic generative process follows a Dirichlet-multinomial model, where the topic proportions \theta are drawn from a Dirichlet distribution parameterized by \alpha; for each word, a topic assignment z is sampled from a multinomial distribution governed by \theta; and the word w is then drawn from a multinomial distribution over the vocabulary conditioned on the selected topic's parameters \phi_z:

\begin{align*} \theta &\sim \mathrm{Dir}(\alpha), \\ z &\sim \mathrm{Mult}(\theta), \\ w &\sim \mathrm{Mult}(\phi_z). \end{align*}

^[2] As an illustration, consider a corpus of news articles: a topic model might uncover a "politics" topic with high probabilities for words like "election," "government," and "policy," alongside a "sports" topic featuring terms such as "game," "team," and "score," thereby revealing thematic clusters across the collection.^[5]

Role in Natural Language Processing

Topic models play a pivotal role in natural language processing (NLP) by enabling the unsupervised discovery of latent semantic structures within large collections of unstructured text data. This capability bridges the gap between raw textual inputs and structured representations, such as topic distributions over documents, allowing for interpretable insights into thematic content without requiring labeled training data. By modeling text as mixtures of topics—where each topic is a distribution over words—topic models facilitate the extraction of hidden patterns that reflect underlying themes, making them essential for handling the high volume and variability of natural language.^[6] In information retrieval, topic models enhance performance through topic-based indexing, which captures document semantics beyond simple term matching and improves relevance ranking in ad-hoc search tasks.^[7] For instance, by representing documents as mixtures of topics rather than sparse bag-of-words vectors, these models enable more effective smoothing and query expansion, leading to higher retrieval precision.^[7] Similarly, in sentiment analysis, topic models contribute by disentangling sentiment from topical content, as seen in joint sentiment-topic frameworks that simultaneously infer polarity and themes, thereby improving the accuracy of opinion mining in reviews or social media.^[8] A key advantage of topic models in NLP is their ability to reduce dimensionality, transforming high-dimensional vocabulary spaces—often exceeding 10,000 terms—into compact K-dimensional topic representations, where K is typically much smaller (e.g., 50–200 topics). This reduction not only mitigates the curse of dimensionality but also serves as a foundational step for downstream tasks like document clustering, where topic vectors enable efficient grouping of similar texts based on shared themes.^[6] For example, topic models have been applied to email filtering, automatically identifying thematic categories such as "work-related" or "promotions" from unlabeled inboxes, which supports rule-based organization and spam detection without manual annotation.^[9]

Historical Development

Origins in Information Retrieval

The foundations of topic modeling trace back to early information retrieval (IR) systems developed in the 1960s, which emphasized structured representations of text to improve search efficiency. The SMART system, pioneered by Gerard Salton at Cornell University, introduced term-document matrices as a core mechanism for indexing and retrieving documents based on weighted term frequencies, laying essential groundwork for later latent structure techniques.^[10] These matrices captured associations between terms and documents but struggled with synonymy and polysemy, highlighting the need for methods that could uncover deeper semantic relationships beyond exact term matching.^[11] During the 1970s and 1980s, vector space models emerged as a dominant paradigm in IR, representing documents and queries as vectors in a high-dimensional space where similarity was measured via cosine distance or dot products. This approach, formalized by Salton and colleagues, enabled ranking based on term co-occurrences but revealed limitations in handling semantic nuances, such as related terms not explicitly co-occurring, which spurred interest in dimensionality reduction to reveal latent topical structures.^[11] Conferences like the ACM SIGIR, with its early meetings in the 1970s fostering key discussions on these challenges, played a pivotal role in driving innovations toward more sophisticated retrieval models.^[12] A landmark advancement came in 1990 with the introduction of Latent Semantic Indexing (LSI) by Deerwester et al., which applied singular value decomposition (SVD) to term-document matrices for dimensionality reduction, thereby capturing implicit associations among terms and documents to enhance retrieval accuracy. LSI addressed some vector space model shortcomings by approximating latent semantic factors, yet its deterministic nature lacked a probabilistic interpretation, limiting its ability to model uncertainty in term distributions and motivating subsequent probabilistic extensions. This transition toward probabilistic frameworks built directly on LSI's insights into latent structures.

Evolution from Latent Semantic Analysis

Latent Semantic Indexing (LSI), a deterministic matrix factorization method for uncovering latent topics in document collections, laid the groundwork for subsequent probabilistic approaches by addressing synonymy and polysemy in information retrieval. However, LSI's reliance on singular value decomposition lacked a statistical foundation for generative modeling, prompting the development of probabilistic alternatives in the late 1990s. In 1999, Thomas Hofmann introduced Probabilistic Latent Semantic Analysis (pLSA), also known as Probabilistic Latent Semantic Indexing, as a statistical extension of LSI that incorporates a latent class model, termed the aspect model, to generate word-document co-occurrences probabilistically.^[14] The aspect model posits that each word in a document is generated by first selecting a latent topic z conditioned on the document, followed by sampling the word from the topic's distribution, enabling a likelihood-based framework fitted via expectation-maximization that outperformed LSI in retrieval tasks.^[14] Despite these advances, pLSA suffered from overfitting due to its maximum-likelihood estimation without hierarchical priors, resulting in parameters that scaled linearly with the training corpus size (kV + kM, where k is the number of topics, V the vocabulary size, and M the number of documents) and poor generalization to unseen documents, as it lacked a proper generative process for new data.^[2] This led to a pivotal shift in 2003 with the introduction of Latent Dirichlet Allocation (LDA) by David M. Blei, Andrew Y. Ng, and Michael I. Jordan, which established a fully generative Bayesian model for topic discovery by imposing Dirichlet priors on topic distributions to promote sparsity and coherence while fixing the parameter count independent of corpus size.^[2] LDA's hierarchical structure—drawing document-topic proportions \theta from a Dirichlet distribution and topic-word distributions \phi similarly—enabled scalable inference through variational methods or sampling, transitioning topic modeling from deterministic approximations to stochastic, exchangeable processes that better captured corpus-level regularities.^[2] Published in the Journal of Machine Learning Research, this work marked a key milestone in enabling broader applications beyond retrieval, such as visualization and summarization.^[2] Following LDA, the integration of Bayesian nonparametrics after 2005 further evolved topic models by allowing the number of topics to grow adaptively with data, as seen in extensions like the Hierarchical Dirichlet Process, which influenced scalable, infinite mixtures for dynamic corpora without fixed hyperparameters.^[15]

Mathematical Foundations

Probabilistic Frameworks

Topic models operate within a probabilistic framework that conceptualizes documents as observed data generated from mixtures of hidden latent topics. In this setup, each document is represented as a distribution over topics, and each topic as a distribution over words, enabling the model to capture the underlying thematic structure of a corpus through stochastic processes. This approach draws on Bayesian principles to infer the posterior distribution of hidden variables—such as topic assignments and mixture proportions—given the observed words, providing a principled way to handle uncertainty in topic discovery.^[5] A key aspect of this framework is the use of conjugate priors to ensure computational tractability. The Dirichlet distribution serves as the conjugate prior for the multinomial distributions governing topic mixtures and word distributions, allowing for efficient posterior updates in Bayesian inference. This choice facilitates the integration of prior knowledge about sparsity and smoothness in topic assignments, which is crucial for modeling real-world text data where topics are often sparse. Graphical models, often depicted using plate notation, compactly represent the generative process by illustrating dependencies and repetitions across documents and words; for instance, plates denote replication over multiple documents (D) and words within each document (N).^[2]^[5] The hierarchical structure distinguishes corpus-level from document-level distributions, enabling shared topics across the entire collection while allowing topic mixtures to vary per document. In models like Latent Dirichlet Allocation (LDA), the per-topic word distributions φ are drawn once from a corpus-level Dirichlet prior parameterized by β, promoting coherence across documents, whereas per-document topic proportions θ are drawn independently from a document-level Dirichlet prior parameterized by α. This setup captures both global thematic consistency and document-specific emphases. The full joint distribution over the observed words W, latent topic assignments Z, document-topic distributions θ, and topic-word distributions φ, given hyperparameters α and β, is given by:

\begin{aligned} p(W, Z, \theta, \phi \mid \alpha, \beta) = \prod_{k=1}^{K} p(\phi_k \mid \beta) \prod_{d=1}^{D} \left[ p(\theta_d \mid \alpha) \prod_{n=1}^{N_d} p(z_{d,n} \mid \theta_d) p(w_{d,n} \mid z_{d,n}, \phi_{z_{d,n}}) \right], \end{aligned}

where K is the number of topics, D the number of documents, and N_d the number of words in document d; here, p(φ_k | β) is Dirichlet, p(θ_d | α) is Dirichlet, p(z_{d,n} | θ_d) is multinomial, and p(w_{d,n} | z_{d,n}, φ) is multinomial. This formulation encapsulates the generative process and serves as the foundation for inference in probabilistic topic models.^[2]

Matrix Factorization Approaches

Matrix factorization approaches to topic modeling provide a non-probabilistic framework for discovering latent topics by decomposing the term-document matrix into lower-rank factors, offering deterministic alternatives to probabilistic methods.^[16] In this paradigm, the term-document matrix W \in \mathbb{R}^{m \times n}, where m is the vocabulary size and n is the number of documents, is approximated as W \approx U V, with U \in \mathbb{R}^{m \times k} representing the topic-word matrix (where columns are topic distributions over words) and V \in \mathbb{R}^{k \times n} the document-topic matrix (where rows are topic mixtures for documents), for k topics.^[16] This decomposition uncovers topics as coherent groups of terms and assigns documents to mixtures of these topics without assuming generative processes. The cornerstone of these approaches is Non-negative Matrix Factorization (NMF), introduced by Lee and Seung in 1999, which enforces non-negativity constraints on U and V to yield interpretable, parts-based representations.^[16] Unlike unconstrained factorizations such as principal component analysis, NMF's non-negativity ensures that topics emerge as additive combinations of word features, promoting intuitive and human-readable results, as demonstrated in early applications to text data where semantic features naturally arise.^[16] NMF is optimized by minimizing the Frobenius norm of the reconstruction error:

\min_{U, V \geq 0} \| W - U V \|_F^2

subject to U \geq 0 and V \geq 0, typically solved using multiplicative update rules that iteratively refine the factors while preserving non-negativity:

U \leftarrow U \odot \frac{W V^T}{U V V^T}, \quad V \leftarrow V \odot \frac{U^T W}{U^T U V},

where \odot denotes element-wise multiplication.^[17] These updates converge to a local minimum, enabling efficient computation for large-scale text corpora.^[17] NMF offers distinct advantages in topic modeling, including inherent sparsity in the factor matrices, which reduces noise and highlights key terms per topic, and facilitates visualization by allowing topics to be represented as weighted sums of basis elements. For instance, sparse U columns emphasize a subset of words defining each topic, aiding interpretability in document clustering tasks. Extensions such as Archetypal Analysis build on NMF by further constraining the factors to lie within the convex hull of the data points, representing archetypes as extreme mixtures that enhance extremal topic discovery.^[18] Introduced by Cutler and Breiman in 1994, this method modifies the NMF objective to emphasize boundary points, proving useful for identifying pure topic prototypes in diverse datasets.^[18] In contrast to probabilistic frameworks that model uncertainty through distributions, matrix factorization approaches like NMF prioritize optimization-based decompositions for scalable, reproducible topic extractions.^[16]

Key Algorithms and Models

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora, formulated as a three-level hierarchical Bayesian model. In this framework, documents are generated as mixtures of latent topics, where each topic is itself a mixture of words drawn from a shared vocabulary. This hierarchical structure assumes that both the mixing proportions for topics within documents and the distributions over words within topics follow Dirichlet priors, enabling the discovery of coherent thematic patterns across large document sets.^[2] The generative process underlying LDA operates at multiple levels. Globally, for each topic k = 1, \dots, K, the topic-word distribution \phi_k is drawn from a Dirichlet distribution: \phi_k \sim \text{Dir}(\beta). For each document m = 1, \dots, M, the document-topic mixture \theta_m is drawn from \theta_m \sim \text{Dir}(\alpha). Then, for each word position n = 1, \dots, N_m in document m, a topic assignment z_{m,n} is sampled from the multinomial distribution z_{m,n} \sim \text{Mult}(\theta_m), and the observed word w_{m,n} is drawn from w_{m,n} \sim \text{Mult}(\phi_{z_{m,n}}). This process models documents as bags of words exchangeably, capturing the latent topical structure through the assignments Z and parameters \theta, \phi.^[2] Key hyperparameters in LDA include \alpha and \beta, which shape the resulting distributions. The parameter \alpha governs the sparsity of the document-topic mixtures \theta_m, where smaller values of \alpha encourage sparser representations with fewer dominant topics per document. Similarly, \beta controls the smoothness of the topic-word distributions \phi_k, with smaller values leading to more peaked (less smooth) distributions that concentrate probability mass on fewer words per topic. In practice, the number of topics K is typically set between 50 and 100 when modeling large text corpora, balancing granularity and interpretability.^[2]^[19]^[20] Inference in LDA aims to estimate the posterior distribution over the latent topic assignments Z, document mixtures \theta, and topic-word distributions \phi given the observed words W:

p(Z, \theta, \phi \mid W, \alpha, \beta) \propto p(W, Z, \theta, \phi \mid \alpha, \beta)

This posterior lacks a closed-form solution owing to the intricate dependencies introduced by the Dirichlet priors and multinomial likelihood, requiring approximate methods for computation.^[2]

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis (pLSA), also referred to as Probabilistic Latent Semantic Indexing (pLSI), is an unsupervised probabilistic technique for discovering latent topics in a collection of documents. Introduced by Thomas Hofmann in 1999, it extends latent semantic analysis by incorporating a statistical mixture model to capture the probabilistic relationships between words and documents through unobserved latent variables representing topics or aspects.^[14] In pLSA, documents are viewed as mixtures of these latent topics, and topics are distributions over words, enabling the model to represent the co-occurrence patterns in text data more flexibly than deterministic methods.^[14] The core formulation of pLSA, known as the aspect model, posits that the probability of observing a word w in a document d is generated through a latent topic z:

P(w \mid d) = \sum_z P(z \mid d) P(w \mid z)

Here, P(z \mid d) represents the mixing proportions of topics in document d, while P(w \mid z) denotes the probability of word w under topic z.^[14] This generative process treats each word occurrence as independently drawn from one of the topics associated with its document, assuming a fixed number of topics Z. To estimate the model parameters, pLSA employs the Expectation-Maximization (EM) algorithm, which iteratively maximizes the log-likelihood of the observed word-document data:

\log P(W \mid D) = \sum_{d,w} \log \sum_z P(z \mid d) P(w \mid z)

The E-step computes posterior probabilities over latent topics, and the M-step updates the topic mixtures and word distributions accordingly.^[14] Despite its foundational role in probabilistic topic modeling, pLSA has notable limitations. The model lacks a proper generative story for new documents, making it unsuitable for assigning probabilities to unseen documents without retraining, as parameters are tied directly to the training corpus.^[2] Additionally, without regularization mechanisms like priors, pLSA is prone to overfitting, particularly as the number of parameters scales linearly with the training set size, leading to poor generalization on sparse data.^[2] These issues motivated extensions such as Latent Dirichlet Allocation, which introduces Bayesian priors to mitigate them.^[2]

Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF) decomposes a non-negative input matrix V \in \mathbb{R}_{\geq 0}^{n \times m} into two lower-rank non-negative matrices W \in \mathbb{R}_{\geq 0}^{n \times r} and H \in \mathbb{R}_{\geq 0}^{r \times m}, such that V \approx WH, where r \ll \min(n, m). In topic modeling, the columns of W serve as basis vectors representing topics as distributions over words, while the rows of H indicate the proportions of each topic in the corresponding documents.^[21] The non-negativity constraint promotes an additive parts-based representation, enhancing interpretability by ensuring that data points are reconstructed from localized, non-overlapping components rather than holistic or subtractive mixtures. For instance, when applied to grayscale pixel images of faces, NMF learns basis images corresponding to distinct facial parts like eyes, noses, and mouths. Similarly, in text corpora, it identifies coherent groups of words forming semantic topics, such as terms related to chemistry (e.g., "aluminum," "copper," "iron") or government (e.g., "constitution," "court," "rights").^[21] NMF was originally proposed in 1999 as a method for unsupervised learning of object parts, with demonstrations on both image and text data for discovering semantic features. Extensions in 2001 focused on practical algorithms for computing the factorization, enabling its broader adoption in topic modeling tasks.^[22] Common algorithms for NMF include multiplicative updates and alternating least squares, both of which guarantee non-negativity and converge to a local minimum.^[22] Multiplicative updates, derived as diagonally rescaled gradient descent, minimize objectives like the Frobenius norm \|V - WH\|_F^2 through iterative element-wise multiplication.^[22] The update rules are:

H_{a\mu} \leftarrow H_{a\mu} \frac{(W^T V)_{a\mu}}{(W^T W H)_{a\mu}}

W_{ia} \leftarrow W_{ia} \frac{(V H^T)_{ia}}{(W H H^T)_{ia}}

A small positive constant \epsilon can be added to denominators for numerical stability.^[22] Analogous rules apply for minimizing the generalized Kullback-Leibler divergence.^[22] Alternating least squares (ALS) alternately optimizes W and H by solving non-negative least squares subproblems, often using active-set methods for efficiency in high dimensions.

Inference Methods

Variational Inference Techniques

Variational inference approximates the intractable posterior distributions in probabilistic topic models by selecting a tractable variational distribution q(Z, \theta, \phi) that minimizes the Kullback-Leibler (KL) divergence to the true posterior p(Z, \theta, \phi \mid W).^[2] This process equivalently maximizes the evidence lower bound (ELBO), which provides a tractable lower bound on the marginal log-likelihood of the observed data W.^[2] The ELBO is formulated as

L(q) = \mathbb{E}_q \left[ \log p(W, Z, \theta, \phi) - \log q(Z, \theta, \phi) \right],

where the expectation is taken with respect to q, and maximizing L(q) tightens the bound while facilitating optimization.^[2] A common approach within variational inference employs a mean-field approximation, which factorizes the variational distribution to assume conditional independence among latent variables, such as q(Z, \theta, \phi) = q(\theta \mid \gamma) \prod_d \prod_n q(z_{dn} \mid \phi_{dn}), with Dirichlet priors on \theta and multinomial forms for the topic assignments Z.^[2] For Latent Dirichlet Allocation (LDA), inference proceeds via a coordinate ascent algorithm that iteratively optimizes the variational parameters.^[2] In the expectation (E) step, the variational posterior over topic assignments for each word is updated as q(z_{dn}=k) \propto \exp\left( \psi(\gamma_{dk}) + \psi(\beta_{k w_{dn}}) - \psi\left( \sum_w \beta_{kw} \right) \right), where \psi denotes the digamma function, \gamma_{dk} parameterizes the per-document topic proportions, and \beta_{kw} relates to the topic-word distributions.^[2] The maximization (M) step then refines the hyperparameters, such as the Dirichlet parameters \alpha and \beta, by maximizing the resulting ELBO.^[2] These variational techniques scale efficiently to large corpora, supporting inference over millions of documents through deterministic optimization that avoids the high variance of sampling-based alternatives.^[5] This scalability comes at the cost of introducing approximation bias in the posterior estimates, prioritizing computational speed over the unbiased but slower convergence of methods like MCMC.^[5]

Sampling-Based Methods

Sampling-based methods for inference in topic models primarily rely on Markov Chain Monte Carlo (MCMC) techniques to approximate the posterior distribution over latent variables, such as topic assignments, by generating samples from the joint distribution. These approaches are particularly valuable for models like Latent Dirichlet Allocation (LDA), where exact inference is intractable due to the high-dimensional parameter space. Unlike deterministic approximations, MCMC methods provide asymptotically exact samples from the posterior, enabling better quantification of uncertainty in topic assignments and model parameters.^[23] A cornerstone of these methods is collapsed Gibbs sampling, which integrates out the continuous parameters (topic proportions θ and word distributions φ) to sample directly from the conditional distribution over topic assignments z. In LDA, the process iteratively samples the topic z_i for each word position i from its full conditional distribution, excluding the current assignment to avoid self-influence:

P(z_i = k \mid z_{-i}, w, \alpha, \beta) \propto (n_{d,k}^{-i} + \alpha) \frac{n_{k,t}^{-i} + \beta}{n_{k,\cdot}^{-i} + V \beta}

Here, d is the document of word i, t is the observed word type, n_{d,k}^{-i} is the number of words in document d assigned to topic k excluding i, n_{k,t}^{-i} is the number of times word t is assigned to topic k excluding i, n_{k,\cdot}^{-i} is the total assignments to topic k excluding i, V is the vocabulary size, and α, β are Dirichlet hyperparameters. This sampling is repeated across all word positions in a sweep, with multiple sweeps continued until the chain reaches stationarity, as indicated by convergence diagnostics. The method was notably implemented and applied to scientific abstracts by Griffiths and Steyvers in 2004, demonstrating its efficacy for discovering coherent topics.^[23]^[23] To ensure reliable inference, MCMC chains require burn-in periods to discard initial samples biased by starting values, allowing the chain to converge to the stationary distribution; for instance, the first 1,000 iterations are often discarded in LDA applications. Thinning, or subsampling the chain at regular intervals (e.g., every 100 iterations), further reduces autocorrelation between samples, improving the effective sample size for estimating posterior expectations like topic-word distributions. While these techniques enhance accuracy, they increase computational cost compared to faster approximations. For efficiency, extensions incorporate the alias method to sample from the multinomial conditionals in amortized O(1) time per draw by precomputing alias tables for the probability distribution, as introduced in AliasLDA, which reduces the per-iteration complexity from O(K) to O(1) for K topics. Overall, sampling-based methods excel in capturing posterior uncertainty but remain computationally intensive, often requiring thousands of iterations for large corpora.^[23]^[23]^[24]

Evaluation and Metrics

Intrinsic Measures

Intrinsic measures assess the quality of topic models internally, using only the model's parameters and the underlying corpus, without external tasks or human judgments. These metrics primarily evaluate how well the model captures the statistical structure of the data, focusing on fit and predictive generalization. Key examples include perplexity and held-out likelihood, which are derived from probabilistic principles and are applicable to models like Latent Dirichlet Allocation (LDA). Perplexity quantifies the model's predictive power on held-out test data by measuring how surprised the model is by unseen words, with lower values indicating better performance. It is computed as the exponential of the negative average log-likelihood per word across the test set:

\text{perplexity}(D_{\text{test}}) = \exp\left( -\frac{\sum_{d=1}^M \log p(w_d)}{\sum_{d=1}^M N_d} \right)

where D_{\text{test}} consists of M documents, w_d denotes the sequence of words in document d, and N_d is the length of d in words.^[2] This metric originates from language modeling and has been adapted for topic models to gauge generalization, as demonstrated in early LDA evaluations where it outperformed unigram baselines.^[2] Despite its utility, perplexity has notable limitations in evaluating semantic quality, as it emphasizes likelihood-based fit over human-interpretable topic coherence or diversity.^[25] For instance, models with high perplexity may still produce meaningful topics, while low-perplexity models can yield semantically poor distributions.^[25] Held-out likelihood forms the basis for perplexity, directly estimating the probability of unseen documents under the model, p(w \mid \theta, \phi, \alpha), where \theta are document-topic distributions, \phi are topic-word distributions, and \alpha are hyperparameters. Due to the intractability of exact computation in LDA, approximations such as importance sampling or bridge sampling are employed.^[26] Log-likelihood on the training data measures in-sample fit but tends to favor overparameterized models due to overfitting, making it less reliable for model selection.^[26] To address this, the harmonic mean combines training and held-out likelihoods, approximating the marginal likelihood as the harmonic mean over posterior samples z^{(s)}:

p(w \mid \theta, \phi, \alpha) \approx \left( \frac{1}{S} \sum_{s=1}^S \frac{1}{p(w \mid z^{(s)}, \phi)} \right)^{-1},

where S is the number of samples drawn from p(z \mid w, \theta, \phi, \alpha). This estimator balances fit and generalization but can suffer from high variance.^[26] As a representative example, LDA models trained on the 20 Newsgroups dataset—a collection of approximately 20,000 documents across 20 categories—often yield perplexity scores around 1068 for 128 topics, providing a benchmark for comparing inference methods and hyperparameters.^[27]

Extrinsic Measures

Extrinsic measures assess the practical utility and semantic quality of topic models by evaluating their performance in downstream applications and alignment with human judgments, rather than solely internal statistical properties. These metrics emphasize interpretability and effectiveness in real-world tasks, such as enhancing document classification or information retrieval systems. By focusing on external criteria, extrinsic evaluations help determine how well topics support broader NLP objectives, including user-facing applications where coherent and diverse topics improve outcomes like recommendation accuracy or search relevance. A primary extrinsic metric is topic coherence, which quantifies the semantic relatedness among the top words representing a topic, serving as a proxy for human interpretability. Coherence scores are derived from co-occurrence patterns in a reference corpus, such as Wikipedia, and have been validated against human annotations where evaluators rate topics on scales from coherent to incoherent. For instance, automatic coherence measures achieve Spearman rank correlations of up to 0.78 with human judgments on datasets like news articles and books, approaching inter-annotator agreement levels of 0.79–0.82.^[28] Human annotations typically involve multiple raters assessing 200–300 topics from models like LDA, providing gold-standard benchmarks for tuning and comparison.^[28] Prominent coherence variants include the UMass measure and Normalized Pointwise Mutual Information (NPMI). The UMass coherence computes the sum over pairs of top words of the log of their conditional co-occurrence probability, normalized by the total number of pairs:

\text{UMass} = \frac{1}{N(N-1)} \sum_{i=1}^{N} \sum_{j>i}^{N} \log \frac{P(w_j \mid w_i)}{P(w_j)},

where N is the number of top words per topic, P(w_j \mid w_i) is the fraction of documents containing w_i that also contain w_j, and P(w_j) is the fraction of documents containing w_j. This asymmetric measure favors word pairs that frequently co-occur in documents, promoting interpretable topics.^[29] NPMI extends pointwise mutual information by normalization to handle sparsity:

\text{NPMI}(w_i, w_j) = \frac{\log \frac{P(w_i, w_j)}{P(w_i) P(w_j)}}{-\log P(w_i, w_j)},

yielding values between -1 and 1, with higher scores indicating stronger semantic association based on joint and marginal probabilities from a large reference corpus. NPMI often outperforms UMass in correlating with human ratings due to its symmetry and normalization.^[28] Another widely adopted measure is C_v coherence, which combines document-level co-occurrence with pairwise word similarities derived from co-occurrence statistics (functioning as distributional embeddings) to compute an average indirect cosine similarity across topic words. This hybrid approach captures both topical proximity in documents and broader semantic links, achieving Pearson correlations of up to 0.859 with human evaluations on benchmarks like the 20 Newsgroups dataset. C_v is particularly effective for models producing diverse, human-readable topics, as it balances local context with global word relations. Topic coherence metrics are instrumental in hyperparameter tuning, such as selecting the optimal number of topics K, by plotting coherence scores against K and identifying peaks that indicate balanced granularity. Studies have demonstrated that maximizing semantic coherence during inference, such as through asymmetric priors in LDA, can improve the proportion of interpretable topics compared to standard settings.^[30] Recent advances include LLM-based metrics, such as Contextualized Topic Coherence (CTC), which leverage large language models to evaluate topic interpretability by considering contextual patterns and embeddings, achieving higher correlations with human judgments than traditional measures.^[31] Beyond coherence, extrinsic evaluations often examine integration in downstream tasks, where topic distributions serve as features for classifiers or retrieval systems. In document classification, topic-enhanced models have shown improvements in F1-scores on tasks like sentiment analysis, as topics provide compact, interpretable representations that capture latent themes missed by bag-of-words approaches. Similarly, in information retrieval, topics boost precision at rank 10 by aligning queries with thematic document clusters, enhancing relevance in large corpora. To ensure non-redundancy, diversity metrics complement coherence by quantifying topic overlap, typically via the average pairwise cosine similarity between topic-word probability vectors:

\text{TD} = 1 - \frac{1}{K(K-1)} \sum_{i \neq j} \cos(\theta_i, \theta_j),

where \theta_i and \theta_j are topic distributions and K is the number of topics; values closer to 1 indicate greater diversity. High diversity prevents topics from converging on similar terms, supporting comprehensive coverage in applications like corpus exploration.^[32]

Applications

Text Corpus Analysis

Topic models have been widely applied to general text processing tasks, enabling the organization and exploration of large document collections through the discovery of latent themes. In document clustering, topic models project documents into a lower-dimensional space of topic distributions, facilitating the grouping of similar texts based on shared thematic content rather than exact word matches. This approach improves clustering accuracy by capturing semantic similarities, as demonstrated in integrations of topic modeling with traditional clustering algorithms like k-means, where topic weights serve as features for partitioning documents. For browsing, topic models support interactive navigation by providing summaries of document sets via topic proportions, allowing users to drill down into relevant subsets without exhaustive reading. A prominent application is in digital libraries, where topic models enhance search and discovery in vast archives. For instance, JSTOR's Topicgraph tool employs topic modeling to generate visual overviews of books, highlighting key topics and linking them to specific pages for efficient exploration of long-form content. This facilitates scholarly browsing by revealing thematic structures in monographs and journals, scaling to millions of digitized texts. Trend analysis in social media represents another key use, particularly for detecting evolving discussions over time. On platforms like Twitter, topic models identify emerging topics from streaming data, tracking shifts in public discourse such as event-driven conversations. Dynamic topic models extend this by modeling topic evolution across time slices, capturing how themes like political events or cultural trends change in large corpora, as introduced in the seminal work on dynamic topic models applied to historical document collections. Illustrative examples include visualizations of topic models on the New York Times Annotated Corpus, where interfaces allow users to explore article themes through interactive topic maps, revealing patterns in journalistic coverage over decades. Joint sentiment-topic models further enrich analysis by simultaneously inferring topics and associated polarities, enabling nuanced insights into opinion dynamics within text corpora, such as product reviews or news comments. Scalability to massive datasets is achieved through online variants of LDA, which update topic distributions incrementally as new documents arrive, processing millions of documents efficiently without requiring full-batch recomputation. This makes topic modeling viable for real-time applications on web-scale text, maintaining model quality while reducing computational demands.

Biomedical and Scientific Literature

Topic modeling has been extensively applied to biomedical and scientific literature to uncover latent themes and trends in vast collections of research articles, particularly from databases like PubMed. By analyzing abstracts and full texts, methods such as Latent Dirichlet Allocation (LDA) enable the identification of evolving research foci, including disease mechanisms, treatment advancements, and interdisciplinary connections. For instance, LDA applied to large corpora of millions of PubMed articles has revealed temporal shifts in research emphasis, such as the progression of studies on disease trajectories from basic etiology to clinical interventions.^[33]^[34] These applications facilitate quantitative biomedicine by grouping related publications, aiding researchers in synthesizing knowledge without manual curation. A notable example in cancer research involves the use of survival-linked LDA (survLDA), which integrates gene expression data with survival outcomes to characterize cancer subtypes. In a 2012 study, survLDA was employed to model heterogeneous gene expression patterns in cancer datasets, identifying prognostic subtypes by linking topic distributions to patient survival rates, thereby providing interpretable biomarkers for personalized medicine.^[35] This approach highlights how topic models extend beyond text to multimodal biomedical data, enhancing subtype discovery in oncology. Similarly, topic-based meta-analysis leverages these models to aggregate evidence across studies; for example, LDA clusters publications by thematic similarity, enabling systematic reviews of treatment efficacy in sparse or heterogeneous datasets like rare diseases.^[34] Integration of topic modeling with network analysis has advanced drug discovery by mapping relationships between drugs, pathways, and genes in scientific literature. A pathway-based LDA variant analyzes PubMed texts to infer probabilistic associations, constructing networks that reveal potential drug targets and repurposing opportunities, such as linking off-target effects to novel therapeutic pathways.^[36] This method outperforms traditional keyword searches by capturing contextual co-occurrences in biomedical narratives. Biomedical texts often feature sparse medical terms and domain-specific jargon, posing challenges for standard topic models due to high dimensionality and rarity of specialized vocabulary. To address this, advanced variants like multiple kernel fuzzy topic modeling (MKFTM) incorporate fuzzy membership and kernel functions to handle sparsity, improving topic coherence in PubMed abstracts by reducing noise from infrequent terms while preserving semantic relevance.^[37] Additionally, specialized priors, such as those in Graph-Sparse LDA, enforce structured sparsity based on biomedical ontologies or graphs, enabling more interpretable topics that align with known biological relationships and mitigate overfitting in jargon-heavy corpora.^[38] These adaptations ensure robust performance in quantitative analyses of scientific literature, where evaluation metrics like topic coherence are crucial for validating domain-specific insights.^[34]

Creative and Multimedia Domains

Topic modeling extends to creative and multimedia domains, enabling the analysis of stylistic evolution, genre patterns, and collaborative influences in music, art, and digital humanities. In music, these methods process lyrics and symbolic representations like MIDI files to discover latent themes and genres. For example, applying BERTopic to 537,553 English song lyrics from diverse genres such as pop, rap, rock, country, and R&B uncovered 541 topics, revealing thematic shifts over 70 years—from romantic motifs like "tears_heart_wish" dominant in the 1960s–1980s to increased sexualization and profanity, such as the "nigga_niggas_bitch" topic comprising 37.88% of rap lyrics since the 1990s—thus highlighting genre-specific evolutions akin to those in Billboard chart analyses.^[39] Similarly, BERTopic on 3,455 song lyrics from 14 artists generated 215 topic clusters, measuring artist similarity via shared topics (e.g., hip-hop artists like 50 Cent, 2Pac, and Eminem overlapping in 5–6 emotional and event-based themes), which supports modeling collaborative patterns in creative works.^[40] Symbolic music data, such as MIDI sequences, benefits from specialized topic models that account for temporal structure. The Variable-gram Topic Model integrates latent topics with a Dirichlet Variable-Length Markov Model to learn probabilistic representations of melodic sequences within genres, outperforming standard LDA in next-note prediction on datasets like 264 Scottish and Irish folk reels by distinguishing musically meaningful regimes such as keys (e.g., G major vs. D major) and tempos.^[41] This approach models improvisation topics, as in jazz, by capturing contextual dependencies in sequential phrases, facilitating analysis of creative processes like spontaneous variation in solos.^[41] In digital humanities, topic models aid author attribution for artistic texts, treating authorship as a latent stylistic topic. The Disjoint Author-Document Topic model (DADT), an extension of LDA, projects authors and documents into separate topic spaces, achieving state-of-the-art accuracy (e.g., 93.64% on small essay sets and 28.62% on large blog corpora with 19,320 authors) by capturing genre-agnostic stylistic markers applicable to literary arts.^[42] Multimedia extensions incorporate correlated topic models to handle images alongside text in creative analysis. The Topic Correlation Model (TCM) jointly models textual topics via LDA and image features via bag-of-visual-words (e.g., SIFT descriptors), enabling cross-modal retrieval on datasets like TVGraz; supporting stylistic studies in visual arts. Unique to arts applications, topic models address sequential and multimodal data through dynamic variants. The Document Influence Model, a dynamic topic extension, analyzes 24,941 songs (1922–2010) to track topic evolution over time slices, using time-decay kernels to quantify how influential tracks (e.g., innovative ones from the 1970s) shape subsequent genre topics, thus modeling stylistic progression in music corpora.^[43] TCM further integrates sequential image-text pairs for multimodal creativity, such as correlating narrative descriptions with artistic visuals in digital archives.

Recent Advances and Challenges

Neural and Deep Learning Integrations

One significant advancement in neural topic modeling emerged with ProdLDA, introduced in 2017, which adapts Latent Dirichlet Allocation (LDA) using a variational autoencoder framework to enable scalable inference through amortized optimization.^[44] This model replaces traditional multinomial priors with a product of experts prior, allowing end-to-end training where document embeddings are learned via an encoder-decoder architecture, resulting in more coherent topics compared to standard LDA, as measured by automated coherence scores on benchmark datasets like 20 Newsgroups.^[44] ProdLDA's amortized inference approximates the posterior distribution efficiently during training, addressing limitations in classical LDA by integrating neural components for better representation of topic-document relationships without requiring collapsed Gibbs sampling.^[44] Building on such foundations, BERTopic, developed in 2020, leverages transformer-based embeddings from BERT combined with class-based TF-IDF (c-TF-IDF) to generate dynamic and interpretable topics from document clusters.^[45] The approach first embeds documents using BERT to capture contextual semantics, then applies dimensionality reduction via UMAP followed by HDBSCAN clustering, and finally represents topics with c-TF-IDF weighted by cluster assignments, enabling the model to handle evolving topics over time without retraining the entire pipeline.^[46] This integration has shown superior performance in topic diversity and coherence on short-text corpora, such as social media posts, where traditional bag-of-words models struggle due to sparsity.^[45] Neural topic models have further improved short-text handling and enabled zero-shot topic discovery by incorporating pre-trained language models, allowing inference on unseen domains without fine-tuning.^[47] For instance, contextualized embeddings from multilingual transformers facilitate cross-lingual topic extraction in zero-shot settings, outperforming non-neural baselines by up to 12% in F1 scores on classification tasks derived from topics.^[48] Recent developments from 2023 to 2025 have extended these to multimodal settings, such as neural topic models for text-image pairs, where joint variational inference on visual and textual features enhances topic interpretability in datasets such as artwork collections, achieving up to 174.8% improvement in recommendation accuracy over unimodal baselines.^[49]^[50] In applications with large language models (LLMs), neural topic models support interpretable prompting by providing structured topic representations that guide zero-shot generation, as seen in frameworks where LLMs rival traditional methods for topic assignment on long-context inputs.^[51] Scalability is bolstered through transformer architectures, enabling efficient processing of massive corpora via parallelizable embeddings and amortized inference, which reduces computational overhead by orders of magnitude compared to sampling-based alternatives.^[52] These integrations facilitate end-to-end training, where topic discovery and downstream tasks like classification are optimized jointly, promoting broader adoption in dynamic environments.^[44]

Scalability and Interpretability Issues

Scalability remains a primary challenge in topic modeling, particularly for big data applications where corpora exceed millions of documents. Traditional inference methods, such as Markov chain Monte Carlo sampling, suffer from high computational costs and slow convergence on large-scale datasets, often requiring days or weeks for training. To mitigate this, online variational Bayes approaches enable incremental learning by processing documents in mini-batches, allowing models like latent Dirichlet allocation to scale to massive streaming data without full recomputation.^[53] Similarly, distributed frameworks for hierarchical topic models distribute computation across clusters, achieving linear speedup for corpora up to billions of tokens while maintaining topic quality.^[54] Interpretability in topic models is hindered by issues of stability and diversity, where topics must be consistent across multiple runs and sufficiently distinct to provide meaningful insights. Instability arises from random initializations leading to varied topic-word distributions, complicating reliable analysis; metrics like normalized pointwise mutual information assess stability by comparing topic similarity over reruns. Diversity ensures topics capture broad corpus aspects without overlap, evaluated through measures like topic-word exclusivity, which penalizes redundant themes. Neural topic models exacerbate these challenges, as opaque embeddings can produce less human-readable topics compared to classical methods.^[55]^[56] Efforts to enhance interpretability include regularization techniques that promote coherent and diverse topics, such as semantic similarity constraints in variational autoencoders.^[57] Neural topic models integrating word embeddings inherit biases from pre-trained representations, resulting in skewed topics that amplify societal prejudices, such as gender stereotypes in word co-occurrences. For instance, embeddings trained on web corpora often associate professional terms with masculine attributes, leading to biased topic clusters in downstream applications like document classification. Post-2020 studies employing topic modeling on AI literature have revealed ethical issues in biased topic discovery, including the reinforcement of discriminatory narratives in social media analysis and the need for debiasing interventions to ensure equitable outcomes.^[58]^[59] These biases pose risks of perpetuating inequities, prompting calls for fairness-aware training in topic discovery pipelines.^[60] Future directions in topic modeling emphasize hybrid symbolic-neural architectures, which combine neural embeddings for pattern recognition with symbolic rules for explicit reasoning, improving both scalability and interpretability in complex domains. Real-time streaming topic models, leveraging online updates and embedding spaces, enable dynamic topic evolution on live data feeds like social media, supporting applications in crisis monitoring. Standardization of evaluation remains crucial, with ongoing efforts to develop unified benchmarks for coherence, diversity, and downstream utility to facilitate reproducible comparisons across models.^[61]^[62]^[63]^[64]

References

[1]
[PDF] Introduction to Probabilistic Topic Models
Probabilistic topic models are a suite of algorithms whose aim is to discover the hidden thematic structure in large archives of documents.
[2]
[PDF] Latent Dirichlet Allocation - Journal of Machine Learning Research
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level ...
[3]
[2401.15351] A Survey on Neural Topic Models - arXiv
Jan 27, 2024 · In this paper, we present a comprehensive survey on neural topic models concerning methods, applications, and challenges.
[4]
[PDF] Probabilistic topic models - Columbia CS
In generative probabilistic modeling, we treat our data as arising from a generative process that includes hid- den variables. This generative process.
[5]
LDA-based document models for ad-hoc retrieval - Semantic Scholar
LDA-based document models for ad-hoc retrieval · Figures and Tables · Topics · 1,262 Citations · 21 References · Related Papers ...
[6]
[PDF] LDA-Based Document Models for Ad-hoc Retrieval
LDA-Based Document Models for Ad-hoc Retrieval. Xing Wei and W. Bruce Croft. Computer Science Department. University of Massachusetts Amherst. 140 Governors ...
[7]
Joint sentiment/topic model for sentiment analysis | Proceedings of ...
This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet Allocation (LDA), called joint sentiment/topic model (JST),
[8]
https://dl.acm.org/doi/10.1145/1645953.1646003
[9]
[PDF] The SMART system - AN INTRODUCTION Gerard Salton - SIGIR
The first eleven sections of the present report are devoted to a detailed description of the SMART document retrieval system/ This system is designed to process ...Missing: 1960s | Show results with:1960s
[10]
A vector space model for automatic indexing - ACM Digital Library
Salton, G. Automatic btformation Organiza;ion and Retrieval. McGraw-Hill, New York, 1968, Ch. 4. Digital Library · Google Scholar.
[11]
History of the SIGIR conferences - SIGIR'07
The first official SIGIR conference was held in 1978 in Rochester, New York in the USA chaired by James Iverson. The second conference in Dallas, Texas in the ...
[12]
Probabilistic latent semantic indexing - ACM Digital Library
GILDEA, D., AND HOFMANN, T. Topic-based ... In Proceedings of the 6th European Conference on Speech Communication and Technology(EUROSPEECIt) (1999).
[13]
Probabilistic Topic Models - Communications of the ACM
Apr 1, 2012 · This generative process defines a joint probability distribution over both the observed and hidden random variables. We perform data analysis by ...
[14]
Learning the parts of objects by non-negative matrix factorization
Oct 21, 1999 · Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text.
[15]
Algorithms for Non-negative Matrix Factorization - NIPS papers
Authors. Daniel D. Lee, H. Sebastian Seung. Abstract. Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for ...
[16]
Archetypal Analysis: Technometrics - Taylor & Francis Online
Archetypal analysis represents each individual in a data set as a mixture of individuals of pure type or archetypes. The archetypes themselves are restricted to ...
[17]
Selection of the Optimal Number of Topics for LDA Topic Model ...
Latent Dirichlet Allocation (LDA) is a document topic generation model proposed by Blei et al. (2003) after introducing the Dirichlet distribution based on ...
[18]
[PDF] Learning Topic Models — Going beyond SVD - arXiv
Apr 10, 2012 · in the thousands or tens of thousands, but the number of topics is usually somewhere in the range from 50 to 100. Note that separability ...
[19]
[PDF] Learning the parts of objects by non-negative matrix factorization
When non-negative matrix factoriza- tion is implemented as a neural network, parts-based representa- tions emerge by virtue of two properties: the firing rates ...
[20]
None
### Summary of https://proceedings.neurips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf
[21]
Finding scientific topics - PNAS
We applied our Gibbs sampling algorithm to this dataset, together with the two algorithms that have previously been used for inference in Latent Dirichlet ...
[22]
[PDF] Reducing the Sampling Complexity of Topic Models
Aug 24, 2014 · Sampling complexity is reduced by scaling with instantiated topics, using a Metropolis-Hastings step, sparsity, and amortized sampling via ...Missing: thinning | Show results with:thinning
[23]
Evaluation methods for topic models - ACM Digital Library
Evaluation methods for topic models. Authors: Hanna M. Wallach. Hanna M. Wallach. University of Massachusetts, Amherst, MA. View Profile. , Iain Murray. Iain ...
[24]
[PDF] Evaluation Methods for Topic Models
This method is com- putationally expensive, but is often accurate. For the. Page 7. Evaluation Methods for Topic Models harmonic mean method, B = ...
[25]
[PDF] Ordering-sensitive and Semantic-aware Topic Modeling - arXiv
Feb 12, 2015 · Latent Dirichlet Allocation (LDA): In the LDA model. (Blei, Ng, and ... More specifically, for 20 Newsgroups data set, the perplexity de-.
[26]
Automatic Evaluation of Topic Coherence - ACL Anthology
Automatic Evaluation of Topic Coherence. David Newman, Jey Han Lau, Karl Grieser, Timothy Baldwin. newman-etal-2010-automatic PDF
[27]
[PDF] Evaluating topic coherence measures
The main contribution of this paper is to compare coherence measures of different complexity with human ratings. Furthermore, we include in our study not just ...
[28]
Optimizing Semantic Coherence in Topic Models - ACL Anthology
Optimizing Semantic Coherence in Topic Models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 262–272, ...Missing: paper | Show results with:paper
[29]
Benchmarking topic models on scientific articles using BERTeley
Röder M., Both A., Hinneburg A. Exploring the space of topic coherence measures. Proceedings of the Eighth ACM International Conference on Web Search and Data ...Benchmarking Topic Models On... · 4. Results · 4.3. Use Case 3: Arxiv
[30]
(PDF) Leveraging Topic Modelling to Analyze Biomedical Research ...
Jun 15, 2024 · The results of this study suggest that topic modelling using the LDA can be used to identify trends in biomedical research with high accuracy.
[31]
An overview of topic modeling and its current applications in ...
Sep 20, 2016 · The aim of topic modeling is to discover the themes that run through a corpus by analyzing the words of the original texts. We call these themes ...
[32]
https://www.sciencedirect.com/science/article/pii/S2949719123000419
[33]
https://www.researchgate.net/publication/381696091_Leveraging_Topic_Modelling_to_Analyze_Biomedical_Research_Trends_from_the_PubMed_Database_Using_LDA_Method
[34]
A novel multiple kernel fuzzy topic modeling technique for ...
Jul 12, 2022 · We described our proposed multiple kernel fuzzy topic modeling method that discover the uncover hidden topics in biomedical text documents.Missing: drug | Show results with:drug
[35]
Graph-Sparse LDA: A Topic Model with Structured Sparsity
Feb 21, 2015 · Graph-Sparse LDA recovers sparse, interpretable summaries on two real-world biomedical datasets while matching state-of-the-art prediction performance.Missing: terms | Show results with:terms
[36]
None
### Summary of BERTopic Use for Analyzing Song Lyrics Across Genres
[37]
[PDF] Measuring the Similarity of Song Artists using Topic Modelling
Oct 10, 2022 · In this paper, we propose an topic modeling-based approach for measuring the similarity of the music artists based only on their song lyrics.Missing: MIDI discovery Billboard thematic shifts<|separator|>
[38]
[PDF] A Topic Model for Melodic Sequences
We examine the problem of learning a proba- bilistic model for melody directly from musical sequences belonging to the same genre. This.
[39]
Authorship Attribution with Topic Models | Computational Linguistics
Utilizing our model in authorship attribution yields state-of-the-art performance on several data sets, containing either formal texts written by a few authors ...
[40]
[PDF] Modeling Musical Influence with Topic Models
Here we model influence as a process where one song affects the “musical language” of a musical stream, or “topic”.Missing: sequential | Show results with:sequential
[41]
Autoencoding Variational Inference For Topic Models - arXiv
Mar 4, 2017 · By changing only one line of code from LDA, we find that ProdLDA yields much more interpretable topics, even if LDA is trained via collapsed ...
[42]
BERTopic: Neural topic modeling with a class-based TF-IDF ... - arXiv
Mar 11, 2022 · We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of ...
[43]
BERTopic - Maarten Grootendorst
BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics.BERTopic is a topic modeling... · Guided Topic Modeling · Dynamic Topic ModelingMissing: 2020 | Show results with:2020
[44]
[PDF] Cross-lingual Contextualized Topic Models with Zero-shot Learning
This paper introduces a novel neural topic mod- eling architecture in which we replace the input. BoW document representations with multilingual contextualized ...
[45]
[PDF] Leveraging Zero-Shot Text Classification by Topic Modeling - HAL
Jun 4, 2022 · We show that. ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in ...
[46]
Neural Multimodal Topic Modeling: A Comprehensive Evaluation
Mar 26, 2024 · This paper presents the first systematic and comprehensive evaluation of multimodal topic modeling of documents containing both text and images.Missing: 2023 2025
[47]
MultArtRec: A Multimodal Neural Topic Modeling for Integrating ...
Jan 10, 2024 · MultArtRec is a neural topic modeling system for artwork recommendation, using image and text features to extract user preferences.2. Related Work · 5. Experiments · 5.4. Comparative Experiments
[48]
https://telecom-paris.hal.science/hal-03628242/file/preprint_zeroberto.pdf
[49]
https://arxiv.org/abs/2403.17308
[50]
[PDF] Scalable Topic Modeling: Online Learning, Diagnostics, and ... - DTIC
While stochastic variational inference scaled Bayesian computation up to massive data, black box variational inference expands the scope of scalable. Bayesian ...Missing: challenges | Show results with:challenges
[51]
[PDF] Scalable Training of Hierarchical Topic Models - VLDB Endowment
ABSTRACT. Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications.
[52]
A Review of Stability in Topic Modeling: Metrics for Assessing and ...
This paper fills that gap and provides a systematic review of different approaches to measure stability and of various techniques that are intended to improve ...
[53]
Enhancing Topic Interpretability for Neural Topic Modeling through ...
For topic interpretability, we choose two kinds of common metrics: topic coherence, and topic diversity. Topic coherence measures the average NPMI over the ...
[54]
[PDF] Measuring the Interpretability of Statistical Topics
One key concern with topic models lies with how well human beings can actually understand the topics, or the problem of topic interpretability. It may be true ...Missing: stable | Show results with:stable
[55]
Bias in word embeddings | Proceedings of the 2020 Conference on ...
Jan 27, 2020 · Recent studies demonstrate that word embeddings contain and amplify biases present in data, such as stereotypes and prejudice.
[56]
Topic Modeling in Embedding Spaces - MIT Press Direct
Jul 1, 2020 · Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics ...
[57]
Navigating the Muddy Waters of Bias in Artificial Intelligence Research
Oct 30, 2025 · In this study, we employ topic modeling on 6,520 articles to explore how the AI research community interprets the concept of bias. Our results ...
[58]
Neuro-Symbolic AI: Explainability, Challenges, and Future Trends
Nov 7, 2024 · This article proposes a classification for explainability by considering both model design and behavior of 191 studies from 2013, focusing on neuro-symbolic AI.
[59]
Real-Time Topic Modeling for Streaming Embedding Spaces - arXiv
Sep 1, 2025 · Applying this technique, we create Chronotome, a tool for interactively exploring evolving themes in time-based data -- in real time. We ...
[60]
Beyond standardization: a comprehensive review of topic modeling ...
Jun 30, 2025 · Beyond standardization: a comprehensive review of topic modeling validation methods for computational social science research.
[61]
Quantum approaches for inference and decision-making in quantum ...
Sep 6, 2025 · To address this, we propose a recursive quantum-classical Bayesian network inference method inspired by the forward–backward algorithm. By ...