Fact-checked by Grok 2 weeks ago

Latent semantic analysis

Latent semantic analysis (LSA) is a technique in , particularly , that analyzes relationships between terms and documents in a . Also known as latent semantic indexing (LSI) in contexts, LSA applies (SVD) to a term-document matrix to uncover latent semantic structures, reducing dimensionality while capturing associations beyond exact keyword matches. Developed in the late 1980s by Scott Deerwester and colleagues, it addresses issues like vocabulary mismatch in search systems by representing documents and queries in a lower-dimensional space where similarity is computed using . LSA constructs a semantic space from term co-occurrences, enabling applications in , text classification, and cognitive modeling. Despite its influence, it relies on a bag-of-words approach, ignoring and , and faces scalability challenges due to SVD's .

Overview

Definition and core principles

(LSA) is an unsupervised statistical technique that employs to analyze relationships between terms and documents in a , representing them as vectors in a lower-dimensional space that captures underlying semantic similarities. Developed to enhance , LSA uncovers latent structures in word usage patterns without requiring explicit labeling or training on annotated data. By reducing the dimensionality of the representation while preserving key associations, LSA enables more robust modeling of text semantics compared to traditional bag-of-words approaches. A primary limitation of exact term matching in is its failure to account for synonymy—where different words like "car" and "automobile" denote the same concept—and , where a single word has multiple meanings depending on context; addresses these by deriving hidden associations from co-occurrence patterns across documents. For instance, if documents frequently pair "car" with "driver" and "road," while "automobile" appears with similar contextual terms, infers their semantic equivalence even without direct overlap in a query. This latent structure allows retrieval of relevant documents that might not share exact query terms but align semantically. To illustrate, consider a simple term-document matrix where rows represent terms (e.g., "," "," "computer," "," "," "," "response," "time") and columns represent short documents or passages; entries indicate term frequencies. After via (SVD), the matrix approximates a lower-rank form (e.g., retaining 2-3 dimensions), projecting terms and documents into a where similarity is measured by between vectors. For example, in a reduced , vectors for "" and "" would cluster closely due to shared latent dimensions related to concepts, enabling scores that reflect intuitive relatedness (e.g., ~0.8 for high similarity). At its core, LSA extends the by assuming that observed word distributions arise from a smaller set of latent topics or concepts, which can be statistically estimated to filter noise from irregular usage. This principle posits that semantic meaning emerges from higher-order co-occurrences rather than isolated terms, with the reduced space approximating human-like judgments of word and document similarity. By focusing on these latent factors, LSA provides a principled way to bridge surface-level text variations with deeper conceptual links.

Distinction from latent semantic indexing

Latent semantic indexing (LSI) refers to the specific application of (LSA) techniques to the construction of term-document matrices for enhancing the of search results in systems. By reducing the dimensionality of these matrices through , LSI uncovers hidden associations between terms and documents, allowing for more accurate matching beyond exact keyword overlaps. The term "latent semantic indexing" originated in a 1988 paper by Deerwester and colleagues, which introduced the method as a way to address vocabulary mismatches in textual retrieval, though it built on earlier explorations of semantic structures in text. While the terms and LSI are frequently used interchangeably in the literature, LSI is technically narrower, focusing on indexing applications rather than the broader framework of . A primary distinction lies in their scope: LSA serves as a general approach to that captures semantic relationships in text data across various domains, whereas LSI particularly emphasizes mechanisms like —where user queries are projected into the reduced semantic space to incorporate related latent terms—and subsequent document ranking based on cosine similarities to mitigate issues such as synonymy and . For instance, in a database search for "car," LSI might automatically expand the query to include latent terms like "automobile" or "vehicle" derived from co-occurrence patterns in the , thereby retrieving more relevant documents without requiring explicit synonyms from the user.

Mathematical Foundations

Term-document matrix

The term-document matrix, also known as the document-term matrix in some contexts, serves as the foundational in latent semantic analysis (LSA), representing the of across a of . In this , rows correspond to unique (typically words or word stems), columns correspond to individual , and each cell entry quantifies the association between a term and a document, often through counts or weighted measures. This structure captures the explicit distributional patterns of terms in the , enabling subsequent mathematical transformations to uncover latent relationships. Construction of the term-document matrix begins with preprocessing the raw to ensure a clean and standardized representation. Key steps include tokenization, which breaks down documents into individual words or tokens by identifying word boundaries and handling ; removal of , such as common function words like "the" or "is" that appear frequently but carry little semantic value; and or , which reduces words to their base or root form (e.g., "running" to "run") to merge variants and reduce size. These steps minimize noise, address morphological variations, and focus on content-bearing terms, typically excluding those appearing in fewer than a certain number of documents to avoid rare outliers. For instance, in early implementations, terms were automatically extracted from titles and abstracts using predefined stop lists, without in some cases to preserve specific meanings. Variants of the term-document differ primarily in how cell values are computed to better reflect term importance. The raw uses simple counts of term occurrences in each , which can overemphasize frequently repeated terms within a single . To mitigate this, weighted versions apply schemes like term -inverse (tf-idf), where each entry is calculated as the term's in the multiplied by the logarithm of the total number of divided by the number containing that term; this downweights common terms across the while highlighting distinctive ones. Alternative weightings, such as logarithmic transformations (e.g., log( + 1)) or entropy-based adjustments, further normalize for distributional biases and . These variants enhance the 's ability to represent semantic before . A representative example illustrates the matrix's structure for a small corpus of three documents: Document 1 ("The cat sat on the mat"), Document 2 ("The dog chased the cat"), and Document 3 ("The mat and the dog"). After preprocessing (tokenization, stop-word removal for "the", and ), the unique terms are "cat", "sat", "on", "mat", "dog", "chase", "and". Using raw term frequencies, the resulting 7×3 matrix is sparse, with many zeros indicating non-occurrences:
TermDoc1Doc2Doc3
cat110
sat100
on100
mat101
dog011
chase010
and001
This example highlights the high dimensionality even in small corpora—real-world applications often involve thousands of terms (rows), leading to matrices like 5,823 × 1,033 for abstracts. The term-document matrix exhibits key properties that underscore its limitations for direct semantic analysis. It is inherently sparse, with the vast majority of entries as zeros due to the limited overlap of terms across documents; for example, sparsity rates can exceed 99% in typical corpora. This sparsity exacerbates of dimensionality, where high-dimensional spaces (often tens of thousands of terms) result in unreliable similarity measures, such as cosine distances approaching uniformity. Moreover, the matrix fails to capture semantics directly because it treats polysemous terms (e.g., "" as river edge or ) as single vectors averaging their contexts, ignoring contextual nuances and leading to distorted associations. These issues motivate techniques like to approximate a lower-dimensional semantic space.

Singular value decomposition

Singular value decomposition (SVD) is a matrix factorization technique that decomposes a rectangular matrix A, such as the term-document matrix in latent semantic analysis, into the product of three matrices: A = U \Sigma V^T, where U and V are orthogonal matrices, and \Sigma is a diagonal matrix containing non-negative real numbers known as singular values, arranged in decreasing order. The columns of U are the left singular vectors, representing the principal directions in the row space of A; the columns of V are the right singular vectors, representing the principal directions in the column space; and the diagonal entries of \Sigma quantify the strengths of these directions. In the context of latent semantic analysis, the left singular vectors in U correspond to term concepts, capturing latent associations among terms; the right singular vectors in V represent document concepts, encoding latent structures across documents; and the singular values in \Sigma indicate the relative importance or strength of each latent . This reveals orthogonal latent factors that underlie the original associations in the term-document , allowing for the of hidden semantic relationships not apparent in the surface-level co-occurrence data. A key property of SVD is its ability to provide low-rank approximations of the original matrix. The rank-k approximation is given by A_k = U_k \Sigma_k V_k^T, where U_k and V_k consist of the first k columns of U and V, respectively, and \Sigma_k is the k \times k of the largest k singular values. By the Eckart-Young theorem, this truncated SVD yields the optimal to A in both the Frobenius norm and the spectral norm, minimizing the error \|A - A_k\| for a given rank k. SVD is particularly suited for latent semantic analysis because it extracts a set of uncorrelated orthogonal factors that approximate the original term-document while reducing dimensionality and capturing the underlying semantic through these latent dimensions. This enables the of terms and documents in a lower-dimensional space where similarities are derived from projections onto these orthogonal concept axes, providing a robust mathematical foundation for uncovering latent meanings.

Methodology

Matrix construction and preprocessing

The process of matrix construction for latent semantic analysis (LSA) begins with corpus collection, where a set of relevant documents is gathered to form the basis of the analysis. These documents can include abstracts, full texts, or other textual units from domains such as or news articles, ensuring representation of the target semantic space. The corpus size typically ranges from hundreds to thousands of documents for initial applications, though scalable methods allow for larger sets. Preprocessing follows to clean and normalize the text, starting with tokenization to split documents into individual words or terms based on whitespace and punctuation boundaries. Punctuation marks are removed to avoid fragmenting meaningful terms, and all text is converted to lowercase to ensure case-insensitive matching and reduce vocabulary redundancy—for instance, treating "Car" and "car" as identical. Stop words, such as common function words like "the" or "and," are then eliminated using predefined lists to focus on content-bearing terms, which can reduce the effective vocabulary by up to 30-50% without significant loss of meaning. Stemming or lemmatization may optionally be applied to conflate word variants (e.g., "running" to "run"), though early LSA implementations often omitted this to preserve exact term associations. Vocabulary building involves extracting unique terms from the cleaned to define the rows of the term- . Terms appearing in only one are typically filtered out to eliminate and sparsity, resulting in a controlled set of index terms— for example, reducing from over 10,000 potential words to around 5,000-6,000 in standard test . This step ensures the matrix captures patterns relevant to the 's semantics. Weighting schemes are applied to populate the matrix entries, with term frequency-inverse document frequency (tf-idf) being a widely adopted method to emphasize discriminative terms. The tf-idf weight for a term t in document d is calculated as \text{tf-idf}(t, d) = \text{tf}(t, d) \times \log \left( \frac{N}{\text{df}(t)} \right), where \text{tf}(t, d) is the frequency of t in d, N is the total number of documents, and \text{df}(t) is the number of documents containing t. For example, in a corpus of N = 1000 documents, if term "algorithm" appears 5 times in a document (\text{tf} = 5) and in 50 documents (\text{df} = 50), its tf-idf score is $5 \times \log(1000 / 50) = 5 \times \log(20) \approx 5 \times 1.30 = 6.50, highlighting its relative importance. This weighting transforms raw frequencies into a sparse matrix where zeros represent term absences, prioritizing rare yet frequent terms in specific documents. For large corpora exceeding computational limits for , strategies such as random sampling of documents or to the top-k most frequent terms and documents are employed to manage dimensions while preserving semantic structure. Sampling might select a representative (e.g., 10-20% of documents), and could limit the to the 10,000-50,000 highest-frequency terms, reducing sparsity and processing time without severely degrading latent . Specific challenges arise with multilingual text, where morphological differences across languages necessitate language-specific tokenization, stop-word lists, and rules, often requiring separate matrices or cross-lingual techniques to enable joint analysis. Domain-specific vocabularies pose additional issues, as general corpora may underrepresent specialized terms (e.g., medical in a legal ), demanding tailored preprocessing to include or adapt domain terms for accurate semantic capture. The output is a sparse term-document , with rows as terms, columns as , and entries as tf-idf weights (or raw frequencies), primed for via .

Dimensionality reduction process

The process in latent semantic analysis (LSA) begins with the application of (SVD) to the term-document A, which is decomposed as A = T S D^T, where T is the term , S is the of singular values, and D is the . To reduce dimensionality, the full decomposition is truncated by retaining only the top k singular values and corresponding vectors, yielding the rank-k approximation A_k = T_k S_k D_k^T. This truncation step is performed after computing the full SVD, often using iterative algorithms for large matrices to approximate the decomposition efficiently. The rationale for lowering the rank to k (typically 100-300) lies in discarding smaller singular values, which represent noise, variability in word usage, and less significant associations, thereby emphasizing the major underlying semantic dimensions that capture reliable patterns in the data. For instance, in empirical evaluations on medical abstract collections, ranks around 100 have been shown to filter out incidental co-occurrences while preserving core topical structures. Selecting the optimal k involves methods such as cross-validation on retrieval performance or examining the "elbow" in the singular value spectrum, where additional dimensions yield diminishing returns in explained variation. Heuristics like retaining dimensions that explain a high of the total variance—such as 90% of the sum of singular values squared—can guide the choice, though this approach may not always align with task-specific effectiveness in , as higher ranks (e.g., 300) sometimes improve recognition without proportionally increasing variance captured. The of the reduced is quantified by the Frobenius \|A - A_k\|_F, which measures the least-squares difference and decreases monotonically as k increases, but practical trade-offs favor smaller k to balance fidelity with computational efficiency and . Following reduction, terms and documents are projected into the k-dimensional as vectors from T_k S_k and S_k D_k^T, respectively, enabling compact representations for subsequent analysis.

Derivation of semantic space

In latent semantic analysis (LSA), the derivation of the semantic space begins with the reduced-rank approximation of the term-document matrix A \approx U_k \Sigma_k V_k^T, where U_k is the t \times k matrix of the first k left singular vectors (term-concept loadings), \Sigma_k is the k \times k of the top k singular values, and V_k is the k \times d matrix of the first k right singular vectors (document-concept loadings). This approximation projects terms and documents into a lower-dimensional space where each dimension represents a latent derived from patterns of term co-occurrences across documents. The term vectors in this semantic space, denoted T', are obtained as T' = U_k \Sigma_k^{1/2}, providing k-dimensional representations for each of the t terms that balance the influence of singular values. Similarly, the document vectors D' are derived as D' = \Sigma_k^{1/2} V_k^T, yielding k-dimensional representations for each of the d documents. The concept vectors themselves are captured by the columns of U_k (for terms) and V_k (for documents), forming an orthogonal basis that encodes shared semantic structures. These transformations ensure that the inner product between a term vector \mathbf{t}_i' and a document vector \mathbf{d}_j' approximates the original matrix entry A_{ij}, preserving associative information while revealing hidden relationships. Similarity in the semantic space is typically measured using the cosine similarity between vectors: \cos(\theta) = \frac{\mathbf{t}_i' \cdot \mathbf{d}_j'}{\|\mathbf{t}_i'\| \|\mathbf{d}_j'\|} This metric quantifies the angle between vectors, emphasizing directional alignment over magnitude differences, which highlights semantic relatedness independent of term frequency variations. The orthogonal dimensions of the space emerge as latent factors, each corresponding to clusters of co-occurring terms interpreted as "topics" or semantic themes, such as medical contexts where terms like "hospital" and "treatment" load highly on the same dimension. The latent structure arises because the SVD uncovers higher-order associations: terms that do not co-occur directly but share indirect contextual links (e.g., through intermediary documents) become aligned in the reduced space. For instance, in a discussing healthcare, the vectors for "" and "" may initially differ in the full term-document matrix due to exact term mismatches, but after projection, their coordinates align closely along dimensions capturing medical co-occurrences, yielding a high (e.g., above 0.8 in empirical tests on encyclopedic texts). This resolves synonymy by inferring equivalence from distributional patterns. Theoretically, LSA's derivation draws from , which applies to contingency tables to reveal associations between categorical variables, adapting this to term-document matrices as a form of where meaning is induced from co-occurrence statistics. This connection positions LSA as an algebraic method for approximating semantic spaces akin to human from contextual exposure.

Applications

Latent semantic indexing (LSI), a variant of latent semantic analysis tailored for , constructs an index from the reduced-dimensional semantic space derived from of the term-document matrix. This approach enables the representation of documents as vectors in a lower-dimensional space that captures latent associations among terms, allowing for more effective storage and retrieval without relying solely on exact term matches. Queries are projected into this same space as pseudo-document vectors, facilitating expanded matching where semantically related documents can be retrieved even if they lack the query's explicit terms. In retrieval processes, relevance ranking is performed using between the query vector and document vectors in the semantic space, where higher cosine values indicate greater semantic alignment. This method inherently handles synonymy by associating co-occurring terms across documents, improving recall for queries involving related but non-identical vocabulary. is addressed to some extent through contextual weighting in the , as documents provide disambiguating associations that modulate term meanings, though full resolution remains challenging without additional techniques. The derivation of this semantic space enables these similarity measures, allowing LSI to extend traditional models by uncovering hidden relationships. Early applications integrated LSI with vector space model systems like , demonstrating its ability to enhance retrieval in test collections such as MED and CISI, where it retrieved relevant technical memoranda despite term mismatches—for instance, associating "" and "" through shared contexts. In modern contexts, LSI has been applied to legal search, as in the TREC 2010 Legal Track using the email corpus, where essential dimensions of LSI improved hypothetical retrieval for topics like financial misconduct, outperforming standard LSI on F1 scores for several queries. Evaluations show LSI yielding precision improvements of about 13% over bag-of-words term matching on the MED (0.51 vs. 0.45 average ) and up to 30% better overall retrieval performance across large-scale collections compared to lexical methods. These gains particularly manifest in handling, boosting by 20-30% in scenarios with vocabulary mismatches. Query augmentation in LSI often involves projecting the query to identify and incorporate latent terms, effectively expanding it with semantically related concepts to refine results without manual intervention.

Text classification and clustering

In text classification, Latent Semantic Analysis (LSA) transforms high-dimensional term-document matrices into lower-dimensional semantic spaces, where document vectors serve as feature representations for classifiers such as support vector machines (SVM) and k-nearest neighbors (k-NN). This approach captures underlying semantic relationships, enabling more robust topic labeling by mitigating issues like synonymy that plague bag-of-words methods. A specific technique involves creating pseudo-documents for each class by averaging the LSA-reduced vectors of training documents in that category, which then act as class prototypes for classifying new documents based on . LSA has been applied to practical tasks like news article categorization, where reduced document vectors improve the assignment of articles to topics such as or by leveraging latent associations between terms. Similarly, in detection, LSA-derived features enhance by identifying semantic patterns in suspicious content, outperforming traditional keyword matching. Empirical studies demonstrate that LSA-based features can yield F1-score improvements of around 13% over tf-idf baselines in multi-class settings, particularly for low-frequency topics, due to the semantic enrichment provided by . This gain stems from the technique's ability to produce more efficient, noise-reduced representations without requiring extensive . For unsupervised text clustering, LSA projects documents into a semantic space where algorithms like k-means or group them based on vector similarities, uncovering latent themes without predefined labels. K-means, for instance, partitions the reduced space into clusters representing coherent topics, while hierarchical methods build dendrograms to reveal nested structures in document collections. These approaches excel in discovering hidden groupings, such as thematic clusters in large corpora, by emphasizing semantic proximity over exact term matches. Studies show LSA enhances clustering coherence over tf-idf by better handling , leading to more interpretable and accurate groupings in applications like document organization. Overall, the dimensionality reduction in LSA contributes to computational efficiency in both and clustering tasks.

Cognitive and educational modeling

Latent semantic analysis () has been applied in cognitive modeling to simulate human-like processes of language and memory representation. In particular, Landauer, Foltz, and Laham demonstrated that can mimic human judgments of text by computing between sentences or passages, achieving correlations as high as r = 0.93 with expert human ratings on tasks involving thematic consistency in medical texts. This approach posits that 's reduced-dimensional space captures latent semantic structures akin to associative memory in humans, enabling predictions of difficulty without explicit syntactic rules. By leveraging its ability to handle synonymy and , provides a basis for modeling cognitive similarity in verbal tasks, such as word associations and passage recall. Psychological validation of LSA's cognitive fidelity comes from its strong alignments with human performance benchmarks. For instance, LSA's measures correlate with human synonymy judgments at r ≈ 0.70 on average in the TOEFL test, where it correctly identifies synonyms in 65% of cases, comparable to non-native speakers' accuracy. Broader evaluations show correlations ranging from 0.7 to 0.9 across tasks like word (r = 0.90) and semantic priming experiments, supporting LSA as a plausible model of acquired verbal knowledge. Landauer and Dumais further validated this through simulations of acquisition, where LSA incrementally learns word meanings from exposure to textual contexts, replicating human-like growth curves in without predefined rules. In educational modeling, LSA underpins tools for automated assessment and learning support. The Intelligent Essay Assessor (IEA), developed by Landauer and colleagues, uses to score essay content by comparing semantic vectors to reference materials, achieving correlations of 0.80 or higher with human graders on standardized prompts, including those similar to TOEFL writing tasks. This enables holistic evaluation of conceptual understanding over surface features, as demonstrated in applications for grading TOEFL-style essays where detects thematic relevance with r = 0.75-0.85 agreement. Additionally, facilitates plagiarism detection by quantifying semantic overlap between student work and sources, identifying paraphrased content through cosine similarities exceeding 0.70 thresholds in systems. Post-2010 adaptations have integrated into for personalized education. For example, LSA-enhanced systems analyze student interactions with texts to model knowledge gaps, simulating vocabulary development and providing adaptive feedback in online platforms, with correlations to learning outcomes around r = 0.60-0.80 in empirical studies. These applications extend LSA's cognitive roots to educational interventions, such as in learner-generated essays to predict mastery levels.

Implementation

Algorithms and software libraries

The core algorithm for latent semantic analysis (LSA) relies on truncated (), which approximates the term-document matrix by retaining only the top k singular values and corresponding vectors to reduce dimensionality while preserving semantic relationships. Variants such as randomized enhance efficiency for large matrices by using random projections to approximate the , achieving near-optimal accuracy with lower computational cost compared to full . These methods, including the implicitly restarted Lanczos bidiagonalization (IRLBA) approach, are particularly suited for sparse matrices typical in text data. Several open-source libraries facilitate implementation across programming languages. In , scikit-learn's TruncatedSVD supports both randomized and ARPACK-based algorithms for on term-frequency or TF-IDF matrices, with the n_components parameter setting the reduced dimensionality k. Gensim provides the LsiModel , which builds models via online computation on a preprocessed , enabling efficient topic extraction and similarity queries. For users, the irlba package implements fast truncated using augmented IRLBA, optimized for large sparse matrices and directly applicable to pipelines. In distributed environments, Apache Spark's MLlib offers functionality through the RowMatrix , allowing scalable on massive datasets via in-memory computation across clusters. The pipeline can be outlined in as follows, assuming a term-document A \in \mathbb{R}^{m \times n} where m is the number of terms and n is the number of documents:
# Step 1: Construct the term-document matrix
A = build_term_document_matrix(documents, preprocessing_steps)  # e.g., TF-IDF weighting

# Step 2: Compute truncated SVD
U, [Sigma](/page/Sigma), Vt = truncated_svd(A, k)  # A ≈ U_k * diag([Sigma](/page/Sigma)_k) * Vt_k, with k << min(m, n)

# Step 3: Derive reduced representations
term_vectors = U @ diag(1 / [Sigma](/page/Sigma))  # Reduced term space T' = T * U_k * [Sigma](/page/Sigma)_k^{-1}
doc_vectors = [Sigma](/page/Sigma) * Vt  # Reduced document space D' = [Sigma](/page/Sigma)_k * V_k^T

# Step 4: Project new document vector t into semantic space
new_doc_vector = t @ term_vectors  # Or equivalently, t @ U @ diag(1 / [Sigma](/page/Sigma))
This process folds new vectors into the existing semantic space without recomputing the full decomposition. For large-scale corpora exceeding single-machine memory, incremental SVD algorithms update the decomposition as new data arrives, avoiding full recomputation; for instance, the Generalized Hebbian Algorithm learns SVD incrementally from streaming data, suitable for evolving text collections in LSA. Distributed frameworks like Spark MLlib enable parallel SVD on partitioned matrices, scaling to billions of terms and documents by distributing matrix operations across nodes. Best practices for LSA implementation emphasize the dimensionality parameter k, typically starting with values between 100 and 300 to balance and information retention, as higher k risks to idiosyncrasies while lower k may oversimplify semantics. Validation involves cross-validation on downstream tasks like retrieval or clustering , or assessing explained variance from singular values to select k where additional dimensions contribute minimally (e.g., retaining 90% of variance). Preprocessing consistency, such as stopword removal and , is crucial, and libraries like and Gensim offer built-in tools for reproducible tuning via search over k.

Computational requirements and optimization

The (SVD), central to (LSA), presents significant computational challenges due to its complexity. For an m \times n term-document matrix, where m is the number of terms and n the number of documents, the standard full SVD requires O(\min(m, n)^2 \max(m, n)) time, which in typical LSA scenarios with n \gg m approximates to O(m^2 n). This cubic scaling in the smaller dimension becomes a bottleneck for large corpora, where processing millions of documents can take hours or days on conventional , limiting applicability to massive text collections. Memory demands further exacerbate these issues, particularly for storing and manipulating sparse term-document matrices. A corpus of around 40,000 documents and 130,000 unique terms, with millions of nonzero entries, may require approximately 1 of for computations involving reorthogonalization across hundreds of dimensions. For larger scales, such as 1 million documents, memory usage for the dense factors from (e.g., left singular vectors) can exceed 100 despite sparse input representations, necessitating high-capacity servers or distributed systems, often accelerated by CPUs or GPUs to handle matrix operations efficiently. GPU implementations, for instance, can reduce runtime by orders of magnitude through parallel matrix factorizations. Optimizations focus on approximation and parallelism to scale LSA. Approximate SVD methods, such as randomized projections and Lanczos iterations, lower complexity to near-linear time O(m n \log k + (m + n) k^2) for k-rank approximations, enabling effective semantic mapping with minimal accuracy degradation on large datasets. Parallelization via (MPI) distributes SVD across clusters, achieving speedups proportional to the number of nodes for term-document matrices from web-scale corpora. These approaches trade exactness for feasibility, preserving LSA's core benefits in . A key arises in handling dynamic data: full recomputes the entire decomposition, ideal for static but impractical for updates, while incremental updates only affected components in O(k^2) time per addition, supporting streaming scenarios like real-time document indexing with comparable precision to full recomputation when increments are small relative to the size. Post-2020 advancements leverage platforms, such as AWS SageMaker, for elastic, managed workflows that automate scaling, storage, and acceleration without on-premises hardware constraints.

Advantages

Handling synonymy and polysemy

Latent semantic analysis (LSA) addresses synonymy by leveraging co-occurrence patterns across a large to position related terms closely in a reduced-dimensional semantic space, allowing words with similar meanings to share representational dimensions even without direct overlap in documents. For instance, terms like "big" and "large" become aligned through their associations with common contexts, such as descriptions of size in various texts, enabling indirect inference of equivalence. This mechanism stems from the (SVD) process, which uncovers latent factors that capture higher-order associations beyond explicit word matches. In handling polysemy, LSA represents each word as a single vector that averages its multiple meanings weighted by contextual frequencies in the training corpus, but achieves context-dependent disambiguation through document-specific or query projections in the semantic space. For example, the word "bank" receives influences from financial (e.g., co-occurring with "money" or "account") and geographical (e.g., co-occurring with "river" or "shore") contexts, allowing its vector to shift toward the appropriate sense when projected against a query or document vector emphasizing one domain. This averaging in the global semantic space facilitates partial resolution, as demonstrated in psychological priming experiments where LSA replicated human response times to polysemous words like "mint" (financial vs. herbal senses) by computing cosine similarities conditioned on contextual associates. Quantitative evaluations highlight LSA's advantages over bag-of-words models in disambiguation tasks; for synonymy resolution, LSA achieved approximately 65% accuracy on the TOEFL synonym test using a 4.6-million-word corpus, compared to 15-20% for simple co-occurrence methods without dimensionality reduction, representing over a threefold improvement. In information retrieval contexts involving synonymy, LSA improved precision by 13% on medical abstracts (from 0.45 to 0.51) by retrieving relevant documents despite term mismatches. For polysemy, LSA's context projections correlated significantly (p < 0.001) with human priming effects in experiments, outperforming non-semantic models that ignore latent structures. These gains arise from the derivation of a shared semantic space that averages diverse contexts while enabling targeted alignments. The global semantic space in LSA thus provides a mechanism for averaging contexts across the corpus, as evidenced in psychological simulations where it mimicked human vocabulary acquisition and judgments without explicit rules, learning up to 75% of word through indirect associations from text exposure equivalent to children's reading. However, LSA's averaging approach is not perfect for rare polysemes, where infrequent senses may be underrepresented in the , leading to suboptimal disambiguation in edge cases.

Dimensionality reduction benefits

One of the primary advantages of in latent semantic analysis () stems from its use of () to eliminate insignificant singular values, thereby focusing on the core variances in the term-document matrix and reducing noise from unreliable or incidental associations. This process filters out obscuring noise in high-dimensional text data, where sparse and erratic term co-occurrences can distort semantic representations, allowing to emphasize major associative patterns that reflect underlying meanings. By truncating the to a lower —typically reducing dimensions from tens or hundreds of thousands (corresponding to size) to 100-500— achieves substantial computational efficiency, particularly in similarity computations such as cosine measures between documents or queries. For instance, processing large corpora (e.g., × matrices) becomes feasible in 2-4 hours using modest resources like when limited to dimensions, enabling scalable applications in without exhaustive full-matrix operations. This reduction also promotes improved generalization by smoothing sparse data representations, which mitigates in downstream tasks like or retrieval by capturing higher-order relationships rather than rote term matches. In sparse text environments, where direct word overlaps are infrequent, the lower-dimensional space infers latent connections, enhancing robustness to variations in wording and reducing sensitivity to dataset-specific quirks. Empirically, selecting an optimal number of dimensions, such as k=300, balances noise and structure, as demonstrated in simulations on encyclopedic corpora where this choice tripled the effective knowledge extraction compared to unoptimized ranks. Benchmarks on datasets like those in TREC-3 further illustrate these gains, with LSI (LSA's indexing variant) achieving improvements in average precision for some ad hoc retrieval tasks over baseline vector models when using 200-300 dimensions, highlighting the practical impact of reduction on retrieval performance. Overall, LSA's dimensionality reduction laid foundational groundwork for modern NLP embeddings by demonstrating how low-rank approximations can distill semantic essence from vast text data.

Limitations and Challenges

Scalability and performance issues

Latent semantic analysis (LSA) faces significant scalability challenges primarily due to the computational demands of (), which is central to its process. For corpora exceeding one million documents, standard becomes infeasible without specialized hardware or approximations, as the matrix factorization requires storing and processing dense representations of term-document matrices that grow quadratically with corpus size. In practice, early implementations were limited to tens of thousands of documents, with processing times extending to days on single machines, and no reported successes beyond the million-document threshold at the time. Even with modern hardware, term limits hover around 1.7 million for a 300-dimensional space under 4 GB constraints, highlighting the inherent bottlenecks for very large-scale applications. Performance issues in stem from high memory usage and the slow of iterative solvers employed for on sparse, high-dimensional matrices. Traditional implementations demand substantial RAM for orthogonalization steps—up to 20 GB for a with nine million terms—far exceeding optimized alternatives that reduce this to under 300 MB. Iterative methods like ARPACK, commonly used for large sparse matrices, exhibit prolonged times (e.g., 1.7 hours for certain benchmarks) due to challenges in noisy or ill-conditioned data, exacerbating runtime inefficiencies for or iterative model updates. In real-world scenarios, such as web-scale search, is considered outdated compared to inverted indexing techniques, which offer superior efficiency for massive, heterogeneous collections. 's full-matrix approach incurs O(t · d · c) (where t is terms, d documents, and c the reduced ), leading to gigabytes of memory overhead even for subsets of corpora like TREC (e.g., 1.7 GB for 15% coverage), while inverted indexes enable fast, sparse retrieval without such overheads. Consequently, underperforms in precision on large corpora relative to methods like , which achieve higher recall by leveraging lightweight term-document mappings suited to distributed web environments. Mitigation strategies, such as approximate variants, address these issues but introduce gaps in semantic fidelity. Techniques like sparse reduce memory by orders of magnitude (e.g., 132 MB to 0.6 MB for news corpora) and accelerate projections (e.g., 31 ms to 0.25 ms), yet they yield slight accuracy drops—under 1% on benchmarks like 20 Newsgroups—due to enforced sparsity that discards subtle term associations. Recent post-2015 critiques further underscore 's limitations in diverse linguistic contexts, where it underperforms on low-resource and typologically varied languages compared to multilingual embeddings, as semantic capture degrades without sufficient monolingual training data.

Interpretability and evaluation difficulties

One major challenge in latent semantic analysis (LSA) is the interpretability of its latent dimensions, which emerge as abstract, mathematical factors from rather than explicit semantic categories. These dimensions lack intuitive labels and do not directly correspond to human-understandable topics or concepts, making it difficult to ascribe meaningful interpretations to them without additional post-processing techniques like rotation methods (e.g., varimax), which simplify structure but do not alter the underlying similarity computations. Evaluating the effectiveness of LSA is complicated by the absence of definitive for semantic relationships, as true meaning in language is inherently subjective and context-dependent. Instead, assessments rely on measures such as judgments of similarity (correlations up to 0.6) or performance on tasks like synonym recognition (e.g., LSA achieved 65% accuracy on TOEFL questions, comparable to non-expert s at 64-65%, but not equivalent to deeper perceptions of meaning). Common metrics like , while useful for quantifying semantic proximity in the reduced space, exhibit pitfalls such as sensitivity to document length and term frequency, leading to inflated values for longer texts without necessarily reflecting deeper causal semantic alignment. This with human judgments is empirical rather than mechanistic, limiting its reliability as a standalone evaluator of LSA's semantic capture. Significant gaps persist in standardizing benchmarks for assessing the quality of LSA's , with no universally accepted protocols for validating dimension relevance beyond task-specific outcomes. The selection of the dimensionality k remains highly subjective, often determined empirically through trial-and-error based on characteristics, where low k underrepresents relationships and high k introduces , without on optimal values (typically ranging from 100 to 300). In modern contexts, LSA's interpretability challenges persist relative to probabilistic models like (LDA), which can produce more readily labelable topic distributions in some applications.

Alternatives

Probabilistic topic models

Probabilistic topic models offer a generative alternative to the deterministic of latent semantic analysis (LSA), framing documents as arising from underlying latent topics through probabilistic processes. These models, rooted in , treat topics as distributions over terms and documents as mixtures drawn from these distributions, enabling more flexible representations of semantic structure. Among them, (LDA) stands as a seminal approach, introduced as a three-level hierarchical Bayesian model for collections of discrete data such as text corpora. In LDA, each document is generated by first sampling a distribution over topics from a Dirichlet prior, then selecting words by drawing a topic assignment for each position and sampling the word from the corresponding topic's multinomial distribution over the vocabulary. Topics themselves are modeled as distributions over terms, capturing co-occurrence patterns probabilistically rather than through matrix factorization. This generative process is compactly represented using plate notation in graphical models, where plates denote repetitions over documents and words, with shared parameters like the topic-term matrix at the corpus level and per-document topic mixtures. A key distinction from lies in LDA's probabilistic framework, which assigns soft, posterior probabilities to topic memberships instead of LSA's hard, deterministic projections derived from on term-document matrices. This allows LDA to naturally accommodate uncertainty and multiple latent meanings in data. Over LSA, LDA provides advantages in handling documents that span multiple topics via mixture weights and supports posterior inference for unseen documents, facilitating tasks like dynamic topic tracking. LDA parameters are typically inferred using approximate methods such as collapsed , which iteratively samples topic assignments conditioning on observed words, or variational Bayes, which optimizes a lower bound on the posterior. These techniques have been applied effectively in discovering topics in news corpora, such as identifying themes like or from article collections. Empirically, LDA demonstrates superiority over in topic quality metrics, reflecting more interpretable and semantically cohesive topics.

Neural network-based embeddings

Neural network-based embeddings represent a class of methods that leverage architectures to generate dense representations of words or , serving as powerful alternatives to for capturing semantic relationships. Unlike LSA's reliance on linear algebraic techniques like , these approaches train neural networks on large corpora to learn distributed representations that encode both syntactic and semantic information in lower-dimensional spaces. Seminal models such as and introduced static word embeddings, while later transformer-based models like enable contextualized representations that adapt to surrounding text. Word2Vec, developed by Mikolov et al., employs shallow neural networks—either continuous bag-of-words (CBOW) or skip-gram architectures—to predict words from their contexts or vice versa, producing fixed vectors that capture . These embeddings excel at modeling local word co-occurrences, enabling arithmetic operations that reflect semantic analogies, such as the vector equation king - man + womanqueen. Similarly, by Pennington et al. combines global matrix factorization with local context windows, optimizing a log-bilinear model on word co-occurrence statistics to yield embeddings that balance predictive and count-based strengths. Both methods train efficiently on massive datasets, often billions of words, and have been widely adopted for tasks like and due to their ability to handle synonymy more effectively than LSA. Advancing beyond static embeddings, transformer architectures introduced by Vaswani et al. form the backbone of contextual models like , developed by Devlin et al., which pre-train bidirectional encoders on masked language modeling and next-sentence prediction objectives. 's embeddings are dynamic, generating context-dependent vectors that outperform static methods in downstream tasks such as and natural language inference, where they achieve state-of-the-art results on benchmarks like GLUE by capturing nuanced . Key advantages include scalability via GPU acceleration, allowing training on corpora exceeding terabytes, and superior handling of long-range dependencies through self-attention mechanisms, contrasting LSA's limitations in non-linear semantic modeling. Comparisons highlight neural embeddings' edge over in semantic tasks; for example, achieves up to 68.5% accuracy on word analogy benchmarks, surpassing LSA's performance by capturing finer-grained relationships in large-scale data. further elevates this, attaining Spearman correlations of approximately 0.7-0.8 in evaluations like WordSim-353, compared to LSA's typical 0.4-0.5 range, due to its contextual depth. Post-2015 research has explored hybrids, such as initializing neural models with LSA-derived features to boost convergence in low-resource settings, or combining Word2Vec with LSA via spherical k-means for enhanced topic modeling. These integrations leverage LSA's foundational influence on while benefiting from neural methods' expressiveness.

History

Origins and key publications

The origins of latent semantic analysis (LSA) trace back to advancements in statistical techniques for text analysis during the 1970s, particularly developed by Jean-Paul Benzécri as a method for exploring associations in contingency tables. This approach laid groundwork for in categorical data, which later influenced LSA's use of to uncover hidden structures in term-document matrices. Additionally, LSA built upon the for introduced by Gerard Salton and colleagues in 1975, which represented documents and queries as vectors in a high-dimensional space to measure similarity based on term overlap. In the 1980s, the primary motivation for developing stemmed from persistent challenges in systems, especially the "vocabulary problem" where synonymy and led to mismatches between user queries and document content in large databases. Traditional keyword-based methods failed to capture semantic relationships, resulting in poor for queries using different but related terms, as highlighted in studies of legal and corpora. To address this, researchers at Bell Communications proposed as a way to derive latent semantic structures from term co-occurrences, enabling better handling of term variability without manual thesauri. The seminal publication introducing —initially termed latent semantic indexing (LSI)—was "Using Latent Semantic Analysis to Improve Access to Textual Information" by Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman, presented at the 1988 SIGCHI conference and later published in full in 1990. This work demonstrated initial empirical improvements in retrieval precision by reducing dimensions while preserving semantic associations. Early extensions in the 1990s emphasized LSA's cognitive implications, with George Furnas and Thomas Landauer exploring its alignment with human word meaning acquisition through statistical exposure. Their collaborative efforts, including simulations of psycholinguistic phenomena like the "tip-of-the-tongue" effect, positioned LSA as a of . First implementations emerged around 1990, coinciding with the journal publication and a related granted in , enabling practical applications in experimental systems. By 1998, psychological validations solidified LSA's theoretical foundations, as Landauer demonstrated its ability to predict human judgments of text coherence and synonymy across diverse corpora, correlating highly with empirical benchmarks in and tasks.

Evolution and modern adaptations

Following its foundational development in the late and 1990s, latent semantic analysis (LSA) underwent significant extensions in the 2000s to address multilingual and cross-language challenges in . Researchers adapted LSA by constructing joint term-document matrices across languages or aligning semantic spaces, enabling effective cross-language document retrieval without requiring query translation. For instance, techniques leveraging LSA for automatic cross-language retrieval demonstrated performance comparable to human-translated queries in evaluations on diverse language pairs. These advancements built on earlier work, such as latent semantic indexing for multilingual IR, and were reviewed in comprehensive surveys highlighting LSA's growing utility in global text processing. A notable timeline highlight from this period was the integration of into commercial applications, exemplified by its use in enhanced search and text analysis systems around 2003, as explored in publications on practical deployments. Post-2010, found modern applications in hybrid systems combining it with techniques, particularly in educational AI and . In , -powered tools emerged for automated in massive open online courses (MOOCs) after 2015, providing formative feedback on open-ended responses by comparing student submissions to reference materials via . To address scalability gaps for , approximations like randomized () were increasingly adopted in implementations during the 2020s, enabling efficient on large corpora. Libraries such as incorporate randomized in their TruncatedSVD module specifically for , achieving near-exact low-rank approximations with reduced computational overhead compared to full . As of 2025, occupies a niche yet influential role in , serving as a conceptual precursor to models by demonstrating the benefits of large-scale semantic representations in capturing contextual meaning. Recent papers continue to explore 's viability in low-resource languages, such as evaluations in multilingual for language pairs like English-French and integrations with pre-trained models for tasks like readability assessment in .

References

  1. [1]
    [PDF] Latent Semantic Indexing: An overview
    LSI analysis recovers the original semantic structure of the space and its original dimensions. [Deerwester et al] describe the three major advantages of ...
  2. [2]
    [PDF] Latent semantic analysis
    This article reviews latent semantic analysis (LSA), a theory of meaning as well as a method for extracting that meaning from passages of text, ...
  3. [3]
    [PDF] Strengths, Limitations, and Extensions of LSA
    1) It picks up the word importance score from the information provided by the corpus. 2) It sets up semantic similarity between words. This widely extends the ...
  4. [4]
    [PDF] Indexing by Latent Semantic Analysis Scott Deerwester Graduate ...
    ABSTRACT. A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the ...
  5. [5]
    [PDF] An Introduction to Latent Semantic Analysis - LSA.colorado.edu
    Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations ...
  6. [6]
    Using latent semantic analysis to improve access to textual information
    This paper describes a new approach for dealing with the vocabulary problem in human-computer interaction.
  7. [7]
    Latent semantic analysis - Dumais - 2004 - ASIS&T Digital Library
    Sep 22, 2005 · A technique for improving information retrieval. The key insight in LSA was to reduce the dimensionality of the information retrieval problem.Missing: pdf | Show results with:pdf
  8. [8]
    [PDF] Matrix decompositions and latent semantic indexing
    ... and columns, and a rank in the tens of thousands as well. In latent semantic indexing (sometimes referred to as latent semantic analysis (LSA)), we use the ...<|control11|><|separator|>
  9. [9]
  10. [10]
    Document Matrix - an overview | ScienceDirect Topics
    Document matrices are fundamental in Latent Semantic Analysis (LSA), where a TF-IDF weighted term-document matrix is constructed to represent a corpus. Rows ...
  11. [11]
  12. [12]
    [PDF] Introduction to Information Retrieval - Stanford University
    Aug 1, 2006 · ... Introduction to. Information. Retrieval. Christopher D. Manning. Prabhakar Raghavan. Hinrich Schütze. Cambridge University Press. Cambridge ...
  13. [13]
    Large-scale latent semantic analysis | Behavior Research Methods
    Feb 8, 2011 · ... Landauer, & Harshman, 1990; Landauer, Foltz, & Laham, 1998 ... Olney, A.M. Large-scale latent semantic analysis. Behav Res 43, 414 ...
  14. [14]
    Evaluation of Latent Semantic Analysis in Multilingual Information ...
    Oct 19, 2025 · trustworthiness and reproducibility. 3.2 Preprocessing. Constructing the term-document matrix requires rigorous. preprocessing. For building ...
  15. [15]
    [PDF] Using Latent Semantic Analysis to Build a Domain-Dependent ...
    For this, we induce a domain-dependent sentiment lexicon ap- plying Latent Semantic Analysis (LSA) on prod- uct reviews corpus, gathered from Ciao. The clas-.Missing: vocabulary | Show results with:vocabulary
  16. [16]
    Text Mining Using Latent Semantic Analysis: An Illustration through ...
    Mar 1, 2018 · The traditional approach of using enough factors to explain a percentage of variance (such as 85 percent) often does not work (Albright 2004).
  17. [17]
    A comparison of latent semantic analysis and correspondence ...
    May 18, 2023 · In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices.2. Latent Semantic Analysis · 2.2. 3 Svd Of The Tf-Idf... · 6. Authorship Attribution
  18. [18]
    Large-scale information retrieval with latent semantic indexing
    One vector-space approach, Latent Semantic Indexing (LSI), has achieved up to 30% better retrieval performance than lexical searching techniques by employing a ...
  19. [19]
    [PDF] Applying Latent Semantic Indexing on the TREC 2010 Legal Dataset
    In this section we begin with a description of vector- space retrieval, which forms the foundation for La- tent Semantic Indexing (LSI). We also present a brief.
  20. [20]
    [PDF] Improving Text Classification using Local Latent Semantic Indexing
    Latent Semantic Indexing (LSI) has been shown to be extremely useful in information retrieval, but it is not an optimal representation for text ...<|separator|>
  21. [21]
    [PDF] An Application of Latent Semantic Analysis for Text Categorization
    An Application of Latent Semantic Analysis for Text Categorization. 359 mathematical programming, organizational behavior, engineering, decision analysis ...
  22. [22]
    BBC News. LSA for news classification - Kaggle
    In this work I'll show how you can improve the classification model using LSA (truncated SVD). LSA (latent semantic analysis) is a text data analysis method ...
  23. [23]
    Using latent semantic indexing to filter spam - ACM Digital Library
    Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41:391--407, 1990. Crossref · Google Scholar. [4]. Gee, K. Text ...
  24. [24]
    Text clustering on latent thematic spaces - ACM Digital Library
    Sep 9, 2007 · Experiments conducted on subsets of two standard text corpora evaluate distinct clustering strategies based on latent thematic spaces and ...Missing: seminal | Show results with:seminal
  25. [25]
    [PDF] Comparison of Latent Semantic Analysis and Vector Space Model ...
    Experimental results show that in most cases LSA outperforms VSM and could even slightly outperform the explicit document description by a taxonomy of keywords, ...
  26. [26]
    Latent semantic analysis - Dumais - 2004 - ASIS&T Digital Library
    Sep 22, 2005 · Deerwester, S., Dumais, S., Landauer, T., Furnass, G., & Beck, L. (1988). Improving information retrieval with latent semantic indexing.
  27. [27]
    [PDF] A Solution to Plato's Problem - Statistics & Data Science
    A new general theory of acquired similarity and knowledge representation, latent semantic analysis. (LSA), is presented and used to successfully simulate such ...
  28. [28]
    A solution to Plato's problem: The latent semantic analysis theory of ...
    A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such ...
  29. [29]
    [PDF] Implementation and Applications of the Intelligent Essay Assessor
    The Intelligent Essay Assessor (IEA) uses LSA to assess text content, and is used for automated essay scoring, especially in ELA, and for evaluating summaries.
  30. [30]
    (PDF) The Intelligent Essay Assessor - ResearchGate
    Aug 7, 2025 · The Intelligent Essay Assessor (IEA; Landauer et al., 2000) uses a combination of ML and NLP techniques to analyze essays and assign a score based on various ...
  31. [31]
    Contextual Plagiarism Detection Using Latent Semantic Analysis
    Latent Semantic Analysis (LSA) is a method which tries to find deeper correlation ... Contextual Plagiarism Detection Using Latent Semantic Analysis.
  32. [32]
    [PDF] Using latent semantic analysis to measure coherence in essays by ...
    These factors could affect LSA measures of coherence as the more repeated words a text contains, the higher the cosines between sentences tend to be. Page 3 ...
  33. [33]
    TruncatedSVD — scikit-learn 1.7.2 documentation
    TruncatedSVD performs linear dimensionality reduction using truncated SVD, also known as LSA for term count/tf-idf matrices. It does not center data before SVD.Missing: core | Show results with:core
  34. [34]
    [PDF] Complexity analysis of Singular Value Decomposition and its variants
    Oct 14, 2019 · This tutorial compares the time and space complexity of SVD, truncated SVD, Krylov method, and Randomized PCA. Traditional SVD can be slow and ...
  35. [35]
    [PDF] irlba: Fast Truncated Singular Value Decomposition and Principal ...
    Oct 3, 2022 · The irlba package provides fast, memory-efficient methods for truncated singular value decomposition and principal components analysis of large ...
  36. [36]
    Dimensionality Reduction - RDD-based API - Apache Spark
    SVD Example. spark. mllib provides SVD functionality to row-oriented matrices, provided in the RowMatrix class. Refer to the SingularValueDecomposition Python ...
  37. [37]
    Latent Semantic Analysis (LSA) for Text Classification Tutorial
    Mar 25, 2016 · Latent Semantic Analysis (LSA) creates a vector representation of a document using tf-idf and then uses SVD for dimensionality reduction.
  38. [38]
  39. [39]
    [PDF] Sparse Latent Semantic Analysis - NYU Stern
    In summary, Sparse LSA and NN Sparse LSA show their advantages when the dimensionality of latent space is large. They can achieve good classification.
  40. [40]
    [PDF] GPU Accelerated Semantic Search Using Latent Semantic Analysis
    We present a parallel LSA implementation on the GPU, using CUDA and the. cuSolver library for common matrix factorization and triangular solve routines for ...
  41. [41]
    Parallel SVD Computing in the Latent Semantic Indexing ...
    The Latent Semantic Indexing (LSI) is a concept-based automatic indexing method for overcoming the two fundamental problems which exist in the traditional ...
  42. [42]
    [PDF] SPEEDING UP LATENT SEMANTIC ANALYSIS - SciTePress
    At the core of LSA is the Singular Value. Decomposition algorithm (SVD), a linear algebra routine for matrix factorization. This paper introduces a streamed ...
  43. [43]
    An Experimental Study of Incremental SVD on Latent Semantic ...
    The results indicate that when increased documents are less than original documents, incremental update method has an approximate precision to re-computing ...
  44. [44]
    The center for all your data, analytics, and AI – Amazon SageMaker
    Amazon SageMaker is the center for data, analytics, and AI, providing an integrated experience with unified access to all your data.Pricing · FAQs · SageMaker AI · Amazon SageMaker JumpStart
  45. [45]
    [PDF] Latent Semantic Indexing (LSI): TREC-3 Report
    This paper reports on recent developments of the Latent Semantic Indexing (LSI) retrieval method for TREC-3 and compares LSI to keyword vector matching ...
  46. [46]
    [PDF] Comparison of Human and Latent Semantic Analysis (LSA ... - DTIC
    This report compares the performance of a commonly used technique, Latent Semantic Analysis. (LSA), with human similarity judgements and identifies the settings ...
  47. [47]
    A Comparison of Latent Semantic Analysis and Latent Dirichlet ...
    Nov 27, 2023 · This article reviews and compares the LSA and LDA topic models. This article also introduces a methodology for comparing the semantic spaces obtained by the ...Missing: core principles
  48. [48]
    [PDF] Latent Dirichlet Allocation - Journal of Machine Learning Research
    We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level ...
  49. [49]
    [PDF] Exploring Topic Coherence over Many Models and Many Topics
    Jul 12, 2012 · Our experiments re- veal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a ...<|separator|>
  50. [50]
    Efficient Estimation of Word Representations in Vector Space - arXiv
    Jan 16, 2013 · Access Paper: View a PDF of the paper titled Efficient Estimation of Word Representations in Vector Space, by Tomas Mikolov and 3 other authors.
  51. [51]
    [PDF] GloVe: Global Vectors for Word Representation - Stanford NLP Group
    These vectors can be used as features in a variety of ap- plications, such as information retrieval (Manning et al., 2008), document classification (Sebastiani,.
  52. [52]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  53. [53]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    Oct 11, 2018 · We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike ...
  54. [54]
    Word2vec-based latent semantic analysis (W2V-LSA) for topic ...
    Aug 15, 2020 · We propose a new topic modeling method called Word2vec-based Latent Semantic Analysis (W2V-LSA), which is based on Word2vec and Spherical k-means clustering.
  55. [55]
    Computer information retrieval using latent semantic structure
    This invention relates generally to computer-based information retrieval and, in particular, to user accessibility to and display of textual material stored in ...<|separator|>
  56. [56]
    Learning and Representing Verbal Meaning - Thomas K. Landauer ...
    Latent semantic analysis (LSA) is a theory of how word meaning—and possibly other knowledge—is derived from statistics of experience, and of how passage ...
  57. [57]
    Latent Semantic Analysis - Microsoft Research
    Nov 1, 2003 · Presents a literature review that covers the following topics related to Latent Semantic Analysis (LSA): (1) LSA overview; (2) applications ...Missing: commercial | Show results with:commercial
  58. [58]
    Using Semantic Technologies for Formative Assessment and ...
    Aug 15, 2018 · Latent semantic analysis (LSA) and G-Rubric are used for automated feedback on open-ended questions, providing enriched and personalized ...
  59. [59]
    2.5. Decomposing signals in components (matrix factorization problems)
    ### Summary: Randomized SVD for LSA in Big Data
  60. [60]
    The Antecedents of Transformer Models - Simon Dennis, Kevin ...
    Nov 18, 2024 · An early model that demonstrated the importance of scale was latent semantic analysis (LSA; Dumais et al., 1988). LSA was a significant ...
  61. [61]
    Combining Latent Semantic Analysis and Pre-trained Model for ...
    Aug 23, 2022 · Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, ...Missing: papers | Show results with:papers<|separator|>