Fact-checked by Grok 2 weeks ago

Multilocus sequence typing

Multilocus sequence typing (MLST) is a standardized molecular for characterizing bacterial isolates by sequencing short, variable fragments (typically 450–500 base pairs) from multiple genes, usually seven in number, to assign unambiguous allelic profiles that define distinct sequence types (STs). Developed in as a portable to earlier techniques like multilocus enzyme electrophoresis (MLEE), MLST exploits the stability of genes—those essential for basic cellular functions and evolving slowly—to identify clonal lineages and population structures in . The methodology involves PCR amplification of gene fragments, followed by sequencing and comparison against curated databases to identify alleles, where each unique sequence at a locus receives a distinct integer identifier; the combination of these integers forms the ST, treating any nucleotide difference as a single evolutionary event regardless of size. For example, the original scheme for Neisseria meningitidis targeted genes such as abcZ, adk, aroE, fumC, gdh, pdhC, and pgm, enabling high discriminatory power with potentially billions of possible STs (e.g., over 20 billion for seven loci with 30 alleles each). This approach is highly reproducible across laboratories, as sequence data are electronically portable and stored in public databases like PubMLST, facilitating global comparisons without exchanging physical strains. MLST's key advantages include its precision at the DNA level, which surpasses phenotypic methods by detecting subtle genetic variations, and its applicability to direct clinical samples like cerebrospinal fluid or blood without prior culturing. It has been instrumental in epidemiological surveillance, such as tracking hypervirulent clones of Staphylococcus aureus, Streptococcus pneumoniae, and meningococci during outbreaks, and in evolutionary studies to infer population dynamics via tools like eBURST for identifying clonal complexes. While traditional MLST focuses on a fixed set of loci, extensions like core genome MLST (cgMLST) incorporate thousands of genes for enhanced resolution in the era of whole-genome sequencing, though the original scheme remains foundational for its simplicity and cost-effectiveness in resource-limited settings.

Introduction

Definition

Multilocus sequence typing (MLST) is a sequence-based molecular typing method used to characterize microbial isolates, particularly , by examining allelic variations in multiple genes to assign unique sequence types (). This approach provides a standardized, portable system for identifying and tracking clonal lineages within populations, enabling unambiguous comparisons across laboratories worldwide. Typically, MLST schemes analyze internal fragments of seven housekeeping genes, though the exact number can vary by species-specific . Housekeeping genes in MLST are essential loci that encode proteins involved in basic cellular functions, such as , and evolve slowly due to their conserved nature, making them ideal for resolving long-term evolutionary relationships without being overly influenced by recent selective pressures. Alleles represent distinct variants at each locus, identified by differences in sequences of the fragments, and are assigned unique identifiers based on their order of discovery in a . A type (ST) is then defined as the specific combination of alleles across all loci examined; for example, an isolate with alleles 1, 3, 5, 2, 4, 7, and 6 at seven loci would be designated ST-1-3-5-2-4-7-6. The standardization of MLST relies on a global database where are cataloged and numbered sequentially, ensuring that the same sequence always receives the same allele number regardless of the performing the analysis, which facilitates and . This database-driven system, initially developed for pathogens like , has been extended to numerous bacterial species and some eukaryotes, supporting epidemiological surveillance and studies.

History

Multilocus sequence typing (MLST) emerged in 1998 as a sequence-based alternative to multilocus enzyme electrophoresis (MLEE), which had been used since the 1980s to infer genetic variation through protein mobility patterns but suffered from limited portability and reproducibility. The method was first proposed and validated by Maiden et al. for , the causative agent of , by sequencing internal fragments of seven housekeeping genes to define allelic profiles and sequence types (STs) that unambiguously identify clones. This approach, developed by a team including Brian G. Spratt, addressed MLEE's ambiguities by leveraging unambiguous DNA sequences for global data exchange. Concurrently, Mark C. Enright and Spratt extended MLST to , sequencing seven loci in 295 isolates to delineate clones linked to invasive infections, marking one of the earliest applications beyond Neisseria. Adoption accelerated in the early 2000s with schemes for key pathogens, including , where Enright et al. in 2000 sequenced seven housekeeping genes in methicillin-susceptible and resistant isolates to track epidemic clones worldwide. The launch of the inaugural MLST database for in 2000 via the PubMLST platform enabled centralized allele curation and ST assignment, promoting standardized global comparisons and reducing inter-laboratory discrepancies. By 2004, MLST had achieved de facto standardization for bacterial surveillance, with over a dozen schemes published for diverse species like and , integrated into frameworks for outbreak detection and vaccine evaluation. In the early 2000s, MLST expanded to fungal pathogens, exemplified by the first scheme for developed in 2003, which gained traction in the 2010s for epidemiological studies of antifungal resistance. Integration with next-generation sequencing (NGS) transformed the technique, allowing extraction of MLST profiles from whole-genome assemblies, as demonstrated in 2012 for 66 species using short-read data. This evolution enhanced scalability without sacrificing portability. By 2020, over 100 MLST schemes existed across and eukaryotes, hosted on PubMLST for more than 100 species; as of 2025, PubMLST hosts schemes for over 130 species and genera. This establishes MLST as a foundational tool in microbial genomics.

Methodology

Gene Selection and Sequencing

In multilocus sequence typing (MLST), genes are selected as loci based on specific criteria to ensure reliable characterization of bacterial population structure. Typically, seven conserved genes are chosen, each providing internal fragments of 400-500 base pairs (bp) that exhibit sufficient nucleotide variation for distinguishing clones while maintaining stability across strains. These genes must be unlinked on the chromosome to avoid bias from physical proximity and informative in terms of allelic diversity, ideally with an average of around 30 alleles per locus to resolve billions of potential sequence types. For example, in the original MLST scheme for , genes such as (encoding a putative ) and (adenylate kinase) were selected for their selective neutrality and congruence with multilocus enzyme data, contributing to high discriminatory power with an average of 17 alleles per locus. The process begins with from bacterial isolates, followed by (PCR) amplification of the targeted internal gene fragments using primers designed to be universal within the species. These primers flank conserved regions, enabling reliable amplification of the variable central portions of the genes, typically yielding products of approximately 450 bp for standardization. The amplified fragments are then sequenced bidirectionally using traditional on automated DNA sequencers, such as the Applied Biosystems Prism 377, to achieve high-accuracy base calls essential for unambiguous allele identification. This method ensures that sequences are portable and comparable across laboratories, as the data consist of exact profiles rather than interpreted electrophoretic patterns. Quality control measures are integral to maintain the integrity of MLST data, including standardization of fragment lengths to around 450 to facilitate consistent sequencing coverage and . Loci prone to recombination, which could confound clonal inferences, are excluded during development to ensure the selected genes reflect stable, vertically inherited variation. Post-sequencing, ambiguities or poor-quality reads are manually resolved or discarded to uphold data reliability. Since around 2010, MLST has evolved toward high-throughput implementations leveraging next-generation sequencing (NGS) technologies, such as 454, to reduce costs and increase scalability while preserving the core gene-targeting approach. In these methods, barcoded amplicons from multiple isolates are pooled and sequenced in a single run, enabling profiling of up to 96 samples simultaneously and lowering the per-isolate cost to approximately $38 USD—a tenfold reduction compared to traditional Sanger-based workflows. This shift has facilitated broader application in clinical and surveillance settings without compromising the accuracy of locus-specific sequencing.

Allele Assignment and Sequence Type Determination

Once the nucleotide sequences of the selected housekeeping gene fragments are obtained, allele identification begins by comparing each to a centralized database of known , typically using tools such as for exact matching. Identical receive the same arbitrary number at that locus, while any differing by even a single is classified as novel and assigned a new unique integer identifier. This process ensures unambiguous categorization, as differences are not quantified by percentage divergence but treated as discrete allelic variants. The sequence type (ST) is then determined by concatenating the allele numbers from the standard set of loci—usually seven—into a unique allelic profile, such as 1-5-3-7-2-4-6, which is assigned a distinct ST designation (e.g., ST-42). This profile serves as a portable, numerical identifier for the strain, enabling direct comparison across global datasets without reliance on sequence data itself. To infer evolutionary relationships and define clonal complexes (CCs), the ST profiles are analyzed using algorithms like eBURST or minimum spanning tree methods, which cluster related STs based on the number of shared alleles, typically grouping those differing at a single locus (single-locus variants, SLVs) into a CC while highlighting more divergent subtypes. Several software tools automate allele assignment and ST calculation, integrating with databases like PubMLST's BIGSdb platform for querying and updating profiles. Command-line utilities such as mlst and commercial applications like BioNumerics facilitate batch processing of Sanger or whole-genome sequencing data, performing BLAST-based alignments and generating STs efficiently. These tools also address potential ambiguities, such as mixed signals from double infections (e.g., superimposed peaks in chromatograms or ambiguous bases), by flagging inconclusive loci for manual review or exclusion to maintain profile integrity. MLST's resolution for strain differentiation relies on allelic mismatches across loci, where isolates with identical profiles at all positions are deemed the same , those differing at one locus represent close relatives within a CC, and differences at 2–7 loci allow discrimination of more distantly related subtypes, providing a scalable metric for population structure analysis.

Comparison with Other Typing Methods

Pre-Genomic Techniques

Prior to the advent of multilocus sequence typing (MLST), bacterial strain typing relied on pre-genomic techniques that were primarily phenotypic or early genotypic methods, offering limited resolution and portability for epidemiological studies. These approaches, while foundational in identifying clonal relationships and tracking outbreaks, suffered from subjectivity, labor intensity, and challenges in inter-laboratory comparisons, prompting the development of more standardized sequence-based alternatives like MLST. Multilocus enzyme electrophoresis (MLEE) emerged as an early genotypic method in the 1970s and 1980s, analyzing the electrophoretic mobility of proteins from 10 to 20 housekeeping enzymes to infer at multiple loci. This semiquantitative technique detected slowly evolving, neutral mutations, providing insights into long-term population structure and clonal complexes in pathogens like Neisseria meningitidis. However, MLEE's reliance on gel-based assays introduced variability due to differences in enzyme extraction, gel preparation, and staining, resulting in poor reproducibility and limited portability across laboratories; results were often compared visually or via dendrograms, hindering global . Pulsed-field gel electrophoresis (PFGE), developed in the 1980s and standardized for surveillance programs like PulseNet in the 1990s, offered higher resolution for short-term epidemiology by digesting bacterial DNA with rare-cutting restriction enzymes and separating large fragments (up to 1 Mb) using alternating electric fields. This method excelled in outbreak investigations, such as distinguishing Salmonella or Escherichia coli strains, with discriminatory power often exceeding that of MLST (e.g., Simpson's index of 0.999 for Pseudomonas aeruginosa isolates). Despite its "gold standard" status for local tracking, PFGE was labor-intensive, requiring 2-3 days per analysis, and non-portable due to protocol variations and the need for image-based pattern matching, which complicated international comparisons and failed to reveal broader evolutionary relationships. Serotyping and phage typing represented simpler phenotypic approaches, with serotyping using antisera to detect surface antigens (e.g., O and H antigens in Salmonella) and phage typing assessing susceptibility to specific bacteriophages for strain differentiation in species like Staphylococcus aureus. These methods were quick and inexpensive but limited to expressing antigens or receptors, often leaving 10-15% of strains untypeable and prone to cross-reactivity or phase variation, reducing discriminatory power for clonal analysis. Reproducibility was low due to environmental factors and labor-intensive antiserum or phage propagation, restricting their utility beyond initial grouping. MLST addressed these shortcomings by sequencing short fragments of multiple genes to assign unambiguous alleles, enabling the creation of portable, digital sequence types that support centralized global databases for unambiguous comparisons. Unlike the image-dependent or gel-variable outputs of MLEE and PFGE, or the antigen-limited scope of serotyping and , MLST's sequence data ensure high reproducibility (approaching 100%) and facilitate both local outbreak resolution and long-term evolutionary tracking, as demonstrated in its application to N. meningitidis with potentially billions of possible types from seven loci (e.g., over 20 billion assuming 30 alleles each). This allelic profiling approach, while more costly, revolutionized by allowing electronic data exchange without loss of resolution.

Whole-Genome Sequencing Approaches

Whole-genome sequencing (WGS) approaches have largely supplanted or complemented multilocus sequence typing (MLST) in bacterial pathogen surveillance by providing genome-wide resolution through (SNP) analysis. Traditional MLST relies on sequence variation at just seven loci, which often lacks the discriminatory power to distinguish closely related strains in outbreak investigations. In contrast, SNP typing from WGS identifies thousands of variants across the entire bacterial genome, enabling finer-scale differentiation of isolates that share identical MLST profiles. For instance, in analyses of , WGS-based SNP typing resolved 249 unique profiles among outbreak strains, far exceeding the resolution of standard MLST. A key WGS-based extension of MLST is core genome MLST (cgMLST), which expands the scheme to encompass 1,000–3,000 conserved loci present in the core genome of a bacterial or , rather than the limited seven loci of classical MLST. This approach maintains the gene-by-gene allelic numbering system of MLST while achieving higher resolution for epidemiological clustering. For example, the PulseNet network employs a cgMLST scheme with 2,513 loci for surveillance, facilitating the detection of foodborne clusters with greater precision than traditional methods. WGS methods, including SNP typing and cgMLST, offer advantages over MLST by detecting recombination events and variations in accessory genes, which are absent from MLST's focus on core housekeeping genes alone. These capabilities allow for comprehensive insights into bacterial evolution, virulence factors, and profiles that MLST cannot capture. However, WGS requires more computational resources than MLST, typically 8-64 of RAM and multiple processing cores (4-16) for and variant calling of individual bacterial isolates, with higher demands for large datasets or assemblies. Interoperability between MLST and WGS approaches is ensured by designing cgMLST schemes to include traditional MLST loci as a , allowing and integration of historical data into modern genomic databases. This hierarchical structure supports scalable analysis, where MLST provides a portable for global comparisons, while cgMLST or full WGS refines resolution for local outbreaks.

Applications

Bacterial

Multilocus sequence typing (MLST) plays a pivotal role in bacterial surveillance by enabling the assignment of types (STs) to isolates, which facilitates the tracing of clonal complexes responsible for outbreaks and the monitoring of . In , MLST identifies related strains within clonal complexes, allowing authorities to track transmission pathways and implement targeted interventions. For instance, the clonal complex 8 (CC8) in methicillin-resistant Staphylococcus aureus (MRSA) has been instrumental in delineating the global spread of hospital- and community-acquired strains, with ST8 variants like USA300 emerging as dominant pandemic lineages. Specific examples highlight MLST's utility in linking bacterial strains to sources and disease outcomes. In , the ST-21 complex predominates in human infections and is strongly associated with reservoirs, aiding source attribution in foodborne outbreaks. For , the ST-11 complex defines hypervirulent lineages responsible for , enabling rapid identification of outbreak strains across regions. The ST-398 lineage of is a livestock-associated MRSA that has zoonotically transmitted to humans, particularly in agricultural settings, underscoring MLST's value in one-health . In , ST-28 is frequently linked to invasive group A streptococcal infections, such as , helping to monitor shifts in virulence and . Similarly, ST-4 is a stable implicated in severe cases, often traced to contaminated powdered . MLST's integration into networks enhances real-time outbreak detection and response. Through platforms like PulseNet, MLST-derived support standardized subtyping of foodborne pathogens, facilitating interstate and international investigations. Recent advancements as of 2025 have validated core genome MLST (cgMLST) extensions for Shiga toxin-producing (STEC) confirmation, improving the accuracy of linking clinical cases to environmental sources in national surveillance systems. To infer evolutionary relationships and population structure from MLST data, the eBURST algorithm groups into clonal complexes based on shared alleles, predicting ancestral founders and recent descendants to reconstruct networks and detect emerging variants. This approach has been widely adopted for analyzing bacterial in datasets, providing insights into recombination events and long-term .

Fungal and Eukaryotic Pathogens

Multilocus sequence typing (MLST) has been adapted for fungal pathogens, with a prominent example being , where a standardized scheme utilizing seven housekeeping loci—such as (alcohol dehydrogenase), (adenylate kinase), and (glucose-6-phosphate dehydrogenase)—enables the identification of sequence types (STs) linked to clinical outcomes. This approach has facilitated tracking of resistance, as certain STs, like those in Clade 1, correlate with reduced susceptibility to and other antifungals in cases. Similarly, for , an MLST scheme based on seven loci (e.g., CAP59, GPD1, LAC1, PLB1, , URA5, and the IGS1 region of rDNA) has been employed to link clinical isolates to environmental sources and investigate outbreaks, revealing clonal expansions in immunocompromised patients. These fungal applications underscore MLST's utility in delineating population structure amid antifungal selective pressures. Extensions of MLST to eukaryotic pathogens beyond fungi include protozoan parasites like , where multilocus schemes incorporating 10 neutral loci have assessed strain diversity and transmission dynamics in malaria-endemic regions, highlighting recombination events that influence antigenic variation. For , the causative agent of , optimized MLST using six to eight housekeeping genes (e.g., dihydrofolate reductase-thymidylate synthase, ) has resolved discrete typing units (DTUs) and informed epidemiological surveillance, aiding in tracing sylvatic-to-domestic transmission cycles across . These schemes reveal higher in eukaryotic pathogens compared to many , reflecting and larger effective population sizes. Adaptations for fungi and eukaryotes address unique genomic challenges, such as larger genome sizes and higher intron densities, necessitating locus selection that spans exon-intron boundaries to ensure consistent allele calling across strains. Unlike bacterial MLST, which benefits from compact prokaryotic genomes, eukaryotic schemes often require bioinformatics adjustments for splicing variants, leading to fewer standardized protocols; for instance, Candida schemes were refined in the early 2010s to incorporate whole-genome data for better resolution. Recent developments include the application of an established five-loci MLST scheme for Scedosporium species in a 2025 environmental survey in Lebanon, utilizing actin, calmodulin, β-tubulin, RNA polymerase II (RPB2), and manganese superoxide dismutase to assess genetic diversity and potential links to clinical isolates, with implications for outbreak detection in vulnerable populations such as cystic fibrosis patients. This highlights ongoing efforts to expand MLST for emerging fungal threats in vulnerable populations.

Advantages and Limitations

Advantages

One key advantage of multilocus sequence typing (MLST) is its portability, as data can be electronically exchanged between laboratories worldwide without the need to transport physical samples, facilitating the creation of centralized databases for . This enables seamless integration of typing results across diverse research groups and organizations, supporting rapid identification of emerging clones in real-time . MLST also offers high reproducibility due to its reliance on unambiguous allele assignments based on exact nucleotide sequences, resulting in standardized integer-based sequence types (STs) that eliminate subjective interpretations common in methods like (PFGE). For instance, each unique allelic profile is assigned a distinct ST number, ensuring consistent results regardless of the performing or equipment used. The method's scalability has improved dramatically with the adoption of next-generation sequencing (NGS), allowing high-throughput processing of hundreds to thousands of isolates efficiently, with costs now below those of traditional Sanger-based MLST. Furthermore, MLST enables robust evolutionary inferences by analyzing ST clustering to reveal population structures, recombination rates, and clonal relationships within microbial populations. This allelic variation in housekeeping genes provides insights into and evolutionary dynamics, such as the balance between and recombination, which are critical for understanding adaptation and spread.

Limitations

One major limitation of classical multilocus sequence typing (MLST) is its restricted discriminatory power, as it typically examines sequences from only seven loci, which represent a mere 0.1–0.2% of a . This limited sampling often fails to detect fine-scale genetic differences among closely related strains, resulting in identical sequence types (STs) assigned to isolates involved in distinct outbreaks. For instance, in , multiple strains such as USA300 variants share the same ST8 despite exhibiting substantial genomic diversity and epidemiological independence. Recombination events further exacerbate this issue by homogenizing alleles across loci, potentially masking phylogenetic relationships and leading to inaccurate inferences of strain evolution. In species with high recombination rates, such as or , the exchange of genetic material can create mosaic alleles that confound MLST-based clustering, as the method's reliance on a small number of loci provides insufficient sequence data to resolve true genome-wide phylogeny. The traditional implementation of MLST, which depends on for allele identification, is also constrained by high costs and labor-intensive processes, making it impractical for large-scale . Sequencing costs for a single isolate can approximate $50, rendering the approach prohibitive for routine application across thousands of samples without specialized . Additionally, the method requires access to curated databases for allele assignment; novel variants not previously cataloged cannot be immediately typed and must undergo manual submission and verification, delaying analysis and introducing bottlenecks in real-time . Locus selection in MLST introduces inherent , as the focus on conserved genes—chosen for their stability under neutral —overlooks variation in genes associated with adaptive traits like factors or determinants. This neutral limits MLST's utility for studying pathogen-host interactions or emergence of resistant clones, where accessory genome elements play critical roles. Finally, classical MLST proves inadequate for hypervariable , including highly recombinant and viruses, where rapid and gene exchange outpace the method's capacity to track diversity without supplementary approaches. In such cases, the fixed set of loci fails to capture the extensive genomic plasticity, rendering ST assignments unreliable for outbreak delineation or evolutionary reconstruction.

Extensions and Recent Developments

Core Genome MLST

Core genome multilocus sequence typing (cgMLST) is an advanced method that expands on traditional multilocus sequence typing by analyzing a larger set of conserved genes from the core genome, typically ranging from 1,000 to 4,000 loci, which represent the essential genes present in nearly all of a bacterial species while excluding accessory or variable genomic elements. This approach leverages whole-genome sequencing (WGS) data to provide higher resolution for compared to classical MLST, which relies on only 7-10 genes, effectively treating classical schemes as a subset of the broader cgMLST framework. By focusing on the core genome, cgMLST ensures portability and standardization across laboratories, as allelic profiles can be consistently compared regardless of sequencing platform. The development of cgMLST emerged around as whole-genome sequencing became more accessible, with early schemes designed to address the limitations of traditional typing in capturing fine-scale genomic variation for epidemiological purposes. A notable example is the cgMLST scheme for implemented in the PulseNet surveillance network, which utilized 1,748 core loci by 2013 to enable tracking of foodborne outbreaks. This scheme was built on of diverse isolates, identifying loci present in at least 95% of strains to balance coverage and discriminatory power, and has since been refined for broader application. In the cgMLST process, WGS data from bacterial isolates are assembled into contigs, followed by gene-by-gene calling, where each locus is compared against a reference database of known sequences to assign the closest matching variant based on exact or threshold-based similarity. The resulting allelic profile—a numerical string representing variants at each locus—is then used to calculate genetic relatedness between strains, often employing , which quantifies the number of differing s as a simple metric of evolutionary divergence. This -calling step can be performed on raw reads or assemblies using standardized software, ensuring reproducibility and minimizing assembly errors that could affect accuracy. Compared to classical MLST, cgMLST offers superior resolution for outbreak detection by detecting subtle genetic differences that distinguish closely related strains within hours of sequencing, as demonstrated in a 2025 validation study for Shiga toxin-producing (STEC) surveillance across European networks, where cgMLST clusters aligned precisely with epidemiological links missed by traditional methods. This enhanced discrimination has led to its standardization for over 10 bacterial pathogens, including , , and , facilitating global surveillance and reducing false-negative outbreak identifications.

Whole Genome MLST

Whole genome multilocus sequence typing (wgMLST) extends traditional MLST by performing allelic profiling across the entire bacterial , encompassing all protein-coding genes in a reference set, including both core and elements, to achieve maximal discrimination. This approach transforms whole-genome sequencing data into discrete alleles at thousands of loci—often exceeding 3,000 to 5,000—enabling precise comparisons of without relying solely on single-nucleotide polymorphisms (SNPs). Unlike core genome MLST (cgMLST), which limits analysis to conserved genes, wgMLST incorporates genes to capture the full genomic , making it particularly suited for resolving closely related isolates in outbreak investigations. The development of wgMLST emerged in the mid-2010s as whole-genome sequencing became routine, with foundational proposals around building on earlier gene-by-gene typing concepts to leverage full genomic data. A key advancement occurred in 2016 with its application to pathogens like , where schemas were defined using nearly 4,000 loci derived from reference strains' open reading frames. More recently, in 2022, researchers developed a wgMLST schema for based on 208 complete genomes, identifying 3,044 target loci to support scalable high-resolution typing of this diverse . These schemas are iteratively refined using diverse isolate collections to ensure portability across labs. In wgMLST analysis, genomes are scanned for matches to predefined loci in the , assigning unique numbers to variants and generating profiles for distance-based clustering, such as using metrics. Tools like ChewBBACA facilitate creation by identifying candidate loci through sequence similarity thresholds and performing allele calling on assemblies, while also evaluating schema quality via metrics like locus conservation. This process incurs higher computational demands than traditional MLST due to the volume of loci, often requiring assembled genomes and significant processing time, though cloud-based platforms mitigate this for large datasets. wgMLST enables forensic-level strain discrimination in challenging scenarios, such as typing from degraded environmental or clinical forensic samples, where it outperforms SNP methods in allele-based portability for international databases. It also bridges classical MLST's portability with SNP analysis's resolution, offering allele profiles that correlate strongly with SNP distances for epidemiological surveillance of pathogens like .

Databases and Resources

PubMLST Database

The PubMLST database, established in 2003 at the as an extension of the initial multilocus sequence typing (MLST) efforts begun in 1998, serves as the primary global repository for MLST and related gene-by-gene typing data for bacterial and fungal pathogens. It hosts over 100 MLST schemes covering more than 130 microbial species and genera, encompassing classical MLST, ribosomal MLST (rMLST), core genome MLST (cgMLST), and whole genome MLST (wgMLST) approaches. The database catalogs millions of alleles across these schemes, alongside millions of sequence type (ST) profiles and isolate records, enabling standardized comparisons of microbial population structures worldwide. Key features include curated collections of sequences, which are novel variants of or genes assigned unique integer identifiers, and ST profiles that combine these alleles into portable genotypes for each isolate. Isolate is integrated, capturing details such as geographic location (including GPS coordinates), isolation source, and clinical or epidemiological phenotypes like patterns or factors. The database employs the Bacterial Isolate Genome Sequence Database (BIGSdb) platform as its backend, providing tools for data submission via web interfaces or RESTful APIs, advanced querying by , ST, or filters, and automated allele calling from raw sequences. As of 2025, registration is required to access , , and isolate data added after December 31, 2024. PubMLST offers free, open-access downloads of entire schemes, allele libraries, and isolate datasets in formats suitable for phylogenetic or epidemiological analyses, with no restrictions on usage for research or applications. Its BIGSdb supports extensible plugins for in-depth analyses, such as constructing phylogenetic trees, goeBURST clustering for inferring evolutionary relationships, or mapping onto minimum spanning trees using tools like GrapeTree. Maintenance is community-driven, involving approximately 125 volunteer curators worldwide who validate submissions and update schemes, ensuring data quality and relevance; the platform also integrates with the (NCBI) to facilitate the incorporation of whole-genome sequencing data from public repositories.

Other Specialized Databases

In addition to the central PubMLST resource, several specialized databases focus on pathogen-specific or regionally tailored multilocus sequence typing (MLST) schemes to support targeted surveillance and research. For instance, the TB Portals database, maintained by the National Institute of Allergy and Infectious Diseases, integrates genomic and clinical data from over 28,000 cases worldwide, facilitating sequence-based typing analyses that complement traditional MLST approaches for . Similarly, the Centers for Disease Control and Prevention (CDC) employs a whole-genome MLST (wgMLST) scheme for M. tuberculosis, utilizing 2,672 genetic loci to enhance discrimination in outbreak investigations. EnteroBase serves as a key repository for , including and , where it applies core-genome MLST (cgMLST) and to analyze >1.1 million bacterial genomes for epidemiological tracking and prediction. EnteroBase's schemes are based on a wgMLST scheme utilizing 21,065 loci, enabling finer resolution of serovar-specific transmissions in foodborne outbreaks. Regional networks further specialize MLST applications for surveillance. PulseNet International, a global collaboration of public health laboratories, standardizes cgMLST schemes for foodborne pathogens such as Shiga toxin-producing E. coli, using whole-genome sequencing data to detect multistate outbreaks and trace contamination sources across borders. In , the European Centre for Disease Prevention and Control's (ECDC) The European Surveillance System (TESSy) collects and harmonizes molecular typing data, including MLST profiles, from member states to monitor and pathogen spread in real time. Emerging databases extend MLST to non-bacterial pathogens. FungiDB, part of the VEuPathDB bioinformatics platform, integrates genomic datasets for fungal species like , supporting MLST schemes that sequence seven housekeeping genes to delineate clonal complexes and track antifungal resistance in clinical isolates. For viruses, adaptations of MLST principles appear in HIV-1 surveillance, where pol gene sequencing profiles subtypes and mutations, though not in a traditional allelic database format. These specialized databases often link to PubMLST via shared allelic profiles and , promoting data interoperability, but challenges persist in harmonization due to varying schemes, nomenclature inconsistencies, and resource constraints for curation.