Multilocus sequence typing

Multilocus sequence typing (MLST) is a standardized molecular method for characterizing bacterial isolates by sequencing short, variable fragments (typically 450–500 base pairs) from multiple housekeeping genes, usually seven in number, to assign unambiguous allelic profiles that define distinct sequence types (STs).^[1]^[2] Developed in 1998 as a portable alternative to earlier techniques like multilocus enzyme electrophoresis (MLEE), MLST exploits the stability of housekeeping genes—those essential for basic cellular functions and evolving slowly—to identify clonal lineages and population structures in pathogenic bacteria.^[1]^[3] The methodology involves PCR amplification of gene fragments, followed by sequencing and comparison against curated databases to identify alleles, where each unique sequence at a locus receives a distinct integer identifier; the combination of these integers forms the ST, treating any nucleotide difference as a single evolutionary event regardless of size.^[2]^[3] For example, the original scheme for Neisseria meningitidis targeted genes such as abcZ, adk, aroE, fumC, gdh, pdhC, and pgm, enabling high discriminatory power with potentially billions of possible STs (e.g., over 20 billion for seven loci with 30 alleles each).^[1] This approach is highly reproducible across laboratories, as sequence data are electronically portable and stored in public databases like PubMLST, facilitating global comparisons without exchanging physical strains.^[2]^[3] MLST's key advantages include its precision at the DNA level, which surpasses phenotypic methods by detecting subtle genetic variations, and its applicability to direct clinical samples like cerebrospinal fluid or blood without prior culturing.^[3] It has been instrumental in epidemiological surveillance, such as tracking hypervirulent clones of Staphylococcus aureus, Streptococcus pneumoniae, and meningococci during outbreaks, and in evolutionary studies to infer population dynamics via tools like eBURST for identifying clonal complexes.^[1]^[3] While traditional MLST focuses on a fixed set of loci, extensions like core genome MLST (cgMLST) incorporate thousands of genes for enhanced resolution in the era of whole-genome sequencing, though the original scheme remains foundational for its simplicity and cost-effectiveness in resource-limited settings.^[3]

Introduction

Definition

Multilocus sequence typing (MLST) is a nucleotide sequence-based molecular typing method used to characterize microbial isolates, particularly bacteria, by examining allelic variations in multiple housekeeping genes to assign unique sequence types (STs).^[1] This approach provides a standardized, portable system for identifying and tracking clonal lineages within pathogen populations, enabling unambiguous comparisons across laboratories worldwide.^[4] Typically, MLST schemes analyze internal fragments of seven housekeeping genes, though the exact number can vary by species-specific protocol.^[4] Housekeeping genes in MLST are essential loci that encode proteins involved in basic cellular functions, such as metabolism, and evolve slowly due to their conserved nature, making them ideal for resolving long-term evolutionary relationships without being overly influenced by recent selective pressures. Alleles represent distinct sequence variants at each locus, identified by differences in nucleotide sequences of the gene fragments, and are assigned unique integer identifiers based on their order of discovery in a centralized database.^[1] A sequence type (ST) is then defined as the specific combination of alleles across all loci examined; for example, an isolate with alleles 1, 3, 5, 2, 4, 7, and 6 at seven loci would be designated ST-1-3-5-2-4-7-6.^[4] The standardization of MLST relies on a global database where alleles are cataloged and numbered sequentially, ensuring that the same sequence always receives the same allele number regardless of the laboratory performing the analysis, which facilitates data portability and reproducibility.^[1] This database-driven system, initially developed for pathogens like Neisseria meningitidis, has been extended to numerous bacterial species and some eukaryotes, supporting epidemiological surveillance and population genetics studies.^[4]

History

Multilocus sequence typing (MLST) emerged in 1998 as a sequence-based alternative to multilocus enzyme electrophoresis (MLEE), which had been used since the 1980s to infer genetic variation through protein mobility patterns but suffered from limited portability and reproducibility. The method was first proposed and validated by Maiden et al. for Neisseria meningitidis, the causative agent of meningococcal disease, by sequencing internal fragments of seven housekeeping genes to define allelic profiles and sequence types (STs) that unambiguously identify clones.^[1] This approach, developed by a team including Brian G. Spratt, addressed MLEE's ambiguities by leveraging unambiguous DNA sequences for global data exchange. Concurrently, Mark C. Enright and Spratt extended MLST to Streptococcus pneumoniae, sequencing seven loci in 295 isolates to delineate clones linked to invasive infections, marking one of the earliest applications beyond Neisseria.^[5] Adoption accelerated in the early 2000s with schemes for key pathogens, including Staphylococcus aureus, where Enright et al. in 2000 sequenced seven housekeeping genes in methicillin-susceptible and resistant isolates to track epidemic clones worldwide. The launch of the inaugural MLST database for Neisseria in 2000 via the PubMLST platform enabled centralized allele curation and ST assignment, promoting standardized global comparisons and reducing inter-laboratory discrepancies.^[6] By 2004, MLST had achieved de facto standardization for bacterial surveillance, with over a dozen schemes published for diverse species like Campylobacter jejuni and Haemophilus influenzae, integrated into public health frameworks for outbreak detection and vaccine evaluation.^[7] In the early 2000s, MLST expanded to fungal pathogens, exemplified by the first scheme for Candida albicans developed in 2003, which gained traction in the 2010s for epidemiological studies of antifungal resistance. Integration with next-generation sequencing (NGS) transformed the technique, allowing in silico extraction of MLST profiles from whole-genome assemblies, as demonstrated in 2012 for 66 bacterial species using short-read data. This evolution enhanced scalability without sacrificing portability. By 2020, over 100 MLST schemes existed across bacteria and eukaryotes, hosted on PubMLST for more than 100 species; as of 2025, PubMLST hosts schemes for over 130 species and genera.^[8] This establishes MLST as a foundational tool in microbial genomics.^[9]^[10]

Methodology

Gene Selection and Sequencing

In multilocus sequence typing (MLST), housekeeping genes are selected as loci based on specific criteria to ensure reliable characterization of bacterial population structure. Typically, seven conserved housekeeping genes are chosen, each providing internal fragments of 400-500 base pairs (bp) that exhibit sufficient nucleotide variation for distinguishing clones while maintaining stability across strains.^[2] These genes must be unlinked on the chromosome to avoid bias from physical proximity and informative in terms of allelic diversity, ideally with an average of around 30 alleles per locus to resolve billions of potential sequence types.^[2] For example, in the original MLST scheme for Neisseria meningitidis, genes such as abcZ (encoding a putative ABC transporter) and adk (adenylate kinase) were selected for their selective neutrality and congruence with multilocus enzyme electrophoresis data, contributing to high discriminatory power with an average of 17 alleles per locus.^[1] The process begins with DNA extraction from bacterial isolates, followed by polymerase chain reaction (PCR) amplification of the targeted internal gene fragments using primers designed to be universal within the species. These primers flank conserved regions, enabling reliable amplification of the variable central portions of the genes, typically yielding products of approximately 450 bp for standardization.^[2] The amplified fragments are then sequenced bidirectionally using traditional Sanger sequencing on automated DNA sequencers, such as the Applied Biosystems Prism 377, to achieve high-accuracy base calls essential for unambiguous allele identification.^[1] This method ensures that sequences are portable and comparable across laboratories, as the data consist of exact nucleotide profiles rather than interpreted electrophoretic patterns. Quality control measures are integral to maintain the integrity of MLST data, including standardization of fragment lengths to around 450 bp to facilitate consistent sequencing coverage and alignment. Loci prone to recombination, which could confound clonal inferences, are excluded during scheme development to ensure the selected genes reflect stable, vertically inherited variation.^[1] Post-sequencing, ambiguities or poor-quality reads are manually resolved or discarded to uphold data reliability. Since around 2010, MLST has evolved toward high-throughput implementations leveraging next-generation sequencing (NGS) technologies, such as Roche 454, to reduce costs and increase scalability while preserving the core gene-targeting approach. In these methods, barcoded PCR amplicons from multiple isolates are pooled and sequenced in a single run, enabling profiling of up to 96 samples simultaneously and lowering the per-isolate cost to approximately $38 USD—a tenfold reduction compared to traditional Sanger-based workflows.^[11] This shift has facilitated broader application in clinical and surveillance settings without compromising the accuracy of locus-specific sequencing.

Allele Assignment and Sequence Type Determination

Once the nucleotide sequences of the selected housekeeping gene fragments are obtained, allele identification begins by comparing each sequence to a centralized database of known alleles, typically using sequence alignment tools such as BLAST for exact matching. Identical sequences receive the same arbitrary allele number at that locus, while any sequence differing by even a single nucleotide is classified as novel and assigned a new unique integer identifier. This process ensures unambiguous categorization, as nucleotide differences are not quantified by percentage divergence but treated as discrete allelic variants.^[4]^[1] The sequence type (ST) is then determined by concatenating the allele numbers from the standard set of loci—usually seven—into a unique allelic profile, such as 1-5-3-7-2-4-6, which is assigned a distinct ST designation (e.g., ST-42). This profile serves as a portable, numerical identifier for the strain, enabling direct comparison across global datasets without reliance on sequence data itself. To infer evolutionary relationships and define clonal complexes (CCs), the ST profiles are analyzed using algorithms like eBURST or minimum spanning tree methods, which cluster related STs based on the number of shared alleles, typically grouping those differing at a single locus (single-locus variants, SLVs) into a CC while highlighting more divergent subtypes.^[1]^[12] Several software tools automate allele assignment and ST calculation, integrating with databases like PubMLST's BIGSdb platform for querying and updating profiles. Command-line utilities such as mlst and commercial applications like BioNumerics facilitate batch processing of Sanger or whole-genome sequencing data, performing BLAST-based alignments and generating STs efficiently. These tools also address potential ambiguities, such as mixed signals from double infections (e.g., superimposed peaks in chromatograms or ambiguous bases), by flagging inconclusive loci for manual review or exclusion to maintain profile integrity.^[13]^[14]^[15] MLST's resolution for strain differentiation relies on allelic mismatches across loci, where isolates with identical profiles at all positions are deemed the same strain, those differing at one locus represent close relatives within a CC, and differences at 2–7 loci allow discrimination of more distantly related subtypes, providing a scalable metric for population structure analysis.^[1]

Comparison with Other Typing Methods

Pre-Genomic Techniques

Prior to the advent of multilocus sequence typing (MLST), bacterial strain typing relied on pre-genomic techniques that were primarily phenotypic or early genotypic methods, offering limited resolution and portability for epidemiological studies. These approaches, while foundational in identifying clonal relationships and tracking outbreaks, suffered from subjectivity, labor intensity, and challenges in inter-laboratory comparisons, prompting the development of more standardized sequence-based alternatives like MLST.^[16] Multilocus enzyme electrophoresis (MLEE) emerged as an early genotypic method in the 1970s and 1980s, analyzing the electrophoretic mobility of proteins from 10 to 20 housekeeping enzymes to infer genetic variation at multiple loci. This semiquantitative technique detected slowly evolving, neutral mutations, providing insights into long-term population structure and clonal complexes in pathogens like Neisseria meningitidis. However, MLEE's reliance on gel-based assays introduced variability due to differences in enzyme extraction, gel preparation, and staining, resulting in poor reproducibility and limited portability across laboratories; results were often compared visually or via dendrograms, hindering global data sharing.^[17]^[18] Pulsed-field gel electrophoresis (PFGE), developed in the 1980s and standardized for surveillance programs like PulseNet in the 1990s, offered higher resolution for short-term epidemiology by digesting bacterial DNA with rare-cutting restriction enzymes and separating large fragments (up to 1 Mb) using alternating electric fields. This method excelled in outbreak investigations, such as distinguishing Salmonella or Escherichia coli strains, with discriminatory power often exceeding that of MLST (e.g., Simpson's index of 0.999 for Pseudomonas aeruginosa isolates). Despite its "gold standard" status for local tracking, PFGE was labor-intensive, requiring 2-3 days per analysis, and non-portable due to protocol variations and the need for image-based pattern matching, which complicated international comparisons and failed to reveal broader evolutionary relationships.^[19]^[16]^[18] Serotyping and phage typing represented simpler phenotypic approaches, with serotyping using antisera to detect surface antigens (e.g., O and H antigens in Salmonella) and phage typing assessing susceptibility to specific bacteriophages for strain differentiation in species like Staphylococcus aureus. These methods were quick and inexpensive but limited to expressing antigens or receptors, often leaving 10-15% of strains untypeable and prone to cross-reactivity or phase variation, reducing discriminatory power for clonal analysis. Reproducibility was low due to environmental factors and labor-intensive antiserum or phage propagation, restricting their utility beyond initial grouping.^[16]^[18] MLST addressed these shortcomings by sequencing short fragments of multiple housekeeping genes to assign unambiguous alleles, enabling the creation of portable, digital sequence types that support centralized global databases for unambiguous comparisons. Unlike the image-dependent or gel-variable outputs of MLEE and PFGE, or the antigen-limited scope of serotyping and phage typing, MLST's sequence data ensure high reproducibility (approaching 100%) and facilitate both local outbreak resolution and long-term evolutionary tracking, as demonstrated in its application to N. meningitidis with potentially billions of possible types from seven loci (e.g., over 20 billion assuming 30 alleles each).^[2] This allelic profiling approach, while more costly, revolutionized epidemiology by allowing electronic data exchange without loss of resolution.^[17]^[19]^[16]

Whole-Genome Sequencing Approaches

Whole-genome sequencing (WGS) approaches have largely supplanted or complemented multilocus sequence typing (MLST) in bacterial pathogen surveillance by providing genome-wide resolution through single nucleotide polymorphism (SNP) analysis. Traditional MLST relies on sequence variation at just seven housekeeping gene loci, which often lacks the discriminatory power to distinguish closely related strains in outbreak investigations. In contrast, SNP typing from WGS identifies thousands of variants across the entire bacterial genome, enabling finer-scale differentiation of isolates that share identical MLST profiles. For instance, in analyses of Salmonella enterica, WGS-based SNP typing resolved 249 unique profiles among outbreak strains, far exceeding the resolution of standard MLST.^[20] A key WGS-based extension of MLST is core genome MLST (cgMLST), which expands the scheme to encompass 1,000–3,000 conserved loci present in the core genome of a bacterial species or population, rather than the limited seven loci of classical MLST. This approach maintains the gene-by-gene allelic numbering system of MLST while achieving higher resolution for epidemiological clustering. For example, the PulseNet network employs a cgMLST scheme with 2,513 loci for Escherichia coli surveillance, facilitating the detection of foodborne pathogen clusters with greater precision than traditional methods.^[21]^[22] WGS methods, including SNP typing and cgMLST, offer advantages over MLST by detecting recombination events and variations in accessory genes, which are absent from MLST's focus on core housekeeping genes alone. These capabilities allow for comprehensive insights into bacterial evolution, virulence factors, and antimicrobial resistance profiles that MLST cannot capture. However, WGS requires more computational resources than MLST, typically 8-64 GB of RAM and multiple processing cores (4-16) for alignment and variant calling of individual bacterial isolates, with higher demands for large datasets or assemblies.^[23]^[24] Interoperability between MLST and WGS approaches is ensured by designing cgMLST schemes to include traditional MLST loci as a subset, allowing backward compatibility and integration of historical data into modern genomic databases. This hierarchical structure supports scalable analysis, where MLST provides a portable baseline for global comparisons, while cgMLST or full WGS refines resolution for local outbreaks.^[21]

Applications

Bacterial Pathogen Surveillance

Multilocus sequence typing (MLST) plays a pivotal role in bacterial pathogen surveillance by enabling the assignment of sequence types (STs) to isolates, which facilitates the tracing of clonal complexes responsible for outbreaks and the monitoring of population dynamics. In epidemiology, MLST identifies related strains within clonal complexes, allowing public health authorities to track transmission pathways and implement targeted interventions. For instance, the clonal complex 8 (CC8) in methicillin-resistant Staphylococcus aureus (MRSA) has been instrumental in delineating the global spread of hospital- and community-acquired strains, with ST8 variants like USA300 emerging as dominant pandemic lineages.^[25]^[26] Specific examples highlight MLST's utility in linking bacterial strains to sources and disease outcomes. In Campylobacter jejuni, the ST-21 complex predominates in human infections and is strongly associated with poultry reservoirs, aiding source attribution in foodborne outbreaks. For Neisseria meningitidis, the ST-11 complex defines hypervirulent lineages responsible for epidemic meningococcal disease, enabling rapid identification of outbreak strains across regions. The ST-398 lineage of Staphylococcus aureus is a livestock-associated MRSA clone that has zoonotically transmitted to humans, particularly in agricultural settings, underscoring MLST's value in one-health surveillance. In Streptococcus pyogenes, ST-28 is frequently linked to invasive group A streptococcal infections, such as necrotizing fasciitis, helping to monitor shifts in virulence and antimicrobial resistance. Similarly, Cronobacter sakazakii ST-4 is a stable clone implicated in severe neonatal meningitis cases, often traced to contaminated powdered infant formula.^[27]^[28]^[29]^[30]^[31]^[32]^[33]^[34]^[35] MLST's integration into public health surveillance networks enhances real-time outbreak detection and response. Through platforms like PulseNet, MLST-derived STs support standardized subtyping of foodborne pathogens, facilitating interstate and international cluster investigations. Recent advancements as of 2025 have validated core genome MLST (cgMLST) extensions for Shiga toxin-producing Escherichia coli (STEC) cluster confirmation, improving the accuracy of linking clinical cases to environmental sources in national surveillance systems.^[22] To infer evolutionary relationships and population structure from MLST data, the eBURST algorithm groups STs into clonal complexes based on shared alleles, predicting ancestral founders and recent descendants to reconstruct transmission networks and detect emerging variants. This approach has been widely adopted for analyzing bacterial diversity in surveillance datasets, providing insights into recombination events and long-term epidemiology.^[36]

Fungal and Eukaryotic Pathogens

Multilocus sequence typing (MLST) has been adapted for fungal pathogens, with a prominent example being Candida albicans, where a standardized scheme utilizing seven housekeeping loci—such as caA (alcohol dehydrogenase), caB (adenylate kinase), and caC (glucose-6-phosphate dehydrogenase)—enables the identification of sequence types (STs) linked to clinical outcomes.^[37] This approach has facilitated tracking of azole resistance, as certain STs, like those in Clade 1, correlate with reduced susceptibility to fluconazole and other antifungals in invasive candidiasis cases.^[38] Similarly, for Cryptococcus neoformans, an MLST scheme based on seven loci (e.g., CAP59, GPD1, LAC1, PLB1, SOD1, URA5, and the IGS1 region of rDNA) has been employed to link clinical isolates to environmental sources and investigate outbreaks, revealing clonal expansions in immunocompromised patients.^[39] These fungal applications underscore MLST's utility in delineating population structure amid antifungal selective pressures.^[40] Extensions of MLST to eukaryotic pathogens beyond fungi include protozoan parasites like Plasmodium falciparum, where multilocus schemes incorporating 10 neutral loci have assessed strain diversity and transmission dynamics in malaria-endemic regions, highlighting recombination events that influence antigenic variation.^[41] For Trypanosoma cruzi, the causative agent of Chagas disease, optimized MLST using six to eight housekeeping genes (e.g., dihydrofolate reductase-thymidylate synthase, glucose-6-phosphate isomerase) has resolved discrete typing units (DTUs) and informed epidemiological surveillance, aiding in tracing sylvatic-to-domestic transmission cycles across Latin America.^[42] These schemes reveal higher genetic diversity in eukaryotic pathogens compared to many bacteria, reflecting sexual reproduction and larger effective population sizes.^[43] Adaptations for fungi and eukaryotes address unique genomic challenges, such as larger genome sizes and higher intron densities, necessitating locus selection that spans exon-intron boundaries to ensure consistent allele calling across strains.^[44] Unlike bacterial MLST, which benefits from compact prokaryotic genomes, eukaryotic schemes often require bioinformatics adjustments for splicing variants, leading to fewer standardized protocols; for instance, Candida schemes were refined in the early 2010s to incorporate whole-genome data for better resolution.^[45] Recent developments include the application of an established five-loci MLST scheme for Scedosporium species in a 2025 environmental survey in Lebanon, utilizing actin, calmodulin, β-tubulin, RNA polymerase II (RPB2), and manganese superoxide dismutase to assess genetic diversity and potential links to clinical isolates, with implications for outbreak detection in vulnerable populations such as cystic fibrosis patients.^[46] This highlights ongoing efforts to expand MLST for emerging fungal threats in vulnerable populations.

Advantages and Limitations

Advantages

One key advantage of multilocus sequence typing (MLST) is its portability, as nucleotide sequence data can be electronically exchanged between laboratories worldwide without the need to transport physical samples, facilitating the creation of centralized international databases for global surveillance.^[1] This enables seamless integration of typing results across diverse research groups and public health organizations, supporting rapid identification of emerging clones in real-time epidemiology.^[47] MLST also offers high reproducibility due to its reliance on unambiguous allele assignments based on exact nucleotide sequences, resulting in standardized integer-based sequence types (STs) that eliminate subjective interpretations common in methods like pulsed-field gel electrophoresis (PFGE).^[48] For instance, each unique allelic profile is assigned a distinct ST number, ensuring consistent results regardless of the performing laboratory or equipment used.^[1] The method's scalability has improved dramatically with the adoption of next-generation sequencing (NGS), allowing high-throughput processing of hundreds to thousands of isolates efficiently, with costs now below those of traditional Sanger-based MLST.^[49] Furthermore, MLST enables robust evolutionary inferences by analyzing ST clustering to reveal population structures, recombination rates, and clonal relationships within microbial populations.^[50] This allelic variation in housekeeping genes provides insights into genetic diversity and evolutionary dynamics, such as the balance between mutation and recombination, which are critical for understanding pathogen adaptation and spread.^[51]

Limitations

One major limitation of classical multilocus sequence typing (MLST) is its restricted discriminatory power, as it typically examines sequences from only seven housekeeping gene loci, which represent a mere 0.1–0.2% of a bacterial genome. This limited sampling often fails to detect fine-scale genetic differences among closely related strains, resulting in identical sequence types (STs) assigned to isolates involved in distinct outbreaks. For instance, in Staphylococcus aureus, multiple strains such as USA300 variants share the same ST8 despite exhibiting substantial genomic diversity and epidemiological independence.^[52] Recombination events further exacerbate this issue by homogenizing alleles across loci, potentially masking phylogenetic relationships and leading to inaccurate inferences of strain evolution. In species with high recombination rates, such as Neisseria gonorrhoeae or Streptococcus pneumoniae, the exchange of genetic material can create mosaic alleles that confound MLST-based clustering, as the method's reliance on a small number of loci provides insufficient sequence data to resolve true genome-wide phylogeny.^[53]^[52] The traditional implementation of MLST, which depends on Sanger sequencing for allele identification, is also constrained by high costs and labor-intensive processes, making it impractical for large-scale surveillance. Sequencing costs for a single isolate can approximate $50, rendering the approach prohibitive for routine application across thousands of samples without specialized infrastructure.^[54] Additionally, the method requires access to curated databases for allele assignment; novel variants not previously cataloged cannot be immediately typed and must undergo manual submission and verification, delaying analysis and introducing bottlenecks in real-time epidemiology.^[55] Locus selection in MLST introduces inherent bias, as the focus on conserved housekeeping genes—chosen for their stability under neutral evolution—overlooks variation in genes associated with adaptive traits like virulence factors or antibiotic resistance determinants. This neutral bias limits MLST's utility for studying pathogen-host interactions or emergence of resistant clones, where accessory genome elements play critical roles.^[56] Finally, classical MLST proves inadequate for hypervariable species, including highly recombinant bacteria and viruses, where rapid mutation and gene exchange outpace the method's capacity to track diversity without supplementary approaches. In such cases, the fixed set of loci fails to capture the extensive genomic plasticity, rendering ST assignments unreliable for outbreak delineation or evolutionary reconstruction.^[56]

Extensions and Recent Developments

Core Genome MLST

Core genome multilocus sequence typing (cgMLST) is an advanced genotyping method that expands on traditional multilocus sequence typing by analyzing a larger set of conserved genes from the core genome, typically ranging from 1,000 to 4,000 loci, which represent the essential genes present in nearly all strains of a bacterial species while excluding accessory or variable genomic elements.^[57] This approach leverages whole-genome sequencing (WGS) data to provide higher resolution for strain differentiation compared to classical MLST, which relies on only 7-10 housekeeping genes, effectively treating classical schemes as a subset of the broader cgMLST framework.^[58] By focusing on the core genome, cgMLST ensures portability and standardization across laboratories, as allelic profiles can be consistently compared regardless of sequencing platform.^[59] The development of cgMLST emerged around 2012 as whole-genome sequencing became more accessible, with early schemes designed to address the limitations of traditional typing in capturing fine-scale genomic variation for epidemiological purposes.^[21] A notable example is the cgMLST scheme for Listeria monocytogenes implemented in the PulseNet surveillance network, which utilized 1,748 core loci by 2013 to enable real-time tracking of foodborne outbreaks.^[60] This Listeria scheme was built on comparative genomics of diverse isolates, identifying loci present in at least 95% of strains to balance coverage and discriminatory power, and has since been refined for broader application.^[61] In the cgMLST process, WGS data from bacterial isolates are assembled into contigs, followed by gene-by-gene allele calling, where each core locus is compared against a reference database of known allele sequences to assign the closest matching variant based on exact or threshold-based similarity.^[57] The resulting allelic profile—a numerical string representing variants at each locus—is then used to calculate genetic relatedness between strains, often employing Hamming distance, which quantifies the number of differing alleles as a simple metric of evolutionary divergence.^[62] This allele-calling step can be performed on raw reads or assemblies using standardized software, ensuring reproducibility and minimizing assembly errors that could affect accuracy.^[63] Compared to classical MLST, cgMLST offers superior resolution for outbreak detection by detecting subtle genetic differences that distinguish closely related strains within hours of sequencing, as demonstrated in a 2025 validation study for Shiga toxin-producing Escherichia coli (STEC) surveillance across European networks, where cgMLST clusters aligned precisely with epidemiological links missed by traditional methods.^[64] This enhanced discrimination has led to its standardization for over 10 bacterial pathogens, including Salmonella enterica, Staphylococcus aureus, and Klebsiella pneumoniae, facilitating global surveillance and reducing false-negative outbreak identifications.^[58]

Whole Genome MLST

Whole genome multilocus sequence typing (wgMLST) extends traditional MLST by performing allelic profiling across the entire bacterial pan-genome, encompassing all protein-coding genes in a reference set, including both core and accessory elements, to achieve maximal strain discrimination. This approach transforms whole-genome sequencing data into discrete alleles at thousands of loci—often exceeding 3,000 to 5,000—enabling precise comparisons of genetic variation without relying solely on single-nucleotide polymorphisms (SNPs). Unlike core genome MLST (cgMLST), which limits analysis to conserved genes, wgMLST incorporates variable accessory genes to capture the full genomic diversity, making it particularly suited for resolving closely related isolates in outbreak investigations.^[65]^[66] The development of wgMLST emerged in the mid-2010s as whole-genome sequencing became routine, with foundational proposals around 2015 building on earlier gene-by-gene typing concepts to leverage full genomic data. A key advancement occurred in 2016 with its application to pathogens like Yersinia pestis, where schemas were defined using nearly 4,000 loci derived from reference strains' open reading frames. More recently, in 2022, researchers developed a wgMLST schema for Streptococcus pyogenes based on 208 complete genomes, identifying 3,044 target loci to support scalable high-resolution typing of this diverse pathogen. These schemas are iteratively refined using diverse isolate collections to ensure portability across labs.^[65]^[67] In wgMLST analysis, genomes are scanned for matches to predefined loci in the pan-genome, assigning unique allele numbers to variants and generating profiles for distance-based clustering, such as using Hamming distance metrics. Tools like ChewBBACA facilitate schema creation by identifying candidate loci through sequence similarity thresholds and performing allele calling on assemblies, while also evaluating schema quality via metrics like locus conservation. This process incurs higher computational demands than traditional MLST due to the volume of loci, often requiring assembled genomes and significant processing time, though cloud-based platforms mitigate this for large datasets.^[68]^[69] wgMLST enables forensic-level strain discrimination in challenging scenarios, such as typing Yersinia pestis from degraded environmental or clinical forensic samples, where it outperforms SNP methods in allele-based portability for international databases. It also bridges classical MLST's portability with SNP analysis's resolution, offering allele profiles that correlate strongly with SNP distances for epidemiological surveillance of pathogens like Listeria monocytogenes.^[70]^[71]

Databases and Resources

PubMLST Database

The PubMLST database, established in 2003 at the University of Oxford as an extension of the initial multilocus sequence typing (MLST) efforts begun in 1998, serves as the primary global repository for MLST and related gene-by-gene typing data for bacterial and fungal pathogens. It hosts over 100 MLST schemes covering more than 130 microbial species and genera, encompassing classical MLST, ribosomal MLST (rMLST), core genome MLST (cgMLST), and whole genome MLST (wgMLST) approaches. The database catalogs millions of alleles across these schemes, alongside millions of sequence type (ST) profiles and isolate records, enabling standardized comparisons of microbial population structures worldwide.^[9]^[72] Key features include curated collections of allele sequences, which are novel variants of housekeeping or core genes assigned unique integer identifiers, and ST profiles that combine these alleles into portable genotypes for each isolate. Isolate metadata is integrated, capturing provenance details such as geographic location (including GPS coordinates), isolation source, and clinical or epidemiological phenotypes like antimicrobial resistance patterns or virulence factors. The database employs the Bacterial Isolate Genome Sequence Database (BIGSdb) platform as its backend, providing tools for data submission via web interfaces or RESTful APIs, advanced querying by allele, ST, or metadata filters, and automated allele calling from raw sequences. As of 2025, registration is required to access allele, profile, and isolate data added after December 31, 2024.^[9]^[73]^[74]^[8] PubMLST offers free, open-access downloads of entire schemes, allele libraries, and isolate datasets in formats suitable for phylogenetic or epidemiological analyses, with no restrictions on usage for research or public health applications. Its BIGSdb infrastructure supports extensible plugins for in-depth analyses, such as constructing phylogenetic trees, goeBURST clustering for inferring evolutionary relationships, or mapping STs onto minimum spanning trees using tools like GrapeTree. Maintenance is community-driven, involving approximately 125 volunteer curators worldwide who validate submissions and update schemes, ensuring data quality and relevance; the platform also integrates with the National Center for Biotechnology Information (NCBI) to facilitate the incorporation of whole-genome sequencing data from public repositories.^[9]^[72]^[74]

Other Specialized Databases

In addition to the central PubMLST resource, several specialized databases focus on pathogen-specific or regionally tailored multilocus sequence typing (MLST) schemes to support targeted surveillance and research. For instance, the TB Portals database, maintained by the National Institute of Allergy and Infectious Diseases, integrates genomic and clinical data from over 28,000 Mycobacterium tuberculosis cases worldwide, facilitating sequence-based typing analyses that complement traditional MLST approaches for tuberculosis epidemiology.^[75] Similarly, the Centers for Disease Control and Prevention (CDC) employs a whole-genome MLST (wgMLST) scheme for M. tuberculosis, utilizing 2,672 genetic loci to enhance strain discrimination in outbreak investigations.^[76] EnteroBase serves as a key repository for Enterobacteriaceae, including Salmonella and Escherichia coli, where it applies core-genome MLST (cgMLST) and hierarchical clustering to analyze >1.1 million bacterial genomes for epidemiological tracking and antimicrobial resistance prediction.^[77]^[78] EnteroBase's Salmonella schemes are based on a wgMLST scheme utilizing 21,065 loci, enabling finer resolution of serovar-specific transmissions in foodborne outbreaks.^[79] Regional networks further specialize MLST applications for surveillance. PulseNet International, a global collaboration of public health laboratories, standardizes cgMLST schemes for foodborne pathogens such as Shiga toxin-producing E. coli, using whole-genome sequencing data to detect multistate outbreaks and trace contamination sources across borders.^[80] In Europe, the European Centre for Disease Prevention and Control's (ECDC) The European Surveillance System (TESSy) collects and harmonizes molecular typing data, including MLST profiles, from member states to monitor antimicrobial resistance and pathogen spread in real time.^[81]^[82] Emerging databases extend MLST to non-bacterial pathogens. FungiDB, part of the VEuPathDB bioinformatics platform, integrates genomic datasets for fungal species like Candida albicans, supporting MLST schemes that sequence seven housekeeping genes to delineate clonal complexes and track antifungal resistance in clinical isolates.^[83]^[37] For viruses, adaptations of MLST principles appear in HIV-1 surveillance, where pol gene sequencing profiles subtypes and drug resistance mutations, though not in a traditional allelic database format.^[84] These specialized databases often link to PubMLST via shared allelic profiles and APIs, promoting data interoperability, but challenges persist in harmonization due to varying schemes, nomenclature inconsistencies, and resource constraints for curation.^[85]^[21]^[86]