Fact-checked by Grok 2 weeks ago

SAMtools

SAMtools is a suite of command-line utilities written in C for manipulating and analyzing high-throughput sequencing data, primarily focused on processing alignments stored in the , , and formats. It enables essential operations such as viewing, sorting, indexing, merging, and generating per-position summaries from read alignments, supporting efficient handling of large datasets from next-generation sequencing platforms. Originally developed to facilitate post-processing of alignments for projects like the , SAMtools has become a foundational tool in bioinformatics pipelines worldwide. Introduced in 2009 by Heng Li and colleagues from the Broad Institute, Wellcome Trust Sanger Institute, and other institutions as part of the Data Processing Subgroup, SAMtools was released alongside the specification of the SAM —a flexible, tab-delimited text for representing sequence alignments against reference genomes. The software addressed the need for standardized, efficient tools to manage the growing volume of sequencing data, with its BAM providing compact storage and rapid capabilities. Over time, the project evolved: in 2010, variant calling components were separated into BCFtools, and by 2014, the codebase was restructured into three coordinated repositories—HTSlib (a C library for high-throughput sequencing data I/O), SAMtools (for manipulation), and BCFtools (for variant data)—to improve modularity and maintenance. This restructuring marked the 1.0 release of SAMtools, which doubled the codebase size and introduced support for the CRAM for further compression. Key features of SAMtools include threading for (added in version 0.1.19), on-the-fly indexing during file writing (version 1.10), and utilities like samtools view for format conversion and subsetting, samtools sort for coordinate or name ordering, samtools index for enabling fast queries, and samtools stats for generating alignment summaries. It supports alignments from short reads (e.g., Illumina) to long reads (up to 128 Mbp from PacBio or Nanopore), making it versatile across sequencing technologies and species, including vertebrates, plants, and microbes. SAMtools has had a profound impact on research, with its original 2009 publication cited over 56,000 times and the software installed more than seven million times via package managers like Bioconda. Actively maintained under the on , it features extensive testing (>700 unit tests), across platforms, and ongoing enhancements for handling massive datasets and 64-bit integer support for large genomes, with the latest stable release (version 1.22) in 2025. As of 2025, SAMtools and its sister projects continue to underpin major genomic analyses, from detection to studies, ensuring compatibility with evolving standards in the field.

Introduction

Overview

SAMtools is a widely used in bioinformatics for processing and analyzing high-throughput sequencing data, particularly alignments between sequencing reads and reference genomes. It provides a collection of command-line utilities that enable efficient manipulation of files in the Sequence Alignment/Map (SAM), (BAM), and formats, supporting tasks such as viewing alignments, sorting, merging, indexing, and generating consensus sequences. Developed to address the challenges of handling massive datasets from next-generation sequencing technologies, SAMtools facilitates downstream analyses like variant calling and structural variant detection by ensuring fast and reliable data access. The core of SAMtools relies on the HTSlib library, which implements low-level input/output operations for compressed and indexed genomic files, allowing for random access to specific regions without loading entire datasets into . This design choice enhances , making it suitable for terabyte-scale files common in modern projects. Originally introduced as part of the , SAMtools has become a foundational tool in sequencing pipelines, integrated with aligners like BWA and variant callers like GATK. Over the years, the project has expanded to include complementary tools like BCFtools for handling variant data in binary call format (BCF), reflecting its evolution into a comprehensive for genomic . Its open-source nature and active maintenance by a global developer community ensure compatibility with emerging sequencing technologies, including long-read platforms. SAMtools remains essential for reproducible research, with its utilities cited in thousands of studies for enabling scalable bioinformatics workflows.

Key Components

SAMtools is a modular software suite centered on the HTSlib library and a collection of command-line tools for manipulating high-throughput sequencing alignments in SAM, BAM, and CRAM formats. HTSlib, a C library developed as part of the project, provides low-level input/output operations, including parsing, compression, and indexing support, enabling efficient handling of large datasets across local and remote files. This library forms the backbone, allowing tools to perform operations like format conversion and region-based querying without redundant code. The suite's primary tools are subcommands under the samtools executable, categorized into file manipulation, processing, and analysis functions. These utilities support Unix-style piping for seamless integration into bioinformatics workflows and leverage multi-threading for performance on modern hardware. Originally introduced to address the need for standardized handling, the components emphasize speed and compatibility with evolving sequencing technologies.
ComponentDescription
viewConverts between SAM, BAM, and formats; filters alignments by region, flags, or quality; extracts /FASTQ sequences from alignments. Essential for initial inspection and subsetting of data.
sortSorts alignments by coordinate or query name, producing coordinate-sorted BAM files required for most downstream analyses; supports temporary file management for large inputs.
indexGenerates .bai or .csi index files for BAM or , enabling fast random access to genomic regions without full file loading; also supports indexing via faidx.
mpileupGenerates a textual pileup of aligned bases at each genomic position; for BCF or VCF output suitable for variant calling, use bcftools mpileup. Includes options for handling and base quality adjustments.
mergeCombines multiple sorted alignment files into one, preserving ; useful for aggregating results from or multi-sample experiments.
markdupIdentifies and flags duplicates in sorted alignments based on mapping position and orientation; outputs updated BAM with duplicate metrics.
statsGenerates comprehensive statistics on alignments, including total reads, mapping rates, insert size distributions, and per-chromosome coverage; aids in quality assessment.
Additional specialized tools, such as fixmate for correcting mate-pair information and calmd for base alignment quality adjustment, extend functionality for targeted tasks like error correction and amplicon analysis. Together, these components facilitate the full spectrum of alignment processing, from import to preparation for analysis tools.

History

Development Origins

SAMtools originated in late 2008 as part of efforts to standardize the representation and processing of high-throughput sequencing alignments for the , a large-scale initiative to sequence human genomes and identify genetic variants. The project required a flexible format to accommodate diverse sequencing technologies, such as Illumina/Solexa, AB/SOLiD, and Roche/454, and various alignment tools, enabling efficient downstream analyses like variant detection and genotype calling. Prior to this, alignments were often stored in proprietary or tool-specific formats, hindering interoperability and scalability for projects handling up to 10^11 base pairs of data. The format was conceptualized and named on October 21, 2008, by Heng Li, a key developer from the Sanger Institute, in collaboration with members of the Data Processing Subgroup, including Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Gonçalo Abecasis, and Richard Durbin. Development emphasized streamability for processing large files without full loading into memory, leading to the adoption of fixed columns, optional tags, extended strings for representing insertions and deletions, and a binning index for quick lookups on October 22, 2008. By November 3, 2008, a dual text/binary format was established, with the binary BAM format introduced on November 7 to support compression via BGZF (Blocked Zipper Format) for reduced storage and faster access. The final draft, under the , was sent to the on December 8, 2008. SAMtools, the accompanying software suite for manipulating and BAM files, was first publicly released on December 22, 2008, initially supporting file viewing, sorting, indexing, and basic variant calling. This release addressed the immediate needs of the for modular tools that could interface with alignment software and facilitate genomic analyses. The toolkit's design prioritized efficiency and extensibility, allowing integration with emerging sequencing pipelines while maintaining compatibility with the evolving SAM specification. Early development was driven by practical challenges in handling terabyte-scale datasets, with contributions from the broader bioinformatics community shaping its core utilities.

Major Releases

SAMtools' initial release, version 0.1.1, occurred on December 22, 2008, introducing core utilities for converting between the text format and the binary BAM format, sorting alignments, indexing files, and generating pileup data for variant detection. This laid the groundwork for efficient processing of high-throughput sequencing alignments, with early versions emphasizing compact storage and basic manipulation to handle the growing volume of genomic data from projects like the . The 0.x series evolved through over 20 releases up to version 0.1.20 in August 2014, incorporating incremental enhancements such as improved error handling and performance tweaks. A key advancement came in version 0.1.9 (October 2010), which restructured the integrated variant caller into the standalone package for better modularity. Version 0.1.19 (March 2013) marked a performance milestone by adding multi-threading support to the 'view' command for sorting and BAM writing, enabling faster processing on multi-core systems. Version 1.0, released on August 15, 2014, represented a major architectural overhaul, splitting the monolithic package into three coordinated projects: HTSlib as the foundational C library for file I/O, BCFtools for variant calling and manipulation, and SAMtools dedicated to alignment-specific operations. This restructuring improved maintainability and interoperability, while introducing native support for the CRAM format—a reference-based compression scheme that reduces file sizes by up to 30% compared to BAM without sacrificing random access efficiency. Automatic detection of input/output formats (SAM, BAM, CRAM) was also added, simplifying user workflows. Post-1.0 releases in the 1.x series prioritized , threading, and with modern sequencing paradigms. Version 1.10 (December 2019) introduced on-the-fly indexing during BAM and writing, eliminating separate indexing steps and accelerating throughput. Enhanced multi-threading across read/write operations followed in version 1.11 ( 2020), alongside new commands like 'ampliconclip' and 'ampliconstats' tailored for amplicon-based sequencing, which clips primers and computes coverage metrics to support targeted resequencing applications. Later major releases have refined , , and capabilities. Version 1.18 (July 2023) added minimizer-based for faster coordinate queries and a '--duplicate-count' option in 'markdup' to track duplicates more precisely. Version 1.19 (December 2023) extended the 'coverage' command with '--plot-depth' for visualizing read depth distributions and introduced lexicographical name in 'merge' and 'sort' for consistent handling of multi-sample data. Version 1.20 (April 2024) introduced '--max-depth' for 'bedcov' and support for multiple '-d' options in 'fastq'. Version 1.21 ( 2024) added region filtering for '' and a 'reset' command to remove auxiliary tags. Version 1.22 (May 2025) debuted the '' command for integrity verification of alignment files and shifted the default CRAM output to version 3.1, leveraging advanced codecs for better ratios while maintaining with tools supporting CRAM 3.0. The most recent update, version 1.22.1 (July 2025), primarily addresses bugs, including a use-after-free issue in 'mpileup -a' and buffer overflows in CRAM parsing, ensuring robustness for large-scale genomic analyses.

Supported Formats

SAM Format

The Sequence Alignment/Map () format is a TAB-delimited text-based standard for representing alignments of sequencing reads against reference sequences, designed to accommodate both short and long reads from diverse high-throughput sequencing platforms. Introduced in 2009 as part of the SAMtools suite, it addresses the need for a unified format amid varying alignment tools and sequencing technologies, such as Illumina and /454, enabling efficient downstream analyses like variant calling and . The format supports up to 128 megabase pairs per read and scales to datasets exceeding 100 gigabase pairs, as demonstrated in its adoption by the for processing large-scale alignments. A SAM file consists of two main sections: an optional header and an alignment section. The header begins with lines prefixed by '@', providing such as entries (e.g., @SQ lines specifying sequence names and lengths), read group information (@RG for sample like platform and library), program details (@PG for analysis steps), and comments (@CO). These header lines ensure reproducibility and facilitate coordinate sorting, with the specification recommending a specific order for tags to maintain compatibility. The alignment section follows, where each line represents a single read alignment or unmapped , using 1-based coordinates for positions to align with biological conventions. Each alignment line in the SAM format includes exactly 11 mandatory fields, separated by tabs, followed by zero or more optional fields. These mandatory fields are:
FieldDescriptionExample
QNAMEQuery name (read identifier), up to 255 charactersread_001
FLAGBitwise flag indicating read properties (e.g., paired, unmapped)0 (unpaired, mapped)
RNAMEReference sequence name, or '*' if unmappedchr1
POS1-based leftmost mapping position, or 0 if unmapped1000
MAPQMapping quality (Phred-scaled probability of random placement), 255 for unavailable60
CIGARConcise Idiosyncratic Gapped Alignment Report string describing matches, insertions, deletions, etc.50M (50 matches)
MRNMMate reference name for paired reads, or '*' if unavailable'=' (same as RNAME)
MPOS1-based position of mate read2000
TLENObserved template length (insert size), signed1000
SEQQuery sequence as a string of ACGTN, or '*' if unavailableAGCT...
QUALASCII-encoded Phred quality scores for SEQ bases (+33 offset), or '*' if unavailable!''*+...
The field is a 12-bit encoding read attributes, such as 0x0001 for paired-end reads, 0x0004 for unmapped reads, and 0x0400 for duplicate-marked reads, allowing tools to or alignments based on these properties. The string uses extended operators like 'M' for alignment match/mismatch, 'I' for insertion to the reference, 'D' for deletion from the reference, 'N' for skipped reference regions (e.g., introns), and 'S' for soft-clipped bases outside the aligned region. Optional fields appear after the mandatory ones in the format TAG:TYPE:VALUE, where TAG is a two-character code (e.g., 'NM' for edit distance), TYPE specifies the value format (e.g., 'i' for integer, 'Z' for string, 'f' for float), and VALUE holds the data. Common tags include AS for alignment score, MD for mismatch details, and RG for read group assignment, with over 100 standardized tags defined in the SAMtags specification to support advanced features like structural variant annotations. These optional fields enhance flexibility without mandating their presence, ensuring the format remains lightweight for basic use while extensible for complex analyses. The SAM format's text-based design promotes human readability and interoperability across tools, but for efficiency, it pairs with the binary BAM equivalent, which compresses files (e.g., reducing a 112 Gbp dataset from 116 GB to under 30 GB) while preserving all information for random access via indexing. The specification, initially version 1.0 released in 2013 and maintained by the Global Alliance for Genomics and Health (GA4GH), has been updated to version 1.6 as of November 2024, with ongoing refinements.

BAM Format

The BAM (Binary Alignment/Map) format is a compressed representation of the (Sequence Alignment/Map) format, designed for efficient storage and random access to high-throughput sequencing alignment data. Developed as part of the SAMtools suite, BAM retains all information from , including mandatory fields such as query name (QNAME), alignment flag (FLAG), and reference sequence position (POS), while encoding them in a compact to reduce file sizes significantly—for instance, compressing 112 Gbp of uncompressed data (approximately 116 GB) to under 30 GB. This format uses little-endian byte order and supports both aligned and unaligned reads, making it suitable for diverse sequencing platforms and read types. A BAM file begins with a 4-byte magic string "BAM\1", followed by a uint32_t indicating the length of the header section, which contains the textual SAM header (e.g., lines starting with @HD or @SQ) in raw form. This is succeeded by a uint32_t specifying the number of reference sequences, each described by a binary entry with the sequence name length (uint32_t), name (null-terminated string), and length (uint32_t). Alignment records follow, each prefixed by a uint32_t block size (excluding the size field itself) and comprising core fields such as reference ID (int32_t, -1 for unmapped), position (0-based int32_t), mapping quality (uint8_t), bin and FLAG (uint16_t), read name length and sequence length (uint16_t each), next reference ID and position (int32_t each), template length (int32_t), and variable-length arrays for CIGAR string (uint32_t array), sequence (packed uint8_t array using 2 bits per base, e.g., A=0, C=1), and quality scores (uint8_t array). Optional tags are encoded as key-value pairs (3-byte tag + 1-byte type + variable value), allowing flexible extension without altering the core structure. BAM files employ BGZF (Blocked Zip Format) compression, a gzip-compatible method that divides data into independent blocks of up to 64 KB, enabling parallel decompression and without full file loading. This is handled via the HTSlib integrated with SAMtools, which also facilitates between SAM and BAM (e.g., processing 112 Gbp of data in about 10 hours on standard hardware). For efficient querying, BAM files must be sorted by coordinate and indexed using the BAI (BAM Alignment Index) format, which employs a hierarchical binning system based on genomic regions (e.g., bin 0 covers the entire 512 Mbp , with finer bins down to 8 Kbp) combined with linear offsets for chunks within bins. This indexing allows retrieval of alignments overlapping a specific with typically one disk seek, supporting operations like samtools view on regions with low memory overhead (under 30 MB for large datasets). Key differences from SAM include the shift to 0-based positioning (versus SAM's 1-based), binary encoding of sequences and qualities (e.g., bases packed into 4-bit values using the order =ACMGRSVTWYHKDBN), and support for large CIGAR strings via the CG:Z optional tag to avoid overflow in the fixed array. These features make BAM the preferred format for SAMtools workflows, such as sorting (samtools sort), merging, and statistical analysis, where it provides substantial speed and space savings over text-based alternatives. The format's specification, version 1.6 as of November 2024, ensures backward compatibility while accommodating evolving sequencing technologies.

CRAM Format

The (Compressed Alignment/Map) format is a reference-based columnar storage format designed for high-efficiency of biological alignments, offering significant space savings over the BAM format while maintaining full with the SAM specification. Developed as an extension of the SAM/BAM ecosystem, CRAM encodes alignment data in a way that leverages the to store only differences from the reference , such as substitutions, insertions, and deletions, rather than full read sequences. This approach enables ratios typically 3 to 4 times better than BAM for short-read data, with file sizes reduced by 50-70% in practice for Illumina sequencing outputs. CRAM files are structured as a sequence of containers, beginning with a fixed 26-byte definition header that identifies the format version and level, followed by a header container storing the header and reference sequences (via checksums for validation). Subsequent containers group into slices—logical units of up to 100,000 records—each preceded by a header that defines per-field encoding parameters. Slices consist of a block (a bit-packed of encoded alignment fields) and optional external blocks (byte s for less compressible like read names or quality scores), with all blocks compressed using algorithms such as , LZMA, or rANS . The concludes with an EOF container for integrity verification. This modular design supports via external indexing (e.g., .crai ) and selective decoding, allowing tools to load only relevant slices without decompressing the entire . Encoding in CRAM is field-specific and adaptive: core alignment attributes (e.g., flags, mapping quality, positions) use variable-length integer encodings like ITF-8 or , while read features (e.g., base substitutions as delta offsets from the reference) are represented as arrays of operations to reconstruct the sequence on demand. External references are mandatory for decoding, but CRAM optionally embeds reference slices for portability, with MD5-based validation to ensure consistency. Version 3.1, released in 2021, introduced advanced codecs including rANS4x16 for faster encoding, adaptive , fqzcomp for quality scores, and a name tokenizer for read identifiers, yielding 7-15% additional gains over version 3.0 for high-coverage short reads, and enabling processing speeds up to 3 times faster than equivalent BAM operations in benchmarks like samtools flagstat on large datasets. As of September 2024, CRAM version 3.1 became the default in SAMtools and HTSlib releases. These enhancements are implemented in the HTSlib library, ensuring seamless integration with SAMtools commands such as view, sort, and index for CRAM I/O. Compared to BAM's record-oriented BGZF compression, CRAM's columnar structure and reference dependency reduce redundancy in aligned data, making it particularly advantageous for archival storage in large-scale genomics projects, though it requires reference availability during decoding—a trade-off mitigated by widespread reference standardization. Support for controlled lossy compression (e.g., via quality score binning) further optimizes space for variant calling pipelines without impacting accuracy in downstream analyses. CRAM was first integrated into SAMtools with version 1.0 in , evolving from early prototypes at the and to become the default format in recent releases.

Core Commands

File Manipulation

SAMtools offers a suite of core commands dedicated to file manipulation tasks, enabling users to view, convert, sort, index, merge, and otherwise process alignment files in , BAM, and formats. These utilities are fundamental for handling high-throughput sequencing data, as they facilitate efficient data preparation, subset extraction, and integration in bioinformatics pipelines without requiring full reloading of large files. Designed for speed and low memory usage, these commands leverage the compressed BAM and formats to manage terabyte-scale datasets effectively. The samtools view command serves as the primary tool for inspecting and converting alignment files. It extracts and prints alignments from input files to standard output in SAM format by default, but supports output conversion to BAM (-b) or (-C) via specified options, with the output file designated using -o. For region-specific extraction, an indexed input file is required, allowing queries like samtools [view](/page/View) input.bam chr1:10000-20000 to retrieve alignments in a genomic . This command is versatile for initial data exploration and format interoperability, often piped to other tools for downstream analysis. Sorting alignments is handled by samtools sort, which rearranges records by coordinate (default) or read name (-n) to produce a sorted BAM output. Essential for indexing and efficient querying, it uses temporary files prefixed by -T and supports multi-threading with -@ for on large inputs. For instance, samtools sort -o sorted.bam input.bam generates a coordinate-sorted file suitable for subsequent operations, reducing random access times in variant calling workflows. Once sorted, files can be indexed using samtools index to enable rapid region-based access. This creates a .bai (BAI) or .csi (CSI) index file alongside the input, with -b or -c options selecting the index type; BAI is standard for BAM files under 2^29 bases per reference. The command requires coordinate-sorted input and is non-destructive, as in samtools index sorted.bam, which produces sorted.bam.bai for use with tools like samtools view or genome browsers. Indexing dramatically improves performance on datasets exceeding gigabytes, avoiding sequential scans. Merging multiple sorted files is accomplished with samtools merge, which combines inputs while maintaining sort order and merging headers. It accepts a list of BAM/CRAM files as arguments, outputting to a specified file, and includes options like -n for name-based or -f to force overwriting. An example workflow is samtools merge -o merged.bam sample1.bam sample2.bam, ideal for consolidating lane-level alignments from sequencing runs. For simpler without order preservation, samtools cat joins files of compatible formats using -h for a shared header source. Additional manipulation includes samtools split, which partitions a file by read group () tag into separate outputs prefixed by the input name, as in samtools split merged.bam yielding files like merged.bam.A.1.bam. Header replacement is streamlined by samtools reheader, which applies a new SAM header to a BAM/ file efficiently: samtools reheader newheader.sam input.bam > output.bam. These commands support targeted file restructuring, such as separating samples or correcting , enhancing data organization in multi-sample studies. For read shuffling and grouping, samtools collate prepares name-sorted inputs by collating paired reads together, using samtools collate -o output.bam input.bam to output without full , which aids duplicate marking. Complementing this, samtools fixmate populates mate information in name-sorted files, adding flags and coordinates via samtools fixmate -O bam input.bam output.bam, and optionally removes unmapped mates (-r). Finally, samtools markdup identifies and marks duplicates in coordinate-sorted files, with options to remove them (-r) or output statistics (-s), as in samtools markdup input.bam output.bam. These utilities ensure during manipulation, critical for accurate downstream analyses like variant detection.

Alignment Processing

SAMtools provides a suite of commands for processing alignment files in , BAM, and formats, enabling tasks such as viewing, filtering, sorting, indexing, and merging alignments to facilitate downstream genomic analysis. These operations are essential for managing high-throughput sequencing data, ensuring efficient access and manipulation of read alignments against reference genomes. The samtools view command is a foundational tool for extracting and filtering alignments from input files. It converts between formats (e.g., BAM to SAM) and restricts output to specific genomic regions using positional arguments or the -L option for files, allowing users to focus on subsets of data without loading entire files into memory. For instance, samtools view -b input.bam chr1:1000-2000 outputs BAM-format alignments for the specified region, supporting rapid querying in large datasets. Sorting and indexing are critical for optimizing alignment files for random access and efficient processing. The samtools sort command rearranges by coordinate (default) or read name (with -n), producing sorted output that is prerequisite for many analyses; it uses temporary files specified by -T to handle large inputs. Following sorting, samtools index generates indexes (BAI for BAM with -b, or for larger files with -c), enabling fast region-based retrieval via samtools view without rescanning the entire file. This combination significantly reduces computational overhead in workflows involving repeated region queries. Merging and concatenation support the integration of multiple alignment files, often from or multi-sample experiments. The samtools merge command combines sorted files while preserving order and merging headers (optionally from a separate file with -h), suitable for consolidating data from distributed alignments; it includes -f to force overwriting existing outputs. In contrast, samtools cat simply concatenates unsorted files with identical reference dictionaries, providing a lightweight option for appending alignments without re-sorting. Duplicate handling and mate-pair fixing address common artifacts in sequencing data. The samtools markdup command identifies and marks or optical duplicates based on mapping coordinates and orientation, with options like -r to remove them and -s for single-end reads, improving accuracy in variant calling pipelines. Similarly, samtools fixmate updates information in name-sorted files, correcting flags and positions for paired-end reads (using -m to mark supplementary alignments), which is vital for downstream paired-end analyses. Additional processing tools include samtools split for dividing files by read group into separate outputs, aiding in sample-specific workflows, and samtools reheader for replacing headers without altering alignment records, useful for updating metadata in files (with -i for in-place modification). These commands collectively enable robust preprocessing of alignments, ensuring data integrity and compatibility with tools like variant callers.

Statistics and Inspection

SAMtools provides several commands dedicated to generating statistics and inspecting files in SAM, BAM, or formats, enabling users to assess , coverage, and read properties without extensive processing. These tools are essential for in high-throughput sequencing workflows, offering both summary metrics and detailed views of the data. The primary commands include samtools stats, samtools flagstat, samtools idxstats, and samtools view, each targeting specific aspects of file inspection and statistical analysis. The samtools stats command computes comprehensive statistics from files, producing a text-based report that can be visualized using the accompanying plot-bamstats script. It categorizes metrics into sections such as summary numbers (e.g., total reads, mapped percentage), insert size distributions, coverage depths, and biases, distinguishing between paired and unpaired reads based on SAM flags like PAIRED (0x1), READ1 (0x40), and READ2 (0x80). For instance, it reports averages like insert size and coverage, along with histograms for insert sizes and quality scores, allowing users to identify issues such as library preparation artifacts or sequencing biases. Options like -c for custom coverage ranges (default: 1-1000) or -d to exclude duplicates enable targeted analysis, and the output supports region-specific queries when the input is indexed. This command is particularly useful for overall file inspection, as it processes the entire file or specified regions efficiently. For flag-based inspection, samtools flagstat analyzes the FLAG field of alignments according to the SAM specification, counting reads across 13 categories such as total, mapped, paired, duplicates, and QC failures. It outputs counts and percentages for primary, secondary, and supplementary alignments, split by QC pass/fail status (FLAG 0x200), providing a quick overview of mapping quality and potential artifacts like unmapped or improperly paired reads. The default output is human-readable (e.g., "122 + 28 in total"), but it supports JSON or TSV formats for programmatic use, with multi-threading via -@ for large files. This tool is lightweight and runs in a single pass, making it ideal for rapid quality checks during alignment pipelines. The samtools idxstats command retrieves per-reference-sequence statistics from an indexed BAM file, reporting the reference name, length, number of mapped read segments, and unmapped read segments in a TAB-delimited format. It requires prior indexing with samtools index for efficiency, though unindexed files can be processed by full scan (slower for large datasets). This is valuable for inspecting alignment distribution across chromosomes or contigs, highlighting uneven coverage or unmapped portions, and it may overcount multi-mapped or fragmented reads. An example output line might read "* 0 0 100" for unmapped reads, aiding in decisions about downstream filtering. Inspection at the level is facilitated by samtools [view](/page/View), which extracts and displays records from files, supporting filtering by , mapping (via -q), flags (include/exclude with -f/-F), or tags. For coordinate-sorted and indexed inputs, it enables fast random access to regions (e.g., samtools [view](/page/View) input.bam chr1:1000-2000), outputting in SAM, BAM, or formats. Options like -h include headers, -c counts matches without printing, and -L uses files for targeted viewing, making it a versatile tool for detailed examination of specific alignments or during workflows.

Usage and Examples

Basic Usage

SAMtools provides a straightforward for essential operations on files in , BAM, and formats, such as , , indexing, and viewing. These core functionalities enable users to process high-throughput sequencing data efficiently, often in a Unix using . For instance, input files can be piped from other tools, and remote files accessed via URLs like FTP or HTTP. Basic operations require the software to be installed and typically involve specifying input files, output options, and optional flags for filtering or formatting. The view command is fundamental for inspecting, converting, and filtering alignments. It reads , , or files and outputs in SAM format by default, but can produce BAM or CRAM with flags. To convert a file to BAM, use:
samtools view -b input.sam > output.bam
This compresses the text-based SAM into the BAM format for compact storage. For viewing the first few alignments without conversion, pipe to head:
samtools [view](/page/View) input.bam | head -5
Region-specific extraction requires a coordinate-sorted and indexed input , such as retrieving reads from :10000-20000:
samtools [view](/page/View) -b input.bam "1:10000-20000" > region.bam
Filtering by read group or flags (e.g., mapped reads only) is achieved with options like -r or -q. Sorting alignments by genomic coordinates is a prerequisite for many downstream analyses, including indexing and . The sort command rearranges records in a BAM or , outputting a new sorted . A basic example sorts by leftmost coordinate:
samtools sort -o output.sorted.bam input.bam
This uses up to 768 of per by default and supports multi-threading with -@ for large files; for example, with 4 threads:
samtools sort -@ 4 -o output.sorted.bam input.bam
by read name instead (useful for paired-end ) adds the -n flag. The process is memory-efficient, handling datasets up to hundreds of gigabases with modest resources. Indexing a sorted BAM or CRAM file enables rapid random access to specific genomic regions without scanning the entire file. The index command generates a binary index file (BAI for BAM, CRAI for CRAM). For a sorted BAM:
samtools index output.sorted.bam
This creates output.sorted.bam.bai alongside the input. For compressed SAM (SAM.gz), the same command applies, producing a .bai index. CSI indices (for larger intervals) can be specified with -c, suitable for very large files. Once indexed, commands like view leverage it for efficient querying. Basic statistics on alignment files can be generated with flagstat, providing counts of mapped, unmapped, and duplicate reads. Run:
samtools flagstat input.bam
This outputs a summary report, such as total reads and rates, essential for . These operations form the foundation of SAMtools workflows, often chained together for data preparation in pipelines.

Common Workflows

One of the most prevalent workflows in high-throughput sequencing analysis involves converting raw FASTQ files to compressed BAM or formats for efficient storage and . This typically begins with of reads to a using tools like BWA-MEM or minimap2, producing a SAM file that is then processed with SAMtools commands to fix mate-pair information, sort by genomic position, mark duplicates, and convert to the desired format. For instance, the fixmate command resolves pairing issues in paired-end data, while sort ensures coordinate ordering essential for indexing and variant calling. The markdup step identifies and flags duplicates to avoid biases in coverage estimates. This workflow reduces file sizes significantly—CRAM can achieve file sizes 23-55% of equivalent BAM files (45-77% smaller), depending on the dataset and compression settings, while maintaining reference-based efficiency—and is foundational for whole-genome sequencing pipelines. A representative pipeline for this conversion pipes the aligner output directly into SAMtools for streaming processing: minimap2 -a -x sr reference.fa reads.fastq | samtools fixmate -m - - | samtools sort -@ 8 -T /tmp/temp - | samtools markdup -r - final.bam, which avoids intermediate files and leverages multi-threading for speed on large datasets. Conversion to CRAM requires specifying the reference with -T reference.fa in the view command, enabling reference-dependent compression that embeds differences rather than full sequences. This approach is particularly useful in resource-constrained environments, as CRAM files can be decoded on-the-fly without full . For whole-genome sequencing (WGS) or whole-exome sequencing (WES), a standard extends the process into variant calling and refinement. After initial with BWA-MEM to produce sorted BAM s via SAMtools sort and fixmate, base quality score recalibration (BQSR) and duplicate marking are applied using external tools like GATK, followed by merging multiple lanes with SAMtools merge. Variant calling then uses BCFtools mpileup to generate pileup data from the BAM, piped into bcftools call for : bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -mv -Ob -o calls.bcf. This produces a BCF containing high-confidence SNPs and indels, indexed with tabix for quick querying. The emphasizes quality filtering during calling, such as skipping low-depth regions with -d 5 in mpileup, to balance in detecting variants at ~30x coverage typical for WGS. Post-calling, VCF filtering refines variants using BCFtools to remove artifacts based on quality metrics like QUAL score, read depth (), and strand bias (). A common post-call filter excludes low-quality sites with bcftools filter -i 'QUAL>20 && [DP](/page/DP)>10' -Ob -o filtered.bcf calls.bcf, separating SNPs and indels via TYPE annotations for tailored rules—e.g., indels require minimum supporting reads (IDV > 2) to mitigate errors. Pre-call options in mpileup, such as -L 250 for maximum depth, prevent over-calling in high-coverage regions. This step is crucial for reducing false positives, with empirical thresholds often tuned against truth sets using bcftools isec to compare true/false positives, improving precision in well-calibrated datasets. Integration with tools follows for functional . CRAM-specific workflows optimize for reference-dependent storage in collaborative or archival settings, requiring alignments to be position-sorted before encoding to maintain ratios around 1:4 to 1:6 versus uncompressed . As of version 1.22 (July 2025), SAMtools defaults to 3.1, which provides further improvements over previous versions. Key commands include viewing with samtools view -T ref.fa cram.cram for on-demand decoding and mpileup directly on for pileups without conversion. Best practices mandate embedding reference hashes in headers via the aligner's -R option and setting environment variables like REF_PATH for remote reference fetching, ensuring seamless access in distributed systems like the European Nucleotide Archive. This format supports workflows where storage is a bottleneck, as partial decoding reduces I/O compared to BAM, though it demands consistent reference availability to avoid decoding failures. Basic inspection and form another routine , often preceding analysis. After ing a BAM with samtools index aligned.bam, the flagstat command summarizes metrics: samtools flagstat aligned.bam, reporting total reads, mapped percentage (typically >95% for good ), and duplicates. Depth statistics via samtools depth -a aligned.bam > coverage.txt quantify per-position coverage, aiding in identifying biases. These steps, executable in seconds on gigabyte-scale files, provide essential diagnostics without full reprocessing.

Integration and Extensions

With HTSlib and BCFtools

SAMtools, BCFtools, and HTSlib form a tightly integrated ecosystem for processing high-throughput sequencing data, with HTSlib serving as the foundational library that enables efficient reading, writing, and manipulation of formats such as SAM/BAM/ for alignments and VCF/BCF for variants. HTSlib provides a unified for these operations, allowing SAMtools and BCFtools to share core functionalities like binary format handling, indexing, and multi-threading support, which enhances performance and ensures compatibility across the tools. This integration originated from the restructuring of the original SAMtools project, where HTSlib was extracted as a standalone to facilitate independent development and third-party embedding, while SAMtools focused on alignment processing and BCFtools on variant calling. In practice, SAMtools depends on HTSlib for all file I/O operations, such as , merging, and indexing BAM files, with source distributions including bundled HTSlib copies for standalone builds. Similarly, BCFtools leverages HTSlib for VCF/BCF manipulation, including conversion between text and binary formats, enabling seamless handling of large-scale datasets. This shared dependency minimizes code duplication—HTSlib provides the core functionality supporting both tools—and allows updates like in-file indexing (introduced in version 1.10) to propagate efficiently across SAMtools and BCFtools. A primary example of integration occurs in variant calling , where SAMtools prepares aligned BAM files that serve as input to BCFtools' mpileup command, which generates pileups and likelihoods using HTSlib for efficient data access. For instance, after and with samtools sort and samtools index, the pipeline proceeds to bcftools mpileup -f ref.fa input.bam | bcftools call -mv -Ob -o calls.bcf, producing a compressed BCF file of variants; this one-liner combines pileup generation and calling, relying on HTSlib's threading for speed on multi-core systems. Such , common in whole-genome sequencing (WGS) and analysis, demonstrate how the tools interoperate: SAMtools handles preprocessing and of , while BCFtools performs downstream variant detection and filtering, all underpinned by HTSlib's format-agnostic efficiency. Historically, BCFtools evolved from SAMtools' variant-calling components (e.g., the original mpileup and call in SAMtools 0.1.9, ), becoming independent in to better support multi-sample and gVCF formats. This modular design has enabled the ecosystem's growth, with extensive ongoing development fostering high-impact applications in research while maintaining low memory usage and platform independence. As of mid-2025, the latest releases are SAMtools 1.22.1, BCFtools 1.22, and HTSlib 1.22.1, continuing to support evolving sequencing technologies and pipelines.

With Other Bioinformatics Tools

SAMtools is frequently integrated into next-generation sequencing (NGS) workflows alongside aligners such as BWA, where BWA generates initial SAM files from read alignments to a , and SAMtools subsequently converts these to compressed BAM format, sorts them, and generates indices for efficient . This integration ensures compatibility in standard pipelines, as BAM files produced by SAMtools are directly usable by BWA's post-alignment steps or further tools. In variant calling pipelines, SAMtools pairs with the Genome Analysis Toolkit (GATK) by providing pre-processed BAM files—sorted, indexed, and filtered for mapping quality—that serve as input for GATK's HaplotypeCaller or other callers, enabling accurate identification of single nucleotide variants and indels. For instance, after alignment with BWA and duplicate marking, SAMtools' view and index commands prepare files that GATK requires for base quality score recalibration and joint genotyping across samples. This combination has become a cornerstone of best practices for high-confidence variant detection in clinical and research sequencing. SAMtools also complements Picard tools for quality control and duplicate handling; while Picard’s MarkDuplicates identifies and tags or optical duplicates in BAM files, SAMtools can preprocess or postprocess these files via (samtools sort) or fixing pairs (samtools fixmate), ensuring seamless in pipelines that prioritize duplicate removal to reduce in downstream analyses. Although newer SAMtools versions include a markdup command as an alternative, remains preferred in GATK-centric workflows for its robust handling of read groups and metrics reporting. For visualization, SAMtools generates indices (via samtools index) that enable loading of BAM or CRAM files into the (IGV), allowing interactive inspection of alignments, coverage, and variants without file conversion. This integration supports exploratory analysis, where users can correlate SAMtools-derived statistics (e.g., from samtools stats) with IGV's graphical overlays to validate pipeline outputs. Beyond these, SAMtools interfaces with tools like for intersection-based operations on alignments and regions, where indexed BAM files from SAMtools feed into BEDTools' bamToBed or coverage calculations, facilitating tasks such as peak calling integration in ChIP-seq workflows. Overall, SAMtools' standardized formats and Unix-pipe compatibility make it a foundational component in modular pipelines, often scripted with or for reproducible integration across diverse tools.

References

  1. [1]
  2. [2]
    Samtools
    - **Integration**: Samtools and BCFtools use HTSlib internally for reading/writing high-throughput sequencing data.
  3. [3]
  4. [4]
    The Sequence Alignment/Map format and SAMtools - PMC - NIH
    SAMtools is a library and software package for parsing and manipulating alignments in the SAM/BAM format. It is able to convert from other alignment formats, ...
  5. [5]
    samtools(1) manual page
    ### Summary of SAMtools Integration with Bioinformatics Tools
  6. [6]
    Sequence Alignment/Map format and SAMtools - Oxford Academic
    In this article, we present an overview of the SAM format and briefly introduce the companion SAMtools software package. A detailed format specification and the ...
  7. [7]
    Twelve years of SAMtools and BCFtools | GigaScience
    Feb 16, 2021 · SAMtools and BCFtools are tools for processing sequencing data. SAMtools works with alignment data, while BCFtools handles variant data.  ...
  8. [8]
    Samtools
    Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories.SAMtools/BCFtools/HTSlib · Samtools view · Samtools - Documentation · Bgzip
  9. [9]
    samtools/samtools: Tools (written in C using htslib) for ... - GitHub
    This is the official development repository for samtools. The original samtools package has been split into three separate but tightly coordinated projects.Samtools · Samtools/htslib · Releases · BCFtools
  10. [10]
    samtools/htslib: C library for high-throughput sequencing data formats
    HTSlib is an implementation of a unified C library for accessing common file formats, such as SAM, CRAM and VCF, used for high-throughput sequencing data.Releases · Issues 140 · Pull requests 25 · Actions
  11. [11]
    samtools(1) manual page
    May 30, 2025 · Samtools is a set of utilities that manipulate alignments in the SAM (Sequence Alignment/Map), BAM, and CRAM formats.Samtools view · Samtools sort · Samtools-fasta · Samtools indexMissing: bioinformatics | Show results with:bioinformatics
  12. [12]
    The early history of the SAM/BAM format - Heng Li's blog
    Jan 27, 2015 · 2008-12-08: Final draft sent to 1000g. Adopted the MIT license. 2008-12-22: First public release of samtools. It is still working on most BAMs ...
  13. [13]
  14. [14]
    SAM tools - Browse /samtools/1.0 at SourceForge.net
    Get an email when there's a new version of SAM tools ; 2014-08-15, 4.5 MB. 1.
  15. [15]
    Samtools CRAMS in support for improved compression formats
    Aug 15, 2014 · Samtools 1.0 is freely available at http://www.htslib.org/. This new version supports the highly efficient genomic data format CRAM, adds new ...
  16. [16]
    samtools(1) manual page
    Dec 6, 2019 · Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format.Commands · Samtools Options · Global Command OptionsMissing: history | Show results with:history
  17. [17]
    samtools(1) manual page
    Sep 22, 2020 · Samtools is a set of utilities that manipulate alignments in the SAM (Sequence Alignment/Map), BAM, and CRAM formats.Commands · Samtools Options · Global Command OptionsMissing: history | Show results with:history
  18. [18]
    sam(5) manual page - Samtools
    Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of:
  19. [19]
    [PDF] Sequence Alignment/Map Format Specification - Samtools
    Nov 6, 2024 · Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields ...Missing: components | Show results with:components
  20. [20]
    [PDF] Sequence Alignment/Map Optional Fields Specification - Samtools
    The SAM format can be used to represent de novo assemblies, generally by using padded reference sequences and the annotation tags described here ...<|control11|><|separator|>
  21. [21]
    samtools/hts-specs: Specifications of SAM/BAM and related ... - GitHub
    SAMv1.tex is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files.
  22. [22]
    [PDF] CRAM format specification (version 3.1) - Samtools
    Sep 4, 2024 · A CRAM file consists of a fixed length file definition, followed by a CRAM header container, then zero or more data containers, and finally a ...
  23. [23]
    CRAM 3.1: advances in the CRAM file format - PMC - NIH
    CRAM 3.1 is the first major update to CRAM since 2014. It keeps the underlying format unchanged, but adds new compression codecs. This is the first CRAM version ...
  24. [24]
    samtools-view(1) manual page
    May 30, 2025 · Prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).Missing: history | Show results with:history
  25. [25]
    samtools-sort(1) manual page
    ### Basic Usage and Examples for `samtools sort`
  26. [26]
    samtools-index(1) manual page
    May 30, 2025 · Index coordinate-sorted BGZIP-compressed SAM, BAM or CRAM files for fast random access. Note for SAM this only works if the file has been BGZF compressed first.
  27. [27]
    samtools-stats(1) manual page
    May 30, 2025 · samtools stats collects statistics from BAM files and outputs in a text format. The output can be visualized graphically using plot-bamstats.
  28. [28]
    samtools-flagstat(1) manual page
    ### Summary of samtools flagstat
  29. [29]
    samtools-idxstats(1) manual page
    May 30, 2025 · Retrieve and print stats in the index file corresponding to the input file. Before calling idxstats, the input BAM file should be indexed by samtools index.
  30. [30]
    FASTQ to BAM / CRAM - Samtools
    It is possible to store unaligned data in BAM or CRAM, and indeed it may be preferable as it permits meta-data in the header and per-record auxiliary tags.
  31. [31]
    WGS/WES Mapping to Variant Calls - Samtools
    The standard workflow for working with DNA sequence data consists of three major steps: Mapping; Improvement; Variant Calling. Mapping. For reads from 70bp up ...
  32. [32]
    Filtering of VCF Files - Samtools
    This guide is covering filtering a single WGS sample with an expectation of broadly even allele frequencies and therefore any recommendations need to be taken ...
  33. [33]
    Using CRAM within Samtools
    For a workflow this has a few fundamental effects: Alignments should be kept in chromosome/position sort order. The reference must be available at all times.
  34. [34]
    Twelve years of SAMtools and BCFtools - PMC - PubMed Central
    The 1.0 release introduced support for the better-compressed CRAM format [11]. ... Heng Li, Department of Data Sciences, Dana-Farber Cancer Institute, 450 ...
  35. [35]
    samtools/bcftools: This is the official development ... - GitHub
    This is the official development repository for BCFtools. It contains all the vcf* commands which previously lived in the htslib repository.Releases · Discussions · Issues 327 · 2015
  36. [36]
    Variant calling - bcftools mpileup - Samtools
    The first mpileup part generates genotype likelihoods at each genomic position with coverage. The second call part makes the actual calls.Missing: workflows | Show results with:workflows<|control11|><|separator|>
  37. [37]
    From FastQ data to high confidence variant calls: the Genome ... - NIH
    This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls.
  38. [38]
    Best practices for variant calling in clinical sequencing
    Oct 26, 2020 · Tools such as Picard [28] and Sambamba [29] identify and mark duplicate reads in a BAM file to exclude them from downstream analysis. The GATK ...Snv/indel Calling · Identifying De Novo... · Copy Number And Structural...
  39. [39]
    Read groups - GATK - Broad Institute
    Note that some Picard tools have the ability to modify ID s when merging SAM files in order to avoid collisions. In Illumina data, read group ID s are composed ...