Fact-checked by Grok 2 weeks ago

SAMtools

SAMtools is a suite of command-line utilities written in C for manipulating and analyzing high-throughput sequencing data, primarily focused on processing alignments stored in the Sequence Alignment/Map (SAM), Binary Alignment/Map (BAM), and CRAM formats.^[1]^[2] It enables essential operations such as viewing, sorting, indexing, merging, and generating per-position summaries from read alignments, supporting efficient handling of large datasets from next-generation sequencing platforms.^[3] Originally developed to facilitate post-processing of alignments for projects like the 1000 Genomes Project, SAMtools has become a foundational tool in bioinformatics pipelines worldwide.^[4] Introduced in 2009 by Heng Li and colleagues from the Broad Institute, Wellcome Trust Sanger Institute, and other institutions as part of the 1000 Genomes Project Data Processing Subgroup, SAMtools was released alongside the specification of the SAM format—a flexible, tab-delimited text format for representing sequence alignments against reference genomes.^[1] The software addressed the need for standardized, efficient tools to manage the growing volume of sequencing data, with its binary BAM format providing compact storage and rapid random access capabilities.^[4] Over time, the project evolved: in 2010, variant calling components were separated into BCFtools, and by 2014, the codebase was restructured into three coordinated repositories—HTSlib (a C library for high-throughput sequencing data I/O), SAMtools (for alignment manipulation), and BCFtools (for variant data)—to improve modularity and maintenance.^[3] This restructuring marked the 1.0 release of SAMtools, which doubled the codebase size and introduced support for the CRAM format for further compression.^[3] Key features of SAMtools include threading for parallel processing (added in version 0.1.19), on-the-fly indexing during file writing (version 1.10), and utilities like samtools view for format conversion and subsetting, samtools sort for coordinate or name ordering, samtools index for enabling fast queries, and samtools stats for generating alignment summaries.^[5] It supports alignments from short reads (e.g., Illumina) to long reads (up to 128 Mbp from PacBio or Oxford Nanopore), making it versatile across sequencing technologies and species, including vertebrates, plants, and microbes.^[1]^[3] SAMtools has had a profound impact on genomics research, with its original 2009 publication cited over 56,000 times and the software installed more than seven million times via package managers like Bioconda.^[3]^[6]^[7] Actively maintained under the MIT license on GitHub, it features extensive testing (>700 unit tests), continuous integration across platforms, and ongoing enhancements for handling massive datasets and 64-bit integer support for large genomes, with the latest stable release (version 1.22) in 2025.^[3]^[8] As of 2025, SAMtools and its sister projects continue to underpin major genomic analyses, from variant detection to population studies, ensuring compatibility with evolving standards in the field.^[3]

Introduction

Overview

SAMtools is a widely used software suite in bioinformatics for processing and analyzing high-throughput sequencing data, particularly alignments between sequencing reads and reference genomes. It provides a collection of command-line utilities that enable efficient manipulation of files in the Sequence Alignment/Map (SAM), Binary Alignment/Map (BAM), and CRAM formats, supporting tasks such as viewing alignments, sorting, merging, indexing, and generating consensus sequences. Developed to address the challenges of handling massive datasets from next-generation sequencing technologies, SAMtools facilitates downstream analyses like variant calling and structural variant detection by ensuring fast and reliable data access.^[9]^[10] The core of SAMtools relies on the HTSlib library, which implements low-level input/output operations for compressed and indexed genomic files, allowing for random access to specific regions without loading entire datasets into memory. This design choice enhances performance, making it suitable for processing terabyte-scale alignment files common in modern genomics projects. Originally introduced as part of the 1000 Genomes Project, SAMtools has become a foundational tool in sequencing pipelines, integrated with aligners like BWA and variant callers like GATK.^[11]^[9]^[12] Over the years, the project has expanded to include complementary tools like BCFtools for handling variant data in binary call format (BCF), reflecting its evolution into a comprehensive ecosystem for genomic data management. Its open-source nature and active maintenance by a global developer community ensure compatibility with emerging sequencing technologies, including long-read platforms. SAMtools remains essential for reproducible research, with its utilities cited in thousands of studies for enabling scalable bioinformatics workflows.^[10]^[12]

Key Components

SAMtools is a modular software suite centered on the HTSlib library and a collection of command-line tools for manipulating high-throughput sequencing alignments in SAM, BAM, and CRAM formats. HTSlib, a C library developed as part of the project, provides low-level input/output operations, including parsing, compression, and indexing support, enabling efficient handling of large datasets across local and remote files. This library forms the backbone, allowing tools to perform operations like format conversion and region-based querying without redundant code.^[13]^[10] The suite's primary tools are subcommands under the samtools executable, categorized into file manipulation, processing, and analysis functions. These utilities support Unix-style piping for seamless integration into bioinformatics workflows and leverage multi-threading for performance on modern hardware. Originally introduced to address the need for standardized alignment handling, the components emphasize speed and compatibility with evolving sequencing technologies.^[14]^[9]

Component	Description
view	Converts between SAM, BAM, and CRAM formats; filters alignments by region, flags, or quality; extracts FASTA/FASTQ sequences from alignments. Essential for initial inspection and subsetting of data.^[14]
sort	Sorts alignments by coordinate or query name, producing coordinate-sorted BAM files required for most downstream analyses; supports temporary file management for large inputs.^[14]^[10]
index	Generates .bai or .csi index files for BAM or CRAM, enabling fast random access to genomic regions without full file loading; also supports FASTA indexing via `faidx`.^[14]
mpileup	Generates a textual pileup of aligned bases at each genomic position; for BCF or VCF output suitable for variant calling, use bcftools mpileup. Includes options for indel handling and base quality adjustments.^[14]^[9]
merge	Combines multiple sorted alignment files into one, preserving metadata; useful for aggregating results from parallel processing or multi-sample experiments.^[14]
markdup	Identifies and flags PCR duplicates in sorted alignments based on mapping position and orientation; outputs updated BAM with duplicate metrics.^[14]^[10]
stats	Generates comprehensive statistics on alignments, including total reads, mapping rates, insert size distributions, and per-chromosome coverage; aids in quality assessment.^[14]

Additional specialized tools, such as fixmate for correcting mate-pair information and calmd for base alignment quality adjustment, extend functionality for targeted tasks like error correction and amplicon analysis. Together, these components facilitate the full spectrum of alignment processing, from raw data import to preparation for tertiary analysis tools.^[14]^[10]

History

Development Origins

SAMtools originated in late 2008 as part of efforts to standardize the representation and processing of high-throughput sequencing alignments for the 1000 Genomes Project, a large-scale initiative to sequence human genomes and identify genetic variants.^[4] The project required a flexible format to accommodate diverse sequencing technologies, such as Illumina/Solexa, AB/SOLiD, and Roche/454, and various alignment tools, enabling efficient downstream analyses like variant detection and genotype calling.^[4] Prior to this, alignments were often stored in proprietary or tool-specific formats, hindering interoperability and scalability for projects handling up to 10^11 base pairs of data.^[4] The Sequence Alignment/Map (SAM) format was conceptualized and named on October 21, 2008, by Heng Li, a key developer from the Wellcome Trust Sanger Institute, in collaboration with members of the 1000 Genomes Project Data Processing Subgroup, including Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Gonçalo Abecasis, and Richard Durbin.^[15]^[4] Development emphasized streamability for processing large files without full loading into memory, leading to the adoption of fixed columns, optional tags, extended CIGAR strings for representing insertions and deletions, and a binning index for quick lookups on October 22, 2008.^[15] By November 3, 2008, a dual text/binary format was established, with the binary BAM format introduced on November 7 to support compression via BGZF (Blocked GNU Zipper Format) for reduced storage and faster access.^[15] The final draft, under the MIT license, was sent to the 1000 Genomes Project on December 8, 2008.^[15] SAMtools, the accompanying software suite for manipulating SAM and BAM files, was first publicly released on December 22, 2008, initially supporting file viewing, sorting, indexing, and basic variant calling.^[15] This release addressed the immediate needs of the 1000 Genomes Project for modular tools that could interface with alignment software and facilitate genomic analyses.^[10] The toolkit's design prioritized efficiency and extensibility, allowing integration with emerging sequencing pipelines while maintaining compatibility with the evolving SAM specification.^[4] Early development was driven by practical challenges in handling terabyte-scale datasets, with contributions from the broader bioinformatics community shaping its core utilities.^[10]

Major Releases

SAMtools' initial release, version 0.1.1, occurred on December 22, 2008, introducing core utilities for converting between the SAM text format and the binary BAM format, sorting alignments, indexing files, and generating pileup data for variant detection. This laid the groundwork for efficient processing of high-throughput sequencing alignments, with early versions emphasizing compact storage and basic manipulation to handle the growing volume of genomic data from projects like the 1000 Genomes Project.^[10] The 0.x series evolved through over 20 releases up to version 0.1.20 in August 2014, incorporating incremental enhancements such as improved error handling and performance tweaks. A key advancement came in version 0.1.9 (October 2010), which restructured the integrated variant caller into the standalone BCFtools package for better modularity.^[16]^[10] Version 0.1.19 (March 2013) marked a performance milestone by adding multi-threading support to the 'view' command for sorting and BAM writing, enabling faster processing on multi-core systems.^[10] Version 1.0, released on August 15, 2014, represented a major architectural overhaul, splitting the monolithic package into three coordinated projects: HTSlib as the foundational C library for file I/O, BCFtools for variant calling and manipulation, and SAMtools dedicated to alignment-specific operations.^[17]^[18] This restructuring improved maintainability and interoperability, while introducing native support for the CRAM format—a reference-based compression scheme that reduces file sizes by up to 30% compared to BAM without sacrificing random access efficiency. Automatic detection of input/output formats (SAM, BAM, CRAM) was also added, simplifying user workflows.^[10]^[18] Post-1.0 releases in the 1.x series prioritized scalability, threading, and integration with modern sequencing paradigms. Version 1.10 (December 2019) introduced on-the-fly indexing during BAM and CRAM writing, eliminating separate indexing steps and accelerating pipeline throughput.^[19] Enhanced multi-threading across read/write operations followed in version 1.11 (September 2020), alongside new commands like 'ampliconclip' and 'ampliconstats' tailored for amplicon-based sequencing, which clips primers and computes coverage metrics to support targeted resequencing applications.^[20]^[10] Later major releases have refined compression, sorting, and analysis capabilities. Version 1.18 (July 2023) added minimizer-based sorting for faster coordinate queries and a '--duplicate-count' option in 'markdup' to track PCR duplicates more precisely.^[21] Version 1.19 (December 2023) extended the 'coverage' command with '--plot-depth' for visualizing read depth distributions and introduced lexicographical name sorting in 'merge' and 'sort' for consistent handling of multi-sample data.^[22] Version 1.20 (April 2024) introduced '--max-depth' for 'bedcov' and support for multiple '-d' options in 'fastq'.^[23] Version 1.21 (September 2024) added region filtering for 'cat' and a 'reset' command to remove auxiliary tags.^[24] Version 1.22 (May 2025) debuted the 'checksum' command for integrity verification of alignment files and shifted the default CRAM output to version 3.1, leveraging advanced codecs for better compression ratios while maintaining backward compatibility with tools supporting CRAM 3.0.^[25] The most recent update, version 1.22.1 (July 2025), primarily addresses bugs, including a use-after-free issue in 'mpileup -a' and buffer overflows in CRAM parsing, ensuring robustness for large-scale genomic analyses.^[26]

Supported Formats

SAM Format

The Sequence Alignment/Map (SAM) format is a TAB-delimited text-based standard for representing alignments of sequencing reads against reference sequences, designed to accommodate both short and long reads from diverse high-throughput sequencing platforms.^[9] Introduced in 2009 as part of the SAMtools suite, it addresses the need for a unified format amid varying alignment tools and sequencing technologies, such as Illumina and Roche/454, enabling efficient downstream analyses like variant calling and genotyping.^[9] The format supports up to 128 megabase pairs per read and scales to datasets exceeding 100 gigabase pairs, as demonstrated in its adoption by the 1000 Genomes Project for processing large-scale human genome alignments.^[9] A SAM file consists of two main sections: an optional header and an alignment section.^[27] The header begins with lines prefixed by '@', providing metadata such as reference sequence dictionary entries (e.g., @SQ lines specifying sequence names and lengths), read group information (@RG for sample metadata like platform and library), program details (@PG for analysis steps), and comments (@CO).^[27] These header lines ensure reproducibility and facilitate coordinate sorting, with the specification recommending a specific order for tags to maintain compatibility.^[28] The alignment section follows, where each line represents a single read alignment or unmapped sequence, using 1-based coordinates for positions to align with biological conventions.^[27] Each alignment line in the SAM format includes exactly 11 mandatory fields, separated by tabs, followed by zero or more optional fields.^[28] These mandatory fields are:

Field	Description	Example
QNAME	Query name (read identifier), up to 255 characters	read_001
FLAG	Bitwise flag indicating read properties (e.g., paired, unmapped)	0 (unpaired, mapped)
RNAME	Reference sequence name, or '*' if unmapped	chr1
POS	1-based leftmost mapping position, or 0 if unmapped	1000
MAPQ	Mapping quality (Phred-scaled probability of random placement), 255 for unavailable	60
CIGAR	Concise Idiosyncratic Gapped Alignment Report string describing matches, insertions, deletions, etc.	50M (50 matches)
MRNM	Mate reference name for paired reads, or '*' if unavailable	'=' (same as RNAME)
MPOS	1-based position of mate read	2000
TLEN	Observed template length (insert size), signed	1000
SEQ	Query sequence as a string of ACGTN, or '*' if unavailable	AGCT...
QUAL	ASCII-encoded Phred quality scores for SEQ bases (+33 offset), or '*' if unavailable	!''*+...

The FLAG field is a 12-bit integer encoding read attributes, such as 0x0001 for paired-end reads, 0x0004 for unmapped reads, and 0x0400 for duplicate-marked reads, allowing tools to filter or process alignments based on these properties.^[27] The CIGAR string uses extended operators like 'M' for alignment match/mismatch, 'I' for insertion to the reference, 'D' for deletion from the reference, 'N' for skipped reference regions (e.g., introns), and 'S' for soft-clipped bases outside the aligned region.^[9] Optional fields appear after the mandatory ones in the format TAG:TYPE:VALUE, where TAG is a two-character code (e.g., 'NM' for edit distance), TYPE specifies the value format (e.g., 'i' for integer, 'Z' for string, 'f' for float), and VALUE holds the data.^[27] Common tags include AS for alignment score, MD for mismatch details, and RG for read group assignment, with over 100 standardized tags defined in the SAMtags specification to support advanced features like structural variant annotations.^[29] These optional fields enhance flexibility without mandating their presence, ensuring the format remains lightweight for basic use while extensible for complex analyses.^[30] The SAM format's text-based design promotes human readability and interoperability across tools, but for efficiency, it pairs with the binary BAM equivalent, which compresses files (e.g., reducing a 112 Gbp dataset from 116 GB to under 30 GB) while preserving all information for random access via indexing.^[9] The specification, initially version 1.0 released in 2013 and maintained by the Global Alliance for Genomics and Health (GA4GH), has been updated to version 1.6 as of November 2024, with ongoing refinements.^[28]

BAM Format

The BAM (Binary Alignment/Map) format is a compressed binary representation of the SAM (Sequence Alignment/Map) format, designed for efficient storage and random access to high-throughput sequencing alignment data.^[28]^[4] Developed as part of the SAMtools suite, BAM retains all information from SAM, including mandatory fields such as query name (QNAME), alignment flag (FLAG), and reference sequence position (POS), while encoding them in a compact binary structure to reduce file sizes significantly—for instance, compressing 112 Gbp of uncompressed SAM data (approximately 116 GB) to under 30 GB.^[4] This format uses little-endian byte order and supports both aligned and unaligned reads, making it suitable for diverse sequencing platforms and read types.^[28] A BAM file begins with a 4-byte magic string "BAM\1", followed by a uint32_t indicating the length of the header section, which contains the textual SAM header (e.g., lines starting with @HD or @SQ) in raw form.^[28] This is succeeded by a uint32_t specifying the number of reference sequences, each described by a binary entry with the sequence name length (uint32_t), name (null-terminated string), and length (uint32_t).^[28] Alignment records follow, each prefixed by a uint32_t block size (excluding the size field itself) and comprising core fields such as reference ID (int32_t, -1 for unmapped), position (0-based int32_t), mapping quality (uint8_t), bin and FLAG (uint16_t), read name length and sequence length (uint16_t each), next reference ID and position (int32_t each), template length (int32_t), and variable-length arrays for CIGAR string (uint32_t array), sequence (packed uint8_t array using 2 bits per base, e.g., A=0, C=1), and quality scores (uint8_t array).^[28] Optional tags are encoded as key-value pairs (3-byte tag + 1-byte type + variable value), allowing flexible extension without altering the core structure.^[28] BAM files employ BGZF (Blocked GNU Zip Format) compression, a gzip-compatible method that divides data into independent blocks of up to 64 KB, enabling parallel decompression and random access without full file loading.^[28]^[4] This compression is handled via the HTSlib library integrated with SAMtools, which also facilitates conversion between SAM and BAM (e.g., processing 112 Gbp of data in about 10 hours on standard hardware).^[4] For efficient querying, BAM files must be sorted by coordinate and indexed using the BAI (BAM Alignment Index) format, which employs a hierarchical binning system based on genomic regions (e.g., bin 0 covers the entire 512 Mbp genome, with finer bins down to 8 Kbp) combined with linear offsets for chunks within bins.^[28] This indexing allows retrieval of alignments overlapping a specific interval with typically one disk seek, supporting operations like samtools view on regions with low memory overhead (under 30 MB for large datasets).^[28]^[4] Key differences from SAM include the shift to 0-based positioning (versus SAM's 1-based), binary encoding of sequences and qualities (e.g., bases packed into 4-bit values using the order =ACMGRSVTWYHKDBN), and support for large CIGAR strings via the CG:Z optional tag to avoid overflow in the fixed array.^[28] These features make BAM the preferred format for SAMtools workflows, such as sorting (samtools sort), merging, and statistical analysis, where it provides substantial speed and space savings over text-based alternatives.^[4] The format's specification, version 1.6 as of November 2024, ensures backward compatibility while accommodating evolving sequencing technologies.^[28]

CRAM Format

The CRAM (Compressed Alignment/Map) format is a reference-based columnar storage format designed for high-efficiency compression of biological sequence alignments, offering significant space savings over the BAM format while maintaining full compatibility with the SAM specification. Developed as an extension of the SAM/BAM ecosystem, CRAM encodes alignment data in a way that leverages the reference genome to store only differences from the reference sequence, such as substitutions, insertions, and deletions, rather than full read sequences. This approach enables lossless compression ratios typically 3 to 4 times better than BAM for short-read data, with file sizes reduced by 50-70% in practice for Illumina sequencing outputs.^[31]^[32] CRAM files are structured as a sequence of containers, beginning with a fixed 26-byte file definition header that identifies the format version and compression level, followed by a CRAM header container storing the SAM header and reference sequences (via MD5 checksums for validation). Subsequent data containers group alignments into slices—logical units of up to 100,000 records—each preceded by a compression header that defines per-field encoding parameters. Slices consist of a core data block (a bit-packed stream of encoded alignment fields) and optional external blocks (byte streams for less compressible data like read names or quality scores), with all blocks compressed using algorithms such as gzip, LZMA, or rANS entropy coding. The file concludes with an EOF container for integrity verification. This modular design supports random access via external indexing (e.g., .crai files) and selective decoding, allowing tools to load only relevant slices without decompressing the entire file.^[31] Encoding in CRAM is field-specific and adaptive: core alignment attributes (e.g., flags, mapping quality, positions) use variable-length integer encodings like ITF-8 or Huffman coding, while read features (e.g., base substitutions as delta offsets from the reference) are represented as arrays of operations to reconstruct the sequence on demand. External references are mandatory for decoding, but CRAM optionally embeds reference slices for portability, with MD5-based validation to ensure consistency. Version 3.1, released in 2021, introduced advanced codecs including rANS4x16 for faster entropy encoding, adaptive arithmetic coding, fqzcomp for quality scores, and a name tokenizer for read identifiers, yielding 7-15% additional compression gains over version 3.0 for high-coverage short reads, and enabling processing speeds up to 3 times faster than equivalent BAM operations in benchmarks like samtools flagstat on large datasets. As of September 2024, CRAM version 3.1 became the default in SAMtools and HTSlib releases.^[31]^[32]^[33] These enhancements are implemented in the HTSlib library, ensuring seamless integration with SAMtools commands such as view, sort, and index for CRAM I/O.^[31]^[32] Compared to BAM's record-oriented BGZF compression, CRAM's columnar structure and reference dependency reduce redundancy in aligned data, making it particularly advantageous for archival storage in large-scale genomics projects, though it requires reference availability during decoding—a trade-off mitigated by widespread reference standardization. Support for controlled lossy compression (e.g., via quality score binning) further optimizes space for variant calling pipelines without impacting accuracy in downstream analyses. CRAM was first integrated into SAMtools with version 1.0 in 2014, evolving from early prototypes at the European Bioinformatics Institute and Wellcome Sanger Institute to become the default format in recent releases.^[18]^[10]

Core Commands

File Manipulation

SAMtools offers a suite of core commands dedicated to file manipulation tasks, enabling users to view, convert, sort, index, merge, and otherwise process alignment files in SAM, BAM, and CRAM formats. These utilities are fundamental for handling high-throughput sequencing data, as they facilitate efficient data preparation, subset extraction, and integration in bioinformatics pipelines without requiring full reloading of large files. Designed for speed and low memory usage, these commands leverage the compressed BAM and CRAM formats to manage terabyte-scale datasets effectively.^[14] The samtools view command serves as the primary tool for inspecting and converting alignment files. It extracts and prints alignments from input files to standard output in SAM format by default, but supports output conversion to BAM (-b) or CRAM (-C) via specified options, with the output file designated using -o. For region-specific extraction, an indexed input file is required, allowing queries like samtools [view](/page/View) input.bam chr1:10000-20000 to retrieve alignments in a genomic interval. This command is versatile for initial data exploration and format interoperability, often piped to other tools for downstream analysis.^[34] Sorting alignments is handled by samtools sort, which rearranges records by coordinate (default) or read name (-n) to produce a sorted BAM output. Essential for indexing and efficient querying, it uses temporary files prefixed by -T and supports multi-threading with -@ for parallel processing on large inputs. For instance, samtools sort -o sorted.bam input.bam generates a coordinate-sorted file suitable for subsequent operations, reducing random access times in variant calling workflows.^[35] Once sorted, files can be indexed using samtools index to enable rapid region-based access. This creates a .bai (BAI) or .csi (CSI) index file alongside the input, with -b or -c options selecting the index type; BAI is standard for BAM files under 2^29 bases per reference. The command requires coordinate-sorted input and is non-destructive, as in samtools index sorted.bam, which produces sorted.bam.bai for use with tools like samtools view or genome browsers. Indexing dramatically improves performance on datasets exceeding gigabytes, avoiding sequential scans.^[36] Merging multiple sorted files is accomplished with samtools merge, which combines inputs while maintaining sort order and merging headers. It accepts a list of BAM/CRAM files as arguments, outputting to a specified file, and includes options like -n for name-based sorting or -f to force overwriting. An example workflow is samtools merge -o merged.bam sample1.bam sample2.bam, ideal for consolidating lane-level alignments from sequencing runs. For simpler concatenation without order preservation, samtools cat joins files of compatible formats using -h for a shared header source. Additional manipulation includes samtools split, which partitions a file by read group (RG) tag into separate outputs prefixed by the input name, as in samtools split merged.bam yielding files like merged.bam.A.1.bam. Header replacement is streamlined by samtools reheader, which applies a new SAM header to a BAM/CRAM file efficiently: samtools reheader newheader.sam input.bam > output.bam. These commands support targeted file restructuring, such as separating samples or correcting metadata, enhancing data organization in multi-sample studies. For read shuffling and grouping, samtools collate prepares name-sorted inputs by collating paired reads together, using samtools collate -o output.bam input.bam to output without full sorting, which aids duplicate marking. Complementing this, samtools fixmate populates mate information in name-sorted files, adding flags and coordinates via samtools fixmate -O bam input.bam output.bam, and optionally removes unmapped mates (-r). Finally, samtools markdup identifies and marks PCR duplicates in coordinate-sorted files, with options to remove them (-r) or output statistics (-s), as in samtools markdup input.bam output.bam. These utilities ensure data integrity during manipulation, critical for accurate downstream analyses like variant detection.

Alignment Processing

SAMtools provides a suite of commands for processing alignment files in SAM, BAM, and CRAM formats, enabling tasks such as viewing, filtering, sorting, indexing, and merging alignments to facilitate downstream genomic analysis.^[14] These operations are essential for managing high-throughput sequencing data, ensuring efficient access and manipulation of read alignments against reference genomes.^[14] The samtools view command is a foundational tool for extracting and filtering alignments from input files. It converts between formats (e.g., BAM to SAM) and restricts output to specific genomic regions using positional arguments or the -L option for BED files, allowing users to focus on subsets of data without loading entire files into memory.^[34] For instance, samtools view -b input.bam chr1:1000-2000 outputs BAM-format alignments for the specified region, supporting rapid querying in large datasets.^[34] Sorting and indexing are critical for optimizing alignment files for random access and efficient processing. The samtools sort command rearranges alignments by coordinate (default) or read name (with -n), producing sorted output that is prerequisite for many analyses; it uses temporary files specified by -T to handle large inputs.^[14] Following sorting, samtools index generates binary indexes (BAI for BAM with -b, or CSI for larger files with -c), enabling fast region-based retrieval via samtools view without rescanning the entire file.^[14] This combination significantly reduces computational overhead in workflows involving repeated region queries.^[14] Merging and concatenation support the integration of multiple alignment files, often from parallel processing or multi-sample experiments. The samtools merge command combines sorted files while preserving order and merging headers (optionally from a separate file with -h), suitable for consolidating data from distributed alignments; it includes -f to force overwriting existing outputs.^[14] In contrast, samtools cat simply concatenates unsorted files with identical reference dictionaries, providing a lightweight option for appending alignments without re-sorting.^[14] Duplicate handling and mate-pair fixing address common artifacts in sequencing data. The samtools markdup command identifies and marks PCR or optical duplicates based on mapping coordinates and orientation, with options like -r to remove them and -s for single-end reads, improving accuracy in variant calling pipelines.^[14] Similarly, samtools fixmate updates mate information in name-sorted files, correcting flags and positions for paired-end reads (using -m to mark supplementary alignments), which is vital for downstream paired-end analyses.^[14] Additional processing tools include samtools split for dividing files by read group into separate outputs, aiding in sample-specific workflows, and samtools reheader for replacing headers without altering alignment records, useful for updating metadata in CRAM files (with -i for in-place modification).^[14] These commands collectively enable robust preprocessing of alignments, ensuring data integrity and compatibility with tools like variant callers.^[14]

Statistics and Inspection

SAMtools provides several commands dedicated to generating statistics and inspecting alignment files in SAM, BAM, or CRAM formats, enabling users to assess data quality, alignment coverage, and read properties without extensive processing. These tools are essential for quality control in high-throughput sequencing workflows, offering both summary metrics and detailed views of the data. The primary commands include samtools stats, samtools flagstat, samtools idxstats, and samtools view, each targeting specific aspects of file inspection and statistical analysis.^[14] The samtools stats command computes comprehensive statistics from alignment files, producing a text-based report that can be visualized using the accompanying plot-bamstats script. It categorizes metrics into sections such as summary numbers (e.g., total reads, mapped percentage), insert size distributions, coverage depths, and GC-content biases, distinguishing between paired and unpaired reads based on SAM flags like PAIRED (0x1), READ1 (0x40), and READ2 (0x80). For instance, it reports averages like mean insert size and median coverage, along with histograms for insert sizes and quality scores, allowing users to identify issues such as library preparation artifacts or sequencing biases. Options like -c for custom coverage ranges (default: 1-1000) or -d to exclude duplicates enable targeted analysis, and the output supports region-specific queries when the input is indexed. This command is particularly useful for overall file inspection, as it processes the entire file or specified regions efficiently.^[37]^[28] For flag-based inspection, samtools flagstat analyzes the FLAG field of alignments according to the SAM specification, counting reads across 13 categories such as total, mapped, paired, duplicates, and QC failures. It outputs counts and percentages for primary, secondary, and supplementary alignments, split by QC pass/fail status (FLAG 0x200), providing a quick overview of mapping quality and potential artifacts like unmapped or improperly paired reads. The default output is human-readable (e.g., "122 + 28 in total"), but it supports JSON or TSV formats for programmatic use, with multi-threading via -@ for large files. This tool is lightweight and runs in a single pass, making it ideal for rapid quality checks during alignment pipelines.^[38]^[28] The samtools idxstats command retrieves per-reference-sequence statistics from an indexed BAM file, reporting the reference name, length, number of mapped read segments, and unmapped read segments in a TAB-delimited format. It requires prior indexing with samtools index for efficiency, though unindexed files can be processed by full scan (slower for large datasets). This is valuable for inspecting alignment distribution across chromosomes or contigs, highlighting uneven coverage or unmapped portions, and it may overcount multi-mapped or fragmented reads. An example output line might read "* 0 0 100" for unmapped reads, aiding in decisions about downstream filtering.^[39]^[36] Inspection at the alignment level is facilitated by samtools [view](/page/View), which extracts and displays records from files, supporting filtering by region, mapping quality (via -q), flags (include/exclude with -f/-F), or tags. For coordinate-sorted and indexed inputs, it enables fast random access to regions (e.g., samtools [view](/page/View) input.bam chr1:1000-2000), outputting in SAM, BAM, or CRAM formats. Options like -h include headers, -c counts matches without printing, and -L uses BED files for targeted viewing, making it a versatile tool for detailed examination of specific alignments or conversion during inspection workflows.^[34]

Usage and Examples

Basic Usage

SAMtools provides a straightforward command-line interface for essential operations on alignment files in SAM, BAM, and CRAM formats, such as conversion, sorting, indexing, and viewing.^[14] These core functionalities enable users to process high-throughput sequencing data efficiently, often in a Unix pipeline using standard input/output streams.^[14] For instance, input files can be piped from other tools, and remote files accessed via URLs like FTP or HTTP.^[14] Basic operations require the software to be installed and typically involve specifying input files, output options, and optional flags for filtering or formatting.^[14] The view command is fundamental for inspecting, converting, and filtering alignments. It reads SAM, BAM, or CRAM files and outputs in SAM format by default, but can produce BAM or CRAM with flags.^[34] To convert a SAM file to BAM, use:

samtools view -b input.sam > output.bam
samtools view -b input.sam > output.bam

This compresses the text-based SAM into the binary BAM format for compact storage.^[34] For viewing the first few alignments without conversion, pipe to head:

samtools [view](/page/View) input.bam | head -5
samtools [view](/page/View) input.bam | head -5

Region-specific extraction requires a coordinate-sorted and indexed input file, such as retrieving reads from chromosome 1:10000-20000:

samtools [view](/page/View) -b input.bam "1:10000-20000" > region.bam
samtools [view](/page/View) -b input.bam "1:10000-20000" > region.bam

Filtering by read group or flags (e.g., mapped reads only) is achieved with options like -r or -q.^[34] Sorting alignments by genomic coordinates is a prerequisite for many downstream analyses, including indexing and visualization. The sort command rearranges records in a BAM or CRAM file, outputting a new sorted file.^[35] A basic example sorts by leftmost coordinate:

samtools sort -o output.sorted.bam input.bam
samtools sort -o output.sorted.bam input.bam

This uses up to 768 MiB of memory per thread by default and supports multi-threading with -@ for large files; for example, with 4 threads:

samtools sort -@ 4 -o output.sorted.bam input.bam
samtools sort -@ 4 -o output.sorted.bam input.bam

Sorting by read name instead (useful for paired-end data) adds the -n flag.^[35] The process is memory-efficient, handling datasets up to hundreds of gigabases with modest resources.^[1] Indexing a sorted BAM or CRAM file enables rapid random access to specific genomic regions without scanning the entire file. The index command generates a binary index file (BAI for BAM, CRAI for CRAM).^[36] For a sorted BAM:

samtools index output.sorted.bam
samtools index output.sorted.bam

This creates output.sorted.bam.bai alongside the input. For compressed SAM (SAM.gz), the same command applies, producing a .bai index.^[36] CSI indices (for larger intervals) can be specified with -c, suitable for very large files. Once indexed, commands like view leverage it for efficient querying.^[36] Basic statistics on alignment files can be generated with flagstat, providing counts of mapped, unmapped, and duplicate reads. Run:

samtools flagstat input.bam
samtools flagstat input.bam

This outputs a summary report, such as total reads and mapping rates, essential for quality assessment.^[14] These operations form the foundation of SAMtools workflows, often chained together for data preparation in genomics pipelines.^[1]

Common Workflows

One of the most prevalent workflows in high-throughput sequencing analysis involves converting raw FASTQ files to compressed BAM or CRAM formats for efficient storage and downstream processing. This typically begins with alignment of reads to a reference genome using tools like BWA-MEM or minimap2, producing a SAM file that is then processed with SAMtools commands to fix mate-pair information, sort by genomic position, mark duplicates, and convert to the desired format. For instance, the fixmate command resolves pairing issues in paired-end data, while sort ensures coordinate ordering essential for indexing and variant calling. The markdup step identifies and flags PCR duplicates to avoid biases in coverage estimates. This workflow reduces file sizes significantly—CRAM can achieve file sizes 23-55% of equivalent BAM files (45-77% smaller), depending on the dataset and compression settings, while maintaining reference-based efficiency—and is foundational for whole-genome sequencing pipelines.^[40]^[41] A representative pipeline for this conversion pipes the aligner output directly into SAMtools for streaming processing:

minimap2 -a -x sr reference.fa reads.fastq | samtools fixmate -m - - | samtools sort -@ 8 -T /tmp/temp - | samtools markdup -r - final.bam

, which avoids intermediate files and leverages multi-threading for speed on large datasets. Conversion to CRAM requires specifying the reference with -T reference.fa in the view command, enabling reference-dependent compression that embeds differences rather than full sequences. This approach is particularly useful in resource-constrained environments, as CRAM files can be decoded on-the-fly without full decompression.^[40] For whole-genome sequencing (WGS) or whole-exome sequencing (WES), a standard workflow extends the alignment process into variant calling and refinement. After initial mapping with BWA-MEM to produce sorted BAM files via SAMtools sort and fixmate, base quality score recalibration (BQSR) and duplicate marking are applied using external tools like GATK, followed by merging multiple lanes with SAMtools merge. Variant calling then uses BCFtools mpileup to generate pileup data from the BAM, piped into bcftools call for genotyping: bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -mv -Ob -o calls.bcf. This produces a binary BCF file containing high-confidence SNPs and indels, indexed with tabix for quick querying. The workflow emphasizes quality filtering during calling, such as skipping low-depth regions with -d 5 in mpileup, to balance sensitivity and specificity in detecting variants at ~30x coverage typical for WGS.^[42] Post-calling, VCF filtering refines variants using BCFtools to remove artifacts based on quality metrics like QUAL score, read depth (DP), and strand bias (SP). A common post-call filter excludes low-quality sites with bcftools filter -i 'QUAL>20 && [DP](/page/DP)>10' -Ob -o filtered.bcf calls.bcf, separating SNPs and indels via TYPE annotations for tailored rules—e.g., indels require minimum supporting reads (IDV > 2) to mitigate alignment errors. Pre-call options in mpileup, such as -L 250 for maximum depth, prevent over-calling in high-coverage regions. This step is crucial for reducing false positives, with empirical thresholds often tuned against truth sets using bcftools isec to compare true/false positives, improving precision in well-calibrated datasets. Integration with annotation tools follows for functional impact assessment.^[43] CRAM-specific workflows optimize for reference-dependent storage in collaborative or archival settings, requiring alignments to be position-sorted before encoding to maintain compression ratios around 1:4 to 1:6 versus uncompressed SAM. As of version 1.22 (July 2025), SAMtools defaults to CRAM 3.1, which provides further compression improvements over previous versions.^[41]^[25] Key commands include viewing with samtools view -T ref.fa cram.cram for on-demand decoding and mpileup directly on CRAM for pileups without conversion. Best practices mandate embedding MD5 reference hashes in headers via the aligner's -R option and setting environment variables like REF_PATH for remote reference fetching, ensuring seamless access in distributed systems like the European Nucleotide Archive. This format supports workflows where storage is a bottleneck, as partial decoding reduces I/O compared to BAM, though it demands consistent reference availability to avoid decoding failures.^[44] Basic inspection and quality control form another routine workflow, often preceding analysis. After indexing a BAM with samtools index aligned.bam, the flagstat command summarizes alignment metrics: samtools flagstat aligned.bam, reporting total reads, mapped percentage (typically >95% for good data), and duplicates. Depth statistics via samtools depth -a aligned.bam > coverage.txt quantify per-position coverage, aiding in identifying biases. These steps, executable in seconds on gigabyte-scale files, provide essential diagnostics without full reprocessing.^[14]

Integration and Extensions

With HTSlib and BCFtools

SAMtools, BCFtools, and HTSlib form a tightly integrated ecosystem for processing high-throughput sequencing data, with HTSlib serving as the foundational C library that enables efficient reading, writing, and manipulation of formats such as SAM/BAM/CRAM for alignments and VCF/BCF for variants.^[45] HTSlib provides a unified API for these operations, allowing SAMtools and BCFtools to share core functionalities like binary format handling, indexing, and multi-threading support, which enhances performance and ensures compatibility across the tools.^[2] This integration originated from the 2014 restructuring of the original SAMtools project, where HTSlib was extracted as a standalone library to facilitate independent development and third-party embedding, while SAMtools focused on alignment processing and BCFtools on variant calling.^[45] In practice, SAMtools depends on HTSlib for all file I/O operations, such as sorting, merging, and indexing BAM files, with source distributions including bundled HTSlib copies for standalone builds.^[12] Similarly, BCFtools leverages HTSlib for VCF/BCF manipulation, including conversion between text and binary formats, enabling seamless handling of large-scale variant datasets.^[46] This shared dependency minimizes code duplication—HTSlib provides the core functionality supporting both tools—and allows updates like in-file indexing (introduced in version 1.10) to propagate efficiently across SAMtools and BCFtools.^[45] A primary example of integration occurs in variant calling workflows, where SAMtools prepares aligned BAM files that serve as input to BCFtools' mpileup command, which generates pileups and genotype likelihoods using HTSlib for efficient data access.^[42] For instance, after alignment and sorting with samtools sort and samtools index, the pipeline proceeds to bcftools mpileup -f ref.fa input.bam | bcftools call -mv -Ob -o calls.bcf, producing a compressed BCF file of variants; this one-liner combines pileup generation and calling, relying on HTSlib's threading for speed on multi-core systems.^[47] Such workflows, common in whole-genome sequencing (WGS) and exome analysis, demonstrate how the tools interoperate: SAMtools handles preprocessing and quality control of alignments, while BCFtools performs downstream variant detection and filtering, all underpinned by HTSlib's format-agnostic efficiency.^[42] Historically, BCFtools evolved from SAMtools' variant-calling components (e.g., the original mpileup and call in SAMtools 0.1.9, 2010), becoming independent in 2014 to better support multi-sample and gVCF formats.^[45] This modular design has enabled the ecosystem's growth, with extensive ongoing development fostering high-impact applications in genomics research while maintaining low memory usage and platform independence. As of mid-2025, the latest releases are SAMtools 1.22.1, BCFtools 1.22, and HTSlib 1.22.1, continuing to support evolving sequencing technologies and pipelines.^[48]

With Other Bioinformatics Tools

SAMtools is frequently integrated into next-generation sequencing (NGS) workflows alongside aligners such as BWA, where BWA generates initial SAM files from read alignments to a reference genome, and SAMtools subsequently converts these to compressed BAM format, sorts them, and generates indices for efficient downstream processing.^[49] This integration ensures compatibility in standard pipelines, as BAM files produced by SAMtools are directly usable by BWA's post-alignment steps or further tools.^[5] In variant calling pipelines, SAMtools pairs with the Genome Analysis Toolkit (GATK) by providing pre-processed BAM files—sorted, indexed, and filtered for mapping quality—that serve as input for GATK's HaplotypeCaller or other callers, enabling accurate identification of single nucleotide variants and indels. For instance, after alignment with BWA and duplicate marking, SAMtools' view and index commands prepare files that GATK requires for base quality score recalibration and joint genotyping across samples.^[49] This combination has become a cornerstone of best practices for high-confidence variant detection in clinical and research sequencing.^[50] SAMtools also complements Picard tools for quality control and duplicate handling; while Picard’s MarkDuplicates identifies and tags PCR or optical duplicates in BAM files, SAMtools can preprocess or postprocess these files via sorting (samtools sort) or fixing mate pairs (samtools fixmate), ensuring seamless interoperability in pipelines that prioritize duplicate removal to reduce bias in downstream analyses.^[51] Although newer SAMtools versions include a markdup command as an alternative, Picard remains preferred in GATK-centric workflows for its robust handling of read groups and metrics reporting.^[5] For visualization, SAMtools generates indices (via samtools index) that enable loading of BAM or CRAM files into the Integrative Genomics Viewer (IGV), allowing interactive inspection of alignments, coverage, and variants without file conversion. This integration supports exploratory analysis, where users can correlate SAMtools-derived statistics (e.g., from samtools stats) with IGV's graphical overlays to validate pipeline outputs.^[5] Beyond these, SAMtools interfaces with tools like BEDTools for intersection-based operations on alignments and regions, where indexed BAM files from SAMtools feed into BEDTools' bamToBed or coverage calculations, facilitating tasks such as peak calling integration in ChIP-seq workflows. Overall, SAMtools' standardized formats and Unix-pipe compatibility make it a foundational component in modular pipelines, often scripted with Nextflow or Snakemake for reproducible integration across diverse tools.^[49]

References

[1]
https://doi.org/10.1093/bioinformatics/btp352
[2]
Samtools
- **Integration**: Samtools and BCFtools use HTSlib internally for reading/writing high-throughput sequencing data.
[3]
https://doi.org/10.1093/gigascience/giab008
[4]
The Sequence Alignment/Map format and SAMtools - PMC - NIH
SAMtools is a library and software package for parsing and manipulating alignments in the SAM/BAM format. It is able to convert from other alignment formats, ...
[5]
samtools(1) manual page
### Summary of SAMtools Integration with Bioinformatics Tools
[6]
Sequence Alignment/Map format and SAMtools - Oxford Academic
In this article, we present an overview of the SAM format and briefly introduce the companion SAMtools software package. A detailed format specification and the ...
[7]
Twelve years of SAMtools and BCFtools | GigaScience
Feb 16, 2021 · SAMtools and BCFtools are tools for processing sequencing data. SAMtools works with alignment data, while BCFtools handles variant data. ...
[8]
Samtools
Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories.SAMtools/BCFtools/HTSlib · Samtools view · Samtools - Documentation · Bgzip
[9]
samtools/samtools: Tools (written in C using htslib) for ... - GitHub
This is the official development repository for samtools. The original samtools package has been split into three separate but tightly coordinated projects.Samtools · Samtools/htslib · Releases · BCFtools
[10]
samtools/htslib: C library for high-throughput sequencing data formats
HTSlib is an implementation of a unified C library for accessing common file formats, such as SAM, CRAM and VCF, used for high-throughput sequencing data.Releases · Issues 140 · Pull requests 25 · Actions
[11]
samtools(1) manual page
May 30, 2025 · Samtools is a set of utilities that manipulate alignments in the SAM (Sequence Alignment/Map), BAM, and CRAM formats.Samtools view · Samtools sort · Samtools-fasta · Samtools indexMissing: bioinformatics | Show results with:bioinformatics
[12]
The early history of the SAM/BAM format - Heng Li's blog
Jan 27, 2015 · 2008-12-08: Final draft sent to 1000g. Adopted the MIT license. 2008-12-22: First public release of samtools. It is still working on most BAMs ...
[13]
SAM tools - Browse /samtools at SourceForge.net
### SAMtools 0.x Releases
[14]
SAM tools - Browse /samtools/1.0 at SourceForge.net
Get an email when there's a new version of SAM tools ; 2014-08-15, 4.5 MB. 1.
[15]
Samtools CRAMS in support for improved compression formats
Aug 15, 2014 · Samtools 1.0 is freely available at http://www.htslib.org/. This new version supports the highly efficient genomic data format CRAM, adds new ...
[16]
samtools(1) manual page
Dec 6, 2019 · Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format.Commands · Samtools Options · Global Command OptionsMissing: history | Show results with:history
[17]
samtools(1) manual page
Sep 22, 2020 · Samtools is a set of utilities that manipulate alignments in the SAM (Sequence Alignment/Map), BAM, and CRAM formats.Commands · Samtools Options · Global Command OptionsMissing: history | Show results with:history
[18]
sam(5) manual page - Samtools
Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of:
[19]
[PDF] Sequence Alignment/Map Format Specification - Samtools
Nov 6, 2024 · Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields ...Missing: components | Show results with:components
[20]
[PDF] Sequence Alignment/Map Optional Fields Specification - Samtools
The SAM format can be used to represent de novo assemblies, generally by using padded reference sequences and the annotation tags described here ...<|control11|><|separator|>
[21]
samtools/hts-specs: Specifications of SAM/BAM and related ... - GitHub
SAMv1.tex is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files.
[22]
[PDF] CRAM format specification (version 3.1) - Samtools
Sep 4, 2024 · A CRAM file consists of a fixed length file definition, followed by a CRAM header container, then zero or more data containers, and finally a ...
[23]
CRAM 3.1: advances in the CRAM file format - PMC - NIH
CRAM 3.1 is the first major update to CRAM since 2014. It keeps the underlying format unchanged, but adds new compression codecs. This is the first CRAM version ...
[24]
samtools-view(1) manual page
May 30, 2025 · Prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).Missing: history | Show results with:history
[25]
samtools-sort(1) manual page
### Basic Usage and Examples for `samtools sort`
[26]
samtools-index(1) manual page
May 30, 2025 · Index coordinate-sorted BGZIP-compressed SAM, BAM or CRAM files for fast random access. Note for SAM this only works if the file has been BGZF compressed first.
[27]
samtools-stats(1) manual page
May 30, 2025 · samtools stats collects statistics from BAM files and outputs in a text format. The output can be visualized graphically using plot-bamstats.
[28]
samtools-flagstat(1) manual page
### Summary of samtools flagstat
[29]
samtools-idxstats(1) manual page
May 30, 2025 · Retrieve and print stats in the index file corresponding to the input file. Before calling idxstats, the input BAM file should be indexed by samtools index.
[30]
FASTQ to BAM / CRAM - Samtools
It is possible to store unaligned data in BAM or CRAM, and indeed it may be preferable as it permits meta-data in the header and per-record auxiliary tags.
[31]
WGS/WES Mapping to Variant Calls - Samtools
The standard workflow for working with DNA sequence data consists of three major steps: Mapping; Improvement; Variant Calling. Mapping. For reads from 70bp up ...
[32]
Filtering of VCF Files - Samtools
This guide is covering filtering a single WGS sample with an expectation of broadly even allele frequencies and therefore any recommendations need to be taken ...
[33]
Using CRAM within Samtools
For a workflow this has a few fundamental effects: Alignments should be kept in chromosome/position sort order. The reference must be available at all times.
[34]
Twelve years of SAMtools and BCFtools - PMC - PubMed Central
The 1.0 release introduced support for the better-compressed CRAM format [11]. ... Heng Li, Department of Data Sciences, Dana-Farber Cancer Institute, 450 ...
[35]
samtools/bcftools: This is the official development ... - GitHub
This is the official development repository for BCFtools. It contains all the vcf* commands which previously lived in the htslib repository.Releases · Discussions · Issues 327 · 2015
[36]
Variant calling - bcftools mpileup - Samtools
The first mpileup part generates genotype likelihoods at each genomic position with coverage. The second call part makes the actual calls.Missing: workflows | Show results with:workflows<|control11|><|separator|>
[37]
From FastQ data to high confidence variant calls: the Genome ... - NIH
This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls.
[38]
Best practices for variant calling in clinical sequencing
Oct 26, 2020 · Tools such as Picard [28] and Sambamba [29] identify and mark duplicate reads in a BAM file to exclude them from downstream analysis. The GATK ...Snv/indel Calling · Identifying De Novo... · Copy Number And Structural...
[39]
Read groups - GATK - Broad Institute
Note that some Picard tools have the ability to modify ID s when merging SAM files in order to avoid collisions. In Illumina data, read group ID s are composed ...