BAM (file format)
The BAM (Binary Alignment/Map) file format is a compressed, binary representation of the SAM (Sequence Alignment/Map) format, specifically designed for efficient storage and random access to aligned high-throughput sequencing data in bioinformatics applications.[1][2] It serves as a standard container for genomic alignments generated from diverse sequencing technologies, such as Illumina, Roche/454, and AB/SOLiD, enabling downstream analyses like variant calling and structural variation detection.[2] Developed in 2009 by a team including Heng Li, Bob Handsaker, and Richard Durbin as part of the 1000 Genomes Project Data Processing Subgroup, BAM addresses the challenges of managing massive datasets from next-generation sequencing, where alignments can span billions of base pairs.[2] The format was introduced alongside SAMtools, a suite of utilities for manipulating SAM and BAM files, to standardize alignment reporting across tools and promote interoperability in genomic research.[2] Its specification has evolved through versions, with the current SAM/BAM v1.6, the current version as of 2024, incorporating enhancements for long reads and supplementary alignments while maintaining backward compatibility.[1] Key features of BAM include its use of BGZF (Blocked GNU Zip Format) for compression, which reduces file sizes significantly— for example, compressing 112 Gbp of Illumina data to about 116 GB—while allowing parallel decompression and random access without full file loading.[2][1] BAM files must be coordinate-sorted to support indexing via the BAI (BAM Alignment Index) format, which divides the genome into hierarchical bins for rapid retrieval of alignments in specified regions, typically spanning from 2^29 base pairs down to 2^14 base pairs.[1] Each BAM record preserves all SAM fields, including 11 mandatory ones (e.g., reference ID, position, CIGAR string, and sequence) encoded efficiently in binary, along with optional tags for metadata like quality scores and custom attributes.[1] This structure makes BAM the preferred format for submission to repositories like the NCBI Sequence Read Archive (SRA), where it is explicitly recommended over text-based alternatives for its compactness and machine-readability.[3]Overview
Definition and Purpose
The BAM (Binary Alignment/Map) format is a binary-encoded file format that serves as the compressed equivalent of the Sequence Alignment/Map (SAM) format, specifically designed for storing alignments of high-throughput sequencing reads to reference genomes or transcriptomes. It enables efficient representation of both short and long nucleotide sequences, accommodating data from diverse sequencing platforms such as Illumina, AB/SOLiD, and Roche/454.[2] It also supports long-read platforms like Pacific Biosciences.[4] The primary purpose of BAM is to facilitate compact storage and rapid retrieval of alignment data, supporting essential bioinformatics workflows including variant detection in genomic resequencing and gene expression quantification in RNA-seq analyses. By providing a standardized structure for alignment records, BAM underpins scalable processing of massive datasets generated by next-generation sequencing, allowing tools like variant callers to efficiently access specific genomic regions without loading entire files.[2][5] BAM offers key advantages over text-based formats like uncompressed SAM, including substantially reduced file sizes—often achieving approximately 1 byte per aligned base—and accelerated input/output operations, which are critical for managing terabyte-scale sequencing outputs. These efficiencies stem from its binary encoding and built-in compression, enabling random access through indexing while maintaining lossless data integrity.[2][1] BAM was developed in late 2008 and introduced in 2009 to address the surging volumes of data from emerging next-generation sequencing technologies, which demanded more performant alternatives to existing alignment storage methods for projects such as the 1000 Genomes Project.[2]History and Development
The BAM format originated in late 2008 as a binary counterpart to the text-based Sequence Alignment/Map (SAM) format, developed by Heng Li and colleagues at the Wellcome Trust Sanger Institute as part of the SAMtools project.[6] This effort addressed the limitations of text-based formats for scalability in processing high-throughput sequencing data, introducing compression and efficient random access capabilities. The first implementation emerged rapidly, with key design decisions—including the adoption of a dual text/binary structure, BGZF compression, and indexing—finalized between October and December 2008. The initial public release of SAMtools version 0.1.0, which included BAM support, occurred on December 22, 2008, under the MIT license.[6][7] A pivotal milestone came with the format's early adoption by the 1000 Genomes Project, to which a final draft specification was sent on December 8, 2008, leading to its use for releasing project alignments.[7] Subsequent updates enhanced functionality and interoperability; notably, SAMtools version 1.0, released in August 2014, introduced improved compression options like CRAM support and restructured the codebase by extracting HTSlib as a standalone library, providing a stable C API for reading and writing BAM files across diverse tools.[8] This refactoring facilitated broader adoption in bioinformatics pipelines, enabling more efficient handling of variant calling and downstream analyses. Standardization of BAM has evolved through community-driven documentation, initially detailed in the 2009 SAMtools paper and later formalized in the official Sequence Alignment/Map Format Specification (version 1.6), maintained via the HTSlib project; as of November 2025, version 1.6 remains current with minor updates in 2024.[7][1] Contributions from institutions like the Broad Institute—through tools such as GATK and Picard—and EMBL-EBI, which integrates BAM in resources like Ensembl, have refined the specification to ensure compatibility and extensibility.[9][10] The format's development has been propelled by the explosive growth in genomic data volumes, necessitating robust solutions for petabyte-scale datasets from sequencers like Illumina.[7][10]Format Specifications
Header Structure
The BAM file header serves as the initial fixed section of a BAM file, providing essential metadata for the binary alignment data that follows. It begins with the magic string "BAM\1", consisting of four bytes that uniquely identify the file format.[1] This header is structured as a series of length-prefixed blocks, enabling efficient parsing and ensuring the file's integrity before processing the alignment records.[1] Immediately after the magic string, a 32-bit unsigned integer (uint32_t) specifies the length of the header text block (l_text), which is limited to less than 2³¹ bytes and contains the header in plain-text SAM format.[1] This text may include the file format version via the @HD VN tag (e.g., "1.6" for the current specification) and optional human-readable information, such as the sorting order indicated by the @HD SO tag (e.g., "coordinate" or "unsorted").[1] The header text is followed by another 32-bit unsigned integer (n_ref) denoting the number of reference sequences in the dictionary.[1] For each reference sequence, the structure includes a 32-bit unsigned integer for the length of the reference name (a NUL-terminated string, also limited to less than 2³¹ bytes including the NUL), the name itself, and a 32-bit unsigned integer for the reference sequence length (maximum < 2³¹ bases).[1] Optional uniform resource identifiers (URIs) for reference sequences, such as URLs to genome assemblies (e.g., for hg38), can be specified within the header text using @SQ UR tags.[1] All data in the BAM header employs little-endian byte order for consistent cross-platform compatibility, with 32-bit unsigned integers used for lengths, counts, and reference sizes.[1] While positions within alignment records utilize 32-bit signed or unsigned integers (or 64-bit in extended formats), the header itself relies on these 32-bit types to maintain a compact structure.[1] This header organization facilitates file validation by confirming the BAM format through the magic string and providing a complete reference sequence dictionary that describes the genome or reference assembly used for alignments, thereby ensuring tool compatibility and preventing mismatches in downstream analyses.[1]Alignment Records
Alignment records in BAM files represent the core data units, encoding individual sequencing reads and their alignments to reference sequences in a compact binary format. Each record begins with a fixed-size core section of 32 bytes (after the initial block size field), which captures essential mapping metadata, followed by a variable-length section containing the read sequence, quality scores, and optional auxiliary tags. This structure enables efficient storage and random access, particularly when combined with indexing.[1] The core section starts with a 4-byte unsigned integer (uint32_t) specifying the block size, which indicates the total length of the record excluding this field itself, allowing parsers to navigate compressed blocks. This is followed by the reference sequence ID (refID), a 4-byte signed integer (int32_t) ranging from -1 (for unmapped reads) to the number of references minus one, identifying the aligned chromosome or contig. The position (pos) is a 4-byte signed integer representing the 0-based leftmost mapping coordinate (equivalent to the SAM POS field minus one), also set to -1 for unmapped reads. Next come the read name length (l_read_name), a 1-byte unsigned integer (uint8_t) equal to the length of QNAME plus 1 (for NUL), and the mapping quality (MAPQ), a 1-byte unsigned integer (uint8_t) from 0 to 255 (255 for unknown). The BIN field, a 2-byte unsigned integer (uint16_t) used for indexing purposes in BAI files, is computed via the reg2bin function based on the alignment's start and end positions. The FLAG field, another 2-byte unsigned integer (uint16_t), encodes bitwise flags for read properties such as paired-end status (bits 1-3), proper pairing (bit 2), duplicate marking (bit 10), and supplementary alignment (bit 11). The number of CIGAR operations (n_cigar_op) is a 2-byte unsigned integer (uint16_t) limiting the CIGAR string to 65,535 operations, with longer alignments handled via the CG tag in the variable section. The alignment length (l_seq), a 4-byte unsigned integer (uint32_t), specifies the length of the query sequence. Mate reference ID (next_refID) and mate position (next_pos) are 4-byte signed integers (int32_t) each, indicating the paired read's reference and 0-based position (PNEXT minus one), both set to -1 if unknown. Finally, the template length (tlen), a 4-byte signed integer (int32_t), represents the observed length of the full insert for paired reads, or zero if unavailable. All positions in these fields are encoded as 32-bit integers for precision across large genomes.[1] The variable section immediately follows the core, beginning with the read name (QNAME), a NUL-terminated string of length l_read_name. The CIGAR string is then encoded as an array of n_cigar_op 4-byte little-endian unsigned integers (uint32_t), each packing the operation length in the upper 28 bits and the operation code in the lower 4 bits as (length << 4 | code), describing matches, insertions, deletions, and other events. The query sequence is stored in a packed format using 2 bits per base (=00, A=01, C=10, G=11? Standard: 00=A, 01=C, 10=G, 11=T/N, packed from LSB to MSB across bytes), represented as an array of uint8_t bytes of length (l_seq + 1)/2; if SEQ is '', l_seq may be set and the field is all 0x00 or omitted if l_seq=0. Quality scores follow as an array of uint8_t bytes of length l_seq, each holding the Phred-scaled value minus 33 (typically 0-93, or 0xFF if omitted); if SEQ='', the field is filled with 0xFF bytes. Auxiliary tags appear last as key-value pairs: each tag starts with a 2-byte type code (e.g., "NM" for edit distance, "MD" for mismatch details as a printable string, "AS" for alignment score), followed by a 1-byte type specifier (e.g., 'i' for integer, 'Z' for string), and the value itself, allowing flexible addition of metadata like mapping quality or probability estimates without altering the core structure. These tags are optional and can be numerous, contributing to the record's variability.[1]Compression and Indexing
BGZF Compression
BGZF (Blocked GNU Zip Format) is a block-based extension of the gzip compression algorithm, specifically designed to enable efficient random access in large genomic files while maintaining compatibility with standard gzip tools for sequential processing.[1] It divides the uncompressed data into independent virtual blocks of up to 2^16 bytes (64 KB) each, with each block compressed using the DEFLATE algorithm, allowing parallel processing and seeking without requiring full file decompression.[1] This structure contrasts with standard gzip, which treats the entire file as a single compressed stream, by incorporating a 'BC' subfield in the gzip header of each block to record its compressed length, facilitating quick navigation.[11] The internal structure of a BGZF block begins with a standard gzip header, followed by the DEFLATE-compressed data, and concludes with a 2-byte empty DEFLATE block (00 00) to end the deflate stream.[1] This is followed by an 8-byte trailer containing a CRC32 checksum (4 bytes) for integrity verification and the uncompressed size (ISIZE, 4 bytes) of the block.[1] Virtual file offsets in BGZF are represented as 64-bit values combining the compressed offset (shifted left by 16 bits) and the uncompressed offset within the block, enabling precise seeking to any position by decompressing only the relevant blocks.[1] In BAM files, the header remains uncompressed to allow rapid parsing of metadata, while the entire body—consisting of alignment records—is compressed using BGZF blocks.[1] This approach ensures that the binary-encoded alignment data serves as the compressed payload, optimizing storage for high-throughput sequencing outputs.[12] BGZF provides significant benefits for BAM files, achieving a compression ratio of approximately 1 byte per input base for typical Illumina GA data, which substantially reduces file sizes compared to uncompressed formats.[12] It supports multi-threaded compression and decompression through implementations like HTSlib, improving performance on modern hardware for large datasets.[11] Additionally, BGZF files remain fully compatible with standard gzip utilities for linear reading, ensuring broad interoperability.[11] BGZF was defined and integrated into the SAMtools toolkit in 2009 as part of the initial BAM format specification, with ongoing development in the HTSlib library to handle its compression mechanics.[12]BAM Indexing
BAM indexing utilizes the BAI (BAM Index) format, which serves as a companion file to a coordinate-sorted BAM file, typically named with a.bai extension. The BAI file begins with a magic number "BAI\1" (4 bytes) followed by the number of reference sequences (n_ref, 4 bytes), and for each reference sequence, it contains both interval (bin-based) and linear indices to enable rapid retrieval of alignments overlapping specific genomic regions. The interval index partitions the reference into hierarchical bins, while the linear index provides offsets for fixed 16 kbp intervals, allowing efficient seeking without scanning the entire BAM file.[1]
The core of the BAI structure is the binning system, which divides each reference sequence into up to 37,450 distinct bins across six levels, calculated using the reg2bin function based on bit shifts from 1<<15 (level 5, 16 kbp resolution) down to 1<<0 (level 0, whole chromosome). For a given region defined by start (beg) and end (end) positions, bins are determined hierarchically: if beg>>14 == end>>14, the bin is ((1<<15)-1)/7 + (beg>>14); otherwise, coarser levels are used until the region is fully covered. The reg2bins function computes the list of bins overlapping the region, which for large regions can include many bins, up to approximately 32,768 at the finest level for a query spanning the maximum reference length. Each bin entry includes a 4-byte bin number, the number of chunks (n_chunk, 4 bytes), and for each chunk, two 64-bit virtual offsets: chunk_beg (start of the first aligning block) and chunk_end (end of the last aligning block). The linear index follows, with n_intv (≤ 2^17, 4 bytes) indicating the number of 16 kbp intervals, each with an 8-byte ioffset pointing to the first alignment in that interval. Additionally, for unplaced or unmapped reads (position 0), a separate n_no_coor count (8 bytes) is stored.[1]
Virtual offsets in the BAI are 64-bit integers representing positions in the BGZF-compressed BAM file, formatted as (block_offset << 16) | in_block_offset, where block_offset is the byte offset to the start of a BGZF block and in_block_offset (≤ 65,535) is the position within the uncompressed block data. This design allows random access by seeking to the appropriate compressed block and then decompressing only up to the needed offset, leveraging BGZF's block-based compression for efficiency.[1]
BAI indices are generated using the samtools index command on a coordinate-sorted BAM file, producing the .bai file and enabling operations like fetching alignments in a specific region (e.g., "chr1:1000-2000") by identifying overlapping bins, retrieving their chunk offsets, and using the linear index to trim non-overlapping alignments—typically requiring just one seek operation. This facilitates targeted queries in genomic analysis pipelines without full file traversal.[13][1]
Despite its efficiency, the fixed hierarchical binning in BAI has limitations: it supports reference sequences up to 512 Mbp (2^29 - 1 bases) and may require querying multiple bins for small or oddly shaped regions, potentially including extraneous data that must be filtered post-retrieval. For whole-chromosome access, meta-intervals like bin 0 (level 0) cover the entire sequence, but unmapped reads are confined to a single bin (e.g., 4680 for position 0), which can affect query precision in unsorted or multi-mapped scenarios.[1]
An alternative to the BAI format is the CSI (Coordinate-Sorted Index), which uses more flexible, linear binning based on powers of two and supports reference sequences longer than 512 Mbp, generated using tools like samtools index with the -c option.[14]
Relation to SAM
Key Differences
The BAM (Binary Alignment/Map) format serves as the binary counterpart to the SAM (Sequence Alignment/Map) format, encoding the same alignment data in a compact, machine-readable structure rather than human-readable text. While SAM uses plain-text strings for elements like reference sequences and aligned bases (e.g., "ACGT"), BAM employs binary encoding, such as representing nucleotide bases with 2-bit codes (e.g., 00 for A, 01 for C) packed into bytes, which significantly reduces file sizes—typically by a factor of 3-4 compared to uncompressed SAM files.[1][15] This compression is mandatory in BAM via the BGZF (Blocked GNU Zip Format) scheme, enabling efficient storage without loss of information.[1] In terms of performance, BAM is optimized for high-speed input/output operations and random access to specific genomic regions, particularly when paired with indexing (e.g., BAI files), making it suitable for processing large-scale genomic datasets. In contrast, SAM's text-based nature facilitates manual inspection using standard text editors but results in slower parsing and sequential access, which can be prohibitive for terabyte-scale files common in next-generation sequencing.[1][9] BAM's binary structure also allows for multi-threaded decompression and processing in tools like SAMtools, further enhancing computational efficiency over SAM. Both formats maintain identical logical structures, including a header section for metadata (e.g., reference sequences) and alignment records with fields like query name, flag, and mapping quality, ensuring seamless data equivalence. However, BAM's binary rigidity precludes direct text editing, necessitating conversion to SAM for any manual modifications, whereas SAM supports straightforward alterations with basic tools.[1][15] Practically, BAM is preferred for long-term storage and integration into automated analysis pipelines due to its compactness and speed, while SAM is typically generated as the initial output from aligners such as BWA for debugging or visual review before conversion to BAM.[16][9] This division leverages SAM's readability for human oversight and BAM's efficiency for downstream computational tasks like variant calling.[15]Conversion Processes
Conversion between BAM and SAM formats is bidirectional and primarily facilitated by thesamtools view command from the SAMtools toolkit. To convert a SAM file to BAM, the command samtools view -b input.sam -o output.bam is used, where -b specifies binary BAM output; the input format is auto-detected. This process compresses the text-based SAM into the efficient binary BAM while preserving the alignment records. Conversely, converting BAM to SAM employs samtools view -h input.bam -o output.sam, with the -h flag ensuring the header is included in the decompressed text output, which is essential for downstream interpretation of the alignments.[17]
In typical next-generation sequencing (NGS) workflows, aligners such as BWA or Bowtie2 generate SAM output, which is then immediately converted to BAM for compact storage and efficient random access; this step is often automated via piping to avoid intermediate disk writes of large uncompressed SAM files, for example, bwa mem reference.fa reads.fastq | samtools view -b - > alignments.bam. Such integration minimizes storage overhead and facilitates subsequent operations like sorting and indexing directly on the BAM.[18][17]
Key considerations during conversion include preserving the input file's sorting order—either by coordinate (preferred for indexing) or by name—to maintain compatibility with tools expecting specific arrangements; unsorted inputs should be sorted post-conversion using samtools sort if needed. For handling large files, streaming via pipes or uncompressed output (-u flag) prevents memory exhaustion by processing data in chunks rather than loading entire files. Post-conversion validation is recommended using samtools quickcheck output.bam, which verifies file integrity by checking the header and end-of-file markers without full decompression.[17][19]
Advanced options enhance efficiency and flexibility: multi-threading with the -@ INT flag (e.g., -@ 4 for four threads) accelerates compression during SAM-to-BAM conversion on multi-core systems. Subsetting specific genomic regions during conversion is possible by specifying the region as a positional argument on indexed BAM inputs (e.g., samtools view -b -o output.bam input.bam chr1:100-200), outputting only alignments overlapping the specified interval to reduce file size for targeted analyses.[17]
Common pitfalls include failing to re-index a BAM file after any post-conversion edits or filtering, which can lead to loss of random access efficiency and errors in region-based queries; always run [samtools](/page/SAMtools) index on modified BAMs to generate the accompanying .bai file. Additionally, omitting the header during BAM-to-SAM conversion may render the output unusable for tools requiring reference sequence information.[17]