Fact-checked by Grok 2 weeks ago

BAM (file format)

The BAM (Binary Alignment/Map) file format is a compressed, binary representation of the (Sequence Alignment/Map) format, specifically designed for efficient storage and to aligned high-throughput sequencing data in bioinformatics applications. It serves as a standard container for genomic alignments generated from diverse sequencing technologies, such as Illumina, /454, and AB/, enabling downstream analyses like variant calling and structural variation detection. Developed in 2009 by a team including Heng Li, Bob Handsaker, and Richard Durbin as part of the 1000 Genomes Project Data Processing Subgroup, BAM addresses the challenges of managing massive datasets from next-generation sequencing, where alignments can span billions of base pairs. The format was introduced alongside SAMtools, a suite of utilities for manipulating SAM and BAM files, to standardize alignment reporting across tools and promote interoperability in genomic research. Its specification has evolved through versions, with the current SAM/BAM v1.6, the current version as of 2024, incorporating enhancements for long reads and supplementary alignments while maintaining backward compatibility. Key features of BAM include its use of BGZF (Blocked Zip ) for compression, which reduces file sizes significantly— for example, compressing 112 Gbp of Illumina data to about 116 GB—while allowing parallel decompression and without full file loading. BAM files must be coordinate-sorted to support indexing via the BAI (BAM Alignment Index) format, which divides the into hierarchical bins for rapid retrieval of alignments in specified regions, typically spanning from 2^29 base pairs down to 2^14 base pairs. Each BAM record preserves all SAM fields, including 11 mandatory ones (e.g., reference ID, position, string, and sequence) encoded efficiently in binary, along with optional tags for metadata like quality scores and custom attributes. This structure makes BAM the preferred for submission to repositories like the NCBI Sequence Read Archive (), where it is explicitly recommended over text-based alternatives for its compactness and machine-readability.

Overview

Definition and Purpose

The BAM (Binary Alignment/Map) format is a binary-encoded that serves as the compressed equivalent of the Sequence Alignment/Map () format, specifically designed for storing alignments of high-throughput sequencing reads to reference genomes or transcriptomes. It enables efficient representation of both short and long sequences, accommodating data from diverse sequencing platforms such as Illumina, AB/SOLiD, and Roche/454. It also supports long-read platforms like . The primary purpose of BAM is to facilitate compact and rapid retrieval of data, supporting essential bioinformatics workflows including variant detection in genomic resequencing and gene expression quantification in analyses. By providing a standardized structure for records, BAM underpins scalable processing of massive datasets generated by next-generation sequencing, allowing tools like variant callers to efficiently access specific genomic regions without loading entire files. BAM offers key advantages over text-based formats like uncompressed , including substantially reduced file sizes—often achieving approximately 1 byte per aligned base—and accelerated operations, which are critical for managing terabyte-scale sequencing outputs. These efficiencies stem from its binary encoding and built-in compression, enabling through indexing while maintaining lossless . BAM was developed in late and introduced in to address the surging volumes of data from emerging next-generation sequencing technologies, which demanded more performant alternatives to existing alignment storage methods for projects such as the .

History and Development

The BAM format originated in late as a binary counterpart to the text-based Sequence Alignment/Map (SAM) format, developed by Heng Li and colleagues at the as part of the project. This effort addressed the limitations of text-based formats for in processing high-throughput sequencing data, introducing and efficient capabilities. The first implementation emerged rapidly, with key design decisions—including the adoption of a dual text/ structure, BGZF , and indexing—finalized between October and December . The initial public release of version 0.1.0, which included BAM support, occurred on December 22, , under the . A pivotal milestone came with the format's early adoption by the , to which a final draft specification was sent on December 8, 2008, leading to its use for releasing project alignments. Subsequent updates enhanced functionality and interoperability; notably, version 1.0, released in August 2014, introduced improved compression options like support and restructured the codebase by extracting HTSlib as a standalone library, providing a stable C API for reading and writing BAM files across diverse tools. This refactoring facilitated broader adoption in bioinformatics pipelines, enabling more efficient handling of variant calling and downstream analyses. Standardization of BAM has evolved through community-driven documentation, initially detailed in the 2009 SAMtools paper and later formalized in the official /Map Format Specification (version 1.6), maintained via the HTSlib project; as of November 2025, version 1.6 remains current with minor updates in 2024. Contributions from institutions like the Broad Institute—through tools such as GATK and —and EMBL-EBI, which integrates BAM in resources like Ensembl, have refined the specification to ensure compatibility and extensibility. The format's development has been propelled by the explosive growth in genomic data volumes, necessitating robust solutions for petabyte-scale datasets from sequencers like Illumina.

Format Specifications

Header Structure

The BAM file header serves as the initial fixed section of a BAM file, providing essential for the that follows. It begins with the "BAM\1", consisting of four bytes that uniquely identify the . This header is structured as a series of length-prefixed blocks, enabling efficient parsing and ensuring the file's integrity before processing the records. Immediately after the magic string, a 32-bit unsigned (uint32_t) specifies the length of the header text block (l_text), which is limited to less than 2³¹ bytes and contains the header in plain-text SAM format. This text may include the file format version via the @HD VN tag (e.g., "1.6" for the current specification) and optional human-readable information, such as the sorting order indicated by the @HD SO tag (e.g., "coordinate" or "unsorted"). The header text is followed by another 32-bit unsigned (n_ref) denoting the number of sequences in the dictionary. For each sequence, the structure includes a 32-bit unsigned for the length of the name (a NUL-terminated string, also limited to less than 2³¹ bytes including the NUL), the name itself, and a 32-bit unsigned for the sequence length (maximum < 2³¹ bases). Optional uniform resource identifiers (URIs) for sequences, such as URLs to genome assemblies (e.g., for hg38), can be specified within the header text using @SQ UR tags. All data in the BAM header employs little-endian byte order for consistent cross-platform compatibility, with 32-bit unsigned integers used for lengths, counts, and reference sizes. While positions within alignment records utilize 32-bit signed or unsigned integers (or 64-bit in extended formats), the header itself relies on these 32-bit types to maintain a compact structure. This header organization facilitates file validation by confirming the BAM format through the magic string and providing a complete reference sequence dictionary that describes the genome or reference assembly used for alignments, thereby ensuring tool compatibility and preventing mismatches in downstream analyses.

Alignment Records

Alignment records in BAM files represent the core data units, encoding individual sequencing reads and their alignments to reference sequences in a compact binary format. Each record begins with a fixed-size core section of 32 bytes (after the initial block size field), which captures essential mapping metadata, followed by a variable-length section containing the read sequence, quality scores, and optional auxiliary tags. This structure enables efficient storage and random access, particularly when combined with indexing. The core section starts with a 4-byte unsigned integer (uint32_t) specifying the block size, which indicates the total length of the record excluding this field itself, allowing parsers to navigate compressed blocks. This is followed by the reference sequence ID (refID), a 4-byte signed integer (int32_t) ranging from -1 (for unmapped reads) to the number of references minus one, identifying the aligned chromosome or contig. The position (pos) is a 4-byte signed integer representing the 0-based leftmost mapping coordinate (equivalent to the SAM POS field minus one), also set to -1 for unmapped reads. Next come the read name length (l_read_name), a 1-byte unsigned integer (uint8_t) equal to the length of QNAME plus 1 (for NUL), and the mapping quality (MAPQ), a 1-byte unsigned integer (uint8_t) from 0 to 255 (255 for unknown). The BIN field, a 2-byte unsigned integer (uint16_t) used for indexing purposes in BAI files, is computed via the reg2bin function based on the alignment's start and end positions. The FLAG field, another 2-byte unsigned integer (uint16_t), encodes bitwise flags for read properties such as paired-end status (bits 1-3), proper pairing (bit 2), duplicate marking (bit 10), and supplementary alignment (bit 11). The number of CIGAR operations (n_cigar_op) is a 2-byte unsigned integer (uint16_t) limiting the CIGAR string to 65,535 operations, with longer alignments handled via the CG tag in the variable section. The alignment length (l_seq), a 4-byte unsigned integer (uint32_t), specifies the length of the query sequence. Mate reference ID (next_refID) and mate position (next_pos) are 4-byte signed integers (int32_t) each, indicating the paired read's reference and 0-based position (PNEXT minus one), both set to -1 if unknown. Finally, the template length (tlen), a 4-byte signed integer (int32_t), represents the observed length of the full insert for paired reads, or zero if unavailable. All positions in these fields are encoded as 32-bit integers for precision across large genomes. The variable section immediately follows the core, beginning with the read name (QNAME), a NUL-terminated string of length l_read_name. The CIGAR string is then encoded as an array of n_cigar_op 4-byte little-endian unsigned integers (uint32_t), each packing the operation length in the upper 28 bits and the operation code in the lower 4 bits as (length << 4 | code), describing matches, insertions, deletions, and other events. The query sequence is stored in a packed format using 2 bits per base (=00, A=01, C=10, G=11? Standard: 00=A, 01=C, 10=G, 11=T/N, packed from LSB to MSB across bytes), represented as an array of uint8_t bytes of length (l_seq + 1)/2; if SEQ is '', l_seq may be set and the field is all 0x00 or omitted if l_seq=0. Quality scores follow as an array of uint8_t bytes of length l_seq, each holding the Phred-scaled value minus 33 (typically 0-93, or 0xFF if omitted); if SEQ='', the field is filled with 0xFF bytes. Auxiliary tags appear last as key-value pairs: each tag starts with a 2-byte type code (e.g., "NM" for edit distance, "MD" for mismatch details as a printable string, "AS" for alignment score), followed by a 1-byte type specifier (e.g., 'i' for integer, 'Z' for string), and the value itself, allowing flexible addition of metadata like mapping quality or probability estimates without altering the core structure. These tags are optional and can be numerous, contributing to the record's variability.

Compression and Indexing

BGZF Compression

BGZF (Blocked GNU Zip Format) is a block-based extension of the gzip compression algorithm, specifically designed to enable efficient random access in large genomic files while maintaining compatibility with standard gzip tools for sequential processing. It divides the uncompressed data into independent virtual blocks of up to 2^16 bytes (64 KB) each, with each block compressed using the DEFLATE algorithm, allowing parallel processing and seeking without requiring full file decompression. This structure contrasts with standard gzip, which treats the entire file as a single compressed stream, by incorporating a 'BC' subfield in the gzip header of each block to record its compressed length, facilitating quick navigation. The internal structure of a BGZF block begins with a standard gzip header, followed by the DEFLATE-compressed data, and concludes with a 2-byte empty DEFLATE block (00 00) to end the deflate stream. This is followed by an 8-byte trailer containing a CRC32 checksum (4 bytes) for integrity verification and the uncompressed size (ISIZE, 4 bytes) of the block. Virtual file offsets in BGZF are represented as 64-bit values combining the compressed offset (shifted left by 16 bits) and the uncompressed offset within the block, enabling precise seeking to any position by decompressing only the relevant blocks. In BAM files, the header remains uncompressed to allow rapid parsing of metadata, while the entire body—consisting of alignment records—is compressed using BGZF blocks. This approach ensures that the binary-encoded alignment data serves as the compressed payload, optimizing storage for high-throughput sequencing outputs. BGZF provides significant benefits for BAM files, achieving a compression ratio of approximately 1 byte per input base for typical Illumina GA data, which substantially reduces file sizes compared to uncompressed formats. It supports multi-threaded compression and decompression through implementations like HTSlib, improving performance on modern hardware for large datasets. Additionally, BGZF files remain fully compatible with standard gzip utilities for linear reading, ensuring broad interoperability. BGZF was defined and integrated into the SAMtools toolkit in 2009 as part of the initial BAM format specification, with ongoing development in the HTSlib library to handle its compression mechanics.

BAM Indexing

BAM indexing utilizes the BAI (BAM Index) format, which serves as a companion file to a coordinate-sorted BAM file, typically named with a .bai extension. The BAI file begins with a magic number "BAI\1" (4 bytes) followed by the number of reference sequences (n_ref, 4 bytes), and for each reference sequence, it contains both interval (bin-based) and linear indices to enable rapid retrieval of alignments overlapping specific genomic regions. The interval index partitions the reference into hierarchical bins, while the linear index provides offsets for fixed 16 kbp intervals, allowing efficient seeking without scanning the entire BAM file. The core of the BAI structure is the binning system, which divides each sequence into up to 37,450 distinct s across six levels, calculated using the reg2bin function based on bit shifts from 1<<15 (level 5, 16 kbp resolution) down to 1<<0 (level 0, whole chromosome). For a given region defined by start (beg) and end (end) positions, bins are determined hierarchically: if beg>>14 == end>>14, the bin is ((1<<15)-1)/7 + (beg>>14); otherwise, coarser levels are used until the region is fully covered. The reg2bins function computes the list of bins overlapping the region, which for large regions can include many bins, up to approximately 32,768 at the finest level for a query spanning the maximum . Each bin entry includes a 4-byte bin number, the number of chunks (n_chunk, 4 bytes), and for each chunk, two 64-bit virtual offsets: chunk_beg (start of the first aligning block) and chunk_end (end of the last aligning block). The linear follows, with n_intv (≤ 2^17, 4 bytes) indicating the number of 16 kbp , each with an 8-byte ioffset pointing to the first alignment in that interval. Additionally, for unplaced or unmapped reads ( 0), a separate n_no_coor count (8 bytes) is stored. Virtual offsets in the BAI are 64-bit integers representing positions in the BGZF-compressed BAM file, formatted as (block_offset << 16) | in_block_offset, where block_offset is the byte offset to the start of a BGZF block and in_block_offset (≤ 65,535) is the position within the uncompressed block data. This design allows random access by seeking to the appropriate compressed block and then decompressing only up to the needed offset, leveraging BGZF's block-based compression for efficiency. BAI indices are generated using the samtools index command on a coordinate-sorted BAM file, producing the .bai file and enabling operations like fetching alignments in a specific region (e.g., "chr1:1000-2000") by identifying overlapping bins, retrieving their chunk offsets, and using the linear index to trim non-overlapping alignments—typically requiring just one seek operation. This facilitates targeted queries in genomic analysis pipelines without full file traversal. Despite its efficiency, the fixed hierarchical binning in BAI has limitations: it supports reference sequences up to 512 Mbp (2^29 - 1 bases) and may require querying multiple bins for small or oddly shaped regions, potentially including extraneous data that must be filtered post-retrieval. For whole-chromosome access, meta-intervals like bin 0 (level 0) cover the entire sequence, but unmapped reads are confined to a single bin (e.g., 4680 for position 0), which can affect query precision in unsorted or multi-mapped scenarios. An alternative to the BAI format is the CSI (Coordinate-Sorted Index), which uses more flexible, linear binning based on powers of two and supports reference sequences longer than 512 Mbp, generated using tools like samtools index with the -c option.

Relation to SAM

Key Differences

The BAM (Binary Alignment/Map) format serves as the binary counterpart to the SAM (Sequence Alignment/Map) format, encoding the same alignment data in a compact, machine-readable structure rather than human-readable text. While SAM uses plain-text strings for elements like reference sequences and aligned bases (e.g., "ACGT"), BAM employs binary encoding, such as representing nucleotide bases with 2-bit codes (e.g., 00 for A, 01 for C) packed into bytes, which significantly reduces file sizes—typically by a factor of 3-4 compared to uncompressed SAM files. This compression is mandatory in BAM via the BGZF (Blocked GNU Zip Format) scheme, enabling efficient storage without loss of information. In terms of performance, BAM is optimized for high-speed input/output operations and random access to specific genomic regions, particularly when paired with indexing (e.g., BAI files), making it suitable for processing large-scale genomic datasets. In contrast, SAM's text-based nature facilitates manual inspection using standard text editors but results in slower parsing and sequential access, which can be prohibitive for terabyte-scale files common in next-generation sequencing. BAM's binary structure also allows for multi-threaded decompression and processing in tools like , further enhancing computational efficiency over SAM. Both formats maintain identical logical structures, including a header section for metadata (e.g., reference sequences) and alignment records with fields like query name, flag, and mapping quality, ensuring seamless data equivalence. However, BAM's binary rigidity precludes direct text editing, necessitating conversion to SAM for any manual modifications, whereas SAM supports straightforward alterations with basic tools. Practically, BAM is preferred for long-term storage and integration into automated analysis pipelines due to its compactness and speed, while SAM is typically generated as the initial output from aligners such as for debugging or visual review before conversion to BAM. This division leverages SAM's readability for human oversight and BAM's efficiency for downstream computational tasks like variant calling.

Conversion Processes

Conversion between BAM and SAM formats is bidirectional and primarily facilitated by the samtools view command from the SAMtools toolkit. To convert a SAM file to BAM, the command samtools view -b input.sam -o output.bam is used, where -b specifies binary BAM output; the input format is auto-detected. This process compresses the text-based SAM into the efficient binary BAM while preserving the alignment records. Conversely, converting BAM to SAM employs samtools view -h input.bam -o output.sam, with the -h flag ensuring the header is included in the decompressed text output, which is essential for downstream interpretation of the alignments. In typical next-generation sequencing (NGS) workflows, aligners such as BWA or Bowtie2 generate SAM output, which is then immediately converted to BAM for compact storage and efficient random access; this step is often automated via piping to avoid intermediate disk writes of large uncompressed SAM files, for example, bwa mem reference.fa reads.fastq | samtools view -b - > alignments.bam. Such integration minimizes storage overhead and facilitates subsequent operations like and indexing directly on the BAM. Key considerations during conversion include preserving the input file's sorting order—either by coordinate (preferred for indexing) or by name—to maintain compatibility with tools expecting specific arrangements; unsorted inputs should be sorted post-conversion using samtools sort if needed. For handling large files, streaming via pipes or uncompressed output (-u flag) prevents memory exhaustion by processing data in chunks rather than loading entire files. Post-conversion validation is recommended using samtools quickcheck output.bam, which verifies file integrity by checking the header and end-of-file markers without full decompression. Advanced options enhance efficiency and flexibility: multi-threading with the -@ INT flag (e.g., -@ 4 for four threads) accelerates during SAM-to-BAM on multi-core systems. Subsetting specific genomic s during is possible by specifying the region as a positional argument on indexed BAM inputs (e.g., samtools view -b -o output.bam input.bam chr1:100-200), outputting only alignments overlapping the specified interval to reduce file size for targeted analyses. Common pitfalls include failing to re-index a BAM file after any post-conversion edits or filtering, which can lead to loss of random access efficiency and errors in region-based queries; always run [samtools](/page/SAMtools) index on modified BAMs to generate the accompanying .bai file. Additionally, omitting the header during BAM-to-SAM conversion may render the output unusable for tools requiring reference sequence information.

Applications and Tools

Common Software Tools

SAMtools is a core open-source toolkit for manipulating alignments in the BAM format, providing utilities for viewing, sorting, merging, indexing, and generating statistics from BAM files. Written in C and relying on HTSlib for high-throughput sequencing input/output operations, it supports conversion between and BAM formats and is widely used for efficient processing of large-scale genomic data. Picard, developed by the Broad Institute, is a Java-based suite of command-line tools designed for validating, marking duplicates, sorting by coordinates, and gathering BAM files, ensuring in high-throughput sequencing pipelines. It integrates seamlessly with the Genome Analysis Toolkit (GATK) ecosystem for downstream variant analysis and handles formats including , BAM, and . HTSlib serves as a low-level C library providing programmatic access to BAM files for reading, writing, and querying alignments, with support for multithreading, plugin extensions, and efficient indexing mechanisms like BAI or . It forms the foundational I/O layer for numerous tools, including , enabling high-performance operations on compressed BAM data without decompressing entire files. Bcftools acts as a companion toolkit to , primarily for variant calling and manipulation, where it processes BAM files via the mpileup engine to generate likelihoods and outputs results in VCF or BCF format for further analysis. It supports multi-sample variant detection from aligned reads in BAM, facilitating downstream genomic interpretation. Among other notable tools, PySAM offers Python bindings to HTSlib, allowing scripted reading, writing, and manipulation of BAM files with facilities for alignment iteration and tag access. HTS-JDK provides a Java API for accessing BAM and related formats, supporting SAM record parsing and integration into Java-based bioinformatics applications. Additionally, samjdk enables multi-format BAM filtering using compiled Java expressions for custom data extraction.

Integration in Workflows

BAM files serve as a central component in standard next-generation sequencing (NGS) bioinformatics pipelines, where they facilitate efficient storage and processing of alignment data. Typically, the process begins with alignment tools such as BWA-MEM, which output alignments in SAM format; these are then converted to BAM using samtools view for compression and binary efficiency. Subsequent steps involve sorting the BAM file by genomic coordinates with samtools sort to enable rapid regional queries, followed by indexing via samtools index to create .bai files that support random access without loading the entire file. Duplicate read removal is commonly performed using Picard's MarkDuplicates tool on the sorted BAM, which identifies and flags PCR artifacts to improve downstream accuracy. Finally, variant calling pipelines like GATK's HaplotypeCaller rely on these indexed, processed BAM files as input to detect single nucleotide variants and small indels across the genome. In large-scale genomic projects, BAM files play a pivotal role in standardization and sharing, enabling collaborative analysis across institutions. For instance, the project utilizes BAM as the primary for aligned sequencing in its uniform pipelines, producing filtered BAM files for assays like ChIP-seq and to ensure reproducibility and interoperability. Similarly, the TCGA dataset distributes aligned BAM files through the Genomic Data Commons (GDC), where they form the basis for controlled-access sharing of whole-genome and , supporting multi-omics integration and pan-cancer studies. This 's compatibility with BGZF compression allows for on computing clusters, distributing tasks like regional extraction across nodes to handle petabyte-scale datasets efficiently. Advanced applications leverage BAM files for specialized analyses beyond basic variant calling. In ChIP-seq workflows, tools like MACS2 accept BAM inputs to model signal enrichment and call peaks for binding sites or modifications, accounting for read duplicates and control samples to generate narrow or files. For quantification, featureCounts processes coordinate-sorted BAM files to assign reads to genomic features like exons, producing count matrices for differential expression analysis with tools such as DESeq2. Structural variant detection pipelines, such as Delly, use duplicate-marked and indexed BAM files from paired tumor-normal samples to identify deletions, insertions, and translocations by analyzing split and discordant reads. Best practices for BAM integration emphasize optimization for performance and integrity. Indexing is essential for all query-based operations, as it allows tools to retrieve specific genomic regions without decompressing the full file, significantly reducing I/O overhead in iterative analyses. Coordinate-sorted BAM files are preferred over name-sorted ones for most applications, as they align with genomic order and facilitate merging and visualization in genome browsers. To ensure data fidelity, especially in shared repositories, checksums should be computed and stored alongside BAM files to verify integrity during transfers or long-term archiving. Challenges in BAM workflows often arise with multi-sample cohorts and high-volume data. Merging BAM files from multiple lanes or replicates requires careful coordinate sorting and duplicate handling to avoid artifacts, typically using samtools merge followed by re-indexing, which can be computationally intensive for cohorts exceeding hundreds of samples. Scalability issues emerge with whole-genome datasets, where uncompressed BAM files routinely exceed 100 GB per sample, necessitating distributed computing frameworks like Apache Spark for parallel merging and processing to manage storage and runtime constraints effectively.

References

  1. [1]
    [PDF] Sequence Alignment/Map Format Specification - Samtools
    Nov 6, 2024 · This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAM file may optionally specify the version being used via the @ ...
  2. [2]
    The Sequence Alignment/Map format and SAMtools - PMC - NIH
    The Sequence Alignment/Map format and SAMtools · Heng Li, Bob Handsaker · Bob Handsaker, Alec Wysoker · Alec Wysoker, Tim Fennell · Tim Fennell, Jue Ruan ...
  3. [3]
    File Format Guide - SRA - NCBI - NIH
    Sep 20, 2019 · SRA accepts binary formats like BAM, SFF, and HDF5, and text formats like FASTQ. BAM is a preferred format.
  4. [4]
    RNAseq short variant discovery (SNPs + Indels) - GATK
    One way to do this is to use the Picard tools utility "ReorderSam" to modify your RNA-seq BAM file to match the contig names in the reference genome fasta file.Missing: applications | Show results with:applications
  5. [5]
    The early history of the SAM/BAM format - Heng Li's blog
    Jan 27, 2015 · 2008-12-08: Final draft sent to 1000g. Adopted the MIT license. 2008-12-22: First public release of samtools. It is still working on most BAMs ...Missing: file | Show results with:file
  6. [6]
    Sequence Alignment/Map format and SAMtools - Oxford Academic
    The equivalent binary representation, BAM, is compact in size and supports fast retrieval of alignments in specified regions. Using positional sorting and ...
  7. [7]
    Twelve years of SAMtools and BCFtools | GigaScience
    Therefore the decision was taken in August 2014 (release 1.0) to split the SAMtools package into a stand-alone library with a well-defined API (HTSlib, ...
  8. [8]
    SAM or BAM or CRAM - Mapped sequence data formats - GATK
    Jun 25, 2024 · SAM, BAM and CRAM are all different forms of the original SAM format that was defined for holding aligned (or more properly, mapped) high-throughput sequencing ...Sam Or Bam Or Cram - Mapped... · Gatk And Picard Requirements · A Few Additional...Missing: contributions EMBL- EBI
  9. [9]
    HTSlib: C library for reading/writing high-throughput sequencing data
    SAM/BAM quickly replaced all the other short-read alignment formats and became the de facto standard in the analysis of high-throughput sequence data. In 2010, ...Htslib: C Library For... · Findings · Implementation<|control11|><|separator|>
  10. [10]
    bgzip(1) manual page - Samtools
    May 30, 2025 · Bgzip compresses files in a similar manner to, and compatible with, gzip(1). The file is compressed into a series of small (less than 64K) 'BGZF' blocks.
  11. [11]
  12. [12]
    samtools-index(1) manual page
    May 30, 2025 · Index coordinate-sorted BGZIP-compressed SAM, BAM or CRAM files for fast random access. Note for SAM this only works if the file has been BGZF compressed first.
  13. [13]
    What is the difference between SAM and BAM files? - SAMtools
    Aug 23, 2025 · BAM is its compressed, binary counterpart that enables faster access and reduced storage requirements. Both formats are supported by popular ...
  14. [14]
    Manual Reference Pages - bwa (1) - Burrows-Wheeler Aligner
    In the paired-end mode, BWA-SW may still output split alignments but they ... BWA outputs the final alignment in the SAM (Sequence Alignment/Map) format.
  15. [15]
    samtools-view(1) manual page
    May 30, 2025 · Prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).
  16. [16]
    Twelve years of SAMtools and BCFtools - PMC - PubMed Central
    Early releases of SAMtools could read and write alignment data in the SAM and BAM formats. The 1.0 release introduced support for the better-compressed CRAM ...Missing: history | Show results with:history
  17. [17]
    samtools-quickcheck(1) manual page
    May 30, 2025 · Quickly check that input files appear to be intact. Checks that beginning of the file contains a valid header (all formats) containing at least one target ...
  18. [18]
    samtools(1) manual page
    May 30, 2025 · Heng Li from the Sanger Institute wrote the original C version of samtools. ... File format specification of SAM/BAM,CRAM,VCF/BCF: <http:// ...Samtools view · Samtools sort · Samtools-fasta · Samtools index
  19. [19]
    samtools/samtools: Tools (written in C using htslib) for ... - GitHub
    This is the official development repository for samtools. The original samtools package has been split into three separate but tightly coordinated projects.Samtools · Samtools/htslib · Releases · BCFtools
  20. [20]
    Picard Tools - By Broad Institute - GitHub Pages
    Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.Command Line Overview · SAM Differences in Picard · Explain SAM Flags
  21. [21]
    broadinstitute/picard: A set of command line tools (in Java ... - GitHub
    A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.Releases 186 · Issues 212 · Pull requests 19 · Actions
  22. [22]
    samtools/htslib: C library for high-throughput sequencing data formats
    HTSlib implements a generalized BAM index, with file extension .csi (coordinate-sorted index). The HTSlib file reader first looks for the new index and then for ...Releases · Issues 140 · Pull requests 25 · Actions
  23. [23]
    Samtools
    Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories.SAMtools/BCFtools/HTSlib · Samtools view · Samtools - Documentation · Bgzip
  24. [24]
    bcftools(1) Manual Page - Samtools
    BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF.
  25. [25]
    Variant calling - bcftools mpileup - Samtools
    The variant callers provide a quality score (the QUAL) column, which gives an estimate of how likely it is to observe a call purely by chance.
  26. [26]
    Introduction — pysam 0.23.3 documentation
    Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in SAM/BAM files.Installing pysam · Working with BAM/CRAM/SAM... · Glossary
  27. [27]
    samtools/htsjdk: A Java API for high-throughput sequencing ... - GitHub
    HTSJDK is an implementation of a unified Java library for accessing common file formats, such as SAM and VCF, used for high-throughput sequencing data.
  28. [28]
    SamJdk | jvarkit
    Filters a BAM using a java expression compiled in memory. Usage This program is now part of the main jvarkit tool. See jvarkit for compiling.
  29. [29]
    From FastQ data to high confidence variant calls: the Genome ... - NIH
    This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls.
  30. [30]
    Common File Formats Used by the ENCODE Consortium
    The consortium considers FASTQ as the basic file format for archival purpose and thus the FASTQ format's specifications aim to preserve the raw sequence data.Fastq · Bam File Content · Bigwig File Content
  31. [31]
    ATAC-seq Data Standards and Processing Pipeline - ENCODE
    Bowtie2 aligner is used to produce raw bam files, followed by various filtering steps (mappability and quality) to produce filtered bams. bigWig, signals, Two ...Menu · Pipeline Overview · Outputs
  32. [32]
    The ENCODE Uniform Analysis Pipelines - PMC - NIH
    In all cases the pipeline typically produces a common set of files for each replicate: Two BAM files (one each for mapping to the reference genome and ...
  33. [33]
    Controlled Access - GDC Docs - National Cancer Institute
    While much of the data in the GDC is open access, many file types are controlled access. This primarily includes raw sequencing data such as BAM or FASTQ files ...
  34. [34]
    Identifying ChIP-seq enrichment using MACS - PMC - NIH
    MACS is a computational algorithm that identifies genome-wide locations of transcription/chromatin factor binding or histone modification from ChIP-seq data.
  35. [35]
    Using MACS to Identify Peaks from ChIP-Seq Data - PMC - NIH
    As a non-interactive command line tool, MACS takes input by setting proper command line parameters and no input is needed during running. The input should be ...
  36. [36]
    RNA-seq workflow: gene-level exploratory analysis and differential ...
    Apr 18, 2025 · Here we walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ ...2.4 Deseq2 Import Functions · 3 The Deseqdataset Object... · 4.4 Pca Plot
  37. [37]
    [PDF] Rsubread/Subread Users Guide
    Oct 10, 2021 · The data input to featureCounts consists of (i) ... complete RNA-seq analysis, including Subread read mapping, featureCounts read summariza-.
  38. [38]
    dellytools/delly: DELLY2: Structural variant discovery by ... - GitHub
    Delly needs a sorted, indexed and duplicate marked bam file for every input sample. An indexed reference genome is required to identify split-reads. Common ...Delly · Issues 49 · Pull requests 0 · Actions
  39. [39]
    Genome-Wide Mapping of DNA Methylation 5mC by ... - NIH
    This can be done using any file checksum such as MD5. Prior to any analysis ... Convert the SAM files to a sorted BAM format using the SAMtools [12] utility.Missing: practices | Show results with:practices<|control11|><|separator|>
  40. [40]
    Optimized distributed systems achieve significant performance ...
    Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. Keywords: ...
  41. [41]
    Supercomputing for the parallelization of whole genome analysis
    With focus on scalability, we sought to improve the timeline required to process whole genome sequencing (WGS) by optimizing extraction, alignment, processing ...
  42. [42]
    Scalable and efficient DNA sequencing analysis on different ...
    Processes outputting CRAM files reduce their storage needs by at least a third. In the case of GATK4 ApplyBQSR it was reduced by 64%.