The Variant Call Format (VCF) is a standardized, tab-delimited text file format widely used in bioinformatics to store and exchange genetic variation data, including single nucleotide polymorphisms (SNPs), insertions, deletions (indels), structural variants, and copy number variations, along with associated annotations and quality metrics relative to a reference genome.[1] Developed in 2010 by the 1000 Genomes Project to facilitate the representation of human genetic variation across large-scale sequencing efforts, VCF has become a de facto standard adopted by numerous initiatives, such as UK10K, dbSNP, and the NHLBI Exome Project, and is now maintained by the Global Alliance for Genomics and Health (GA4GH) to support large-scale genomics initiatives, due to its flexibility, scalability for handling millions of variants from thousands of samples, and support for both diploid and non-diploid genomes.[1][2][3]
VCF files typically begin with metadata lines prefixed by "##" to describe file format version, reference genome, and custom fields, followed by a header line prefixed by "#" defining the columns, and then data lines each representing a single variant site.[4] The core fixed columns include chromosome (CHROM), position (POS), variant identifier (ID), reference allele (REF), alternate allele(s) (ALT), quality score (QUAL), filter status (FILTER), and informational tags (INFO), with optional per-sample genotype data introduced via a FORMAT column.[4] To address storage and retrieval efficiency, VCF files are often compressed using bgzip and indexed with tabix for rapid querying of specific genomic regions, while a binary counterpart, BCF version 2, provides further optimization for computational pipelines.[1] The format's extensibility allows for symbolic alleles (e.g., for deletions) and specialized notations for complex rearrangements, ensuring compatibility with evolving genomic analyses, as reflected in its latest specification, version 4.5, released in October 2024.[4]
Overview
Purpose and Scope
The Variant Call Format (VCF) is a standardized, TAB-delimited text file format designed for efficiently storing and describing genetic variants identified from high-throughput sequencing data, including single nucleotide polymorphisms (SNPs), insertions, deletions (indels), structural variants, and copy number variations.[1][4] Its primary purpose is to facilitate the seamless exchange, storage, and analysis of genomic variation data across diverse bioinformatics tools, pipelines, and research consortia, thereby promoting interoperability in large-scale genomic studies.[1][5]
VCF supports both single-sample and multi-sample files, accommodating phased genotypes (indicating haplotype information) as well as unphased genotypes (representing diploid calls without phase).[4] Originating from the 1000 Genomes Project in 2010, the format was developed to standardize the representation of variant calls in population-scale sequencing efforts.[1]
Key benefits of VCF include its human-readable structure, which allows direct inspection without specialized software, and its extensibility through metadata fields that enable the inclusion of custom annotations and quality metrics.[4][5] Widely adopted in bioinformatics, VCF has become a de facto standard for variant data in projects ranging from clinical genomics to evolutionary studies, with a compressed binary counterpart (BCF) available for more efficient storage and querying.[1][4]
Basic Components
The Variant Call Format (VCF) file is structured as a tab-delimited text file divided into two primary sections: the header and the body. The header encompasses all lines beginning with "##" for meta-information and a single line starting with "#CHROM" to define the column structure, while the body consists of all subsequent non-header lines representing individual variant records. This organization enables efficient storage and parsing of genomic variation data in a standardized manner.[4]
The header plays a crucial role in providing essential metadata that contextualizes the file's contents, including the VCF version, contig sequences with their lengths for coordinate mapping, sample identifiers, and definitions for custom fields. These elements ensure interoperability across bioinformatics tools by specifying how to interpret genomic positions and associated annotations without embedding such details in each data line. For instance, contig information outlines the reference genome's chromosome structures, while sample names delineate the genotype columns for multi-sample files.[4][1]
In contrast, the body contains the core variant data, with each line detailing a genetic variant at a precise genomic locus, including reference and alternate alleles, quality metrics, filter statuses, and optional genotype calls for listed samples. This section lists variants in ascending order by position within each contig, facilitating sequential processing in downstream analyses such as population genetics or clinical reporting. VCF's structure supports its widespread use in genomics workflows for variant discovery and interpretation.[4][1]
A minimal VCF skeleton illustrates this division, featuring meta-information for the file format and a contig, followed by the column header and one example data line:
##fileformat=VCFv4.5
##contig=<ID=20,length=62435964>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2 sample3
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
##fileformat=VCFv4.5
##contig=<ID=20,length=62435964>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2 sample3
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
Here, the "##" lines form the meta-information portion of the header, the "#CHROM" line completes the header by naming the fixed columns (CHROM for chromosome, POS for 1-based position, ID for variant identifier, REF for reference allele, ALT for alternate allele, QUAL for phred-scaled quality, FILTER for pass/fail status, and INFO for annotations), followed by FORMAT and sample columns. The body line provides a sample variant entry for three individuals.[4]
History and Development
Origins
The Variant Call Format (VCF) was initially proposed in 2010 by the International 1000 Genomes Project as a standardized approach to storing genetic variants, driven by the rapid growth of high-throughput sequencing data that demanded efficient, interoperable data management.[1] This effort addressed the challenges of handling millions of variant sites across thousands of samples in large-scale human genomics studies.[1]
Key development contributions came from leading institutions involved in the 1000 Genomes Project, including the EMBL European Bioinformatics Institute, the Wellcome Trust Sanger Institute, and the Broad Institute, with the first formal specification documented as VCF version 4.0.[1] Prominent researchers such as Richard Durbin, Mark A. DePristo, and Gonçalo Abecasis played central roles in defining the format's structure and guidelines.[1]
The primary motivation for VCF was to establish a flexible, tab-delimited textual format that could accommodate diverse variant types—including single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants—while replacing fragmented ad-hoc file formats prevalent in genomic research at the time.[1] This design emphasized extensibility for annotations and compatibility with existing tools, facilitating data sharing and analysis in collaborative projects.[1]
Early adoption of VCF occurred swiftly following its introduction, with integration into major analysis pipelines such as the Genome Analysis Toolkit (GATK) from the Broad Institute and SAMtools from the Wellcome Trust Sanger Institute by 2011, enabling widespread use in variant calling and processing workflows.[1]
Version Evolution
The Variant Call Format (VCF) originated within the 1000 Genomes Project, with its initial specification version 4.0 released in 2010.[1]
Version 4.1 followed in 2013, incorporating minor fixes and clarifications to improve consistency and usability in variant data representation.[6]
In 2014, VCF 4.2 introduced enhanced support for structural variants, including better encoding of complex rearrangements and symbolic alleles to accommodate larger genomic alterations.[7]
The 4.3 specification, released in 2017, added provisions for phased genotype data, enabling more precise representation of haplotype information across samples.[8]
Version 4.4, published in 2021, refined alternative allele descriptions and expanded options for structural variant annotations, such as tandem repeat notations and event types.[9]
Most recently, VCF 4.5 was issued in October 2024, with improvements to metadata structures and contig definitions to better support diverse reference genomes and sequencing technologies.[4]
Since 2013, the VCF specification has been maintained under the HTSlib and SAMtools project on GitHub, transitioning to open-source governance that incorporates community input through issues, discussions, and pull requests.[10][11]
This collaborative model, now overseen by the Global Alliance for Genomics and Health (GA4GH) Large Scale Genomics work stream, has driven iterative refinements while ensuring broad adoption.[12]
Key milestones include the 2013 shift to GitHub-hosted development, which democratized contributions, and the progression to version 4.5 as the current stable specification as of November 2025.[10]
Each version preserves backward compatibility with earlier iterations, allowing existing VCF files to remain valid, though deprecations—such as certain INFO field usages—are explicitly documented to guide migrations.[4]
Textual Format Structure
The header section of a textual Variant Call Format (VCF) file consists of meta-information lines prefixed with "##", followed by a single column header line prefixed with "#CHROM". These lines precede the data records and provide essential metadata about the file's format, reference genome, and custom fields, enabling parsers to interpret the subsequent variant data correctly.[4]
The mandatory elements include the file format version line, which must be the first meta-information line and specifies the VCF version in the form "##fileformat=VCFvX.Y", where X.Y denotes the major and minor version numbers (e.g., VCFv4.5). Additionally, the column header line is required and must start with "#CHROM" followed by tab-separated fixed columns—POS, ID, REF, ALT, QUAL, FILTER, INFO—and, if sample genotypes are present, optional sample columns appended thereafter. This header line defines the structure of all data records in the file.[4]
Optional meta-information lines allow for additional details, such as "##contig=<...>" to describe reference sequences (e.g., chromosome identifiers and lengths), "##reference" to indicate the genome build (e.g., GRCh38), "##SAMPLE" for sample metadata like individual IDs or phenotypes, and "##ALT" to define types of alternate alleles (e.g., deletions or symbolic variants). Other common optional lines include "##INFO" and "##FORMAT" for custom field definitions, as well as lines for file source, date, or assembly details. These elements enhance interoperability but are not required for basic compliance.[4]
Meta-information lines follow specific syntax rules: simple directives use "##KEY=VALUE" (e.g., "##fileformat=VCFv4.5"), while structured definitions employ "##KEY=<TAG1=val1,TAG2=val2,...>" with comma-separated key-value pairs enclosed in angle brackets. For fields like INFO, FORMAT, FILTER, and ALT, mandatory tags include ID (unique identifier), Number (expected number of values, such as 1, A for alleles, R for references, G for genotypes, or . for variable), Type (Integer, Float, Flag, Character, or String), and Description (a quoted string explaining the field). Special characters in values, such as commas, semicolons, equals signs, or double quotes in descriptions, must be escaped—commas and semicolons with backslashes, and double quotes within descriptions with "—to prevent parsing errors. Unstructured lines can use "##KEY=free text" for arbitrary metadata.[4]
The following example illustrates a typical header snippet for a VCF file aligned to the human reference genome:
##fileformat=VCFv4.5
##fileDate=20241109
##reference=GRCh38
##contig=<ID=1,length=248956422>
##contig=<ID=2,length=242193529>
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at this position">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency for each alternate allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype, encoded as allele values separated by either of / or |">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth at this position for this sample">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
##fileformat=VCFv4.5
##fileDate=20241109
##reference=GRCh38
##contig=<ID=1,length=248956422>
##contig=<ID=2,length=242193529>
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at this position">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency for each alternate allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype, encoded as allele values separated by either of / or |">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth at this position for this sample">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
This structure ensures the header provides a self-describing blueprint for the file's contents.[4]
Data Record Lines
The body of a Variant Call Format (VCF) file consists of data record lines that describe genetic variants at specific genomic positions. These lines are TAB-delimited, lack the "#" prefix found in header lines, and each represents a single variant site, potentially with multiple alternative alleles.[4] The number and order of columns in these lines are defined by the header, with fixed core columns followed by optional sample-specific genotype data.
Variant types are encoded using the REF (reference) and ALT (alternate) allele columns. For single nucleotide polymorphisms (SNPs) or substitutions, both REF and ALT specify single bases, such as REF=A and ALT=T. Insertions are represented with the reference base preceding the insertion in REF and that base plus the inserted sequence in ALT (e.g., REF=A and ALT=AT for insertion of T after a reference A), while deletions use a REF sequence longer than the corresponding ALT (e.g., REF="AT" and ALT="A"). These conventions allow compact representation of small structural variants without explicit type tags.[4]
In sample columns, which contain genotype data for individuals or samples, phasing information is indicated by separators between alleles. The "|" symbol denotes phased genotypes (e.g., 0|1, indicating one haplotype carries the reference allele and the other the alternate), while "/" indicates unphased genotypes (e.g., 0/1, where the alleles are not linked to specific haplotypes). This distinction supports downstream analyses requiring haplotype resolution.[4]
Multi-allelic variants, where multiple alternate alleles exist at the same site, are handled by listing them in the ALT column separated by commas (e.g., ALT=G,T for two possible substitutions). Each data record thus captures biallelic or polyallelic complexity at one position, with genotypes in sample columns referencing these alleles by index (0 for REF, 1 for the first ALT, etc.).[4]
The following is an example of a data record line for a SNP, assuming a header has defined the columns (sample data adapted from the specification for illustration):
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ NA00001 NA00002 NA00003 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ NA00001 NA00002 NA00003 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
This line describes a G-to-A substitution at position 14370 on chromosome 20, with phased and unphased genotypes across three samples.[4]
Column Definitions
The Variant Call Format (VCF) data lines consist of nine fixed columns followed by variable columns for sample-specific data.[4] The first column, #CHROM, specifies the chromosome or contig identifier as a string without whitespace.[4] The second column, POS, indicates the 1-based reference position as a required integer, with positions sorted numerically across the file.[4] The third column, ID, provides semicolon-separated unique variant identifiers, such as dbSNP rsIDs (e.g., rs6054257), using a string without whitespace or semicolons, or "." if unknown.[4] The fourth column, REF, denotes the reference allele(s) as a string of bases (A, C, G, T, N, case-insensitive, or IUPAC ambiguity codes).[4] The fifth column, ALT, lists comma-separated alternate allele(s) as strings, which may represent bases (A, C, G, T, N, case-insensitive, or IUPAC ambiguity codes), the "." for no alternate alleles, or symbolic notations (e.g., <DEL> for deletions).[4] The sixth column, QUAL, reports the Phred-scaled quality score of the assertion made in ALT as a float, or "." if unknown.[4] The seventh column, FILTER, indicates the filter status as a string: "PASS" if the variant passes all filters, semicolon-separated custom terms if it fails specific ones, or "." if unfiltered.[4] The eighth column, INFO, contains semicolon-separated key-value annotations for additional variant information, or "." if none.[4]
Following the fixed columns, the ninth fixed column, FORMAT, appears only if sample data is present and describes the colon-separated order and types of fields in the subsequent sample columns.[4] Variable columns then follow, one per sample, containing genotype and related data (e.g., GT for genotype, GQ for quality) in the order specified by FORMAT, enabling multi-sample analysis.[4]
Key conventions govern these columns: POS must be a positive integer; REF and ALT alleles are represented as strings using standard nucleotide bases or IUPAC codes, with multiple bases allowed for indels; QUAL is a non-negative float; and FILTER terms are user-defined but must be declared in the header.[4] Missing or undefined values in any column are denoted by ".", ensuring consistent parsing across tools.[4]
For illustration, consider this example data line from the VCF specification:
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.[4]
| Column | Value | Description |
|---|
| #CHROM | 20 | Chromosome 20. |
| POS | 14370 | 1-based position on the reference. |
| ID | rs6054257 | dbSNP identifier for the variant. |
| REF | G | Reference base. |
| ALT | A | Alternate base (single nucleotide polymorphism). |
| QUAL | 29 | Phred-scaled quality score. |
| FILTER | PASS | Variant passes all filters. |
| INFO | NS=3;DP=14;AF=0.5;DB;H2 | Annotations: 3 samples (NS), total depth 14 (DP), allele frequency 0.5 (AF), common in dbSNP (DB), haplotype 2 (H2). |
| FORMAT | GT:GQ:DP:HQ | Fields: genotype (GT), genotype quality (GQ), depth (DP), haplotype qualities (HQ). |
| Sample 1 | 0|0:48:1:51,51 | Homozygous reference (0|0), GQ=48, DP=1, HQ=51,51. |
| Sample 2 | 1|0:48:8:51,51 | Heterozygous phased (1|0), GQ=48, DP=8, HQ=51,51. |
| Sample 3 | 1/1:43:5:.,. | Homozygous alternate (1/1), GQ=43, DP=5, missing HQ. |
INFO Fields
The INFO fields in the Variant Call Format (VCF) provide site-level annotations for genetic variants, capturing metadata such as allele frequencies, read depths, and quality metrics in the eighth column of data records. These fields consist of semicolon-separated key-value pairs, where each key corresponds to a predefined identifier (ID) declared in the file header, and values are encoded to handle special characters if necessary.[4]
All INFO fields must be explicitly defined in the VCF header using lines formatted as ##INFO=<ID=<identifier>,Number=<count>,Type=<data type>,Description="<description>">, with optional Source and Version attributes; the ID, Number, Type, and Description are mandatory for each entry. The Number specifies the expected number of values per field, such as 1 for a single scalar, A for one value per alternate allele (in the order listed in the ALT column), R for one per allele (reference first), G for one per possible genotype, . for unknown or variable length, or 0 for flags that carry no value. Supported Types include Integer for whole numbers, Float for decimal numbers, Flag for boolean indicators without values, Character for single characters, and String for text sequences.[4]
In usage, INFO fields annotate variant properties across samples without per-sample granularity; for instance, vector-valued fields like those with Number=A produce comma-separated lists matching the ALT alleles, while flags (e.g., indicating indel status) appear without an equals sign or value. Duplicate keys are prohibited within a single record, and missing values are represented by a dot (.). All IDs used in the INFO column must have corresponding header declarations to ensure interoperability. The INFO column precedes the FORMAT column in data lines, providing global variant context before sample-specific details.[4]
Common INFO fields include allele-related metrics such as AC (allele count: total number of alternate alleles observed in genotypes, Number=A, Type=Integer), AN (total number of alleles: total alleles in called genotypes, often 2×number of samples for diploids, Number=1, Type=Integer), and AF (allele frequency: estimated frequency of each alternate allele, computed as AC/AN, Number=A, Type=Float). Depth and quality fields frequently encountered are DP (total depth: approximate total read depth across samples at the locus, Number=1, Type=Integer) and MQ (mapping quality: root mean square mapping quality across samples, Number=1, Type=Float). Statistical tests for bias detection include BaseQRankSum (Z-score from the Wilcoxon rank sum test comparing base qualities of reads supporting reference versus alternate alleles, Number=1, Type=Float) and ClippingRankSum (Z-score from the Wilcoxon rank sum test comparing clipping counts—hard-clipped bases—in reads supporting reference versus alternate alleles, Number=1, Type=Float), both used to flag potential artifacts in variant calls.[4][13][14]
For example, a header might define ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth"> and ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">, while a data line INFO string could appear as DP=100;AF=0.05,0.02;[INDEL](/page/Indel), where DP=100 indicates total depth, AF provides frequencies for two alternate alleles, and INDEL serves as a flag without a value.[4]
The FORMAT column in a Variant Call Format (VCF) file specifies the structure and order of per-sample genotype data, using a colon-separated list of tags (e.g., GT:AD:DP) that describe the fields provided for each sample.[4] This column appears after the INFO column in the header line and data records of multi-sample VCFs, ensuring that the subsequent sample-specific columns contain values in the exact order defined by the FORMAT tags.[4] The first tag must always be GT if genotype data is present, and the FORMAT definition is declared in the file's meta-information header using lines prefixed with ##FORMAT.[4]
Common FORMAT tags include GT for genotype, which represents the called alleles for each sample using numeric indices (e.g., 0/1 for an unphased heterozygous call where 0 is the reference allele and 1 is the first alternate allele, or 0|1 for a phased heterozygous call); the slash (/) denotes unphased data, while the pipe (|) indicates phasing information.[4] AD denotes allelic depths, providing the total read depth for each allele (e.g., 10,20 for reference and alternate depths); DP indicates the total read depth at the locus for the sample; GQ is the genotype quality, a Phred-scaled probability that the genotype call is incorrect (e.g., 30 meaning a 0.001 probability of error); and PL lists Phred-scaled likelihoods for all possible genotypes in the sample, ordered from most likely to least (e.g., 0,10,100 for genotypes 0/0, 0/1, and 1/1).[4]
Each FORMAT tag includes a Number and Type specification in its header definition, similar to INFO fields but applied per sample: Number can be 1 (one value per sample), R (one per allele, including reference), G (one per possible genotype; for diploid samples, this is (A+1)(A+2)/2 where A is the number of alternate alleles), or . (unknown/variable); Type is Integer (whole numbers), Float (decimal numbers), Character (single printable ASCII character), or String (free text).[4] For multi-allelic sites with multiple alternate alleles in the ALT column, genotypes extend the indexing (e.g., 0/1/2 for a sample heterozygous for the first and second alternates, or 1|2 for phased).[4] Missing values in any FORMAT field are denoted by a dot (.), such as ./.; for a missing genotype or 0/1:. for missing depth data, preserving the structure across samples.[4]
In a multi-sample VCF, the FORMAT column applies uniformly to all samples, with each sample's column providing colon-separated values matching the tag order.[4] For example, a data line with FORMAT=GT:AD:DP:GQ:PL might have sample values like "0/1:10,20:30:99:0,10,100" for a heterozygous sample with 30 total reads (10 reference, 20 alternate), high-quality genotype call, and normalized likelihoods favoring the heterozygous state over homozygous reference or alternate.[4]
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5 GT:AD:DP:GQ:PL 0/0:14,0:14:48:0,48,837 0/1:8,6:14:48:0,48,837
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5 GT:AD:DP:GQ:PL 0/0:14,0:14:48:0,48,837 0/1:8,6:14:48:0,48,837
In this example, Sample1 is homozygous reference (GT=0/0) with all 14 reads supporting the reference allele, while Sample2 is heterozygous (GT=0/1) with balanced read support, both having maximum genotype quality and likelihoods reflecting the calls.[4]
Version 4.5 of the VCF specification, released in October 2024, introduced enhancements for FORMAT fields, including new Number types such as P (one value per ploidy level) and M (one value per base modification), and support for encoding base modifications like 5-methylcytosine (e.g., via tags such as M5mC).[4]
BCF Encoding
The Binary Call Format (BCF) serves as the compressed binary counterpart to the textual Variant Call Format (VCF), enabling more efficient storage and random access of genomic variant data through block-based gzip (BGZF) compression on the body while employing a custom binary header structure.[4] BCF files typically use the .bcf extension and require complete specification of all INFO, FORMAT, and FILTER fields in the header to ensure unambiguous decoding, supporting a subset of VCF features with fully typed data elements.[4] The current version, BCF2.2, aligns with VCF 4.5 and introduces enhancements such as an explicit END OF VECTOR byte for variable-length fields and an IDX field for certain annotations.[4]
The BCF header begins with a fixed magic string "BCF" followed by two uint8_t bytes indicating the major and minor version numbers (e.g., 2 and 2 for BCF2.2), ensuring compatibility checks during file reading.[4] A uint32_t specifies the length of the subsequent NUL-terminated textual header, which mirrors the VCF header lines (from ##fileformat to the column header #CHROM), providing metadata like contig definitions and field descriptions.[4] Following this, a dictionary of strings maps integer IDs to values for elements like CHROM names, REF/ALT alleles, FILTER statuses, and INFO/FORMAT keys, while a separate contig section defines sequence identifiers, lengths, and optional MD5 sums or URLs, all serialized in length-prefixed binary format to facilitate compact referencing in records.[4]
The body of a BCF file consists of contig-ordered blocks of variable-length records, each compressed within BGZF blocks for seekable access, with the file terminating in an empty block.[4] Each record is divided into a shared site section (encoding common data like position and alleles) and an optional individual section (for per-sample genotypes), prefixed by two uint32_t lengths indicating the size of each in bytes.[4] The shared section includes the CHROM ID (int32_t dictionary offset), 0-based POS (int32_t), reference length rlen (int32_t), QUAL score (float32), a uint32_t combining allele and INFO counts (n_allele << 16 | n_info), followed by encoded ID, REF+ALT strings (as concatenated length-prefixed ASCII or dictionary indices), FILTER vector (uint16_t indices, with 0 denoting PASS), and INFO key-value pairs.[4] In the individual section, a uint32_t combines FORMAT and sample counts (n_fmt << 24 | n_sample). Data elements use a typed encoding scheme where a typing byte's low 4 bits specify the atomic type (1=int8_t, 2=int16_t, 3=int32_t, 5=float, 7=char), implying the size, and high 4 bits the number of elements (0-15, with 15 followed by an int32_t length for longer vectors), with missing values represented by special codes like 0x80 for int8; variable-length vectors are padded with an END OF VECTOR marker (e.g., 0x02 0x80 for int16).[4]
In the individual section, per-sample data for FORMAT fields like GT (genotype alleles as integer indices, encoded as typed integer arrays with each allele represented by an integer value ((allele+1)<<1 | phased), typically two per sample with 0 for reference) and others (e.g., DP for depth, GQ for quality) are stored as typed arrays.[4] Alleles reference the shared REF/ALT list via indices, and lengths for per-sample vectors are implicit from the header's maximum definitions, enabling compact representation without explicit delimiters beyond the type encoding.[4]
Conversion between VCF and BCF is facilitated by tools in the HTSlib suite, such as bcftools view -O b to generate a BCF from VCF input, preserving compatibility through shared indexing schemes like CSI (.csi) or tabix (.tbi) for random querying.[15] This bidirectional support ensures BCF files can be processed interchangeably with VCF in pipelines while benefiting from reduced file sizes and faster I/O.[15]
Compression Mechanisms
The Binary Call Format (BCF) employs several compression strategies to minimize file size while maintaining efficient access for large-scale genomic datasets. The primary mechanism is BGZF, a block-based extension of zlib compression applied to the body of the file containing variant records. This allows for parallel decompression and random access without full file loading, similar to its use in BAM files. The header, which embeds the original VCF metadata, remains uncompressed.[4]
Integer fields are encoded compactly to exploit the sorted nature of BCF records. The position (POS) field is stored as an absolute 0-based int32_t. Allele representations (REF and ALT) are stored as typed strings, which may reference a shared string dictionary via integer indices if the alleles match dictionary entries, avoiding repetition of common sequences; short or unique strings are stored directly as length-prefixed byte arrays. Genotype data is stored as typed integer vectors for each FORMAT field, with two integers per diploid sample for GT (one per allele).[4][15]
String and array fields in INFO and FORMAT are handled as typed blocks, with compression provided by BGZF. This is particularly effective for repetitive annotations or uniform quality scores. Keys and values in INFO use dictionary offsets for common terms (e.g., "PASS" for FILTER as index 0).[4]
Indexing further enhances performance on compressed files. The Coordinate-Sorted Index (CSI) provides a binary structure for BCF, enabling queries over genomic intervals without decompressing the entire file, while Tabix Index (TBI) supports similar functionality for BGZF-compressed VCF equivalents. These indices facilitate subset retrieval, yielding substantial benefits for large files; for instance, BCF achieves over 90% size reduction compared to uncompressed VCF, with a typical single-record encoding of 96 bytes versus hundreds in text format.[4]
Despite these efficiencies, BCF trade-offs include dependency on specialized tools like bcftools or htslib for decoding, as the binary structure is not human-readable like VCF. For a dataset with 1 million variants and modest sample counts, BCF files are typically 4-10 MB, compared to 50-100 MB uncompressed VCF or 10-20 MB gzipped VCF, balancing storage savings with faster sequential access at the cost of initial conversion overhead.[4][16]
Specifications and Extensions
Version-Specific Features
The Variant Call Format (VCF) has evolved through several major versions, each introducing targeted enhancements to support advanced genomic analyses while maintaining core compatibility. Version 4.2, released in 2012, marked a significant expansion for representing structural variants by introducing symbolic alleles in the ALT field, such as <INS> for insertions and <DEL> for deletions, which allow compact notation without specifying exact sequences for large events.[7] This version also formalized the BREAKEND notation for complex rearrangements, using bracketed strings like chr2:1234[p[ in ALT to denote breakpoint orientations and mate positions, accompanied by INFO tags such as MATEID and EVENT to link related records.[7]
Version 4.3, published in 2017, refined phasing and multi-allelic representations to better accommodate haplotype data from population studies. It introduced the PS (Phase Set) tag in the FORMAT field as a non-negative integer to group phased genotypes across sites on the same chromosome, enabling efficient tracking of haplotype blocks without separate files.[8] Additionally, it improved multi-allelic handling by allowing INFO fields like AC (Allele Count) and AF (Allele Frequency) to use Number=A for one value per alternate allele, and added FORMAT tags AD, ADF, and ADR for per-allele depth counts, facilitating precise allele-specific quality assessment.[8]
In version 4.4, released in 2022, standardization efforts focused on variant spans and semantics for broader tool interoperability. The END INFO tag was formalized as required for symbolic alleles like <*> (representing unspecified non-reference alleles in gVCF blocks), defining the end position of the variant interval to support accurate indexing and querying.[9] Alt allele semantics were enhanced with support for tandem repeat contractions/expansions via <CNV:TR> in ALT, using accompanying INFO fields like RN (Repeat Name) and RUS (Reference Unit Sequence) to describe repeat units, while clarifying breakend parsing rules for colons in contig names.[9]
Version 4.5, finalized on October 9, 2024, emphasized metadata structure and reference handling for modern assemblies and annotations. It extended contig metadata with a URL tag (e.g., ##contig=<ID=chr1,URL=ftp://example/assembly.fa>), allowing direct links to sequence files for unambiguous reference resolution across diverse genomes.[4] Nested metadata was introduced in INFO and FORMAT headers via optional Source and Version keys (e.g., ##INFO=<ID=CLNSIG,Source="ClinVar",Version="20220804">), promoting traceable annotations, while deprecating legacy tags like AA (Ancestral Allele) in favor of structured alternatives and replacing END with SVLEN or FORMAT:LEN for variant lengths.[4] New support for base modifications (e.g., FORMAT:M5mC for 5-methylcytosine) and Number=P (ploidy-matched values) further enables epigenomic integrations.[4]
VCF files declare their version via the mandatory ##fileformat=VCFv4.x line in the header, enabling tools to detect and adapt to features dynamically.[7] Most parsers, such as bcftools, maintain backward compatibility by ignoring unrecognized tags and defaulting to prior semantics (e.g., treating absent PS as unphased), though full utilization of later features requires version-aware processing.[17]
For illustration, consider a deletion structural variant record evolving across versions. In VCF 4.1 or earlier, it might appear as a simple indel:
chr1 1000 . AT A . PASS . GT:AD 0/1:10,5
chr1 1000 . AT A . PASS . GT:AD 0/1:10,5
In 4.2, symbolic notation and END are added:
chr1 1000 . N <DEL> . PASS SVTYPE=DEL;END=2000 GT:AD 0/1:10,5
chr1 1000 . N <DEL> . PASS SVTYPE=DEL;END=2000 GT:AD 0/1:10,5
By 4.3, multi-allelic depth refines it (if overlapping another variant):
chr1 1000 . N <DEL>,C . PASS SVTYPE=DEL;END=2000 GT:AD:ADF:ADR 0/1:10,5:5,3:5,2
chr1 1000 . N <DEL>,C . PASS SVTYPE=DEL;END=2000 GT:AD:ADF:ADR 0/1:10,5:5,3:5,2
In 4.4, <*> could span non-variant blocks if gVCF-style:
chr1 1000 . N <DEL>,<*> . PASS SVTYPE=DEL;END=2000 GT:AD 0/1:10,5
chr1 1000 . N <DEL>,<*> . PASS SVTYPE=DEL;END=2000 GT:AD 0/1:10,5
And in 4.5, metadata nesting and deprecation yield:
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="SV type",Source="Custom",Version="1.0">
chr1 1000 . N <DEL> . PASS SVTYPE=DEL;SVLEN=-1000 GT:AD 0/1:10,5
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="SV type",Source="Custom",Version="1.0">
chr1 1000 . N <DEL> . PASS SVTYPE=DEL;SVLEN=-1000 GT:AD 0/1:10,5
This progression highlights how versions build expressiveness without breaking core parsing.[4]
The Genomic VCF (gVCF) extends the standard VCF format to represent both variant and non-variant genomic sites efficiently, addressing the inefficiency of standard VCF in handling large reference-matched regions by encoding continuous blocks of non-variant sites with reference confidence intervals.[18] In gVCF files, non-variant blocks are denoted using the reference allele in the ALT field alongside specialized INFO tags such as END for block boundaries and BP_RESOLUTION or MIN_DP for quality metrics, while variant sites retain standard VCF notation; this structure facilitates joint genotyping pipelines, such as GATK's GenotypeGVCFs tool, where multiple gVCFs are combined to produce a multi-sample VCF.[19] gVCF builds directly on VCF's header and column structure, requiring only additional metadata in the FORMAT and INFO fields (e.g., GL for genotype likelihoods in blocks), and has become integral to scalable genomic analysis workflows.[18]
The haplotype VCF (hVCF) provides a hierarchical extension to VCF for representing phased haplotype variants, particularly in pangenome or multi-sample contexts where variants are organized across haplotype paths rather than linear reference coordinates.[20] Developed for tools like the Practical Haplotype Graph (PHG), hVCF introduces fields such as HP (haplotype path) in the INFO column to denote variant associations with specific haplotypes, enabling multi-level calls from individual to population scales by nesting variants within haplotype structures; this addresses limitations in standard VCF for capturing phase and graph-based genomes without fragmenting data across multiple files.[21] Like gVCF, hVCF adheres to core VCF syntax but extends it for hierarchical organization, supporting applications in crop genomics and beyond.[22]
Specialized VCF extensions for domain-specific data, such as immunogenetics and copy number analysis, leverage standard VCF's flexibility through custom INFO and FORMAT tags to encode additional annotations without altering the base format. For HLA typing in immunogenetics, VCF files incorporate HLA allele calls via INFO tags like HLA_GT for genotypes and HLA alleles resolved from exonic variants, often generated by tools like HLA-LA or OptiType that output compliant VCFs for integration into broader variant datasets; this approach handles the high polymorphism of HLA loci while maintaining interoperability.[23] Similarly, for copy number variants (CNVs), VCF uses predefined INFO fields such as CN (copy number value), CNQ (phred-scaled copy number quality), and SVTYPE=CNV to represent segmental duplications or deletions, with END marking event boundaries; these extensions, formalized in VCF 4.2, enable precise CNV reporting in somatic and germline contexts without requiring entirely new formats.[7][24]
As of 2025, gVCF has seen widespread adoption in cloud-based genomics platforms, including Illumina's DRAGEN for scalable joint genotyping of large cohorts and GA4GH initiatives for VCF interoperability, enabling efficient storage and processing of whole-genome data in distributed environments by compactly representing non-variant regions.[25][26] These related formats collectively mitigate standard VCF's gaps in density and hierarchy, enhancing efficiency in joint analysis and specialized applications while preserving backward compatibility.[18]