Fact-checked by Grok 2 weeks ago

General feature format

The General Feature Format (GFF) is a tab-delimited text file format standardized in bioinformatics for representing the locations, structures, and attributes of genomic features, such as genes, exons, introns, and regulatory elements, relative to a reference DNA sequence.^[1] Developed initially by the Sanger Institute in 1997, GFF enables the annotation and exchange of sequence-based data across tools and databases, facilitating genome assembly, gene prediction, and functional analysis.^[1] Each feature is described on a single line comprising nine tab-separated columns: the sequence identifier (e.g., chromosome name), source of the feature (e.g., annotation tool), feature type (e.g., "gene" or "CDS" per the Sequence Ontology), start and end genomic coordinates (1-based, inclusive), a numerical score (often omitted as "."), strand orientation ("+" or "-"), phase for coding sequences (0, 1, 2, or "."), and an attributes field for key-value pairs like ID or Name.^[2]^[3] GFF has evolved through versions, with GFF3 (introduced in 2004) as the current recommended standard, incorporating hierarchical relationships via parent-child feature linkages and controlled vocabularies to ensure interoperability.^[1] A related format, the Gene Transfer Format (GTF), is a variant of GFF2 primarily used for transcript annotations. It restricts feature types to a small set of common ones (e.g., "transcript", "exon", "CDS") and uses a specific semicolon-separated format for attributes in the ninth column, but retains all nine columns including phase.^[4] GFF files are widely supported by genome browsers (e.g., UCSC Genome Browser, Ensembl) and analysis pipelines (e.g., for RNA-seq or variant calling), with tools like AGAT for validation and conversion to ensure data integrity.^[3]^[5] Despite its flexibility, GFF's unstructured attributes can pose parsing challenges, prompting extensions like GFF3's formal ontology integration.^[6]

Introduction

Overview

The General Feature Format (GFF) is a tab-delimited text file format designed for representing genomic features, such as genes, exons, transcripts, and regulatory elements, on DNA, RNA, or protein sequences.^[2] Developed as a standard in bioinformatics, GFF enables the structured description of sequence annotations in a machine-readable way. GFF is primarily used for genome annotation, facilitating the integration of experimental and computational results in sequence analysis pipelines, and supporting data exchange across bioinformatics tools and databases.^[7] It allows researchers to share feature predictions and experimental data without requiring proprietary formats, promoting interoperability in large-scale genomic projects. The format's key advantages lie in its simplicity, which permits easy parsing and manipulation using standard utilities like awk or Perl, and its flexibility for encoding hierarchical relationships among features, such as nested exons within genes. Additionally, GFF is compatible with the Sequence Ontology (SO), a controlled vocabulary that standardizes feature types and relationships, enhancing semantic consistency across annotations. In terms of basic composition, a GFF file typically begins with optional header directives and consists of one line per feature, capturing essential annotation details in a consistent structure.^[2] The format has evolved across versions to address growing annotation complexity, with GFF3 as the modern iteration.

History and Development

The General Feature Format (GFF) originated from a 1997 meeting on computational genefinding held at the Isaac Newton Institute for Mathematical Sciences in Cambridge, UK, amid efforts to standardize data exchange for the Human Genome Project. Proposed by Richard Durbin of the Sanger Centre (now Wellcome Sanger Institute) and David Haussler of the University of California, Santa Cruz, it served as a simple, tab-delimited text format for describing gene predictions and other genomic features, enabling interoperability between gene-finding tools without requiring full system integration.^[5]^[1] Early iterations focused on basic functionality, with GFF1 providing a foundational nine-column structure for feature records. GFF2, formalized around 1999, enhanced this by introducing structured tag-value attributes, support for RNA and protein sequences, and flexible scoring options, addressing limitations in representing complex annotations. By 2004, GFF3 emerged to incorporate multi-level parent-child relationships for hierarchical features and compatibility with controlled vocabularies, mitigating fragmentation from ad hoc extensions in prior versions and promoting ontology-driven descriptions.^[1]^[8]^[9] The Sequence Ontology Consortium, established in 2003, has overseen the specifications for GFF3 since its introduction in 2004, aligning the format with the Sequence Ontology to ensure precise, semantically rich annotations for biological sequences.^[10]^[11] Key adoption milestones reflect GFF's role in advancing genomic data sharing. In the 2000s, it integrated into the Ensembl project, enabling browser-based visualization and export of annotations across vertebrate and invertebrate genomes. The 2010s saw its prominence in initiatives like modENCODE, where it standardized functional element annotations for Drosophila melanogaster and Caenorhabditis elegans. By 2020, GFF3 had become a primary format for NCBI genome submissions, facilitating global interoperability in the post-genomic era's data deluge.^[12]^[13]^[7] These developments addressed critical challenges in the post-genomic landscape, where diverse sequencing projects demanded a flexible yet rigorous standard for exchanging annotations without loss of hierarchical or ontological detail.^[10]

Versions

GFF1 and GFF2

The General Feature Format (GFF) version 1, introduced in 1997 during a meeting on computational gene finding at the Isaac Newton Institute in Cambridge, UK, organized by the Sanger Centre and the University of California, Santa Cruz, established a basic tab-delimited file structure for describing genomic features associated with DNA sequences.^[8]^[5] GFF1 consisted of eight mandatory fields—sequence name, source, feature type, start position, end position, score, strand, and frame—followed by an optional group field for simple feature association, with no dedicated attributes field.^[1] The format enforced strict constraints, such as strings limited to 256 characters and lines to 32 kilobytes, and required a score of 0 if no value was available, while the frame field used '0', '1', '2', or '.' to indicate the reading frame offset.^[5] This flat structure supported only ungrouped or simply grouped features, lacking support for hierarchical relationships beyond basic linking.^[14] GFF2, proposed in late 1998 and formalized in 1999 with a beta release from the Sanger Centre, expanded the format to nine fields by replacing the optional group with a structured attributes field, enabling tag-value pairs separated by semicolons for more descriptive annotations.^[1]^[8] Key enhancements included allowing '.' for missing scores, strands, or frames; permitting start and end positions to extend beyond sequence lengths (with software handling clipping); and introducing meta-comments like ##gff-version 2 for file headers.^[5] The attributes field facilitated basic hierarchies, such as linking exons to a gene via a shared identifier in the group-like structure (e.g., gene_id "GENE1"), but remained limited to free-text entries without standardization.^[14] Feature types were still arbitrary strings, though encouraged to follow common nomenclature, and the format gained flexibility for RNA and protein sequences by setting strand and frame to '.' when inapplicable.^[1] Both GFF1 and GFF2 were constrained to two-level feature hierarchies, such as gene-to-exon relationships, which proved insufficient for representing complex transcripts with multiple isoforms or nested structures like gene-transcript-exon.^[14]^[5] The free-text nature of attributes in GFF2 led to inconsistent parsing across tools, as there was no enforced vocabulary or ontology integration, and the lack of multi-level grouping hindered annotations for eukaryotic genomes with alternative splicing.^[8] These limitations, particularly the inability to handle multi-isoform genes and standardized semantic annotations, drove the development of subsequent versions to support more sophisticated genomic data representation.^[5] A representative GFF2 line, tab-separated for clarity, might appear as:

seqname	source	feature	start	end	score	strand	frame	attributes
chr1	GENSCAN	exon	100	200	.	+	0	gene_id "GENE1"; transcript_id "TRAN1"
seqname	source	feature	start	end	score	strand	frame	attributes
chr1	GENSCAN	exon	100	200	.	+	0	gene_id "GENE1"; transcript_id "TRAN1"

This example illustrates a single exon feature linked to a gene and transcript via attributes, highlighting the format's tab-delimited, nine-field structure.^[14]^[1]

GFF3

The General Feature Format version 3 (GFF3) was introduced in 2004 as an enhanced standard for representing genomic features, addressing limitations in earlier versions such as their inability to handle complex, nested structures.^[5] It supports arbitrary-depth hierarchies, for example, linking genes to mRNAs, exons, and coding sequences (CDS) through parent-child relationships defined in the attributes field.^[15] This format is widely used in bioinformatics for exchanging annotations across databases and tools, emphasizing interoperability while maintaining compatibility with prior GFF iterations where possible.^[15] Key improvements over GFF2 include renaming the first column from "sequence" to "seqid" for clearer identification of reference sequences, adoption of a controlled vocabulary from the Sequence Ontology (SO) for feature types to ensure semantic consistency, and standardization of attribute tags such as ID for unique identifiers and Parent for hierarchical linkages.^[15]^[16] These changes enable more precise and scalable representation of biological data, such as multi-exon transcripts or regulatory elements nested within genes.^[15] GFF3 files must be encoded in UTF-8 and begin with a mandatory header directive ##gff-version 3 (or 3.0 for specificity), followed optionally by ##sequence-region directives that define the coordinate ranges for each seqid.^[15] The hierarchy is implemented via unique, non-duplicated IDs assigned to features and corresponding Parent references; for instance, an exon record might include ID=exon:001;Parent=transcript:001 to associate it with its parent transcript.^[15] For validation, GFF3 requires strict adherence to SO terms for all feature types, prohibition of duplicate IDs across the file, and sorting of records first by seqid and then by ascending start position to facilitate efficient processing and querying.^[15]^[16] These standards ensure data integrity and compatibility with parsers like those in BioPerl or GBrowse.^[15]

Relation to GTF

The Gene Transfer Format (GTF) originated in the early 2000s as a specialized variant of the General Feature Format version 2 (GFF2), developed by the Ensembl project to streamline the representation of gene structures such as transcripts and exons during the Drosophila and Human Genome Projects.^[5] It was designed as a simplified subset of GFF2, focusing on gene-centric annotations while retaining the core nine-column structure of tab-delimited fields for sequence name, source, feature type, start, end, score, strand, frame, and attributes.^[2] Unlike broader GFF applications, GTF emphasizes mandatory attributes like "gene_id" and "transcript_id" to link features hierarchically, with "gene_id" required for all features and "transcript_id" for all except gene-level entries, typically in formats such as ENSGXXXXXXXXXXX.X for Ensembl genes.^[17] Key differences between GTF and full GFF specifications include GTF's restriction to a limited set of predefined feature types (e.g., gene, transcript, exon, CDS) and attribute tags, which simplifies parsing but eliminates support for ontologies like the Sequence Ontology.^[5] It assumes a two-level hierarchy—genes containing transcripts and their subfeatures—without the multi-level parent-child relationships possible in later formats, and its attributes are formatted as semicolon-separated key-value pairs enclosed in quotes, contrasting with the more flexible, unquoted tag=value syntax in GFF3.^[17] These constraints make GTF less versatile for complex annotations but easier for targeted gene modeling.^[18] GTF is primarily employed in genome browsers like the UCSC Genome Browser for displaying gene tracks and in RNA-seq pipelines for aligning and quantifying transcripts due to its straightforward structure that facilitates rapid feature extraction.^[19] This format's ease of parsing supports efficient workflows in transcriptomics, where tools can directly access gene and transcript identifiers without handling extensive ontology mappings.^[5] However, its limited flexibility compared to GFF3 restricts applications beyond basic gene structures. Conversion tools, such as those in the AGAT toolkit, enable mapping GTF files to GFF3, but this process can result in loss of hierarchical depth or require manual addition of ontology terms to accommodate GFF3's requirements, potentially complicating multi-level feature relationships.^[5] Despite the preference for GFF3 in modern standards, GTF remains widely used, including in NCBI RefSeq annotations as of 2024, owing to its entrenched role in legacy datasets and compatible software ecosystems.^[7]

File Structure

Header Directives

Header directives in the General Feature Format (GFF), particularly for GFF3, consist of optional metadata lines that precede the feature records and provide essential context for interpreting the file, such as the format version, sequence coordinate spaces, and references to ontologies, without influencing the parsing of the actual feature data.^[20] These directives ensure interoperability across tools and databases by defining the scope and provenance of the annotations.^[15] Lines designated as header directives begin with two hash symbols (##), followed immediately by the directive name and any arguments, with no leading whitespace, and each on its own line.^[20] Parsers typically ignore these lines when processing feature records but use them for validation and metadata extraction, such as confirming file compliance or resolving sequence identifiers.^[15] For instance, a single hash (#) denotes a comment line, which is entirely skipped, distinguishing it from the functional directives.^[20] The only mandatory directive in GFF3 files is ##gff-version 3, which must appear as the very first line to declare the file's adherence to the GFF3 specification.^[20] This ensures parsers recognize the format and apply the correct rules for fields and attributes.^[15] Subsequent versions, such as 3.1.26, may be specified for minor revisions, but the base "3" remains the standard declaration.^[20] Among the common optional directives, ##sequence-region seqid start end defines the coordinate boundaries for a specific sequence identifier (seqid), helping tools perform bounds checking and avoid errors from out-of-range features; multiple such lines can be used for different sequences.^[20] The ##species directive specifies the organism using a URI, typically from NCBI Taxonomy, as in ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606 for humans.^[20] Similarly, ##source-ontology so_url links to the ontology used for source field terms, such as the Sequence Ontology at http://www.sequenceontology.org/.[](https://gmod.org/wiki/GFF3) Other directives like ##feature-ontology and ##genome-build provide additional context for feature types and assembly versions, respectively.^[20] Best practices recommend including all relevant directives at the file's beginning to promote data portability and validation, with sequence-region strongly encouraged to explicitly bound annotations.^[20] Files should use UTF-8 encoding for broad compatibility, and tools like the modENCODE GFF3 validator can check adherence to these conventions.^[20] After the headers, feature lines follow, often sorted by sequence and position for efficient processing.^[15]

Feature Records

Feature records in the General Feature Format (GFF) constitute the primary data lines that describe individual genomic features, such as genes, exons, or regulatory elements, and appear after any optional header directives in the file. Each record is a single tab-delimited line comprising exactly nine fields, providing essential details on the feature's sequence location, type, and properties. These lines form the bulk of a GFF file's content, enabling the representation of annotations in a compact, machine-readable structure.^[15] GFF3 files recommend feature records be sorted lexicographically by the seqid (sequence identifier in the first field) and, within each seqid, by ascending order of the start position (fourth field); this ordering facilitates efficient indexing and traversal during analysis.^[15] Coordinates in these records use a 1-based numbering system, where the start and end positions (fields 4 and 5) delineate an inclusive interval—both endpoints are part of the feature, and start must not exceed end, permitting zero-length features for elements like insertion sites. The strand indicator (seventh field) denotes orientation as '+' (forward), '-' (reverse), '.' (unspecified or inapplicable), or '?' (unknown), while the score (sixth field) holds a floating-point value for metrics like confidence or probability, or '.' if absent.^[15]^[21] Hierarchical organization of features, including nested structures such as pseudogenes within genes or operons grouping multiple genes, is achieved through the attributes field (ninth field), which employs standardized tags like 'ID' for unique identifiers and 'Parent' to establish parent-child relationships across records. This approach maintains the file's flat format while supporting complex genomic models.^[15] Standard GFF parsers generally ignore or flag invalid records to prevent crashes, such as those with fewer or more than nine fields, improper sorting, or syntax errors; frequent problems include unsorted lines causing sequential processing failures and malformed attributes, like unescaped semicolons or equals signs that break tag-value separation.^[22] A populated example of a gene feature record is:

chr1	GeneScan	gene	1000	2150	12.34	+	.	ID=gene00001;Name=example_gene;biotype=protein_coding
chr1	GeneScan	gene	1000	2150	12.34	+	.	ID=gene00001;Name=example_gene;biotype=protein_coding

This illustrates a gene on chromosome 1, spanning positions 1000 to 2150 with a score of 12.34 on the forward strand, including basic attributes for identification and description.^[23]

The Nine Fields

The General Feature Format version 3 (GFF3) structures each feature record as a single tab-delimited line comprising exactly nine fields, providing essential details about genomic features such as genes, exons, and coding sequences.^[20] These fields follow a standardized order and data types to ensure interoperability across bioinformatics tools and databases.^[12] The format emphasizes precision in coordinate systems and feature typing while allowing flexibility in optional elements like scores.

Field	Name	Description	Data Type/Example
1	seqid	A unique identifier for the reference sequence (landmark), such as a chromosome name or scaffold ID (e.g., "chr1" or "ctg123"). It serves as the coordinate system's reference and must not include unescaped whitespace or leading ">" characters.	String (e.g., ctg123)^[20]
2	source	The origin of the feature data, typically the name of the algorithm, database, or tool that generated it (e.g., "Ensembl" or "Genescan"). This field acts as a qualifier extending the feature's ontology.	String (e.g., Ensembl)^[20]^[12]
3	type	The feature's biological type, drawn from the Sequence Ontology (SO) vocabulary or its accession (e.g., "gene" corresponding to SO:0000704, or "exon"). All types must be subclasses of "sequence_feature" (SO:0000110).	String (e.g., gene)^[20]
4	start	The 1-based starting position of the feature on the seqid, as a positive integer (e.g., 1000). For zero-length features, start equals end.	Integer (e.g., 1000)^[20]^[12]
5	end	The 1-based ending position of the feature on the seqid, as a positive integer always greater than or equal to start (e.g., 9000).	Integer (e.g., 9000)^[20]^[12]
6	score	A numerical score associated with the feature, often indicating confidence or significance (e.g., an E-value like 6.2e-45 for similarity searches or a P-value for predictions). Use "." if no score is applicable.	Float or "." (e.g., 6.2e-45)^[20]
7	strand	The strand orientation of the feature: "+" for the forward strand, "-" for the reverse strand, "." for unstranded features, or "?" for unknown.	One of "+", "-", ".", "?" (e.g., +)^[20]^[12]
8	phase	For coding sequence (CDS) features, the offset (0, 1, or 2) indicating the position of the first base within a codon; use "." otherwise. Detailed usage for CDS features is covered in the core elements section.	One of 0, 1, 2, or "." (e.g., 0)^[20]^[12]
9	attributes	Semicolon-separated key-value pairs providing additional metadata (e.g., "ID=gene00001;Name=EDEN"). Keys and values must escape special characters like semicolons, commas, equals signs, and ampersands using URL percent-encoding (e.g., "%3B" for ";"); multiple values for a key are comma-separated.	Semicolon-delimited string (e.g., ID=gene1;Name=ABC)^[20]

Fields are strictly separated by horizontal tabs (not spaces), with no leading or trailing whitespace around delimiters to maintain parseability.^[20] Undefined or inapplicable values are represented by a single period (".") in the respective field.^[12] The format uses UTF-8 encoding for broad compatibility, and values in fields like seqid and source must escape non-alphanumeric characters outside the allowed set [a-zA-Z0-9.:$^*@!+_?-|] to prevent parsing errors.^[20] These conventions ensure that GFF3 files remain machine-readable and consistent across diverse genomic datasets.

Core Elements

Phase for CDS Features

The phase field in the General Feature Format (GFF3) specifies the offset from the start of a coding sequence (CDS) feature to the boundary of the next codon in the reading frame, using integer values 0, 1, or 2. This field applies only to CDS features, where 0 indicates that the feature starts at the first base of a codon (no offset), 1 indicates a one-base offset (second base aligns with codon start), and 2 indicates a two-base offset (third base aligns with codon start); for non-CDS features, the value is a period (.). The CDS feature type itself is a standard annotation for protein-coding exons in genomic sequences. In spliced transcripts comprising multiple CDS features, the phase ensures reading frame continuity across exon junctions by accounting for the cumulative length of preceding segments. For example, if the first CDS has phase 0 and its length modulo 3 equals 1, the next CDS is assigned phase 2 to preserve the translational frame.^[24] Across a gene's CDS features, the phases effectively sum modulo 3, reflecting the ongoing frame shift to enable accurate concatenation and translation of the full coding sequence. This phase information is essential for downstream protein prediction, as it allows tools to correctly align and translate discontinuous CDS segments into amino acid sequences without frame shifts.^[25] For instance, TransDecoder utilizes the phase to generate precise peptide predictions from GFF3-annotated transcripts, supporting reliable proteomic analysis in de novo assemblies.^[25]

Attributes and Annotations

In the General Feature Format (GFF3), the attributes field, which occupies the ninth column of each feature record, stores additional metadata about the feature in a structured manner. This field consists of semicolon-separated "tag=value" pairs, where tags are case-sensitive identifiers and values can be single strings or comma-separated lists for multiple items. For instance, a typical entry might appear as ID=exon001;Parent=transcript001;Name=exon_A, allowing for precise annotation without disrupting the core tab-delimited structure of the file. Reserved characters such as semicolon (;), equals (=), comma (,), ampersand (&), and tab must be escaped using percent-encoding per RFC 3986 to prevent parsing errors, while spaces within values are permitted as long as they do not introduce ambiguity.^[26]^[15] Several standard tags are defined to ensure interoperability and facilitate hierarchical organization. The ID tag provides a unique identifier for the feature, essential for referencing in other records and required for features that span multiple lines or have child features. The Parent tag establishes parent-child relationships by linking to the ID of a parent feature, enabling the construction of complex hierarchies such as multiple exons belonging to a single transcript (e.g., Parent=transcript001,transcript002 for shared exons across transcripts). Other common tags include Name for a human-readable label, which need not be unique; Note for descriptive free-text commentary; Alias for alternative identifiers; and Dbxref for cross-references to external databases. These tags promote consistency, with uppercase conventions reserved for official attributes to distinguish them from custom lowercase tags. Best practices recommend adhering to Sequence Ontology (SO)-compliant tags where possible, avoiding unstructured free text outside of Note, and omitting a trailing semicolon at the end of the field to comply with the specification.^[26]^[15]^[16] The use of Parent=ID linkages is central to building tree-like structures in GFF3 files, representing biological hierarchies like genes containing transcripts, which in turn contain exons or coding sequences. This mechanism supports multi-parent relationships, reflecting real-world complexities such as alternative splicing, and allows parsers to reconstruct feature graphs efficiently. In contrast, earlier versions like GFF2 relied on a free-text "group" field in the ninth column, which lacked enforced structure and often led to inconsistent parsing across tools; GFF3's formalized attributes address these issues by mandating the tag-value syntax and separating identifiers from group affiliations for clearer semantics. Despite these improvements, implementers must ensure ID uniqueness within a file to avoid resolution errors during hierarchy traversal.^[26]^[15]

Ontology Integration

The Sequence Ontology (SO) serves as the foundational controlled vocabulary for GFF3, standardizing the description of genomic features to promote interoperability across bioinformatics tools and databases. By assigning unique identifiers and hierarchical relationships to biological sequence elements, SO ensures that annotations are semantically consistent, allowing researchers to describe complex genomic structures using precise, machine-readable terms rather than ad hoc labels.^[27]^[10] In GFF3 implementation, the third column (feature type) is strictly constrained to SO terms or their accession numbers, such as SO:0000704 for a gene or SO:0000188 for an intron, ensuring that every feature record adheres to a unified semantic framework. Attribute values in the ninth column can further reference SO terms via the predefined Ontology_term tag (e.g., Ontology_term=SO:0000234), linking qualitative annotations like functional properties to the ontology. To explicitly declare the ontology version, GFF3 files may include the optional ##feature-ontology directive, which points to an OBO-format file containing the relevant SO release, such as http://purl.obolibrary.org/obo/so.obo; similarly, the ##attribute-ontology directive supports future extensions for attribute vocabularies, though no standard attribute ontology exists yet.^[20]^[28]^[29] This integration yields key benefits, including semantic querying capabilities—for instance, tools can efficiently retrieve all mRNA features (SO:0000234) across datasets without ambiguity—and extensibility for specialized annotations, such as RNA modifications or non-coding elements, by leveraging SO's hierarchical is_a and part_of relationships. Compliance is verified using dedicated tools like the modENCODE GFF3 Validator, which cross-checks feature types and relationships against the current SO release to detect inconsistencies or invalid hierarchies. As of 2025, the latest SO version (uploaded September 11, 2025) incorporates ongoing refinements to reflect advances in sequence annotation.^[20]^[30]^[31] SO evolves through community-driven updates to accommodate new biological insights, such as the addition of terms for non-coding RNAs (e.g., SO:0001463 for long_non_coding_RNA), which may require revising existing GFF3 files to align with updated hierarchies and prevent annotation drift. These changes, tracked via SO's release history, underscore the need for periodic validation to maintain file accuracy in dynamic genomic research.^[32]^[33]

Extensions and Applications

Recent Developments

In 2020, the AgBioData consortium initiated discussions through its GFF3 working group to address longstanding issues in the format.^[34] Building on these discussions, a 2022 framework proposed by the AgBioData working group outlined extensions to GFF3 for improved interoperability, particularly in standardizing annotations using Sequence Ontology (SO) terms for epigenetic features. The recommendations advocate integrating SO-compliant terms in the third column (type) and leveraging the attributes field for metadata, ensuring compatibility with tools like VEP and Apollo. These extensions prioritize semantic standardization via ontology URIs, such as http://purl.obolibrary.org/obo/so.obo, to support emerging data types while avoiding format overhauls.^[34] Since 2016, the National Center for Biotechnology Information (NCBI) has offered a beta submission process for annotated genomes using GFF3 or GTF files, allowing integration of annotations with FASTA or ASN.1 files via table2asn. The process enforces requirements like unique locus_tags and transcript_ids but tolerates minor GFF3 deviations to facilitate submissions from diverse sequencing technologies, including support for complex assemblies from long-read data.^[7] As of 2025, no official GFF4 specification has been released, with ongoing efforts instead concentrating on GFF3 extensions through non-standard directives to introduce custom terms while preserving backward compatibility. For instance, proposed directives like ##feature-ontology enable declaration of user-defined SO extensions, allowing integration of domain-specific features without breaking existing parsers. Key challenges include balancing this compatibility with demands from single-cell transcriptomics and metagenomics, where hierarchical annotations for cellular heterogeneity and microbial communities require robust multi-parent relationships and pan-genome coordinates—areas flagged for future refinement in GFF3.^[34]

Use in Genomic Annotation

The General Feature Format (GFF), particularly its GFF3 iteration, plays a central role in genomic annotation pipelines by enabling the integration of ab initio gene prediction tools with empirical evidence such as RNA-seq alignments and protein homologies. In pipelines like MAKER, GFF3 serves as both input for iterative evidence-driven predictions and output for final gene models, allowing users to refine annotations through multiple rounds of alignment and prediction.^[35] Similarly, BRAKER utilizes GFF3 to output high-accuracy eukaryotic gene structures derived from GeneMark and AUGUSTUS predictions, facilitating automated annotation of novel genomes without extensive training data.^[36] GFF3 has become a standard for data exchange in major genomic databases, promoting interoperability across platforms. Ensembl routinely provides gene and transcript annotations in GFF3 format via its FTP site, enabling researchers to download comprehensive feature sets for human and other eukaryotic genomes.^[37] FlyBase employs GFF3 for Drosophila annotations, including those from modENCODE projects, to distribute regulatory and expression data.^[38] In variant calling workflows, tools convert VCF files to GFF3 for overlaying structural variants onto reference annotations, as seen in scripts that map variant positions to gene features for functional impact assessment.^[39] Beyond pipelines, GFF3 supports diverse applications in genomic analysis. For RNA-seq quantification, featureCounts leverages GFF3 (or convertible GTF) annotations to assign reads to exons and genes, providing efficient gene-level expression counts across large datasets.^[40] In comparative genomics, GFF3 files from multiple species enable alignment of orthologous features for evolutionary studies, such as identifying conserved regulatory elements.^[41] Functional annotation integrates Gene Ontology (GO) terms directly into GFF3 attributes via the Ontology_term tag, linking predicted features to biological processes, molecular functions, and cellular components for enriched pathway analysis.^[42] Notable case studies highlight GFF3's practical impact. During the Human Genome Project era, early GFF versions facilitated annotation exchange among consortia, evolving into GFF3 for standardized feature description in the post-assembly phase.^[1] The modENCODE project relied on GFF3 for annotating fly and worm genomes, integrating chromatin and expression data to produce comprehensive regulatory maps.^[43] In the 2020s, the Telomere-to-Telomere (T2T) Consortium's CHM13 assembly was annotated using GFF3/GTF formats by NCBI, capturing previously inaccessible telomeric and centromeric features through long-read sequencing.^[44] Looking ahead, GFF3's adoption is expanding in pan-genomics, where it accommodates variable genomic regions across populations by representing diverse haplotypes and structural variants in a unified format, as demonstrated in tools like Roary for prokaryotic and emerging eukaryotic pangenomes.^[45] This trend supports scalable analyses of genetic diversity, particularly in non-model organisms with complex architectures.^[46]

Software and Tools

Validation Tools

Validation tools for the General Feature Format (GFF) ensure compliance with the GFF3 specification, including syntax, structural integrity, and adherence to standards like the Sequence Ontology (SO). These tools detect issues such as malformed fields, invalid hierarchies, and inconsistencies in feature relationships, which are critical for downstream genomic analyses.^[47] GenomeTools' gt gff3validator is a command-line utility that strictly validates GFF3 file structure, sorting order, and ontology terms, while supporting custom ontology schemas for extended checks.^[47] It verifies parent-child relationships, Dbxref and Ontology_term attributes, and overall file tidiness, making it suitable for large-scale annotations.^[48] Like other validators, it flags issues in attribute escaping and coordinate overlaps that could disrupt parsing.^[49] Another GFF Analysis Toolkit (AGAT), released in the 2020s, provides comprehensive validation for both GFF3 and GTF formats, detecting errors such as frame shifts in coding regions, duplicate IDs, and missing mandatory attributes.^[50] AGAT not only identifies but also suggests fixes for standardization, including sorting features and padding incomplete hierarchies, which enhances its utility in annotation workflows.^[51] It reports on phase continuity discrepancies and ensures attribute consistency, aiding users in creating compliant files. Online validation options include the NCBI GFF Validator, introduced in beta in 2024 as part of the annotated genome submission process, which checks GFF3 files for GenBank compatibility, including internal stops and feature validity.^[7] The European Bioinformatics Institute (EBI) supports GFF3 validation through its submission toolkit for the European Nucleotide Archive (ENA), facilitating conversion and error checking for EMBL flat file generation.^[52] These web-based tools provide accessible reports on common issues like coordinate consistency without requiring local installation.

Parsing and Processing Libraries

Several programmatic libraries facilitate the reading, writing, and manipulation of General Feature Format (GFF) data, particularly GFF3, enabling bioinformatics workflows to process genomic annotations efficiently. These libraries typically parse tabular GFF structures into object-oriented representations, allowing for hierarchical feature extraction and integration with broader analysis pipelines. Key implementations span multiple programming languages, emphasizing scalability for large datasets such as whole-genome annotations.^[53]^[54] Biopython, a Python library for biological computation, includes the SeqIO module for robust GFF3 parsing. SeqIO reads GFF3 files into SeqRecord objects, where features are represented as SeqFeature instances organized in parent-child hierarchies, such as genes containing mRNA subfeatures with nested CDS and exon elements. This structure supports writing GFF3 output via GFF3Writer, preserving hierarchies through sub_features attributes, and allows integration with sequence data from FASTA files for annotation augmentation.^[53] BioPerl, the Perl toolkit for bioinformatics, provides the Bio::Tools::GFF module for parsing GFF formats including GFF3 into Bio::SeqFeatureI objects. It handles ontology lookups by mapping feature types to Sequence Ontology terms, facilitating semantic consistency in annotations, and supports hierarchy traversal through methods for accessing parent and child features. This enables operations like extracting nested features (e.g., exons within transcripts) while adhering to GFF3's controlled vocabulary for attributes beginning with uppercase letters.^[55]^[56] For low-level, high-throughput processing, libgff serves as a C library specialized in GFF/GTF parsing, extracted from tools like GFFRead in the Cufflinks ecosystem for RNA-seq analysis. It offers efficient, memory-conscious reading of large annotation files, supporting rapid feature extraction and format conversions without the overhead of higher-level languages, making it suitable for integration into performance-critical pipelines handling high-volume sequencing data.^[57]^[58] The Python library gffutils provides advanced querying capabilities by importing GFF3 and GTF files into SQLite databases via create_db(), enabling SQL-based operations on feature hierarchies. Users can extract subfeatures (e.g., db.children(gene, featuretype='exon')), compute overlapping intervals with region queries, and traverse parents for hierarchical context. It supports conversion to formats like BED through attribute manipulation and excels in performance on large files, such as human genome annotations with approximately 1 million features, by leveraging database indexing to avoid repeated parsing.^[54]

Servers and Visualization Clients

The Generic Genome Browser (GBrowse), developed as part of the Generic Model Organism Database (GMOD) project, serves as a web-based server for dynamically hosting and displaying GFF annotations. It supports loading GFF2 and GFF3 files into a relational database backend, enabling users to query and visualize genomic features through customizable tracks and plugins. GBrowse's architecture allows for integration with multiple data sources, facilitating the serving of annotations in real-time without requiring extensive preprocessing.^[59] The Distributed Annotation System (DAS) provides a protocol for decentralized servers to share GFF-based annotations across distributed resources. DAS servers store annotations in GFF format and respond to client queries for specific genomic regions, allowing aggregation of data from multiple independent sources into a unified view. This system promotes collaborative annotation by enabling third-party contributions without a central repository.^[60]^[61] JBrowse, a modern web-based visualization client introduced in the 2010s, loads GFF3 files to render interactive genomic tracks directly in the browser. It supports plugin extensions for variant analysis and other advanced features, emphasizing scalability for large datasets through client-side rendering. JBrowse's JavaScript-based design ensures embeddability in web applications while handling GFF imports for dynamic exploration.^[62]^[63] The UCSC Genome Browser allows users to import GFF and GTF files as custom tracks on its server-side platform, where the backend processes and renders annotations alongside reference genomes. This server-managed approach supports visualization of user-uploaded GFF data in context with public datasets, including zooming and track reconfiguration.^[64] The Integrative Genomics Viewer (IGV), a desktop client with ongoing updates into the 2020s, visualizes GFF annotations through high-performance zooming and panning across genomic regions. It supports direct loading of GFF files for feature tracks, integrating them with sequencing data for exploratory analysis. IGV's updates have enhanced support for large-scale annotations, maintaining efficiency for local workflows.^[65]^[66] Integration of GFF data in these tools often leverages REST APIs, such as Ensembl's, for exporting annotations in GFF format to enable programmatic access and scalability. For handling big data, indexing techniques like tabix are employed to accelerate queries and visualization rendering.^[67]

References

[1]
General Feature Format
GFF allows people to develop features and have them tested without having to maintain a complete feature-finding system. Equally, it would help those developing ...Introduction · Definition · Semantics · Ways to use GFF
[2]
GFF/GTF File Format - Ensembl
The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.
[3]
Data File Formats - UCSC Genome Browser
GFF (General Feature Format) lines are based on the Sanger GFF2 specification. GFF lines have nine required fields that must be tab-separated. If the fields ...
[4]
Differences Between GTF and GFF Files in Genomic Data Analysis
Feb 27, 2024 · GFF is a standard file format used for storing genomic sequences and annotations, developed by the Sanger Centre (v2) and the Sequence Ontology ...
[5]
The GTF/GFF formats - AGAT's documentation!
The GTF/GFF formats are 9-column text formats used to describe and represent genomic features. The formats have quite evolved since 1997.
[6]
GPress: a framework for querying general feature format (GFF) files ...
In particular, the GFF files are used for describing genes and other features of DNA, RNA and protein sequences, and they contain several annotations for each ...Abstract · Introduction · Methods and experimental...
[7]
Annotating Genomes with GFF3 or GTF files - NCBI - NIH
Mar 8, 2024 · This page describes how to create an annoated genome submission from GFF3 or GTF files, using the beta version of our process.
[8]
GFF/GTF formats - Genome Annotation
The GTF/GFF formats. The GTF/GFF formats are 9-column text formats used to describe and represent genomic features. The formats have quite evolved since 1997, ...<|control11|><|separator|>
[9]
GFF3 - FAIRsharing
The Generic Feature Format Version 3 (GFF3) format was developed after earlier formats, although widely used, became fragmented into multiple incompatible ...Missing: history | Show results with:history
[10]
The Sequence Ontology: a tool for the unification of genome ...
Apr 29, 2005 · For development purposes, SOFA was stabilized and released (in May 2004) for at least 12 months to allow development of software and formats.Missing: history | Show results with:history
[11]
https://www.sequenceontology.org/
[12]
GFF3 File Format - Ensembl
The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.
[13]
The modENCODE Data Coordination Center: lessons in harvesting ...
Aug 19, 2011 · The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to ...
[14]
GMOD
### GFF2 Specification Summary
[15]
GMOD
### Summary of GFF3 Content
[16]
http://www.sequenceontology.org/gff3.shtml
[17]
Data format - GENCODE
Format description of GENCODE GTF. A. TAB-separated standard GTF columns. a Scaffolds, patches and haplotypes names correspond to their GRC accessions.
[18]
GTF2.2: A Gene Annotation Format - The Brent Lab
GTF stands for Gene transfer format. It borrows from GFF, but has additional structure that warrants a separate definition and format name.
[19]
Frequently Asked Questions: Gene tracks - Genome Browser FAQ
We provide files in GTF format, which is an extension to GFF2, for most assemblies. More information on GTF format can be found in our FAQ. These files are ...
[20]
Specifications/gff3.md at master · The-Sequence-Ontology/Specifications
**Summary of Historical Information and Relation to Sequence Ontology for GFF3:**
[21]
https://www.ensembl.org/info/website/upload/gff.html
[22]
gff3_QC full documentation
This QC program aims to detect over 50 types of formatting errors. Errors are detected by reviewing three types of feature sets in a GFF3 file.Missing: common parsing unsorted
[23]
https://www.ensembl.org/info/website/upload/gff3.html
[24]
Calculate CDS phase in gff3 format - Biostars
Aug 28, 2018 · The phase of a CDS feature depends on the associated upstream CDS feature. If there is the length/3 of the previous CDS feature leaves a remainder of 1, your ...CDS phase 0,1,2 in GFF formatAdding CDS features to Gff3 fileMore results from www.biostars.org
[25]
Home
### Summary: TransDecoder and GFF Phase for CDS Features
[26]
None
### Summary of GFF3 Pragmas, Headers, and Directives Starting with ##
[27]
Sequence Ontology
SO was initially developed by the Gene Ontology Consortium. ... Our aim is to develop an ontology suitable for describing the features of biological sequences.Browser · Sequence Ontology Publications · The Molecular Sequence... · AboutMissing: history 2004
[28]
GENE - The MISO Sequence Ontology Browser
SO:0000704 (SOWiki) · A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory ...Missing: 0000195 | Show results with:0000195
[29]
INTRON - The MISO Sequence Ontology Browser
intron (CURRENT_SVN). SO Accession: SO:0000188 (SOWiki). Definition: A region of a primary transcript that is transcribed, but removed from within the ...Missing: ID | Show results with:ID
[30]
Sequence Types and Features Ontology - NCBO BioPortal
Sequence Types and Features Ontology. Last uploaded: September 11, 2025 ... The autocomplete widget accesses ontology content from the latest ...
[31]
https://github.com/modENCODE-DCC/validator
[32]
http://sequenceontology.org/browser/current_svn/term/SO:0001463
[33]
Evolution of the Sequence Ontology terms and relationships - PMC
The Sequence Ontology is an established ontology, with a large user community, for the purpose of genomic annotation. We are reforming the ontology to ...Missing: history | Show results with:history
[34]
[2202.07782] Recommendations for extending the GFF3 ... - arXiv
Feb 15, 2022 · We suggest improvements for each of the GFF3 fields, as well as the special cases of modeling functional annotations, and standard protein-coding genes.
[35]
[PDF] "Genome Annotation and Curation Using MAKER and MAKER-P". In
MAKER uses two output formats, GFF3 and FASTA. Gene predictions, evidence align- ments, repetitive elements, and the final gene models are output in GFF3 format ...
[36]
Whole-Genome Annotation with BRAKER - PMC - NIH
BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and ...
[37]
FTP Download - Ensembl
About the data · FASTA: FASTA sequence databases of Ensembl gene, transcript and protein model predictions. · Annotated sequence · MySQL · GTF · GFF3 · JSON · EMF ...FTP siteFTP DownloadAccessing Ensembl Data
[38]
FlyBase: introduction of the Drosophila melanogaster Release 6 ...
Nov 14, 2014 · Various cuts of the data are provided in multiple formats, ranging from GFF3 (http://www.sequenceontology.org/gff3.shtml), FASTA (21) and ...
[39]
need help with converting VCF to GTF/GFF format - SEQanswers
Mar 2, 2011 · VCF to BED/GFF is doable with an awk script. BED and GFF are essentially interchangeable with awk as well. GFF/BED to VCF is not really doable ...GFF annotation converted to a new GFF based on VCF informationAnnotating VCF files from GFF or .gbk files - SEQanswersMore results from www.seqanswers.com
[40]
featureCounts: an efficient general purpose program for assigning ...
We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments.Missing: GFF3 | Show results with:GFF3
[41]
Roary: the pan genome pipeline - GitHub Pages
Roary takes GFF3 files as input. They must contain the nucleotide sequence at the end of the file. Input files from Prokka. All GFF3 files created by Prokka are ...Missing: adoption | Show results with:adoption
[42]
[PDF] Recommendations for extending the GFF3 specification for ... - arXiv
Sep 15, 2020 · Providing concrete guidelines for generating GFF3, and creating a standard representation of the most common biological data types in GFF3 that ...
[43]
FlyBase:ModENCODE data at FlyBase - FlyBase Wiki
Jun 30, 2025 · FlyBase offers a subset of modENCODE datasets that characterize gene expression and chromatin/transcriptional regulation in Drosophila.Missing: adoption | Show results with:adoption
[44]
(PDF) A telomere to telomere phased genome assembly and ...
Jul 15, 2025 · A telomere to telomere phased genome assembly and annotation for the Australian central bearded dragon Pogona vitticeps ... gff3 format ...<|control11|><|separator|>
[45]
Roary: rapid large-scale prokaryote pan genome analysis
We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes.Missing: adoption | Show results with:adoption
[46]
[PDF] standards for handling and analyzing plant pan- genomes
Jul 28, 2025 · Annotations must be provided in GFF3 or GTF format (compliant with. Sequence Ontology), with the sequence IDs in the first column exactly ...
[47]
manual page for GT-GFF3VALIDATOR(1) - GenomeTools
gt-gff3validator strictly validates GFF3 files. It can validate parent-child relationships using an ontology file and check Dbxref and Ontology_term attributes.
[48]
The modENCODE Data Coordination Center: lessons in harvesting ...
Aug 19, 2011 · Automated QC. To enforce consistency across all submissions to the modENCODE DCC, we developed a modular automated vetting tool written in Perl.
[49]
https://genometools.org/tools.html
[50]
GFF3 Online Validator - GenomeTools
The GFF3 online validator takes a GFF3 file (up to 50MB, .gz or .bz2) and validates it against the GFF3 specification.
[51]
Tools - GenomeTools
gt gff3validator Strictly validate given GFF3 files. gt gtf_to_gff3 Parse GTF2.2 file and convert it to GFF3. gt hop Cognate sequence-based homopolymer ...Missing: modENCODE | Show results with:modENCODE
[52]
NBISweden/AGAT: Another Gtf/Gff Analysis Toolkit https ... - GitHub
AGAT has the power to check, fix, pad missing information (features/attributes) of any kind of GTF and GFF to create complete, sorted and standardised gff3 ...
[53]
What can AGAT do for you?
AGAT has the power to check, fix, pad missing information (features/attributes) of any kind of GTF and GFF to create complete, sorted and standardised gff3 ...
[54]
EMBLmyGFF3: a converter facilitating genome annotation ... - NIH
Aug 13, 2018 · A robust universal converter from GFF3 format to EMBL format compatible with genome annotation submission to the European Nucleotide Archive.
[55]
Parsing GFF Files - Biopython
Biopython provides a full featured GFF parser which will handle several versions of GFF: GFF3, GFF2, and GTF. It supports writing GFF3, the latest version.Examining Your Gff File · Gff Parsing · Iterating Over Portions Of A...
[56]
Introduction — gffutils 0.13 documentation - GitHub Pages
gffutils is a Python package for working with GFF and GTF files in a hierarchical manner. It allows operations which would be complicated or time-consuming ...
[57]
Bio::Tools::GFF(3pm) - Debian Manpages
Jan 15, 2017 · NAME¶. Bio::Tools::GFF - A Bio::SeqAnalysisParserI compliant GFF format parser. SYNOPSIS¶. use Bio::Tools::GFF; # specify input via -fh or ...
[58]
GFF3 sequence format - BioPerl
All attributes that begin with an uppercase letter are reserved for later use. Attributes that begin with a lowercase letter can be used freely by applications.Missing: module | Show results with:module
[59]
COMBINE-lab/libgff - GitHub
This is an attempt to perform a simple libraryfication of the GFF/GTF parsing code that is used in GFFRead codebase.Missing: HTSlib | Show results with:HTSlib
[60]
GFF Utilities: GffRead and GffCompare - F1000Research
Apr 28, 2020 · GffRead and GffCompare are open source programs that provide extensive and efficient solutions to manipulate files in a GTF or GFF format.
[61]
The Generic Genome Browser: A Building Block for a Model ... - NIH
Within MySQL, however, GBrowse supports two distinct schemata. One schema, called Bio::DB::GFF, is a simple schema that requires minimal preparation on the part ...
[62]
Das | Biodas | Distributed Anotation System
Distributed Annotation System (DAS) - A server system for the sharing of Reference Sequences, a system conceptually composed of a Reference Server and ...
[63]
The Distributed Annotation System - PMC - PubMed Central
Feb 18, 2000 · DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side ...
[64]
JBrowse: a dynamic web platform for genome visualization and ...
Apr 12, 2016 · JBrowse is a fast and full-featured genome browser built with JavaScript and HTML5. It is easily embedded into websites or apps but can also be served as a ...
[65]
Indexed file formats tutorial · JBrowse 1
With genometools, it has added validation checking mechanisms that are helpful gt gff3 -sortlines -tidy data/volvox.gff3 > data/volvox.sorted.gff3. With ...<|control11|><|separator|>
[66]
Genome Browser Custom Tracks
The Genome Browser provides dozens of aligned annotation tracks that have been computed at UCSC or have been provided by outside collaborators.Loading a custom track into... · Troubleshooting annotation...Missing: GFF3 | Show results with:GFF3
[67]
IGV: Integrative Genomics Viewer
Supports both Jupyter and Google Colab. igv-reports. Generate self-contained HTML reports that consist of a table of genomic sites and ...Missing: GFF | Show results with:GFF
[68]
Integrative Genomics Viewer (IGV): high-performance genomics ...
Apr 19, 2012 · IGV supports a number of formats for genomic annotations, including BED, GFF, GTF2 [20] and PSL [21]. Visual representation of annotations ...
[69]
Exporting data via website - Ensembl
From these links you can export sequence, features in BED, CSV, TSV, GTF, GFF and GFF3 formats, and EMBL or GenBank flatfiles. ... The icon at the top right of ...