The FASTA format is a simple, text-based standard for representing biological sequences, such as DNA, RNA, or protein (amino acid) chains, using single-letter codes for nucleotides or residues, and it supports storing multiple sequences in a single file.[1] Developed in 1985 by William R. Pearson and David J. Lipman as part of their sequence alignment software package, the format originated from tools like FAST-P (for proteins) and FAST-N (for nucleotides), with "FASTA" standing for "FAST-All" to denote its broad applicability across sequence types.[2] It quickly became a cornerstone of bioinformatics due to its human-readable simplicity, lack of proprietary restrictions, and compatibility with early computational tools for database searches and alignments.[1]
At its core, a FASTA file consists of one or more sequence records, each starting with a definition line prefixed by a greater-than symbol (">"), followed by a unique identifier (typically ≤25 characters, using alphanumeric symbols, hyphens, underscores, or similar), optional descriptive text, and the sequence data itself on subsequent lines, conventionally wrapped at 60–80 characters for readability.[3] For nucleotide sequences, valid symbols include A, C, G, T (or U for RNA), and N for ambiguities, while protein sequences use the 20 standard amino acid codes (e.g., A for alanine, C for cysteine); no gaps or alignments are inherently represented, though multi-line sequences must avoid spaces or numbers.[3] This structure ensures portability across platforms, with files often using extensions like .fasta, .fa, or .seq, and it is identified by the ">" magic number in its first non-empty line.[1]
Widely adopted since its inception, the FASTA format underpins major databases like GenBank and UniProt, facilitating sequence submission, retrieval, and analysis in tools from BLAST to modern genomic pipelines, and remains actively maintained through community conventions without a central governing body.[3] Its evolution includes extensions for qualifiers (e.g., organism names in brackets) and variants like FASTQ for quality scores, but the core remains unchanged for its efficiency in handling large-scale biological data.[1]
Introduction
Definition and Purpose
The FASTA format is a simple, human-readable, text-based standard for representing biological sequences, including nucleotide sequences from DNA or RNA and amino acid sequences from proteins. It structures data with a single-line header, known as the definition line, that begins with a greater-than symbol ('>') followed by an identifier and optional description, succeeded by one or more lines of sequence characters without spaces or numbers.[3][1]
The primary purpose of the FASTA format is to enable efficient storage, exchange, and processing of biological sequence data in bioinformatics workflows, supporting tasks such as sequence alignment, similarity searches against databases, and interoperability among diverse software tools, all while avoiding complex binary encodings. This design promotes seamless data sharing across research communities and platforms, making it a foundational format in genomics and proteomics.[4][5]
Key advantages of FASTA include its platform-independent portability as a plain ASCII text file, straightforward parsing by both human users and computational algorithms due to its minimalistic structure, and flexibility to accommodate multiple sequences in one file for batch processing. These features have established FASTA as a de facto standard in bioinformatics, enhancing reproducibility and collaboration.[1][4]
For illustration, a basic single-sequence FASTA snippet appears as follows:
>seq1
ATGCATGCAGCTAG
>seq1
ATGCATGCAGCTAG
This example shows the header line identifying the sequence, followed by the raw sequence data.[3]
The format originated with the FASTP and FASTA sequence comparison programs to handle input and output for rapid similarity detection in biological databases.[6][7]
History and Development
The FASTA format, originally known as the "Pearson format," was developed in 1985 by David J. Lipman, a researcher at the National Institutes of Health (NIH), and William R. Pearson, then a collaborator from the University of Virginia, as part of their FASTP program for efficient protein sequence database searches. The format was subsequently incorporated into the FASTA sequence alignment software package, released in 1988, which extended the approach to both protein and DNA sequences using a word-based indexing method to identify local alignments quickly and sensitively without requiring full dynamic programming. This addressed the growing need for handling expanding sequence databases in early bioinformatics, allowing researchers to detect distant homologies that slower methods might miss.
A key milestone occurred following the National Center for Biotechnology Information (NCBI)'s assumption of GenBank responsibilities in 1992.[8] By the 2000s, FASTA had become a de facto standard in bioinformatics, widely used across sequence analysis software due to its simplicity and portability.[9] This evolution influenced subsequent formats, such as FASTQ, which extended FASTA by incorporating quality scores for next-generation sequencing data.[10]
Definition Line Structure
The definition line, also known as the header or descriptor line, in a FASTA file initiates each sequence record and begins with a greater-than symbol (">"), immediately followed by a unique sequence identifier without intervening spaces. This identifier serves to uniquely label the sequence and is typically an alphanumeric string, such as a locus name, database accession number, or custom label, often limited to a maximum of 25 characters in NCBI submissions. Valid characters for the identifier include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).[3][11]
Following the identifier, an optional description field may contain free-text information, such as the species name, gene or protein function, strain details, or other annotations, delimited by a space. This description can incorporate brackets for structured modifiers in NCBI formats, such as [organism=scientific name] or [moltype=mRNA], allowing automated parsing for database integration; multiple modifiers can be chained without spaces around the equals sign. The entire definition line must form a single continuous line without hard returns or line breaks, and it supports alphanumeric characters, spaces, common punctuation (e.g., parentheses, semicolons), and symbols, though excessive special characters should be avoided to ensure compatibility. In multi-sequence files, each definition line uniquely distinguishes its associated sequence.[3][12][11]
NCBI employs specific conventions for identifiers in its FASTA outputs, historically using the format gi|GI_number|database|accession.version| followed by the description, where vertical bars (|) act as delimiters for parsing components: the "gi|" prefix denotes a GenInfo Identifier (a unique numeric ID assigned sequentially), followed by the source database (e.g., gb for GenBank, ref for RefSeq), the accession number, and its version. Although GI numbers were phased out in 2016 for new records, this pipe-delimited structure remains prevalent in legacy files and some tools. Modern NCBI FASTA headers often simplify to >accession.version description, as seen in nucleotide examples like >U49845.1 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds, or protein examples like >NP_001234.1 hypothetical protein [Homo sapiens]. These conventions facilitate precise retrieval and annotation in bioinformatics pipelines.[12][13][14]
>gi|1293614|gb|U49845.1| [Saccharomyces cerevisiae](/page/Saccharomyces_cerevisiae) TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
>gi|1293614|gb|U49845.1| [Saccharomyces cerevisiae](/page/Saccharomyces_cerevisiae) TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
[14]
Sequence Representation
In the FASTA format, biological sequences are encoded using standardized single-letter codes in the data lines that immediately follow the definition line. For nucleotide sequences, these include A for adenine, C for cytosine, G for guanine, T for thymine (or U for uracil in RNA sequences), and N for any unknown or ambiguous base, adhering to IUPAC nomenclature.[3] For protein sequences, the 20 standard amino acids are represented by their conventional single-letter codes—such as A for alanine, C for cysteine, D for aspartic acid, E for glutamic acid, F for phenylalanine, and others—while B indicates asparagine or aspartic acid (Asx), Z denotes glutamine or glutamic acid (Glx), and X represents any or unknown amino acid.[15]
Sequence lines are conventionally written in uppercase letters to ensure uniformity and compatibility with bioinformatics tools, though lowercase letters are accepted and occasionally employed to denote annotations like low-complexity regions or, in specialized cases, modified residues.[16] Only valid biological symbols from the respective alphabet are permitted in these lines; spaces, numbers, punctuation, or other extraneous characters are prohibited to maintain data integrity.[17]
Ambiguities beyond the standard codes follow IUPAC guidelines for both nucleotides (e.g., R for A or G, Y for C or T) and proteins (limited to B, Z, and X as noted), allowing representation of uncertain positions without disrupting sequence parsing.[3][15] Gaps, indicated by the hyphen '-', are supported for sequence alignments to denote insertions or deletions but are typically omitted in files containing pure, unaligned sequences to avoid implying structural variations.[17][3]
The following code block illustrates a short nucleotide sequence example:
>example_dna | Sample DNA sequence
ATCGATCGNNATCG
>example_dna | Sample DNA sequence
ATCGATCGNNATCG
Here, the sequence uses standard nucleotide codes with N indicating ambiguous bases.[3]
In contrast, a short protein sequence example employs amino acid codes:
>example_protein | Sample protein sequence
MVKPLFTGILABXZ
>example_protein | Sample protein sequence
MVKPLFTGILABXZ
This includes standard codes (e.g., M for methionine, A for alanine) along with ambiguities B (Asx) and Z (Glx).[15]
File Organization
Single and Multi-Sequence Files
The FASTA format accommodates both single-sequence and multi-sequence files, allowing flexibility in organizing biological sequence data. In a single-sequence file, the content begins with a single definition line starting with a greater-than symbol (">"), followed immediately by the sequence data, which continues until the end of the file (EOF). This structure is straightforward and commonly used for individual nucleotide or protein sequences, such as a single gene or contig.[3]
Multi-sequence files, often referred to as multi-FASTA or multi-fasta, extend this by including multiple sequence records within one file. Each record starts with its own ">" definition line, followed by the corresponding sequence data; there is no explicit end marker for individual sequences. Subsequent records are delineated by the next ">" line, enabling the concatenation of several single-sequence files into a cohesive multi-sequence file without additional delimiters. This organization is widely used for datasets like gene families, alignments, or genome assemblies comprising multiple contigs.[3][18]
Parsing a FASTA file relies on recognizing ">" lines as the start of new records. For both single and multi-sequence files, software reads line by line: sequence data is collected from non-">" lines until the next ">" line or EOF is encountered, marking the boundary of the current record. This simple, line-based logic ensures robust handling across tools, though care must be taken to preserve unique identifiers in definition lines to avoid conflicts in multi-sequence contexts.[3][18]
FASTA files are efficient for small to medium-sized datasets, typically handling sequences up to several megabases without issue due to their plain-text nature. However, for very large genomes—such as eukaryotic assemblies with extensive contigs—files can become unwieldy, exceeding gigabytes and straining memory during parsing or submission; in such cases, splitting into multiple files (e.g., by contig groups) is recommended to improve manageability.[3]
The following snippet illustrates a basic multi-sequence FASTA file with two short nucleotide sequences:
>seq1 Human gene example
ATGCCGTAGCTAGCTAGCTAGC
>seq2 Mouse ortholog
ATGCCGTAATTAGCTAGCTAGC
>seq1 Human gene example
ATGCCGTAGCTAGCTAGCTAGC
>seq2 Mouse ortholog
ATGCCGTAATTAGCTAGCTAGC
Here, each sequence is separated by its definition line, demonstrating the format's delimiter-free record structure.[18][3]
The FASTA format employs several non-mandatory conventions to enhance readability and ensure compatibility across bioinformatics tools. A primary recommendation is to wrap sequence lines at 60 to 80 characters, with 60 characters as a common default in the original FASTA implementation for display purposes and 80 characters as the upper limit suggested by major databases to prevent excessive line lengths that could complicate parsing or viewing.[16][3] This range balances human readability with efficient processing, as longer lines may exceed terminal widths or impose memory constraints in older software.
Regarding whitespace, sequences must consist solely of valid symbols (such as IUPAC nucleotide or amino acid codes) without embedded spaces, as any whitespace within sequence data is invalid and typically stripped or causes errors in parsers.[3] Blank lines are optional and may be inserted between records for visual clarity in multi-sequence files, though they are not required and must be positioned only at the file's start, between records, or at the end to avoid disrupting standard parsers.[19]
FASTA files support various end-of-line conventions, including Unix-style LF (\n), Windows-style CRLF (\r\n), and legacy Mac-style CR (\r), with most modern tools automatically normalizing these during input to ensure cross-platform compatibility.[20] The format lacks a built-in mechanism for indicating the file version itself, but individual sequence headers may incorporate version numbers (e.g., from database entries like GenBank or UniProt) to track updates to specific sequences.[21]
Best practices further emphasize cleanliness: avoid trailing spaces on any lines, as they constitute unnecessary whitespace that could lead to inconsistencies in file sizes or parsing artifacts, and refrain from including numeric line numbers or annotations, which are not part of the core specification and may be misinterpreted by software expecting pure sequence data.[22] These conventions, while flexible, promote interoperability when adhering to them in file creation and exchange.
Variations and Extensions
File Compression and Handling
FASTA files lack a strict standard for filename extensions, with common conventions including .fasta, .fa, .fna for nucleotide sequences, .faa for protein sequences, and occasionally .fst or .seq.[23][24]
Compression is widely employed to manage the size of FASTA files, particularly for large datasets, using tools like gzip to produce .gz archives or .zip formats.[25] These methods preserve the original text-based structure, allowing seamless decompression and direct parsing by bioinformatics software without altering the format.[25] For instance, utilities such as SeqKit natively support reading and writing gzip-compressed FASTA files, enabling efficient processing of compressed inputs like input.fasta.gz.[25]
Encryption of FASTA files is uncommon due to the format's primary role in open scientific exchange, but it can be applied using general-purpose tools like GPG for secure storage or transfer of sensitive genomic data.[26] This process is not inherent to the FASTA specification and necessitates full decryption prior to parsing, as encrypted files are binary and incompatible with standard sequence readers.[26] Specialized tools, such as Cryfa, offer tailored encryption for FASTA alongside compression, maintaining usability in bioinformatics workflows while enhancing privacy for formats like FASTA and FASTQ.[27]
For handling large FASTA files, which can exceed gigabytes in size for comprehensive genomes or metagenomes, indexing is recommended to enable random access without loading the entire file into memory.[28] The samtools faidx command creates an index file (e.g., reference.fasta.fai) that maps sequence identifiers to byte offsets, facilitating rapid extraction of subsequences in FASTA format.[28] This approach supports both uncompressed and BGZF-compressed inputs, improving efficiency in tools like alignment pipelines.[28]
The ASCII-based nature of FASTA ensures high portability across platforms, as it relies solely on standard text characters without proprietary encoding or binary elements.[3] This text format avoids issues with byte order, line endings, or character sets, making FASTA files directly readable on Windows, macOS, Linux, and other systems using basic text editors or command-line tools.[3]
Specialized Extensions
Specialized extensions to the FASTA format adapt its simple structure to meet niche requirements in bioinformatics, such as enhanced visualization, alignment representation, quality annotation, and domain-specific notations, while building on the core header and sequence lines.
In multiple sequence alignments, FASTA files commonly incorporate gaps denoted by hyphens ('-') to represent insertions or deletions, enabling the storage and exchange of aligned sequences without dedicated formats. This practice is supported by alignment tools like MAFFT, which output results in this extended form, often including coordinate annotations in headers for reference positioning.
Prior to the standardization of FASTQ, quality scores were typically stored in separate files with a .qual extension alongside FASTA sequence files to capture sequencing error estimates; these separate files were largely replaced by FASTQ's structured integration of Phred scores in a single file.[10]
For RNA analysis, secondary structure information is frequently encoded using dot-bracket notation—where paired bases are marked by matching parentheses and unpaired ones by dots—either as a parallel string to the sequence in modified FASTA files or in dedicated structure lines. The FASTR format exemplifies this by pairing nucleotide sequences with their corresponding dot-bracket representations in a single file, facilitating integrated prediction and visualization workflows.[29]
In proteomics, post-translational modifications (PTMs) like phosphorylation or glycosylation are annotated via the Proteomics Standards Initiative Extended FASTA Format (PEFF), which extends headers with metadata keywords (e.g., MOD_RES for modified residues) and sequence markers using controlled vocabularies from resources like UniMod. PEFF maintains compatibility with standard FASTA parsers while embedding PTM sites, variants, and processing events, improving accuracy in mass spectrometry-based identifications.
Color-enhanced FASTA representations, used primarily for motif highlighting in web-based viewers, employ HTML-like tags (e.g., <span style="color:red"> around specific residues) or predefined character schemes within sequence lines to denote features like conserved domains or hydrophobicity; tools like bioSyntax apply such coloring schemes (e.g., Zappo or CLUSTAL) during file rendering for better interpretability.[30]
These extensions, while useful for targeted applications, often compromise interoperability, as standard parsers may ignore or misinterpret non-core elements like PTM metadata or structure notations, leading to data loss; libraries such as Biopython's SeqIO handle vanilla FASTA robustly but require custom adaptations for formats like PEFF or FASTR.[31]
Several command-line tools facilitate the reading, writing, and manipulation of FASTA files in bioinformatics workflows. The Unix utility grep is commonly employed for basic extraction tasks, such as counting the number of sequences in a FASTA file by identifying lines beginning with >, using the command grep -c "^>" file.fasta, which outputs the count without loading the entire file into memory.[32] For more advanced filtering and conversion, seqtk provides efficient operations like sub-sampling, trimming, and format conversion for FASTA files, processing large datasets rapidly due to its lightweight design.[33] Similarly, bioawk extends the standard awk utility with built-in support for biological formats, enabling pattern-based extraction and manipulation of FASTA sequences, such as printing sequence lengths or filtering by identifier.[34]
Programming libraries offer robust programmatic access to FASTA files. In Python, Biopython's SeqIO module serves as a primary tool for parsing, writing, and converting FASTA data, using iterator-based functions like SeqIO.parse() to handle records sequentially and support operations such as slicing or formatting without requiring full file loading.[35] The EMBOSS suite includes seqret, a versatile command-line program within this library for reformatting FASTA files to other sequence formats or extracting subsequences, making it essential for data preparation in pipelines.[36]
Alignment software routinely incorporates native FASTA support for input and output. BLAST, developed by NCBI, accepts multi-sequence FASTA queries for similarity searches against databases, enabling batch processing of nucleotide or protein sequences in standard FASTA format.[37] Clustal Omega, a multiple sequence alignment tool from EMBL-EBI, reads unaligned FASTA inputs to generate alignments using progressive methods, supporting up to thousands of sequences efficiently.[38] MAFFT, another alignment program, ingests FASTA files for accurate multiple alignments of DNA, RNA, or proteins, with options for handling diverse sequence lengths via FFT-based algorithms.
Graphical editors provide visual interfaces for multi-FASTA manipulation. Jalview allows users to import FASTA alignments for interactive editing, including gap insertion, residue modification, and consensus sequence generation, with export back to FASTA format.[39] UGENE offers a comprehensive alignment editor that loads multi-FASTA files, supports sequence trimming, annotation addition, and real-time visualization, suitable for exploratory analysis of aligned datasets.[40]
For handling large FASTA files, streaming parsers are critical to mitigate memory constraints, as they process records iteratively rather than loading entire datasets. Tools like Biopython's SeqIO and seqtk implement such streaming, enabling analysis of gigabyte-scale files on standard hardware by yielding one record at a time, thus avoiding out-of-memory errors common in DOM-style parsers.[35][33]
Integration with Databases and Pipelines
The FASTA format is widely utilized for exporting sequence data from major bioinformatics databases, facilitating easy access and integration into downstream analyses. The National Center for Biotechnology Information (NCBI) provides nucleotide and protein sequences from GenBank and RefSeq in FASTA format, with weekly updates available for transcript and protein records to support research reproducibility.[41] Similarly, UniProt offers canonical protein sequences and manually curated isoforms in FASTA format, enabling researchers to retrieve comprehensive sets of protein data for functional annotation and comparative studies.[15] These exports ensure that FASTA files serve as a lightweight, standardized intermediary for database dissemination without the overhead of full annotation files.
In bioinformatics pipelines, particularly those involving next-generation sequencing (NGS), FASTA files function as essential inputs for reference genomes and outputs for assembled contigs. Aligners such as BWA and Bowtie2 require FASTA-formatted reference sequences to build indexes and map reads efficiently, streamlining variant calling and genome assembly workflows.[42] Genome assemblers like SPAdes generate contigs and scaffolds directly in FASTA format (e.g., contigs.fasta), which can then be fed into subsequent annotation or alignment steps.[43] This bidirectional role in pipelines enhances interoperability, as FASTA's simplicity allows seamless integration across tools without format-specific parsing.
Conversion between FASTA and other formats, such as GenBank or GFF, is supported by libraries like Biopython, which enable programmatic transformation for pipeline automation and data submission preparation. In cloud-based platforms like Galaxy, FASTA files are natively handled as a core datatype, supporting workflows for sequence manipulation, alignment, and statistical analysis through integrated tools. Regarding standards, FASTA aligns with International Nucleotide Sequence Database Collaboration (INSDC) guidelines for genome assembly submissions, where it is used alongside AGP files to provide sequence data for deposition into member databases like GenBank, EMBL, and DDBJ.[44]
A key challenge in integrating FASTA with dynamic databases lies in managing versioned sequences, as reference FASTA files evolve with database updates, complicating reproducibility in long-term pipelines. Tools and workflows must account for versioning to archive incremental changes in large FASTA databases efficiently, preventing mismatches between analyses and outdated references.[45] This issue is exacerbated in collaborative environments, where unversioned FASTA exports from sources like NCBI or UniProt can lead to inconsistencies in downstream results.