Fact-checked by Grok 2 weeks ago

FASTA format

The FASTA format is a simple, text-based standard for representing biological sequences, such as DNA, RNA, or protein (amino acid) chains, using single-letter codes for nucleotides or residues, and it supports storing multiple sequences in a single file.^[1] Developed in 1985 by William R. Pearson and David J. Lipman as part of their sequence alignment software package, the format originated from tools like FAST-P (for proteins) and FAST-N (for nucleotides), with "FASTA" standing for "FAST-All" to denote its broad applicability across sequence types.^[2] It quickly became a cornerstone of bioinformatics due to its human-readable simplicity, lack of proprietary restrictions, and compatibility with early computational tools for database searches and alignments.^[1] At its core, a FASTA file consists of one or more sequence records, each starting with a definition line prefixed by a greater-than symbol (">"), followed by a unique identifier (typically ≤25 characters, using alphanumeric symbols, hyphens, underscores, or similar), optional descriptive text, and the sequence data itself on subsequent lines, conventionally wrapped at 60–80 characters for readability.^[3] For nucleotide sequences, valid symbols include A, C, G, T (or U for RNA), and N for ambiguities, while protein sequences use the 20 standard amino acid codes (e.g., A for alanine, C for cysteine); no gaps or alignments are inherently represented, though multi-line sequences must avoid spaces or numbers.^[3] This structure ensures portability across platforms, with files often using extensions like .fasta, .fa, or .seq, and it is identified by the ">" magic number in its first non-empty line.^[1] Widely adopted since its inception, the FASTA format underpins major databases like GenBank and UniProt, facilitating sequence submission, retrieval, and analysis in tools from BLAST to modern genomic pipelines, and remains actively maintained through community conventions without a central governing body.^[3] Its evolution includes extensions for qualifiers (e.g., organism names in brackets) and variants like FASTQ for quality scores, but the core remains unchanged for its efficiency in handling large-scale biological data.^[1]

Introduction

Definition and Purpose

The FASTA format is a simple, human-readable, text-based standard for representing biological sequences, including nucleotide sequences from DNA or RNA and amino acid sequences from proteins. It structures data with a single-line header, known as the definition line, that begins with a greater-than symbol ('>') followed by an identifier and optional description, succeeded by one or more lines of sequence characters without spaces or numbers.^[3]^[1] The primary purpose of the FASTA format is to enable efficient storage, exchange, and processing of biological sequence data in bioinformatics workflows, supporting tasks such as sequence alignment, similarity searches against databases, and interoperability among diverse software tools, all while avoiding complex binary encodings. This design promotes seamless data sharing across research communities and platforms, making it a foundational format in genomics and proteomics.^[4]^[5] Key advantages of FASTA include its platform-independent portability as a plain ASCII text file, straightforward parsing by both human users and computational algorithms due to its minimalistic structure, and flexibility to accommodate multiple sequences in one file for batch processing. These features have established FASTA as a de facto standard in bioinformatics, enhancing reproducibility and collaboration.^[1]^[4] For illustration, a basic single-sequence FASTA snippet appears as follows:

>seq1
ATGCATGCAGCTAG
>seq1
ATGCATGCAGCTAG

This example shows the header line identifying the sequence, followed by the raw sequence data.^[3] The format originated with the FASTP and FASTA sequence comparison programs to handle input and output for rapid similarity detection in biological databases.^[6]^[7]

History and Development

The FASTA format, originally known as the "Pearson format," was developed in 1985 by David J. Lipman, a researcher at the National Institutes of Health (NIH), and William R. Pearson, then a collaborator from the University of Virginia, as part of their FASTP program for efficient protein sequence database searches. The format was subsequently incorporated into the FASTA sequence alignment software package, released in 1988, which extended the approach to both protein and DNA sequences using a word-based indexing method to identify local alignments quickly and sensitively without requiring full dynamic programming. This addressed the growing need for handling expanding sequence databases in early bioinformatics, allowing researchers to detect distant homologies that slower methods might miss. A key milestone occurred following the National Center for Biotechnology Information (NCBI)'s assumption of GenBank responsibilities in 1992.^[8] By the 2000s, FASTA had become a de facto standard in bioinformatics, widely used across sequence analysis software due to its simplicity and portability.^[9] This evolution influenced subsequent formats, such as FASTQ, which extended FASTA by incorporating quality scores for next-generation sequencing data.^[10]

Format Specification

Definition Line Structure

The definition line, also known as the header or descriptor line, in a FASTA file initiates each sequence record and begins with a greater-than symbol (">"), immediately followed by a unique sequence identifier without intervening spaces. This identifier serves to uniquely label the sequence and is typically an alphanumeric string, such as a locus name, database accession number, or custom label, often limited to a maximum of 25 characters in NCBI submissions. Valid characters for the identifier include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).^[3]^[11] Following the identifier, an optional description field may contain free-text information, such as the species name, gene or protein function, strain details, or other annotations, delimited by a space. This description can incorporate brackets for structured modifiers in NCBI formats, such as [organism=scientific name] or [moltype=mRNA], allowing automated parsing for database integration; multiple modifiers can be chained without spaces around the equals sign. The entire definition line must form a single continuous line without hard returns or line breaks, and it supports alphanumeric characters, spaces, common punctuation (e.g., parentheses, semicolons), and symbols, though excessive special characters should be avoided to ensure compatibility. In multi-sequence files, each definition line uniquely distinguishes its associated sequence.^[3]^[12]^[11] NCBI employs specific conventions for identifiers in its FASTA outputs, historically using the format gi|GI_number|database|accession.version| followed by the description, where vertical bars (|) act as delimiters for parsing components: the "gi|" prefix denotes a GenInfo Identifier (a unique numeric ID assigned sequentially), followed by the source database (e.g., gb for GenBank, ref for RefSeq), the accession number, and its version. Although GI numbers were phased out in 2016 for new records, this pipe-delimited structure remains prevalent in legacy files and some tools. Modern NCBI FASTA headers often simplify to >accession.version description, as seen in nucleotide examples like >U49845.1 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds, or protein examples like >NP_001234.1 hypothetical protein [Homo sapiens]. These conventions facilitate precise retrieval and annotation in bioinformatics pipelines.^[12]^[13]^[14]

>gi|1293614|gb|U49845.1| [Saccharomyces cerevisiae](/page/Saccharomyces_cerevisiae) TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
>gi|1293614|gb|U49845.1| [Saccharomyces cerevisiae](/page/Saccharomyces_cerevisiae) TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds

^[14]

Sequence Representation

In the FASTA format, biological sequences are encoded using standardized single-letter codes in the data lines that immediately follow the definition line. For nucleotide sequences, these include A for adenine, C for cytosine, G for guanine, T for thymine (or U for uracil in RNA sequences), and N for any unknown or ambiguous base, adhering to IUPAC nomenclature.^[3] For protein sequences, the 20 standard amino acids are represented by their conventional single-letter codes—such as A for alanine, C for cysteine, D for aspartic acid, E for glutamic acid, F for phenylalanine, and others—while B indicates asparagine or aspartic acid (Asx), Z denotes glutamine or glutamic acid (Glx), and X represents any or unknown amino acid.^[15] Sequence lines are conventionally written in uppercase letters to ensure uniformity and compatibility with bioinformatics tools, though lowercase letters are accepted and occasionally employed to denote annotations like low-complexity regions or, in specialized cases, modified residues.^[16] Only valid biological symbols from the respective alphabet are permitted in these lines; spaces, numbers, punctuation, or other extraneous characters are prohibited to maintain data integrity.^[17] Ambiguities beyond the standard codes follow IUPAC guidelines for both nucleotides (e.g., R for A or G, Y for C or T) and proteins (limited to B, Z, and X as noted), allowing representation of uncertain positions without disrupting sequence parsing.^[3]^[15] Gaps, indicated by the hyphen '-', are supported for sequence alignments to denote insertions or deletions but are typically omitted in files containing pure, unaligned sequences to avoid implying structural variations.^[17]^[3] The following code block illustrates a short nucleotide sequence example:

>example_dna | Sample DNA sequence
ATCGATCGNNATCG
>example_dna | Sample DNA sequence
ATCGATCGNNATCG

Here, the sequence uses standard nucleotide codes with N indicating ambiguous bases.^[3] In contrast, a short protein sequence example employs amino acid codes:

>example_protein | Sample protein sequence
MVKPLFTGILABXZ
>example_protein | Sample protein sequence
MVKPLFTGILABXZ

This includes standard codes (e.g., M for methionine, A for alanine) along with ambiguities B (Asx) and Z (Glx).^[15]

File Organization

Single and Multi-Sequence Files

The FASTA format accommodates both single-sequence and multi-sequence files, allowing flexibility in organizing biological sequence data. In a single-sequence file, the content begins with a single definition line starting with a greater-than symbol (">"), followed immediately by the sequence data, which continues until the end of the file (EOF). This structure is straightforward and commonly used for individual nucleotide or protein sequences, such as a single gene or contig.^[3] Multi-sequence files, often referred to as multi-FASTA or multi-fasta, extend this by including multiple sequence records within one file. Each record starts with its own ">" definition line, followed by the corresponding sequence data; there is no explicit end marker for individual sequences. Subsequent records are delineated by the next ">" line, enabling the concatenation of several single-sequence files into a cohesive multi-sequence file without additional delimiters. This organization is widely used for datasets like gene families, alignments, or genome assemblies comprising multiple contigs.^[3]^[18] Parsing a FASTA file relies on recognizing ">" lines as the start of new records. For both single and multi-sequence files, software reads line by line: sequence data is collected from non-">" lines until the next ">" line or EOF is encountered, marking the boundary of the current record. This simple, line-based logic ensures robust handling across tools, though care must be taken to preserve unique identifiers in definition lines to avoid conflicts in multi-sequence contexts.^[3]^[18] FASTA files are efficient for small to medium-sized datasets, typically handling sequences up to several megabases without issue due to their plain-text nature. However, for very large genomes—such as eukaryotic assemblies with extensive contigs—files can become unwieldy, exceeding gigabytes and straining memory during parsing or submission; in such cases, splitting into multiple files (e.g., by contig groups) is recommended to improve manageability.^[3] The following snippet illustrates a basic multi-sequence FASTA file with two short nucleotide sequences:

>seq1 Human gene example
ATGCCGTAGCTAGCTAGCTAGC
>seq2 Mouse ortholog
ATGCCGTAATTAGCTAGCTAGC
>seq1 Human gene example
ATGCCGTAGCTAGCTAGCTAGC
>seq2 Mouse ortholog
ATGCCGTAATTAGCTAGCTAGC

Here, each sequence is separated by its definition line, demonstrating the format's delimiter-free record structure.^[18]^[3]

Formatting Conventions

The FASTA format employs several non-mandatory conventions to enhance readability and ensure compatibility across bioinformatics tools. A primary recommendation is to wrap sequence lines at 60 to 80 characters, with 60 characters as a common default in the original FASTA implementation for display purposes and 80 characters as the upper limit suggested by major databases to prevent excessive line lengths that could complicate parsing or viewing.^[16]^[3] This range balances human readability with efficient processing, as longer lines may exceed terminal widths or impose memory constraints in older software. Regarding whitespace, sequences must consist solely of valid symbols (such as IUPAC nucleotide or amino acid codes) without embedded spaces, as any whitespace within sequence data is invalid and typically stripped or causes errors in parsers.^[3] Blank lines are optional and may be inserted between records for visual clarity in multi-sequence files, though they are not required and must be positioned only at the file's start, between records, or at the end to avoid disrupting standard parsers.^[19] FASTA files support various end-of-line conventions, including Unix-style LF (\n), Windows-style CRLF (\r\n), and legacy Mac-style CR (\r), with most modern tools automatically normalizing these during input to ensure cross-platform compatibility.^[20] The format lacks a built-in mechanism for indicating the file version itself, but individual sequence headers may incorporate version numbers (e.g., from database entries like GenBank or UniProt) to track updates to specific sequences.^[21] Best practices further emphasize cleanliness: avoid trailing spaces on any lines, as they constitute unnecessary whitespace that could lead to inconsistencies in file sizes or parsing artifacts, and refrain from including numeric line numbers or annotations, which are not part of the core specification and may be misinterpreted by software expecting pure sequence data.^[22] These conventions, while flexible, promote interoperability when adhering to them in file creation and exchange.

Variations and Extensions

File Compression and Handling

FASTA files lack a strict standard for filename extensions, with common conventions including .fasta, .fa, .fna for nucleotide sequences, .faa for protein sequences, and occasionally .fst or .seq.^[23]^[24] Compression is widely employed to manage the size of FASTA files, particularly for large datasets, using tools like gzip to produce .gz archives or .zip formats.^[25] These methods preserve the original text-based structure, allowing seamless decompression and direct parsing by bioinformatics software without altering the format.^[25] For instance, utilities such as SeqKit natively support reading and writing gzip-compressed FASTA files, enabling efficient processing of compressed inputs like input.fasta.gz.^[25] Encryption of FASTA files is uncommon due to the format's primary role in open scientific exchange, but it can be applied using general-purpose tools like GPG for secure storage or transfer of sensitive genomic data.^[26] This process is not inherent to the FASTA specification and necessitates full decryption prior to parsing, as encrypted files are binary and incompatible with standard sequence readers.^[26] Specialized tools, such as Cryfa, offer tailored encryption for FASTA alongside compression, maintaining usability in bioinformatics workflows while enhancing privacy for formats like FASTA and FASTQ.^[27] For handling large FASTA files, which can exceed gigabytes in size for comprehensive genomes or metagenomes, indexing is recommended to enable random access without loading the entire file into memory.^[28] The samtools faidx command creates an index file (e.g., reference.fasta.fai) that maps sequence identifiers to byte offsets, facilitating rapid extraction of subsequences in FASTA format.^[28] This approach supports both uncompressed and BGZF-compressed inputs, improving efficiency in tools like alignment pipelines.^[28] The ASCII-based nature of FASTA ensures high portability across platforms, as it relies solely on standard text characters without proprietary encoding or binary elements.^[3] This text format avoids issues with byte order, line endings, or character sets, making FASTA files directly readable on Windows, macOS, Linux, and other systems using basic text editors or command-line tools.^[3]

Specialized Extensions

Specialized extensions to the FASTA format adapt its simple structure to meet niche requirements in bioinformatics, such as enhanced visualization, alignment representation, quality annotation, and domain-specific notations, while building on the core header and sequence lines. In multiple sequence alignments, FASTA files commonly incorporate gaps denoted by hyphens ('-') to represent insertions or deletions, enabling the storage and exchange of aligned sequences without dedicated formats. This practice is supported by alignment tools like MAFFT, which output results in this extended form, often including coordinate annotations in headers for reference positioning. Prior to the standardization of FASTQ, quality scores were typically stored in separate files with a .qual extension alongside FASTA sequence files to capture sequencing error estimates; these separate files were largely replaced by FASTQ's structured integration of Phred scores in a single file.^[10] For RNA analysis, secondary structure information is frequently encoded using dot-bracket notation—where paired bases are marked by matching parentheses and unpaired ones by dots—either as a parallel string to the sequence in modified FASTA files or in dedicated structure lines. The FASTR format exemplifies this by pairing nucleotide sequences with their corresponding dot-bracket representations in a single file, facilitating integrated prediction and visualization workflows.^[29] In proteomics, post-translational modifications (PTMs) like phosphorylation or glycosylation are annotated via the Proteomics Standards Initiative Extended FASTA Format (PEFF), which extends headers with metadata keywords (e.g., MOD_RES for modified residues) and sequence markers using controlled vocabularies from resources like UniMod. PEFF maintains compatibility with standard FASTA parsers while embedding PTM sites, variants, and processing events, improving accuracy in mass spectrometry-based identifications. Color-enhanced FASTA representations, used primarily for motif highlighting in web-based viewers, employ HTML-like tags (e.g., <span style="color:red"> around specific residues) or predefined character schemes within sequence lines to denote features like conserved domains or hydrophobicity; tools like bioSyntax apply such coloring schemes (e.g., Zappo or CLUSTAL) during file rendering for better interpretability.^[30] These extensions, while useful for targeted applications, often compromise interoperability, as standard parsers may ignore or misinterpret non-core elements like PTM metadata or structure notations, leading to data loss; libraries such as Biopython's SeqIO handle vanilla FASTA robustly but require custom adaptations for formats like PEFF or FASTR.^[31]

Usage in Bioinformatics

Common Software Tools

Several command-line tools facilitate the reading, writing, and manipulation of FASTA files in bioinformatics workflows. The Unix utility grep is commonly employed for basic extraction tasks, such as counting the number of sequences in a FASTA file by identifying lines beginning with >, using the command grep -c "^>" file.fasta, which outputs the count without loading the entire file into memory.^[32] For more advanced filtering and conversion, seqtk provides efficient operations like sub-sampling, trimming, and format conversion for FASTA files, processing large datasets rapidly due to its lightweight design.^[33] Similarly, bioawk extends the standard awk utility with built-in support for biological formats, enabling pattern-based extraction and manipulation of FASTA sequences, such as printing sequence lengths or filtering by identifier.^[34] Programming libraries offer robust programmatic access to FASTA files. In Python, Biopython's SeqIO module serves as a primary tool for parsing, writing, and converting FASTA data, using iterator-based functions like SeqIO.parse() to handle records sequentially and support operations such as slicing or formatting without requiring full file loading.^[35] The EMBOSS suite includes seqret, a versatile command-line program within this library for reformatting FASTA files to other sequence formats or extracting subsequences, making it essential for data preparation in pipelines.^[36] Alignment software routinely incorporates native FASTA support for input and output. BLAST, developed by NCBI, accepts multi-sequence FASTA queries for similarity searches against databases, enabling batch processing of nucleotide or protein sequences in standard FASTA format.^[37] Clustal Omega, a multiple sequence alignment tool from EMBL-EBI, reads unaligned FASTA inputs to generate alignments using progressive methods, supporting up to thousands of sequences efficiently.^[38] MAFFT, another alignment program, ingests FASTA files for accurate multiple alignments of DNA, RNA, or proteins, with options for handling diverse sequence lengths via FFT-based algorithms. Graphical editors provide visual interfaces for multi-FASTA manipulation. Jalview allows users to import FASTA alignments for interactive editing, including gap insertion, residue modification, and consensus sequence generation, with export back to FASTA format.^[39] UGENE offers a comprehensive alignment editor that loads multi-FASTA files, supports sequence trimming, annotation addition, and real-time visualization, suitable for exploratory analysis of aligned datasets.^[40] For handling large FASTA files, streaming parsers are critical to mitigate memory constraints, as they process records iteratively rather than loading entire datasets. Tools like Biopython's SeqIO and seqtk implement such streaming, enabling analysis of gigabyte-scale files on standard hardware by yielding one record at a time, thus avoiding out-of-memory errors common in DOM-style parsers.^[35]^[33]

Integration with Databases and Pipelines

The FASTA format is widely utilized for exporting sequence data from major bioinformatics databases, facilitating easy access and integration into downstream analyses. The National Center for Biotechnology Information (NCBI) provides nucleotide and protein sequences from GenBank and RefSeq in FASTA format, with weekly updates available for transcript and protein records to support research reproducibility.^[41] Similarly, UniProt offers canonical protein sequences and manually curated isoforms in FASTA format, enabling researchers to retrieve comprehensive sets of protein data for functional annotation and comparative studies.^[15] These exports ensure that FASTA files serve as a lightweight, standardized intermediary for database dissemination without the overhead of full annotation files. In bioinformatics pipelines, particularly those involving next-generation sequencing (NGS), FASTA files function as essential inputs for reference genomes and outputs for assembled contigs. Aligners such as BWA and Bowtie2 require FASTA-formatted reference sequences to build indexes and map reads efficiently, streamlining variant calling and genome assembly workflows.^[42] Genome assemblers like SPAdes generate contigs and scaffolds directly in FASTA format (e.g., contigs.fasta), which can then be fed into subsequent annotation or alignment steps.^[43] This bidirectional role in pipelines enhances interoperability, as FASTA's simplicity allows seamless integration across tools without format-specific parsing. Conversion between FASTA and other formats, such as GenBank or GFF, is supported by libraries like Biopython, which enable programmatic transformation for pipeline automation and data submission preparation. In cloud-based platforms like Galaxy, FASTA files are natively handled as a core datatype, supporting workflows for sequence manipulation, alignment, and statistical analysis through integrated tools. Regarding standards, FASTA aligns with International Nucleotide Sequence Database Collaboration (INSDC) guidelines for genome assembly submissions, where it is used alongside AGP files to provide sequence data for deposition into member databases like GenBank, EMBL, and DDBJ.^[44] A key challenge in integrating FASTA with dynamic databases lies in managing versioned sequences, as reference FASTA files evolve with database updates, complicating reproducibility in long-term pipelines. Tools and workflows must account for versioning to archive incremental changes in large FASTA databases efficiently, preventing mismatches between analyses and outdated references.^[45] This issue is exacerbated in collaborative environments, where unversioned FASTA exports from sources like NCBI or UniProt can lead to inconsistencies in downstream results.

References

[1]
FASTA Database Format - Library of Congress
FASTA is a text-based, bioinformatic data format used to store nucleotide or amino acid sequences (e.g. Deoxyribonucleic Acid [DNA] or Ribonucleic Acid [RNA]).Identification and description · Sustainability factors · File type signifiers
[2]
https://doi.org/10.1126/science.2983426
[3]
FASTA Format for Nucleotide Sequences - NCBI - NIH
Jun 18, 2025 · In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (">"), followed by a unique SeqID (sequence ...
[4]
FASTAFS: file system virtualisation of random access compressed ...
FASTA is a file format used for storing nucleotide and amino acid polymeric sequences and is compatible with a high variety of bioinformatics software. It is ...
[5]
BTEP course - Bioinformatics - National Cancer Institute
SEQUENCE FILE FORMATS FASTA Format. File can contain one or more sequences. The format specifies a single header line which starts with a ">" character, follow ...
[6]
Improved tools for biological sequence comparison - PubMed
Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. doi: 10.1073/pnas.85.8.2444. Authors. W R Pearson , D J ...
[7]
A Brief History of NCBI's Formation and Growth - NIH
1992—GenBank at NCBI—NCBI assumes responsibility for GenBank, a database of nucleotide sequences, and collaborates in its development with international ...Missing: adoption | Show results with:adoption
[8]
DATA RESOURCES AND ANALYSES FAIR Header Reference ...
Dec 1, 2023 · The FASTA format specification (originally the ”Pearson format”) was created by William Pearson and David Lipman in 1985 (4), but has since ...<|control11|><|separator|>
[9]
The Sanger FASTQ file format for sequences with quality scores ...
Dec 16, 2009 · This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them.
[10]
BankIt Submission Help: Definition Lines in FASTA sequence format
Definition Lines should be at the end of the first line in the nucleotide FASTA format which should be followed by a hard return and then the actual nucleotide ...Missing: structure | Show results with:structure
[11]
Modifiers for FASTA Definition Lines - NCBI - NIH
Sep 13, 2024 · Many of the descriptors that refer to the sequenced molecule and the genetic code can be edited using the FASTA definition line. In all cases, ...
[12]
NCBI is Phasing Out Sequence GIs - Here's What You Need to Know
Jul 15, 2016 · Any code that parses GI numbers from NCBI FASTA records (again, from any NCBI source) will break. Why? Same reason. The GI numbers will no ...
[13]
Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) - Nucleotide - NCBI
- **FASTA Definition Line**: `>gi|1293614|gb|U49845.1| Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds`
[14]
fasta in UniProt help search
How to retrieve sets of protein sequences? UniProtKB canonical sequences are also available in FASTA format, as are additional manually curated isoform ...
[15]
[PDF] The FASTA program package Introduction
Nov 21, 2019 · The FASTA programs offer several advantages over BLAST: 1. Rigorous algorithms unavailable in BLAST (Table I).
[16]
[PDF] Briefly: Bioinformatics File Formats
Sequence can contain: newline characters (“\n”), ACGT, N, acgt, n, x, . or - (gaps), IUPAC ambiguity codes BDHV etc., alternates like [A/T], ...
[17]
FASTA format example - Bioinformatics
FASTA format description. A sequence in FASTA format consists of: One line starting with a ">" sign, followed by a sequence identification code.
[18]
Script for breaking large .fa files into smaller files of [N] sequences
Feb 21, 2014 · NCBI requires assemblies containing more than 20k contigs to be split into chunks of 10k. One of the assemblies I'm submitting has over 800k ...
[19]
FASTA/QUAL format (skbio.io.format.fasta)
If a boolean array, it indicates characters to write in lowercase. Characters in the sequence corresponding to True values will be written in lowercase. The ...
[20]
seq_io::fasta - Rust
The parser handles UNIX (LF) and Windows (CRLF) line endings, but not old Mac-style (CR) endings. However, FASTA writing currently always uses UNIX line endings ...<|separator|>
[21]
FASTA headers | UniProt help
Jun 11, 2025 · This format is only available for UniParc entry sets that correspond to the sequences of a proteome. It contains biological information from the ...
[22]
FASTA file format — PacBioFileFormats 13.0.0 documentation
Sequences in FASTA files should be wrapped at a uniform line length, to enable indexing. (A common convention is to wrap lines at 60 characters.) Windows and ...Missing: end- NCBI
[23]
FASTA sequence format - BioPerl
Description. One of the oldest and simplest sequence formats. A single header line followed by 1 or more sequence lines.<|control11|><|separator|>
[24]
File formats used in bioinformatics - GitHub username - rnnh
FASTA filename extensions. FASTA files usually end with the extension .fasta . This extension is arbitrary, as the content of the file determines its format ...Sequence formats · FASTA · FASTA filename extensions · Example Stockholm file
[25]
Usage - SeqKit - Ultrafast FASTA/Q kit - Wei Shen's Bioinformatic tools
To keep the order, just compress the FASTA file (input.fasta) and use the compressed one (input.fasta.gz) as the input. 2. Use "seqkit grep" for extracting ...
[26]
File encryption and decryption made easy with GPG - Red Hat
Jun 15, 2021 · GPG can encrypt files using `-c`, decrypt with `-d`, or extract/decrypt with no option. Encrypting adds .gpg extension. Decrypting removes ...Missing: FASTA | Show results with:FASTA
[27]
Cryfa: a secure encryption tool for genomic data - Oxford Academic
In this paper, we propose Cryfa, a fast secure encryption tool for genomic data, namely in Fasta, Fastq, VCF, SAM and BAM formats.
[28]
samtools-faidx(1) manual page
May 30, 2025 · Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file.
[29]
FASTR: A novel data format for concomitant representation of RNA ...
RNA secondary structure is typically represented using a combination of a string of nucleotide characters along with the corresponding dot-bracket notation ...
[30]
bioSyntax: syntax highlighting for computational biology
Aug 22, 2018 · In gedit, sublime and vim, amino acid FASTA files can be coloured using CLUSTAL [35], Taylor [36], Zappo [19] or hydrophobicity [20] colour ...
[31]
Proteomics Standards Initiative Extended FASTA Format - PMC
First, FASTA files cannot contain metadata about the collection itself: its origin, its production date, key assumptions and parameters used in its production, ...
[32]
Useful Bash Commands To Handle Fasta Files - Biostars
Feb 22, 2012 · My top used bash commands for fasta files: (1) counting number of sequences in a fasta file: grep -c "^>" file.fa (2) add something to end of all header lines.
[33]
lh3/seqtk: Toolkit for processing sequences in FASTA/Q formats
Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be ...Releases 5 · Issues · Pull requests 12 · Actions
[34]
lh3/bioawk: BWK awk modified for biological data - GitHub
Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, ...
[35]
Introduction to SeqIO - Biopython
FASTA format variant with no line wrapping and exactly two lines per record. FASTQ files are a bit like FASTA files but also include sequencing qualities. In ...File Formats · Sequence Input · Examples
[36]
seqret manual - EMBOSS
Because you can specify the output sequence format, seqret is a program to reformat a sequence. ... Simple Pearson FASTA format, an alias for "fasta" format. ncbi ...
[37]
Query Input and database selection - BLAST - NIH
FASTA¶. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished ...
[38]
Clustal Omega < Job Dispatcher < EMBL-EBI
Clustal Omega is a new multiple sequence alignment program that uses seeded guide trees and HMM profile-profile techniques to generate alignments between three ...
[39]
Jalview Home Page - Jalview
Jalview is a free cross-platform program for multiple sequence alignment editing, visualisation and analysis. Use it to align, view and edit sequence ...Download · Jalview · Jalview Discussion Forum · Jalview's Archive
[40]
Alignment Editor Features - Unipro UGENE
The Alignment Editor is a powerful tool for visualizing and editing DNA, RNA, or protein multiple sequence alignments. The editor supports various multiple ...
[41]
RefSeq Frequently Asked Questions (FAQ) - NCBI
Nov 15, 2010 · Data is provided on a weekly basis for transcript and protein records in FASTA and GenBank flat file formats.
[42]
Alignment - NGS Analysis - NYU
Alternative aligners such as Bowtie2 may be used. Note: Most ... sequence (in FASTA format) using the following command: bwa index <reference.fasta>.
[43]
SPAdes output - SPAdes Assembly Toolkit
SPAdes produces assembly graph in GFA 1.2 and legacy FASTG formats. To view GFA and FASTG files we recommend to use Bandage-NG visualization tool.Assembly graph formats · Complete list of output files · metaplasmidSPAdes and...
[44]
INSDC standards for genome assembly submission
This document lays out the requirements for submission of genome assembly information into INSDC databases.Missing: FASTA | Show results with:FASTA
[45]
Sequence database versioning for command line and Galaxy ...
As outlined in motivation, the challenge is to efficiently archive versions of large FASTA format reference sequence databases which usually grow with many ...