Fact-checked by Grok 2 weeks ago

FASTA format

The FASTA format is a simple, text-based standard for representing biological sequences, such as DNA, RNA, or protein (amino acid) chains, using single-letter codes for nucleotides or residues, and it supports storing multiple sequences in a single file. Developed in 1985 by William R. Pearson and David J. Lipman as part of their sequence alignment software package, the format originated from tools like FAST-P (for proteins) and FAST-N (for nucleotides), with "FASTA" standing for "FAST-All" to denote its broad applicability across sequence types. It quickly became a cornerstone of bioinformatics due to its human-readable simplicity, lack of proprietary restrictions, and compatibility with early computational tools for database searches and alignments. At its core, a file consists of one or more sequence records, each starting with a definition line prefixed by a greater-than symbol (">"), followed by a (typically ≤25 characters, using alphanumeric s, hyphens, underscores, or similar), optional descriptive text, and the sequence data itself on subsequent lines, conventionally wrapped at 60–80 characters for readability. For sequences, valid symbols include A, C, G, T (or U for ), and N for ambiguities, while protein sequences use the 20 standard codes (e.g., A for , C for ); no gaps or alignments are inherently represented, though multi-line sequences must avoid spaces or numbers. This structure ensures portability across platforms, with files often using extensions like .fasta, .fa, or .seq, and it is identified by the ">" magic number in its first non-empty line. Widely adopted since its inception, the FASTA format underpins major databases like and , facilitating sequence submission, retrieval, and analysis in tools from to modern genomic pipelines, and remains actively maintained through community conventions without a central governing body. Its evolution includes extensions for qualifiers (e.g., organism names in brackets) and variants like FASTQ for quality scores, but the core remains unchanged for its efficiency in handling large-scale biological data.

Introduction

Definition and Purpose

The FASTA format is a simple, human-readable, text-based standard for representing biological sequences, including nucleotide sequences from DNA or RNA and amino acid sequences from proteins. It structures data with a single-line header, known as the definition line, that begins with a greater-than symbol ('>') followed by an identifier and optional description, succeeded by one or more lines of sequence characters without spaces or numbers. The primary purpose of the FASTA format is to enable efficient , , and of biological sequence data in bioinformatics workflows, supporting tasks such as , similarity searches against databases, and among diverse software tools, all while avoiding complex binary encodings. This design promotes seamless data sharing across research communities and platforms, making it a foundational format in and . Key advantages of FASTA include its platform-independent portability as a plain ASCII , straightforward parsing by both human users and computational algorithms due to its minimalistic structure, and flexibility to accommodate multiple sequences in one file for . These features have established FASTA as a in bioinformatics, enhancing and . For illustration, a basic single-sequence snippet appears as follows:
>seq1
ATGCATGCAGCTAG
This example shows the header line identifying , followed by the raw data. The format originated with the FASTP and sequence comparison programs to handle input and output for rapid similarity detection in biological databases.

History and Development

The FASTA format, originally known as the "Pearson format," was developed in 1985 by David J. Lipman, a researcher at the (NIH), and William R. Pearson, then a collaborator from the , as part of their FASTP program for efficient protein searches. The format was subsequently incorporated into the FASTA software package, released in 1988, which extended the approach to both protein and DNA sequences using a word-based indexing method to identify local alignments quickly and sensitively without requiring full dynamic programming. This addressed the growing need for handling expanding s in early bioinformatics, allowing researchers to detect distant homologies that slower methods might miss. A key milestone occurred following the (NCBI)'s assumption of responsibilities in 1992. By the 2000s, had become a in bioinformatics, widely used across software due to its simplicity and portability. This evolution influenced subsequent formats, such as FASTQ, which extended FASTA by incorporating quality scores for next-generation sequencing data.

Format Specification

Definition Line Structure

The definition line, also known as the header or descriptor line, in a FASTA file initiates each sequence record and begins with a greater-than symbol (">"), immediately followed by a unique sequence identifier without intervening spaces. This identifier serves to uniquely label the sequence and is typically an alphanumeric string, such as a locus name, database , or custom label, often limited to a maximum of 25 characters in NCBI submissions. Valid characters for the identifier include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#). Following the identifier, an optional description field may contain free-text information, such as the species name, or protein , details, or other annotations, delimited by a space. This description can incorporate brackets for structured modifiers in NCBI formats, such as [organism=scientific name] or [moltype=mRNA], allowing automated for database integration; multiple modifiers can be chained without spaces around the equals sign. The entire definition line must form a single continuous line without hard returns or line breaks, and it supports alphanumeric characters, spaces, common punctuation (e.g., parentheses, semicolons), and symbols, though excessive special characters should be avoided to ensure compatibility. In multi-sequence files, each definition line uniquely distinguishes its associated sequence. NCBI employs specific conventions for identifiers in its FASTA outputs, historically using the format gi|GI_number|database|accession.version| followed by the description, where vertical bars (|) act as delimiters for components: the "gi|" denotes a GenInfo Identifier (a unique numeric ID assigned sequentially), followed by the source database (e.g., gb for , ref for ), the , and its . Although GI numbers were phased out in 2016 for new records, this pipe-delimited structure remains prevalent in legacy files and some tools. Modern NCBI headers often simplify to >accession.version description, as seen in examples like >U49845.1 TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds, or protein examples like >NP_001234.1 hypothetical protein [Homo sapiens]. These conventions facilitate precise retrieval and annotation in bioinformatics pipelines.
>gi|1293614|gb|U49845.1| [Saccharomyces cerevisiae](/page/Saccharomyces_cerevisiae) TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds

Sequence Representation

In the FASTA format, biological sequences are encoded using standardized single-letter codes in the data lines that immediately follow the definition line. For sequences, these include A for , C for , G for , T for (or U for uracil in sequences), and N for any unknown or ambiguous , adhering to IUPAC nomenclature. For protein sequences, the 20 standard are represented by their conventional single-letter codes—such as A for , C for , D for , E for , F for , and others—while B indicates or (Asx), Z denotes or (Glx), and X represents any or unknown . Sequence lines are conventionally written in uppercase letters to ensure uniformity and compatibility with bioinformatics tools, though lowercase letters are accepted and occasionally employed to denote annotations like low-complexity regions or, in specialized cases, modified residues. Only valid biological symbols from the respective alphabet are permitted in these lines; spaces, numbers, punctuation, or other extraneous characters are prohibited to maintain data integrity. Ambiguities beyond the standard codes follow IUPAC guidelines for both (e.g., R for A or G, Y for C or T) and proteins (limited to B, Z, and X as noted), allowing representation of uncertain positions without disrupting sequence parsing. Gaps, indicated by the hyphen '-', are supported for sequence alignments to denote insertions or deletions but are typically omitted in files containing pure, unaligned sequences to avoid implying structural variations. The following code block illustrates a short nucleotide sequence example:
>example_dna | Sample DNA sequence
ATCGATCGNNATCG
Here, the sequence uses standard codes with N indicating ambiguous bases. In contrast, a short protein sequence example employs codes:
>example_protein | Sample protein sequence
MVKPLFTGILABXZ
This includes standard codes (e.g., M for , A for ) along with ambiguities B (Asx) and Z (Glx).

File Organization

Single and Multi-Sequence Files

The FASTA format accommodates both single-sequence and multi-sequence files, allowing flexibility in organizing biological sequence . In a single-sequence file, the content begins with a single definition line starting with a greater-than symbol (">"), followed immediately by the sequence , which continues until the end of the file (EOF). This structure is straightforward and commonly used for individual or protein sequences, such as a single or contig. Multi-sequence files, often referred to as multi-FASTA or multi-fasta, extend this by including multiple records within one . Each record starts with its own ">" definition line, followed by the corresponding data; there is no explicit end marker for individual sequences. Subsequent records are delineated by the next ">" line, enabling the of several single-sequence files into a cohesive multi-sequence without additional delimiters. This organization is widely used for datasets like gene families, alignments, or genome assemblies comprising multiple contigs. Parsing a file relies on recognizing ">" lines as the start of new s. For both single and multi-sequence files, software reads line by line: sequence data is collected from non-">" lines until the next ">" line or EOF is encountered, marking the boundary of the current . This simple, line-based logic ensures robust handling across tools, though care must be taken to preserve unique identifiers in definition lines to avoid conflicts in multi-sequence contexts. FASTA files are efficient for small to medium-sized datasets, typically handling sequences up to several megabases without issue due to their plain-text nature. However, for very large genomes—such as eukaryotic assemblies with extensive contigs—files can become unwieldy, exceeding gigabytes and straining memory during parsing or submission; in such cases, splitting into multiple files (e.g., by contig groups) is recommended to improve manageability. The following snippet illustrates a basic multi-sequence FASTA file with two short nucleotide sequences:
>seq1 Human gene example
ATGCCGTAGCTAGCTAGCTAGC
>seq2 Mouse ortholog
ATGCCGTAATTAGCTAGCTAGC
Here, each sequence is separated by its definition line, demonstrating the format's delimiter-free record structure.

Formatting Conventions

The FASTA format employs several non-mandatory conventions to enhance readability and ensure compatibility across bioinformatics tools. A primary recommendation is to wrap sequence lines at 60 to 80 characters, with 60 characters as a common default in the original FASTA implementation for display purposes and 80 characters as the upper limit suggested by major databases to prevent excessive line lengths that could complicate parsing or viewing. This range balances human readability with efficient processing, as longer lines may exceed terminal widths or impose memory constraints in older software. Regarding whitespace, sequences must consist solely of valid symbols (such as IUPAC or codes) without embedded spaces, as any whitespace within sequence data is invalid and typically stripped or causes errors in parsers. Blank lines are optional and may be inserted between records for visual clarity in multi-sequence files, though they are not required and must be positioned only at the file's start, between records, or at the end to avoid disrupting standard parsers. FASTA files support various end-of-line conventions, including Unix-style LF (\n), Windows-style CRLF (\r\n), and legacy Mac-style CR (\r), with most modern tools automatically normalizing these during input to ensure cross-platform compatibility. The format lacks a built-in mechanism for indicating the file version itself, but individual sequence headers may incorporate version numbers (e.g., from database entries like or ) to track updates to specific sequences. Best practices further emphasize cleanliness: avoid trailing spaces on any lines, as they constitute unnecessary whitespace that could lead to inconsistencies in file sizes or parsing artifacts, and refrain from including numeric line numbers or annotations, which are not part of the core specification and may be misinterpreted by software expecting pure sequence data. These conventions, while flexible, promote interoperability when adhering to them in file creation and exchange.

Variations and Extensions

File Compression and Handling

FASTA files lack a strict standard for filename extensions, with common conventions including .fasta, .fa, .fna for sequences, .faa for protein sequences, and occasionally .fst or .seq. is widely employed to manage the size of files, particularly for large datasets, using tools like gzip to produce .gz archives or .zip formats. These methods preserve the original text-based structure, allowing seamless decompression and direct parsing by bioinformatics software without altering the format. For instance, utilities such as SeqKit natively support reading and writing gzip-compressed files, enabling efficient processing of compressed inputs like input.fasta.gz. Encryption of FASTA files is uncommon due to the format's primary role in open scientific exchange, but it can be applied using general-purpose tools like GPG for secure storage or transfer of sensitive genomic data. This process is not inherent to the FASTA specification and necessitates full decryption prior to , as encrypted files are and incompatible with standard sequence readers. Specialized tools, such as Cryfa, offer tailored for FASTA alongside , maintaining in bioinformatics workflows while enhancing for formats like FASTA and FASTQ. For handling large FASTA files, which can exceed gigabytes in size for comprehensive genomes or metagenomes, indexing is recommended to enable without loading the entire file into . The samtools faidx command creates an index file (e.g., reference.fasta.fai) that maps sequence identifiers to byte offsets, facilitating rapid extraction of subsequences in FASTA format. This approach supports both uncompressed and BGZF-compressed inputs, improving efficiency in tools like pipelines. The ASCII-based nature of FASTA ensures high portability across platforms, as it relies solely on standard text characters without proprietary encoding or binary elements. This text format avoids issues with byte order, line endings, or character sets, making FASTA files directly readable on Windows, macOS, , and other systems using basic text editors or command-line tools.

Specialized Extensions

Specialized extensions to the format adapt its simple structure to meet niche requirements in bioinformatics, such as enhanced visualization, representation, quality , and domain-specific notations, while building on the core header and lines. In multiple alignments, FASTA files commonly incorporate gaps denoted by hyphens ('-') to represent insertions or deletions, enabling the storage and exchange of aligned sequences without dedicated formats. This practice is supported by alignment tools like MAFFT, which output results in this extended form, often including coordinate annotations in headers for reference positioning. Prior to the of FASTQ, scores were typically stored in separate files with a .qual extension alongside FASTA sequence files to capture sequencing error estimates; these separate files were largely replaced by FASTQ's structured integration of Phred scores in a single file. For RNA analysis, secondary structure information is frequently encoded using dot-bracket notation—where paired bases are marked by matching parentheses and unpaired ones by dots—either as a parallel string to the sequence in modified FASTA files or in dedicated structure lines. The FASTR format exemplifies this by pairing nucleotide sequences with their corresponding dot-bracket representations in a single file, facilitating integrated prediction and visualization workflows. In , post-translational modifications (PTMs) like or are annotated via the Proteomics Standards Initiative Extended FASTA Format (PEFF), which extends headers with keywords (e.g., MOD_RES for modified residues) and markers using controlled vocabularies from resources like UniMod. PEFF maintains compatibility with standard parsers while embedding PTM sites, variants, and processing events, improving accuracy in mass spectrometry-based identifications. Color-enhanced FASTA representations, used primarily for motif highlighting in web-based viewers, employ HTML-like tags (e.g., <span style="color:red"> around specific residues) or predefined character schemes within sequence lines to denote features like conserved domains or hydrophobicity; tools like bioSyntax apply such coloring schemes (e.g., Zappo or ) during file rendering for better interpretability. These extensions, while useful for targeted applications, often compromise , as standard parsers may ignore or misinterpret non-core elements like or structure notations, leading to ; libraries such as Biopython's SeqIO handle robustly but require custom adaptations for formats like PEFF or FASTR.

Usage in Bioinformatics

Common Software Tools

Several command-line tools facilitate the reading, writing, and manipulation of files in bioinformatics workflows. The Unix utility is commonly employed for basic extraction tasks, such as counting the number of sequences in a file by identifying lines beginning with >, using the command grep -c "^>" file.fasta, which outputs the count without loading the entire file into memory. For more advanced filtering and conversion, seqtk provides efficient operations like sub-sampling, trimming, and format conversion for files, processing large datasets rapidly due to its lightweight design. Similarly, bioawk extends the standard awk utility with built-in support for biological formats, enabling pattern-based extraction and manipulation of sequences, such as printing sequence lengths or filtering by identifier. Programming libraries offer robust programmatic access to files. In , Biopython's SeqIO module serves as a primary tool for , writing, and converting FASTA data, using iterator-based functions like SeqIO.parse() to handle records sequentially and support operations such as slicing or formatting without requiring full file loading. The suite includes seqret, a versatile command-line program within this library for reformatting FASTA files to other sequence formats or extracting subsequences, making it essential for data preparation in pipelines. Alignment software routinely incorporates native FASTA support for input and output. BLAST, developed by NCBI, accepts multi-sequence FASTA queries for similarity searches against databases, enabling batch processing of nucleotide or protein sequences in standard FASTA format. Clustal Omega, a multiple sequence alignment tool from EMBL-EBI, reads unaligned FASTA inputs to generate alignments using progressive methods, supporting up to thousands of sequences efficiently. MAFFT, another alignment program, ingests FASTA files for accurate multiple alignments of DNA, RNA, or proteins, with options for handling diverse sequence lengths via FFT-based algorithms. Graphical editors provide visual interfaces for multi-FASTA manipulation. Jalview allows users to import alignments for interactive editing, including gap insertion, residue modification, and consensus generation, with export back to format. UGENE offers a comprehensive editor that loads multi- files, supports trimming, addition, and real-time , suitable for exploratory of aligned datasets. For handling large FASTA files, streaming parsers are critical to mitigate memory constraints, as they process records iteratively rather than loading entire datasets. Tools like Biopython's SeqIO and seqtk implement such streaming, enabling analysis of gigabyte-scale files on standard hardware by yielding one record at a time, thus avoiding out-of-memory errors common in DOM-style parsers.

Integration with Databases and Pipelines

The FASTA format is widely utilized for exporting sequence data from major bioinformatics databases, facilitating easy access and integration into downstream analyses. The (NCBI) provides nucleotide and protein sequences from and in FASTA format, with weekly updates available for transcript and protein records to support research reproducibility. Similarly, offers canonical protein sequences and manually curated isoforms in FASTA format, enabling researchers to retrieve comprehensive sets of protein data for functional annotation and comparative studies. These exports ensure that FASTA files serve as a lightweight, standardized intermediary for database dissemination without the overhead of full annotation files. In bioinformatics pipelines, particularly those involving next-generation sequencing (NGS), files function as essential inputs for reference genomes and outputs for assembled contigs. Aligners such as BWA and Bowtie2 require FASTA-formatted reference sequences to build indexes and map reads efficiently, streamlining variant calling and genome assembly workflows. Genome assemblers like SPAdes generate contigs and scaffolds directly in format (e.g., contigs.fasta), which can then be fed into subsequent or steps. This bidirectional role in pipelines enhances interoperability, as FASTA's simplicity allows seamless integration across tools without format-specific parsing. Conversion between FASTA and other formats, such as or GFF, is supported by libraries like , which enable programmatic transformation for pipeline automation and data submission preparation. In cloud-based platforms like , FASTA files are natively handled as a core datatype, supporting workflows for sequence manipulation, alignment, and statistical analysis through integrated tools. Regarding standards, FASTA aligns with International Nucleotide Sequence Database Collaboration (INSDC) guidelines for genome assembly submissions, where it is used alongside files to provide sequence data for deposition into member databases like , EMBL, and DDBJ. A key challenge in integrating with dynamic databases lies in managing versioned sequences, as reference files evolve with database updates, complicating in long-term pipelines. Tools and workflows must account for versioning to archive incremental changes in large databases efficiently, preventing mismatches between analyses and outdated references. This issue is exacerbated in collaborative environments, where unversioned exports from sources like NCBI or can lead to inconsistencies in downstream results.

References

  1. [1]
    FASTA Database Format - Library of Congress
    FASTA is a text-based, bioinformatic data format used to store nucleotide or amino acid sequences (e.g. Deoxyribonucleic Acid [DNA] or Ribonucleic Acid [RNA]).Identification and description · Sustainability factors · File type signifiers
  2. [2]
  3. [3]
    FASTA Format for Nucleotide Sequences - NCBI - NIH
    Jun 18, 2025 · In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (">"), followed by a unique SeqID (sequence ...
  4. [4]
    FASTAFS: file system virtualisation of random access compressed ...
    FASTA is a file format used for storing nucleotide and amino acid polymeric sequences and is compatible with a high variety of bioinformatics software. It is ...
  5. [5]
    BTEP course - Bioinformatics - National Cancer Institute
    SEQUENCE FILE FORMATS​​ FASTA Format. File can contain one or more sequences. The format specifies a single header line which starts with a ">" character, follow ...
  6. [6]
    Improved tools for biological sequence comparison - PubMed
    Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. doi: 10.1073/pnas.85.8.2444. Authors. W R Pearson , D J ...
  7. [7]
    A Brief History of NCBI's Formation and Growth - NIH
    1992—GenBank at NCBI—NCBI assumes responsibility for GenBank, a database of nucleotide sequences, and collaborates in its development with international ...Missing: adoption | Show results with:adoption
  8. [8]
    DATA RESOURCES AND ANALYSES FAIR Header Reference ...
    Dec 1, 2023 · The FASTA format specification (originally the ”Pearson format”) was created by William Pearson and David Lipman in 1985 (4), but has since ...<|control11|><|separator|>
  9. [9]
    The Sanger FASTQ file format for sequences with quality scores ...
    Dec 16, 2009 · This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them.
  10. [10]
    BankIt Submission Help: Definition Lines in FASTA sequence format
    Definition Lines should be at the end of the first line in the nucleotide FASTA format which should be followed by a hard return and then the actual nucleotide ...Missing: structure | Show results with:structure
  11. [11]
    Modifiers for FASTA Definition Lines - NCBI - NIH
    Sep 13, 2024 · Many of the descriptors that refer to the sequenced molecule and the genetic code can be edited using the FASTA definition line. In all cases, ...
  12. [12]
    NCBI is Phasing Out Sequence GIs - Here's What You Need to Know
    Jul 15, 2016 · Any code that parses GI numbers from NCBI FASTA records (again, from any NCBI source) will break. Why? Same reason. The GI numbers will no ...
  13. [13]
    Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) - Nucleotide - NCBI
    - **FASTA Definition Line**: `>gi|1293614|gb|U49845.1| Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds`
  14. [14]
    fasta in UniProt help search
    How to retrieve sets of protein sequences? UniProtKB canonical sequences are also available in FASTA format, as are additional manually curated isoform ...
  15. [15]
    [PDF] The FASTA program package Introduction
    Nov 21, 2019 · The FASTA programs offer several advantages over BLAST: 1. Rigorous algorithms unavailable in BLAST (Table I).
  16. [16]
    [PDF] Briefly: Bioinformatics File Formats
    Sequence can contain: newline characters (“\n”), ACGT, N, acgt, n, x, . or - (gaps), IUPAC ambiguity codes BDHV etc., alternates like [A/T], ...
  17. [17]
    FASTA format example - Bioinformatics
    FASTA format description. A sequence in FASTA format consists of: One line starting with a ">" sign, followed by a sequence identification code.
  18. [18]
    Script for breaking large .fa files into smaller files of [N] sequences
    Feb 21, 2014 · NCBI requires assemblies containing more than 20k contigs to be split into chunks of 10k. One of the assemblies I'm submitting has over 800k ...
  19. [19]
    FASTA/QUAL format (skbio.io.format.fasta)
    If a boolean array, it indicates characters to write in lowercase. Characters in the sequence corresponding to True values will be written in lowercase. The ...
  20. [20]
    seq_io::fasta - Rust
    The parser handles UNIX (LF) and Windows (CRLF) line endings, but not old Mac-style (CR) endings. However, FASTA writing currently always uses UNIX line endings ...<|separator|>
  21. [21]
    FASTA headers | UniProt help
    Jun 11, 2025 · This format is only available for UniParc entry sets that correspond to the sequences of a proteome. It contains biological information from the ...
  22. [22]
    FASTA file format — PacBioFileFormats 13.0.0 documentation
    Sequences in FASTA files should be wrapped at a uniform line length, to enable indexing. (A common convention is to wrap lines at 60 characters.) Windows and ...Missing: end- NCBI
  23. [23]
    FASTA sequence format - BioPerl
    Description. One of the oldest and simplest sequence formats. A single header line followed by 1 or more sequence lines.<|control11|><|separator|>
  24. [24]
    File formats used in bioinformatics - GitHub username - rnnh
    FASTA filename extensions. FASTA files usually end with the extension .fasta . This extension is arbitrary, as the content of the file determines its format ...Sequence formats · FASTA · FASTA filename extensions · Example Stockholm file
  25. [25]
    Usage - SeqKit - Ultrafast FASTA/Q kit - Wei Shen's Bioinformatic tools
    To keep the order, just compress the FASTA file (input.fasta) and use the compressed one (input.fasta.gz) as the input. 2. Use "seqkit grep" for extracting ...
  26. [26]
    File encryption and decryption made easy with GPG - Red Hat
    Jun 15, 2021 · GPG can encrypt files using `-c`, decrypt with `-d`, or extract/decrypt with no option. Encrypting adds .gpg extension. Decrypting removes  ...Missing: FASTA | Show results with:FASTA
  27. [27]
    Cryfa: a secure encryption tool for genomic data - Oxford Academic
    In this paper, we propose Cryfa, a fast secure encryption tool for genomic data, namely in Fasta, Fastq, VCF, SAM and BAM formats.
  28. [28]
    samtools-faidx(1) manual page
    May 30, 2025 · Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file.
  29. [29]
    FASTR: A novel data format for concomitant representation of RNA ...
    RNA secondary structure is typically represented using a combination of a string of nucleotide characters along with the corresponding dot-bracket notation ...
  30. [30]
    bioSyntax: syntax highlighting for computational biology
    Aug 22, 2018 · In gedit, sublime and vim, amino acid FASTA files can be coloured using CLUSTAL [35], Taylor [36], Zappo [19] or hydrophobicity [20] colour ...
  31. [31]
    Proteomics Standards Initiative Extended FASTA Format - PMC
    First, FASTA files cannot contain metadata about the collection itself: its origin, its production date, key assumptions and parameters used in its production, ...
  32. [32]
    Useful Bash Commands To Handle Fasta Files - Biostars
    Feb 22, 2012 · My top used bash commands for fasta files: (1) counting number of sequences in a fasta file: grep -c "^>" file.fa (2) add something to end of all header lines.
  33. [33]
    lh3/seqtk: Toolkit for processing sequences in FASTA/Q formats
    Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be ...Releases 5 · Issues · Pull requests 12 · Actions
  34. [34]
    lh3/bioawk: BWK awk modified for biological data - GitHub
    Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, ...
  35. [35]
    Introduction to SeqIO - Biopython
    FASTA format variant with no line wrapping and exactly two lines per record. FASTQ files are a bit like FASTA files but also include sequencing qualities. In ...File Formats · Sequence Input · Examples
  36. [36]
    seqret manual - EMBOSS
    Because you can specify the output sequence format, seqret is a program to reformat a sequence. ... Simple Pearson FASTA format, an alias for "fasta" format. ncbi ...
  37. [37]
    Query Input and database selection - BLAST - NIH
    FASTA¶. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished ...
  38. [38]
    Clustal Omega < Job Dispatcher < EMBL-EBI
    Clustal Omega is a new multiple sequence alignment program that uses seeded guide trees and HMM profile-profile techniques to generate alignments between three ...
  39. [39]
    Jalview Home Page - Jalview
    Jalview is a free cross-platform program for multiple sequence alignment editing, visualisation and analysis. Use it to align, view and edit sequence ...Download · Jalview · Jalview Discussion Forum · Jalview's Archive
  40. [40]
    Alignment Editor Features - Unipro UGENE
    The Alignment Editor is a powerful tool for visualizing and editing DNA, RNA, or protein multiple sequence alignments. The editor supports various multiple ...
  41. [41]
    RefSeq Frequently Asked Questions (FAQ) - NCBI
    Nov 15, 2010 · Data is provided on a weekly basis for transcript and protein records in FASTA and GenBank flat file formats.
  42. [42]
    Alignment - NGS Analysis - NYU
    Alternative aligners such as Bowtie2 may be used. Note: Most ... sequence (in FASTA format) using the following command: bwa index <reference.fasta>.
  43. [43]
    SPAdes output - SPAdes Assembly Toolkit
    SPAdes produces assembly graph in GFA 1.2 and legacy FASTG formats. To view GFA and FASTG files we recommend to use Bandage-NG visualization tool.Assembly graph formats · Complete list of output files · metaplasmidSPAdes and...
  44. [44]
    INSDC standards for genome assembly submission
    This document lays out the requirements for submission of genome assembly information into INSDC databases.Missing: FASTA | Show results with:FASTA
  45. [45]
    Sequence database versioning for command line and Galaxy ...
    As outlined in motivation, the challenge is to efficiently archive versions of large FASTA format reference sequence databases which usually grow with many ...