From Wikipedia,
the free encyclopedia.
In
bioinformatics, FASTA
format is a
file format used to exchange
information between
genetic
sequence databases. Its format
looks like this:
>SEQUENCE_1
;comment line 1 (optional)
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
;comment line 1 (optional)
;comment line 2 (optional)
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
It consists of a header line
(beginning with a '>') which gives
a name and/or a unique identifier
for the sequence, and often lots
of other information too. Many
different
sequence databases use
standarized headers, which helps
when automatically extracting
information from the header. Often
the first 'word' of the header is
a unique identifier for the
sequence.
After the header line, one or
more comments, distinguished by a
semi-colon at the beginning of the
line, may occur. Most databases
and bioinformatics applications do
not recognize these comments so
their use is discouraged, but they
are part of the official format.
After the header line and
comments, one or more sequence
lines may follow. Sequences may be
protein sequences or
DNA sequences, they must be
shorther than 80 characters and
can contain gaps or alignment
characters (see
sequence alignment).
FASTA format files often have
file extensions like .fa,
.mpfa, fna, or .fsa (and
probably many more!).
The simple format of FASTA
files makes them easy to
manipulate using text processing
tools and
scripting languages like
Perl.
The NCBI have gone so far as to
define a standard for their fasta
header (although generally this is
a bit messy). The
formatdb
man page has this to say on
the subject of FASTA format
databases, "formatdb will
automatically parse the SeqID and
create indexes, but the database
identifiers in the FASTA
definition line must follow the
conventions of the FASTA Defline
Format."
However they do not give a
difinitive description of the
FASTA defline format, an attempt
to create such a format is given
below.
GenBank gi|gi-number|gb|accession|locus
EMBL Data Library gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|name
Brookhaven Protein Data Bank (1) pdb|entry|chain
Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifier lcl|identifier
See also