From Wikipedia,
the free encyclopedia.
FASTA is a
sequence alignment package
first described (as FASTP) by
David J. Lipman and
William R. Pearson in
1985 in the article
Rapid and sensitive protein
similarity searches. The
original FASTP program was
designed for protein sequence
similarity searching. FASTA,
described in
1988 (Improved
Programs for Biological Sequence
Comparison) added the ability
to do DNA:DNA searches, translated
protein:DNA searches and provided
a more sophisticated shuffling
program for evaluating statistical
significance. There are several
programs in this package that
allow the alignment of
protein sequences and
DNA sequences. FASTA is
pronounced "FAST-Aye", and stands
for "FAST-All", because it works
with any alphabet, an extension of
"FAST-P" (protein) and "FAST-N"
(nucleotide) alignment.
The current FASTA package
programs for protein:protein,
DNA:DNA, protein:translated DNA
(with frameshifts), and ordered or
unordered peptide searches. In
addition to rapid heuristic search
methods, the FASTA package
provides SSEARCH, an
implementation of the optimal
Smith-Waterman algorithm. A
major focus of the package is the
calculation of accurate similarity
statistics, so that biologists can
judge whether an alignment is
likely to have occurred by chance,
or whether it can be used to infer
homology. The FASTA package is
available from
ftp.virginia.edu/pub/fasta.
A sequence in FASTA format
begins with a single-line
description, followed by lines of
sequence data. The description
line is distinguished from the
sequence data by a greater-than
(">") symbol in the first column.
The word following the ">" symbol
is the identifier of the sequence,
and the rest of the line is the
description (both are optional).
There should be no space between
the ">" and the first letter of
the identifier. It is recommended
that all lines of text be shorter
than 80 characters. The sequence
ends if another line starting with
a ">" appears; this indicates the
start of another sequence. An
example FASTA format:
>gi|5524211|gb|AAD44166.1| cytochrome b Elephas maximus maximus
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
Sequences are expected to be
represented in the standard IUB/IUPAC
amino acid and
nucleic acid codes, with these
exceptions: lower-case letters are
accepted and are mapped into
upper-case; a single hyphen or
dash can be used to represent a
gap character; and in amino acid
sequences, U and * are acceptable
letters (see below). Before
submitting a request, any
numerical digits in the query
sequence should either be removed
or replaced by appropriate letter
codes (e.g., N for unknown nucleic
acid residue or X for unknown
amino acid residue).
The nucleic acid codes
supported are:
| Nucleic Acid Code |
Meaning |
| A |
Adenosine |
| C |
Cytidine |
| G |
Guanine |
| T |
Thymidine |
| U |
Uracil |
| R |
G A (puRine) |
| Y |
T C (pYrimidine) |
| K |
G T (Ketone) |
| M |
A C (aMino
group) |
| S |
G C (Strong
interaction) |
| W |
A T (Weak
interaction) |
| B |
G T C (not A) (B
comes after A) |
| D |
G A T (not C) (D
comes after C) |
| H |
A C T (not G) (H
comes after G) |
| V |
G C A (not T, not U) (V
comes after U) |
| N |
A G C T (aNy) |
| - |
gap of indeterminate
length |
The amino acid codes supported
are:
See Also
FASTA format