From Wikipedia,
the free encyclopedia.
In the field of
bioinformatics, a sequence
database is a large collection
of
DNA,
protein, or other sequences
stored on a computer. A database
can include sequences from only
one organism, as in databases
including all the proteins in
Saccharomyces cerevisiae, or
it can include sequences from all
organisms whose DNA has been
sequenced.
Sequence databases can be
searched using a variety of
methods. The most common is
probably searching for a sequence
similar to a certain target
protein or gene whose sequence is
already known to the user. The
BLAST program is a method of
this type.
A major problem with all the
large genetic sequence databases
is that records are deposited in
them from a wide range of sources,
from individual researchers to
large genome sequencing centers.
As a result, the sequences
themselves, and especially the
biological annotations attached to
these sequences, vary tremendously
in quality. Also there is much
redundancy, as multiple labs often
submit numerous sequences that are
identical, or nearly identical, to
others in the databases.
Many annotations are based not
on laboratory experiments, but on
the results of sequence similarity
searches for previously-annotated
sequences. Of course, once a
sequence has been annotated based
on similarity to others, and
itself deposited in the database,
it can also become the basis for
future annotations. This leads to
the transitive annotation
problem because there may be
several such annotation transfers
by sequence similarity between a
particular database record and
actual wet-lab experimental
information. Therefore, one must
always regard the biological
annotations in major sequence
databases with a considerable
degree of skepticism, unless they
can be verified by reference to
published papers describing
high-quality experimental data, or
at least by reference to a human-curated
sequence database.
See also