From Wikipedia,
the free encyclopedia.
Sequence alignment is an
arrangement of two or more
sequences, highlighting their
similarity. The sequences are
padded with gaps (usually denoted
by dashes) so that wherever
possible, columns contain
identical or similar
characters from the sequences
involved:
tcctctgcctctgccatcat---caaccccaaagt
|||| ||| ||||| ||||| ||||||||||||
tcctgtgcatctgcaatcatgggcaaccccaaagt
It is usually used to study the
evolution of the sequences
from a common ancestor, especially
biological sequences such as
protein sequences or
DNA sequences. Mismatches in
the alignment correspond to
mutations, and gaps correspond to
insertions or deletions. Sequence
alignment can also be used to
study things like the evolution of
languages and the similarity
between texts.
The term sequence alignment
may also refer to the process of
constructing such alignment or
finding significant alignments in
a database of potentially
unrelated sequences.
Pairwise alignment
Pairwise sequence alignment
methods are concerned with finding
the best-matching piecewise
(local) or global alignments of
protein (amino
acid) or
DNA (nucleic
acid) sequences.
Typically, the purpose of this
is to find
homologues (relatives) of a
gene or gene-product in a
database of known examples.
This information is useful for
answering a variety of
biological questions. The most
important application of pairwise
alignment is identification of
sequences of unknown structure or
function. Another important use is
the study of
molecular evolution.
Global alignment
A global alignment between two
sequences is an alignment in which
all the characters in both
sequences participate in the
alignment. Global alignments are
useful mostly for finding
closely-related sequences. As
these sequences are also easily
identified by local alignment
methods global alignment is now
somewhat deprecated as a
technique. Further, there are
several complications to molecular
evolution (such as
domain shuffling - see below)
which prevent these methods from
being useful.
Local alignment
Local alignment methods find
related regions within sequences -
in other words they can consist of
a subset of the characters within
each sequence. For example,
positions 20-40 of sequence A
might be aligned with positions
50-70 of sequence B.
This is a more flexible
technique than global alignment
and has the advantage that related
regions which appear in a
different order in the two
proteins (which is known as
domain shuffling) can be
identified as being related. This
is not possible with global
alignment methods.
Significance of alignment
The typical assumption in the
use of alignment is the mechanism
of molecular evolution. DNA
carries over
genetic material from
generation to generation, by
virtue of its semi-conservative
duplication mechanism. Changes in
the material are introduced by
occasional errors and
mutations in the duplication,
and by
viruses and other mechanisms
which sometimes move sub-sequences
within the
chromosome and between
individuals. Consequently, an
alignment between sequences
indicates that the sequences
evolved from a common ancestor
which contained the matching
subsequences. In the case of
genetic sequences, it implies that
their carriers evolved from a
common ancestor.
Using assumptions about the
probabilities of these change
events, we can estimate the time
when sequences diverged from a
common ancestor or the time
required for changing one sequence
into another. However, there is
disagreement about the value and
nature of these probabilities for
biological evolution. One school
of thought assumes a simple,
constant rate of change (an
application of
Occam's Razor) while the other
school assumes short evolutionary
periods when changes were
extremely high.
Many genetic changes are
evolutionarily non-significant
(that is, neutral, neither
positive or negative) in that they
do not affect the selective
advantage (the
fitness) of the carrier. This
is the basis of Kimura's Neutral
theory. This is partly supported
by examples such as the
Globin protein sequence which
shows normal functionality
in-spite of several changes at
various bases but loses
functionality only when changes
occur at critical sequence
locations.
The actual biological meaning
of any alignment can never be
absolutely guaranteed. However,
statistical methods can be
used to assess the likelihood of
finding an alignment between two
regions (or sequences) by chance,
given the size of the database and
its composition.
Two important related issues
for sequence alignment are:
- How should we choose the
best alignment between two
sequences (or regions of
sequences)?
- How do we rank the
alignments between a query and
the many sequences in the
database, according to their
significance in the domain (such
as biological significance)?
In studying evolution, the
first question can be addressed by
developing a model of how likely
are certain changes between
characters in the sequence. There
are many ways to do this, none of
which is generally superior. The
models are derived empirically
using related sequences, and are
expressed as
substitution matrices. These
matrices are used by the
algorithms named below to evaluate
alternative alignments between two
sequences. The actual biological
quality of the alignments then
depends upon the evolutionary
model used to generate the score.
The second question is purely
statistical. Research has
determined a few hard theoretical
rules and many approximations. It
is now generally accepted that the
scores of alignments between
random sequences follow the
extreme value distribution.
Pairwise alignment programs such
as
BLAST use
simulation to estimate the
parameters of this distribution
given a particular query,
database, substitution matrix and
certain other parameters.
Alignments can then be given a
statistical significance value,
allowing inference of possible
relationships between sequences.
Structural alignment
One potential way to validate
sequence alignments is to use
structural alignments where
possible. Because protein
structure is more conserved
through evolution than is sequence
(as shown by
Cyrus Chothia and
Arthur Lesk in
1986), structural alignments
can be more reliable over long
evolutionary distances, when the
sequences have diverged so much
that simple sequence comparison
cannot detect their similarity.
However, although structural
alignments are useful, they are
completely artifactual.
Multiple alignment
Multiple alignment is an
extension of pairwise alignment to
incorporate more than two
sequences into an alignment.
Multiple alignment methods try to
align all of the sequences in a
specified set. Alignments help in
the identification of common
regions between the sequences.
There are several approaches to
creating multiple sequence
alignments, one of the most
popular being the progressive
alignment strategy used by the
Clustal family of programs.
Clustal is used in
cladistics to build
phylogenetic trees, and to build
sequence profiles which are used
by PSI-BLAST and
Hidden Markov model- (HMM-)
methods to search sequence
databases for more distant
relatives.
Multiple sequence alignment is
computationally difficult and is
classified as an
NP-Hard problem.
Algorithms
Pairwise global alignment only.
Pairwise, local or global
alignment. It works by generating
all six possible translations of
each nucleotide sequence, then
uses those translations for
protein search.
Framesearch
This is an extension of
Smith-Waterman, for pairwise
alignment between a protein
sequence and a
nucleotide sequence.
Framesearch dynamically
considers every possible
single-nucleotide insertion or
deletion (indel) to generate the
translation that best matches the
protein sequence. This is unlike
six-frame-translated BLAST or
Smith-Waterman, which generate all
six possible translations of each
nucleotide sequence and use
them for protein search. This
allows dramatically better
alignments when the nucleotide
sequence contains many indels, as
is often the case with single-read
expressed sequence tag
sequences and very early versions
of genomic sequences.
Unfortunately Framesearch is
extremely
CPU-intensive. Its practical
use typically requires a large
Computer cluster or
specialized computer hardware
optimized for running
dynamic programming algorithms
on a large scale. Several
companies sell such systems, which
use either custom
integrated circuits or
Field-programmable gate array
technology.
Software
SSearch
Implements the standard
Smith-Waterman algorithm. It is
considerably slower than the more
modern BLAST and
FASTA methods. However,
Smith-Waterman remains the gold
standard for protein-protein
or nucleotide-nucleotide pairwise
alignment; its speed can be
improved by using specialized
hardware or a computer cluster.
BLAST (Basic Local Alignment
Search Tool)
Main article:
BLAST
This method uses a pre-computed
hash table to serve as an
index for short sequences. Given a
query sequence, the sub-sequences
are looked up in the index to
reduce the amount of time and
searching involved. Several
parameters need to be provided to
make this method faster or more
accurate. Once patterns that match
the search sequence are found,
more accurate and intensive
algorithms may be applied.
BLAST uses a pairwise local
search and uses a number of
methods to increase the speed of
the original Smith-Waterman
algorithm.
FASTA
Main article:
FASTA
Pairwise local search.
Superseded by BLAST.
Clustal
Main article:
Clustal
Progressive multiple alignment
method. Comes in several varieties
(ClustalW, ClustalX etc.)