From Wikipedia,
the free encyclopedia.
The Needleman-Wunsch
algorithm performs a global
alignment on two sequences (called
A and B here). It is commonly used
in
bioinformatics to align
protein or
nucleotide sequences. The
algorithm was proposed in 1970 by
Saul Needleman and
Christian Wunsch in their
paper A general method
applicable to the search for
similarities in the amino acid
sequence of two proteins, J
Mol Biol. 48(3):443-53.
The Needleman-Wunsch
algorithm is an example a of
dynamic programming, and is
guaranteed to find the alignment
with the maximum score. Needleman-Wunsch
is the first instance of dynamic
programming being applied to
biological sequence comparison.
Scores for aligned characters
are specified by a
similarity matrix. Here,
S(i,j)
is the similarity of characters i
and j. It uses a linear
gap penalty, here called d.
For example, if the similarity
matrix was
| - |
A |
G |
C |
T |
| A |
10 |
-1 |
-3 |
-4 |
| G |
-1 |
7 |
-5 |
-3 |
| C |
-3 |
-5 |
9 |
0 |
| T |
-4 |
-3 |
0 |
8 |
then the alignment:
AGACTAGTTAC
CGA---GACGT
with a gap penalty of -5, would
have the following score...
To find the alignment with the
highest score, a two-dimensional
array (or
matrix) is allocated. This
matrix is often called the F
matrix, and its (i,j)th entry is
often denoted
Fij
There is one column for each
character in sequence A, and one
row for each character in sequence
B. Thus, if we are aligning
sequences of sizes n and m, the
amount of memory used by the
algorithm is in O(nm). (However,
there is a modified version of the
algorithm which uses only O(m + n)
space, at the cost of a higher
running time. This modification is
in fact a general technique which
applies to many dynamic
programming algorithms; this
method was introduced in
Hirschberg's algorithm for
solving the
longest common subsequence problem.)
As the algorithm progresses,
the Fij
will be assigned to be the optimal
score for the alignment of the
first i characters in A and the
first j characters in B. The
principle of optimality is then
applied as follows.
Basis:
F11 = 0
F1j = 0
Fi1 = 0
Recursion, based on the principle of optimality:
Fij = max(Fi − 1,j − 1 + S(Ai − 1,Bj − 1),Fi,j − 1 + d,Fi − 1,j + d)
The pseudo-code for the
algorithm to compute the F matrix
therefore looks like this (array
indexes start at 0):
for i=0 to length(A)-1
F(i,0) <- 0
for j=0 to length(B)-1
F(0,j) <- 0
for i=1 to length(A)-1
for j = 1 to length(B)-1
{
Choice1 <- F(i-1,j-1) + S(A(i-1), B(j-1))
Choice2 <- F(i-1, j) - d
Choice3 <- F(i, j-1) - d
F(i,j) <- max(Choice1, Choice2, Choice3)
}
Once the F matrix is computed,
note that the bottom right hand
corner of the matrix is the
maximum score for any alignments.
To compute which alignment
actually gives this score, you can
start from the bottom left cell,
and compare the value with the
three possible sources(Choice1,
Choice2, and Choice3 above) to see
which it came from. If it was
Choice1, then A(i) and B(i) are
aligned, if it was Choice2 then
A(i) is aligned with a gap, and if
it was Choice3, then B(i) is
aligned with a gap.
AlignmentA <- ""
AlignmentB <- ""
i <- length(A) - 1
j <- length(B) - 1
while (i > 0 AND j > 0)
{
Score <- F(i,j)
ScoreDiag <- F(i - 1, j - 1)
ScoreUp <- F(i, j - 1)
ScoreLeft <- F(i - 1, j)
if (Score - S(A(i), B(j)) == ScoreDiag)
{
AlignmentA <- A(i) + AlignmentA
AlignmentB <- B(j) + AlignmentB
i <- i - 1
j <- j - 1
}
else if (Score == ScoreLeft - d)
{
AlignmentA <- A(i) + AlignmentA
AlignmentB <- "-" + AlignmentB
i <- i - 1
}
otherwise (Score == ScoreUp - d)
{
AlignmentA <- "-" + AlignmentA
AlignmentB <- B(j) + AlignmentB
j <- j - 1
}
}
while (i >= 0)
{
AlignmentA <- A(i) + AlignmentA
AlignmentB <- "-" + AlignmentB
i <- i - 1
}
while (j >= 0)
{
AlignmentA <- "-" + AlignmentA
AlignmentB <- B(j) + AlignmentB
j <- j - 1
}