From Wikipedia,
the free encyclopedia.
A similarity matrix is a
matrix of scores which express
the similarity between two data
points. Similarity matrices are
strongly related to their
counterparts
distance matrices.
Use in sequence alignment
Similarity matrices are used in
sequence alignment. Higher
scores are given to more similar
characters, and lower or negative
scores for dissimilar characters.
Nucleotide similarity matrices
are used to align
nucleic acid sequences.
Because there are only four
nucleotides commonly found in
DNA (Adenine
(A),
Cytosine (C),
Guanine (G) and
Thymine (T)), nucleotide
similarity matrices are much
simpler than
protein similarity matrices.
For example, a simple matrix will
assign identical bases a score of
+1 and non-identical bases a score
of -1. A more complicated matrix
would give a higher score to
transitions (changes from a
pyrimidine such as C or T to
another pyrimidine, or from a
purine such as A or G to
another purine) than to
transversions (from a pyrimidine
to a purine or vice versa). The
match/mismatch ratio of the matrix
sets the target evolutionary
distance (States et al. 1991
METHODS - A companion to Methods
in Enzymology 3:66-70); the +1/-3
DNA matrix used by BLASTN is best
suited for finding matches between
sequences that are 99% identical;
a +1/-1 (or +4/-4) matrix is much
more sensitive as it is optimal
for matches between sequences that
are about 70% identical.
Amino acid similarity matrices
are more complicated, because
there are 20 amino acids coded for
by the
genetic code. Therefore, the
similarity matrix for amino acids
contains 400 entries (although it
is usually
symmetric). The first approach
scored all amino acid changes
equally. A later refinement was to
determine amino acid similarities
based on how many base changes
were required to change a codon to
code for that amino acid. This
model is better, but it doesn't
take into account the selective
pressure of amino acid changes.
Better models took into account
the chemical properties of amino
acids.
One approach has been to
empirically generate the
similarity matrices. The Dayhoff
method used phylogenetic trees and
sequences taken from species on
the tree. This approach has given
rise to the
PAM series of matrices. PAM
matrices are labelled based on how
many nucleotide changes have
occurred, per 100 amino acids.
While the PAM matrices benefit
from having a well understood
evolutionary model, they are most
useful at short evolutionary
distances (PAM10 - PAM120). At
long evolutionary distances, for
example PAM250 or 20% identity, it
has been shown that the
BLOSUM matrices are much more
effective.
The BLOSUM series were
generated by comparing a number of
divergent sequences. The BLOSUM
series are labelled based on how
much entropy remains unmutated
between all sequences, so a lower
BLOSUM number corresponds to a
higher PAM number.