From Wikipedia,
the free encyclopedia.
A substitution model
describes the process from which a
sequence of characters of a fixed
size from some alphabet changes
into another set of traits. For
example, in
cladistics, each position the
sequence might correspond to a
property of a species which can
either be present or absent. The
alphabet could then consist of "0"
for presence and "1" for absence.
Then the sequence 00110 could
mean, for example, that a species
does not have feathers or lay
eggs, does have fur, is
warm-blooded, and cannot breathe
underwater. Another sequence 11010
would mean that a species has
feathers, lays eggs, does not have
fur, is warm-blooded, and cannot
breathe underwater. In
phylogenetics, sequences are
often obtained by firstly
obtaining a nucleotide or protein
sequence alignment, and then
taking the bases or amino acids at
corresponding positions in the
alignment as the characters.
Sequences achieved by this might
look like AGCGGAGCTTA and
GCCGTAGACGC.
Substitution models are used
for a number of things:
- Constructing evolutionary
trees in phylogenetics or
cladistics.
- Simulating sequences to test
other methods and algorithms.
Neutral, independent, finite
sites models
Most substitution models used
to date are neutral, independent,
finite sites models.
- Neutral
- Selection does not operate
on the substitutions, and so
they are unconstrained.
- Independent
- Changes in one site do not
affect the probability of
changes in another site.
- Finite Sites
- There are finitely many
sites, and so over evolution, a
single site can be changed
multiple times. This means that,
for example, if a character has
value 0 at time 0 and at time t,
it could be that no changes
occurred, or that it changed to
a 1 and back to a 0, or that it
changed to a 1 and back to a 0
and then to a 1 and then back to
a 0, and so on.
The molecular clock and the
units of time
Different substitution models
deal with time differently.
- It is very common to measure
time in substitutions. For
example, if I was going to
construct a phylogenetic tree
from a substitution model, I
could just measure the distance
along the branches of the trees
in substitutions. This is
convenient, because it avoids
any question of whether the rate
of substitution with respect to
the unit of time has changed or
not(because by definition the
number of substitutions per
substitution is one), and it
doesn't need any information
about timescales that could be
called into question.
- The molecular clock
assumption is also very common.
This assumes that the rate of
substitutions with respect to
time is constant. This is just
multiplying factor(usually
called μ,
the number of substitutions per
unit time) different from
measuring time in substitutions.
To carry out this type of
analysis, you need to estimate
μ
first(which requires you know at
least one branch length ahead of
time, often a difficult task,
which can easily be disputed by
others).
- The assumption of a
molecular clock is often
unrealistic, especially across
long periods of evolution. For
example, even though rodents are
genetically very similar to
primates, they have undergone a
much higher number of
substitutions in the estimated
time since divergence, at least
in some regions of the genome.
This could be due to the shorter
generation time. When studying
events like the
Cambrian explosion under a
molecular clock assumption, poor
concurrence between
cladistic and phylogenetic
data is often observed. There
has been some work on models
allowing variable rate of
evolution(see for example
Kishino, Thorne, and Bruno:
Performance of a divergence
estimation method under a
probabilistic model of rate
evolution. Molecular Biology of
Evolution 18: 352-361(2001) and
Thorne, Kishino and Painter:
Estimating the rate of evolution
of the rate of molecular
evolution: Molecular Biology of
Evolution 15: 1647-1657(1998)).
Time reversible models
Most useful substitution models
are
time reversible. In terms of
substitution models, this simply
means that over time, the relative
frequencies of each character do
not change.
For a time reversible model, we
can't tell the direction of time.
For example A -> C -> G is the
same as G -> C -> A
The reason for this is because
when we are analysing real
biological data, we do not have
access to the ancestral species,
only to the extant species present
today. However, when a model is
time-reversible, which species was
the ancestral species is
irrelevant. Instead, we can root
the phylogenetic tree at any
arbitrary extant species, and then
re-root the tree using other data
later(or just leave the tree
unrooted).
A time reversible model
satisfies the following properly
π1Q12
= π2Q21
The mathematics of
substitution models
Neutral, independent, finite
sites models(assuming a constant
rate of evolution) have two
parameters, Π, a vector of base(or
character) frequencies at time
zero(for a time reversible model,
this vector usually referred to as
the equilibrium base frequencies,
and applies at all times), and the
rate matrix, Q, which describes
the rate at which bases of one
type change into bases of another
type(so Qij
for
is the rate at which base i goes
to base j). For convenience, the
diagonals of the Q matrix are
chosen so the rows sum to
zero(which is convenient).

The transition matrix function
is a function from the branch
lengths(in some units of time,
possibly in substitutions), to a
matrix of conditional
probabilities. It is denoted
P(t)
The entry in the i-th column and
the j-th row(Pij(t))
is the probability, after time t,
that there is a base j at a given
base, conditional on there being a
i in that position at time 0. When
the model is time reversible, this
can be performed between any two
sequences, even if one is not the
ancestor of the other, if you know
the total branch length between
them.
The asymptotic properties of
Pij(t)
are such that
,
i.e. there is no change in base
composition between a sequence and
itself, and
,
or in other words, as time goes to
infinity, the probability of
finding base j at a position given
there was an i at that position
originally goes to the probability
that there is base j at that
position(regardless of the
original base).
The transition matrix can be
computed from the rate matrix and
the equilibrium base frequencies
by P(t)
= eQt.
Since Q is a matrix, this is a
matrix exponential, and must be
approximated by the
Taylor series expansion
.
The time reversibility(or
stationarity) constraint is
ΠQ =
0(because the rows where
defined to sum to zero, and the
overall base frequencies must not
systematically change from
Π).
This is equivalent to saying
ΠP(t)
= Π for all t.
GTR: Generalised time
reversible
GTR is the most general
neutral, independent,
finite-sites, time-reversible
model possible. It was first
described in a general form by
Simon Tavaré in 1986.
The GTR parameters consist of
an equilibrium base frequency
vector, Π =
(π1π2π3π4),
giving the frequency at which each
base occurs at each site, and the
rate matrix

Therefore, GTR(for four
characters, as is often the case
in phylogenetics) requires 6
parameters substitution rate
parameters, as well as 4
equilibrium base frequency
parameters. However, this is
usually eliminated down to 9
parameters plus
μ,
the overall number of
substitutions per unit time. When
measuring time in substitutions(μ=1)
only 9 free parameters remain.
In general, to compute the
number of parameters, you count
the number of entries above the
diagonal in the matrix, i.e. for n
trait values per site
,
and then add n for the equilibrium
base frequencies, and subtract 1
because μ
is fixed. You get
.
For example, for an amino acid
sequence(there are 20 "standard"
amino acids that make up
proteins), you would find
there are 209 parameters. However,
when studying coding regions of
the genome, it is more common to
work with a
codon substitution model(a
codon is three bases and codes for
one amino acid in a protein).
There are 43
= 64 codons, but the rates
for transitions between codons
which differ by more than one
amino acid is assumed to be zero.
Hence, there are
parameters.
JC69 model (Jukes and Cantor,
1969)
JC69 is the simplest
substitution model. There are
several assumptions. It assumes
equal base frequencies (
)
and equal mutation rates. The only
parameter of this model is
therefore μ,
the overall substitution rate.


K80 model (Kimura, 1980)
Distinguish between
Transition and
Transversion (α/β)
Equal base frequencies (
)
Rate matrix

F81 model (Felsenstein 1981)
Unequal base frequencies (
)
Rate matrix

HKY85 model (Hasegawa, Kishino,
and Yano 1985)
Distinguish between
Transition and
Transversion (α/β)
Unequal base frequencies (
)
Rate matrix

T92 model (Tamura 1992)
One frequency only
πGC


Rate matrix

TN93 model (Tamura and Nei
1993)
Distinguish between two
different types of
Transition (A <-> G) is
different to (C<->T)
Unequal base frequencies (
)
Rate matrix

External links
References
Jukes, T. H., and C. R. Cantor.
1969. Evolution of protein
molecules. Pp. 21-123 in H. N.
Munro, ed. Mammalian protein
metabolism. Academic Press, New
York.
Kimura, M. 1980. A simple
method for estimating evolutionary
rate of base substitution through
comparative studies of nucleotide
sequences. Journal of Molecular
Evolution 16:111-120.
Felsenstein, J. 1981.
Evolutionary trees from DNA
sequences: a maximum likelihood
approach. Journal of Molecular
Evolution 17:368-376.
Hasegawa, M., H. Kishino, and
T. Yano. 1985. Dating the
human-ape splitting by a molecular
clock of mitochondrial DNA.
Journal of Molecular Evolution
22:160-174.
Tamura, K. 1992. Estimation of
the number of nucleotide
substitutions when there are
strong transition-transversion and
G+C content biases. Molecular
Biology and Evolution 9:678-687.
Tamura, K., and M. Nei. 1993.
Estimation of the number of
nucleotide substitutions in the
control region of mitochondrial
DNA in humans and chimpanzees.
Molecular Biology and Evolution
10:512-526.