"The price of greatness is responsibility." Sir Winston Churchill


Search the IBPA



Top Menu

Menu Sidebar

IBPA Issues
About IBPA
IBPA Constitution
FAQ-s
IBPA Events
Individual Membership
Institutional Membership
IBPA Forums / Groups
Cooperation with IBPA
Links

Publications
IBPA Careers Newsletter
Past Issues
Industry Publications
Promote Yourself within the Industry
Submit Your Article

Career Center: Employers
Job Posting
Free Resume Database
Volunteers Database

Career Center: Job Seekers
Now Hiring
Submit Resume
Career Training
Nurses Careers in Biopharm
Scholarship Programs
Internship Programs
Resume Editing & Interview Coaching
Volunteer for the Industry
Download IBPA Career Info Brochure

Industry Directories and Listings
Pharmaceutical Companies
Contract Research Organizations
Professional Associations
Recruiters and Staffing Agencies
Clinical Research Centers
Consulting Companies
Education & Training Institutions
Jobs and Resume Searching Directories
Research and Development Companies
Industry Service Providers
List Your Company

Investor's Center
Offers
Calls

Contact IBPA
USAChapter
Canadian Chapter
European Chapter
Asian Chapter

Start Your Career in Biotech with IBPA Scholarship Programs
Untitled Document



Subscribe to our "Careers in the Biopharmaceutical Industry" newsletter:

Name*:

Email*:

City:

Country:

Phone:

To unsubscribe, click here

 

 

Substitution model

From Wikipedia, the free encyclopedia.

 

A substitution model describes the process from which a sequence of characters of a fixed size from some alphabet changes into another set of traits. For example, in cladistics, each position the sequence might correspond to a property of a species which can either be present or absent. The alphabet could then consist of "0" for presence and "1" for absence. Then the sequence 00110 could mean, for example, that a species does not have feathers or lay eggs, does have fur, is warm-blooded, and cannot breathe underwater. Another sequence 11010 would mean that a species has feathers, lays eggs, does not have fur, is warm-blooded, and cannot breathe underwater. In phylogenetics, sequences are often obtained by firstly obtaining a nucleotide or protein sequence alignment, and then taking the bases or amino acids at corresponding positions in the alignment as the characters. Sequences achieved by this might look like AGCGGAGCTTA and GCCGTAGACGC.

Substitution models are used for a number of things:

  1. Constructing evolutionary trees in phylogenetics or cladistics.
  2. Simulating sequences to test other methods and algorithms.

Contents

[hide]

[edit]

 

Neutral, independent, finite sites models

Most substitution models used to date are neutral, independent, finite sites models.

Neutral 
Selection does not operate on the substitutions, and so they are unconstrained.
Independent 
Changes in one site do not affect the probability of changes in another site.
Finite Sites 
There are finitely many sites, and so over evolution, a single site can be changed multiple times. This means that, for example, if a character has value 0 at time 0 and at time t, it could be that no changes occurred, or that it changed to a 1 and back to a 0, or that it changed to a 1 and back to a 0 and then to a 1 and then back to a 0, and so on.
[edit]

 

The molecular clock and the units of time

Different substitution models deal with time differently.

  • It is very common to measure time in substitutions. For example, if I was going to construct a phylogenetic tree from a substitution model, I could just measure the distance along the branches of the trees in substitutions. This is convenient, because it avoids any question of whether the rate of substitution with respect to the unit of time has changed or not(because by definition the number of substitutions per substitution is one), and it doesn't need any information about timescales that could be called into question.
  • The molecular clock assumption is also very common. This assumes that the rate of substitutions with respect to time is constant. This is just multiplying factor(usually called μ, the number of substitutions per unit time) different from measuring time in substitutions. To carry out this type of analysis, you need to estimate μ first(which requires you know at least one branch length ahead of time, often a difficult task, which can easily be disputed by others).
  • The assumption of a molecular clock is often unrealistic, especially across long periods of evolution. For example, even though rodents are genetically very similar to primates, they have undergone a much higher number of substitutions in the estimated time since divergence, at least in some regions of the genome. This could be due to the shorter generation time. When studying events like the Cambrian explosion under a molecular clock assumption, poor concurrence between cladistic and phylogenetic data is often observed. There has been some work on models allowing variable rate of evolution(see for example Kishino, Thorne, and Bruno: Performance of a divergence estimation method under a probabilistic model of rate evolution. Molecular Biology of Evolution 18: 352-361(2001) and Thorne, Kishino and Painter: Estimating the rate of evolution of the rate of molecular evolution: Molecular Biology of Evolution 15: 1647-1657(1998)).
[edit]

 

Time reversible models

Most useful substitution models are time reversible. In terms of substitution models, this simply means that over time, the relative frequencies of each character do not change.

For a time reversible model, we can't tell the direction of time. For example A -> C -> G is the same as G -> C -> A

The reason for this is because when we are analysing real biological data, we do not have access to the ancestral species, only to the extant species present today. However, when a model is time-reversible, which species was the ancestral species is irrelevant. Instead, we can root the phylogenetic tree at any arbitrary extant species, and then re-root the tree using other data later(or just leave the tree unrooted).

A time reversible model satisfies the following properly π1Q12 = π2Q21

[edit]

 

The mathematics of substitution models

Neutral, independent, finite sites models(assuming a constant rate of evolution) have two parameters, Π, a vector of base(or character) frequencies at time zero(for a time reversible model, this vector usually referred to as the equilibrium base frequencies, and applies at all times), and the rate matrix, Q, which describes the rate at which bases of one type change into bases of another type(so Qij for i \ne j is the rate at which base i goes to base j). For convenience, the diagonals of the Q matrix are chosen so the rows sum to zero(which is convenient). Q_{ii} = - {\sum_{i\ne j} Q_{ij}}

The transition matrix function is a function from the branch lengths(in some units of time, possibly in substitutions), to a matrix of conditional probabilities. It is denoted P(t) The entry in the i-th column and the j-th row(Pij(t)) is the probability, after time t, that there is a base j at a given base, conditional on there being a i in that position at time 0. When the model is time reversible, this can be performed between any two sequences, even if one is not the ancestor of the other, if you know the total branch length between them.

The asymptotic properties of Pij(t) are such that \lim_{t \rightarrow 0} P_{ij}(t) = \Pi_{i}, i.e. there is no change in base composition between a sequence and itself, and \lim_{t \rightarrow \infty} P_{ij}(t) = \Pi_{j}, or in other words, as time goes to infinity, the probability of finding base j at a position given there was an i at that position originally goes to the probability that there is base j at that position(regardless of the original base).

The transition matrix can be computed from the rate matrix and the equilibrium base frequencies by P(t) = eQt. Since Q is a matrix, this is a matrix exponential, and must be approximated by the Taylor series expansion P(t) = \sum_{n=0}^{\infty}{Q^n {{t^n} \over {n!}}}.

The time reversibility(or stationarity) constraint is ΠQ = 0(because the rows where defined to sum to zero, and the overall base frequencies must not systematically change from Π). This is equivalent to saying ΠP(t) = Π for all t.

[edit]

 

GTR: Generalised time reversible

GTR is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986.

The GTR parameters consist of an equilibrium base frequency vector, Π = (π1π2π3π4), giving the frequency at which each base occurs at each site, and the rate matrix Q = \begin{pmatrix} {-(x_1+x_2+x_3)} & x_1 & x_2 & x_3 \\ {\pi_1 x_1 \over \pi_2} & {-({\pi_1 x_1 \over \pi_2} + x_4 + x_5)} & x_4 & x_5 \\ {\pi_1 x_2 \over \pi_3} & {\pi_2 x_4 \over \pi_3} & {-({\pi_1 x_2 \over \pi_3} + {\pi_2 x_4 \over \pi_3} + x_6)} & x_6 \\  {\pi_1 x_3 \over \pi_4} & {\pi_2 x_5 \over \pi_4} & {\pi_3 x_6 \over \pi_4} & {-({\pi_1 x_3 \over \pi_4} + {\pi_2 x_5 \over \pi_4} + {\pi_3 x_6 \over \pi_4})} \end{pmatrix}

Therefore, GTR(for four characters, as is often the case in phylogenetics) requires 6 parameters substitution rate parameters, as well as 4 equilibrium base frequency parameters. However, this is usually eliminated down to 9 parameters plus μ, the overall number of substitutions per unit time. When measuring time in substitutions(μ=1) only 9 free parameters remain.

In general, to compute the number of parameters, you count the number of entries above the diagonal in the matrix, i.e. for n trait values per site {{n^2-n} \over 2}, and then add n for the equilibrium base frequencies, and subtract 1 because μ is fixed. You get {{n^2-n} \over 2} + n - 1 = {1 \over 2}n^2 + {1 \over 2}n - 1. For example, for an amino acid sequence(there are 20 "standard" amino acids that make up proteins), you would find there are 209 parameters. However, when studying coding regions of the genome, it is more common to work with a codon substitution model(a codon is three bases and codes for one amino acid in a protein). There are 43 = 64 codons, but the rates for transitions between codons which differ by more than one amino acid is assumed to be zero. Hence, there are {{20 \times 19 \times 3} \over 2} + 64 - 1 = 633 parameters.


 

[edit]

 

JC69 model (Jukes and Cantor, 1969)

JC69 is the simplest substitution model. There are several assumptions. It assumes equal base frequencies (\pi_1 = \pi_2 = \pi_3 = \pi_4 = {1\over4}) and equal mutation rates. The only parameter of this model is therefore μ, the overall substitution rate.

Q = \begin{pmatrix} {*} & {\mu\over 4} & {\mu\over 4} & {\mu\over 4} \\ {\mu\over 4} & {*} & {\mu\over 4}& {\mu\over 4}\\ {\mu\over 4}& {\mu\over 4}& {*} & {\mu\over 4}\\ {\mu\over 4}& {\mu\over 4}& {\mu\over 4}& {*} \end{pmatrix}


 

P= \begin{pmatrix} {{1\over4} + {3\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} \\\\ {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} + {3\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} \\\\ {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} + {3\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} \\\\ {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} - {1\over4}e^{-4t\mu}} & {{1\over4} + {3\over4}e^{-4t\mu}}  \end{pmatrix}

[edit]

 

K80 model (Kimura, 1980)

Distinguish between Transition and Transversion (α/β)

Equal base frequencies (\pi_T = \pi_C = \pi_A = \pi_G = {1\over4})

Rate matrix Q= \begin{pmatrix} {*} & {\kappa} & {1} & {1} \\ {\kappa} & {*} & {1} & {1} \\ {1} & {1} & {*} & {\kappa} \\ {1} & {1} & {\kappa} & {*}  \end{pmatrix}


 

[edit]

 

F81 model (Felsenstein 1981)


Unequal base frequencies (\pi_T \ne \pi_C \ne \pi_A \ne \pi_G)

Rate matrix Q= \begin{pmatrix} {*} & {\pi_C} & {\pi_A} & {\pi_G} \\ {\pi_T} & {*} & {\pi_A} & {\pi_G} \\ {\pi_T} & {\pi_C} & {*} & {\pi_G} \\ {\pi_T} & {\pi_C} & {\pi_A} & {*}  \end{pmatrix}


 

[edit]

 

HKY85 model (Hasegawa, Kishino, and Yano 1985)

Distinguish between Transition and Transversion (α/β)

Unequal base frequencies (\pi_T \ne \pi_C \ne \pi_A \ne \pi_G)

Rate matrix Q= \begin{pmatrix} {*} & {\kappa\pi_C} & {\pi_A} & {\pi_G} \\ {\kappa\pi_T} & {*} & {\pi_A} & {\pi_G} \\ {\pi_T} & {\pi_C} & {*} & {\kappa\pi_G} \\ {\pi_T} & {\pi_C} & {\kappa\pi_A} & {*}  \end{pmatrix}

[edit]

 

T92 model (Tamura 1992)

One frequency only πGC

\pi_G = \pi_C = {\pi_{GC}\over 2}

\pi_A = \pi_T = {(1-\pi_{GC})\over 2}


Rate matrix Q= \begin{pmatrix} {*} & {\kappa\pi_{GC}/2} & {(1-\pi_{GC})/2} & {\pi_{GC}/2} \\ {\kappa(1-\pi_{GC})/2} & {*} & {(1-\pi_{GC})/2} & {\pi_{GC}/2} \\ {(1-\pi_{GC})/2} & {\pi_{GC}/2} & {*} & {\kappa\pi_{GC}/2} \\ {(1-\pi_{GC})/2} & {\pi_{GC}/2} & {\kappa(1-\pi_{GC})/2} & {*}  \end{pmatrix}


 

[edit]

 

TN93 model (Tamura and Nei 1993)

Distinguish between two different types of Transition (A <-> G) is different to (C<->T)

Unequal base frequencies (\pi_T \ne \pi_C \ne \pi_A \ne \pi_G)

Rate matrix Q= \begin{pmatrix} {*} & {\kappa_1\pi_C} & {\pi_A} & {\pi_G} \\ {\kappa_1\pi_T} & {*} & {\pi_A} & {\pi_G} \\ {\pi_T} & {\pi_C} & {*} & {\kappa_2\pi_G} \\ {\pi_T} & {\pi_C} & {\kappa_2\pi_A} & {*}  \end{pmatrix}


 

[edit]

 

External links

[edit]

 

References

Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21-123 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.

Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16:111-120.

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368-376.

Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22:160-174.

Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Molecular Biology and Evolution 9:678-687.

Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10:512-526.



External links




Learn More About the Biopharmaceutical Industry and Clinical Research:


Category:

Logo sidebar
  • Analytical Chemistry
  • Bioinformatics
  • Biology
  • Biochemistry
  • Biotechnology
  • Biotechnology Companies
  • Cell Imaging
  • Chemistry
  • Chemists
  • Crystallography
  • Ecology
  • Environmentalism
  • Genetic Engineering
  • Genetically Modified Organisms
  • Genetics
  • Health
  • Health Care
  • Health Sciences
  • Medical Specialities
  • Medicine
  • Molecular Genetics
  • Pharmaceutical Industry
  • Pharmacy
  • Pharmacology

  • Powered by Wikipedia, the free encyclopedia. Articles were developed by IBPA volunteers.

    Logo sidebar

    A

    B

    C

    D

    E

    F

    G

    I

    K

    L

    M

    N

    P

    Q

    R

    S

    T


    Logo sidebar


    IBPA Sponsors and Active Supporters

    http://www.payoneer.com/
    Access Clinical Trials

    Access Clinical Trials
    Access Clinical Trials


    Allied Research International
    Allied Research International

    Altaspera Global Services Inc.
    Altaspera Global Services

    Financial Planning and Personal Insurance
    For Canadian Pharmaceutical Industry Executives


    Biorole Scientific Solutions
    Biorole Scientific Solutions

    CEREPROTEC INC. Development of Novel Neuroprotective Drugs
    CEREPROTEC INC. Development of Novel Neuroprotective Drugs

    Recruitment Advertising Agencies
    Recruitment Advertising Agencies

    Cellular Technology Ltd.
    Cellular Technology Ltd.

    Clinical Trial Network
    Free Database of Clinical Investigators

    ClinQua Clinical Trials Inc.
    ClinQua Clinical Trials Inc.

    Coronis Clinical Research Organization
    Coronis Clinical Research Organization

    CPIC Latin America
    CPIC Latin America

    Espoir Bridge Recruiters
    Espoir Bridge Recruiters

    Genentech
    Genentech

    ILS SA
    Independent Research and Laboratory Solutions

    Inova Health Research
    Inova Health Research, Inc.

    Kriger Research Group International
    Kriger Research Group International

    LCCT
    LCCT

    Metrics Research
    Complete Research Solutions on a Single Platform

    Pharmalef Developments
    Pharmalef Developments

    PrimeHealth Clinical Research Organization
    PrimeHealth Clinical Research Organization

    Research & Development RA SA
    Research & Development RA SA

    Scios Inc.
    Scios Inc. - Manufacturer of Health Care Products

    Scios Inc.
    Southeast Regional Research Group LLC.

    UniMR
    UniMR Clinical Research

    YM BioSciences
    YM BioSciences

    Become IBPA Sponsor
    Post Your Logo Here

    ©2004 International Biopharmaceutical Association Inc., all rights reserved
    Privacy Policy - Terms of Use

    Google