From Wikipedia,
the free encyclopedia.
A sequence motif is a
nucleotide or
amino-acid
sequence pattern that is
widespread and has, or is
conjectured to have, a
biological significance.
An example is the N-glycosylation
site motif:
- Asn, followed by anything
but Pro, followed by either Ser
or Thr, followed by anything but
Pro
where the three-letter
abbreviations are the conventional
designations for
amino acids (see
genetic code).
Overview
When a sequence motif appears
in the
exon of a
gene, it may
encode the "structural
motif" of a
protein; that is a
stereotypical element of the
overall structure of the
protein. Nevertheless, motifs need
not be associated with a
distinctive
secondary structure. "Noncoding"
sequences are not
translated into proteins and
nucleic acids with such motifs
need not deviate from the typical
shape (e.g. the "B-form"
DNA
double helix).
Outside of gene exons, there
exist
regulatory sequence motifs
and motifs within the "junk,"
such as
satellite DNA. Some of these
are believed to affect the shape
of nucleic acids (see for example
RNA self-splicing), but this
is only sometimes the case. For
example, many
DNA binding proteins that have
affinities for specific motifs
only bind DNA in its
double-helical form. They are able
to recognize motifs through
contact with the double helix's
major or minor groove.
Short coding motifs, which
appear to lack secondary
structure, include those that
label proteins for delivery to
particular parts of a
cell, or mark them for
phosphorylation.
Within a sequence or
database of sequences,
researchers search and find motifs
using computer-based techniques of
sequence analysis, such as
BLAST. Such techniques belong
to the discipline of
bioinformatics.
See also
consensus sequence.
Motif bioinformatics
Consider the N-glycosylation
site motif mentioned above:
- Asn, followed by anything
but Pro, followed by either Ser
or Thr, followed by anything but
Pro
This pattern may be written as
N{P}[ST]{P}
where N=Asn, P=Pro,
S=Ser, T=Thr;
{X} means any amino
acid except X; and
[XY] means either
X or Y.
The notation [XY]
does not give any indication of
the probability of X
or Y occurring in the
pattern. Sometimes patterns are
defined in terms of a
probabilistic model such as a
hidden Markov model.
Motifs and consensus sequences
The notation [XYZ]
means X or Y
or Z, but does not
indicate the likelihood of any
particular match. For this reason,
two or more patterns are often
associated with a single motif:
the defining pattern, and various
typical patterns.
For example, the defining
sequence for the IQ motif may be
taken to be:
[FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY]
where x signifies
any amino acid, and the square
brackets indicate an alternative
(see below for further details
about notation).
Usually, however, the first
letter is I, and both
[RK] choices resolve
to R. Since the last
choice is so wide, the pattern
IQxxxRGxxxR is
sometimes equated with the IQ
motif itself, but a more accurate
description would be a
consensus sequence for the IQ
motif.
Software
There are software programs
which, given multiple input
sequences, attempt to identify one
or more candidate motifs. One
example is
MEME, which generates
statistical information for each
candidate.
Discovery through evolutionary
conservation
Motifs have been discovered by
studying similar genes in
different species. For example, by
aligning the amino acid sequences
specified by the GCM (glial
cells missing) gene in man,
mouse and D. melanogaster,
Akiyama and others discovered a
pattern which they called the GCM
motif. It spans about 150 amino
acid residues, and begins as
follows:
WDIND*.*P..*...D.F.*W***.**.IYS**...A.*H*S*WAMRNTNNHN
Here each .
signifies a single amino acid or a
gap, and each *
indicates one member of a
closely-related family of amino
acids.
The authors were able to show
that the motif has DNA binding
activity.
Pattern description notations
Several notations for
describing motifs are in use but
most of them are variants of
standard notations for
regular expressions and use
these conventions:
- there is an alphabet of
single characters, each denoting
a specific amino acid or a set
of amino acids;
- a string of characters drawn
from the alphabet denotes a
sequence of the corresponding
amino acids;
- any string of characters
drawn from the alphabet enclosed
in square brackets matches any
one of the corresponding amino
acids; e.g.
[abc]
matches any of the amino acids
represented by a or
b or c.
The fundamental idea behind all
these notations is the matching
principle, which assigns a meaning
to a sequence of elements of the
pattern notation:
- a sequence of elements of
the pattern notation matches a
sequence of amino acids if and
only if the latter sequence can
be partitioned into subsequences
in such a way that each pattern
element matches the
corresponding subsequence in
turn.
Thus the pattern [AB] [CDE]
F matches the six amino
acid sequences corresponding to
ACF, ADF,
AEF, BCF,
BDF, and BEF.
Different pattern description
notations have other ways of
forming pattern elements. One of
these notations is the PROSITE
notation, described in the
following subsection.
PROSITE pattern notation
The PROSITE notation uses the
IUPAC one-letter codes and
conforms to the above description
with the exception that a
concatenation symbol, '-',
is used between pattern elements,
but it is often dropped between
letters of the pattern alphabet.
PROSITE allows the following
pattern elements in addition to
those described previously:
- The lower case letter '
x'
can be used as a pattern element
to denote any amino acid.
- A string of characters drawn
from the alphabet and enclosed
in braces (curly brackets)
denotes any amino acid except
for those in the string. For
example,
{ST}
denotes any amino acid other
than S or T.
- If a pattern is restricted
to the N-terminal of a sequence,
the pattern is prefixed with '
<'.
- If a pattern is restricted
to the C-terminal of a sequence,
the pattern is suffixed with '
>'.
- The character '
>'
can also occur inside a
terminating square bracket
pattern, so that S[T>]
matches both "ST"
and "S>".
- If
e is a
pattern element, and m
and n are two
decimal integers with m
<= n, then:
e(m) is
equivalent to the repetition
of e exactly
m times;
e(m,n) is
equivalent to the repetition
of e exactly
k times for any
integer k
satisfying: m <=
k <= n.
Some examples:
x(3) is
equivalent to x-x-x.
x(2,4) matches
any sequence that matches
x-x or x-x-x
or x-x-x-x.
The signature of the C2H2-type
zinc finger domain is:
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
Another scheme
This example comes from the
paper by Matsuda and colleagues
cited below.
The
E. coli lactose
operon repressor LacI (PDB
id 1lccA) and E. coli
catabolite gene activator (PDB id
3gapA) both have a
helix-turn-helix motif, but
their amino acid sequences do not
show much similarity, as shown in
the table below.
Matsuda and colleagues devised
a code called the 3D chain code
for representing a protein
structure as a string of letters.
This encoding scheme reveals the
similarity between the proteins
much more clearly than the amino
acid sequence:
| |
3D chain code |
Amino acid sequence |
| 1lccA |
TWWWWWWWKCLKWWWWWWG |
LYDVAEYAGVSYQTVSRVV |
| 3gapA |
KWWWWWWGKCFKWWWWWWW |
RQEIGQIVGCSRETVGRIL |
References
- Akiyama, Y. et al. (1996).
The gcm-motif: a novel
DNA-binding motif conserved in
Drosophila and mammals. Proc.
Natl. Acad. Sci. USA 93
14912–14916.
- Matsuda, Hideo; Taniguchi,
Fumihiro; & Hashimoto, Akihiro
(January 1997). An Approach to
Detection of Protein Structural
Motifs using an Encoding Scheme
of Backbone Conformations.
Proc. of 2nd Pacific Symposium
on Biocomputing 280–291.