From Wikipedia,
the free encyclopedia.
Bioinformatics or
computational biology is the
use of techniques from
applied mathematics,
informatics,
statistics, and
computer science to solve
biological problems. Research
in computational biology often
overlaps with
systems biology. Major
research efforts in the field
include
sequence alignment,
gene finding, genome assembly,
protein structure alignment,
protein structure prediction,
prediction of
gene expression and
protein-protein interactions,
and the modeling of
evolution. The terms
bioinformatics and
computational biology are
often used interchangeably,
although the latter typically
focuses on algorithm development
and specific computational
methods. A common thread in
projects in bioinformatics and
computational biology is the use
of mathematical tools to extract
useful information from
noisy data produced by
high-throughput biological
techniques. (The field of
data mining overlaps with
computational biology in this
regard.) Representative problems
in computational biology include
the assembly of high-quality
DNA sequences from fragmentary
"shotgun" DNA
sequencing, and the prediction
of
gene regulation with data from
mRNA
microarrays or
mass spectrometry.
Making sense of the huge
amounts of DNA data
(pictured) produced by gene
sequencing projects is just
one of the tasks faced by
bioinformatics.
Major research areas
Sequence analysis
Main articles:
Sequence alignment,
Sequence database
Since the
Phage Φ-X174; was
sequenced in 1977, the
DNA sequences of more and more
organisms have been decoded and
stored in electronic databases.
This data is analyzed to determine
genes that code for
proteins, as well as
regulatory sequences. A comparison
of genes within a
species or between different
species can show similarities
between protein functions, or
relations between species (the use
of
molecular systematics to
construct
phylogenetic trees). With the
growing amount of data, it long
ago became impractical to analyze
DNA sequences manually. Today,
computer programs are used to
search the
genome of thousands of
organisms, containing billions of
nucleotides. These programs
can compensate for mutations
(exchanged, deleted or inserted
bases) in the DNA sequence, in
order to identify sequences that
are related, but not identical. A
variant of this
sequence alignment is used in
the sequencing process itself. The
so-called
shotgun sequencing technique
(which was used, for example, by
The Institute for Genomic Research
to sequence the first bacterial
genome, Haemophilus influenza)
does not give a sequential list of
nucleotides, but instead the
sequences of thousands of small
DNA fragments (each about 600-800
nucleotides long). The ends of
these fragments overlap and, when
aligned in the right way, make up
the complete genome. Shotgun
sequencing yields sequence data
quickly, but the task of
assembling the fragments can be
quite complicated for larger
genomes. In the case of the
Human Genome Project, it took
several months of CPU time (on a
circa-2000 vintage DEC Alpha
computer) to assemble the
fragments. Shotgun sequencing is
the method of choice for virtually
all genomes sequenced today, and
genome assembly algorithms are
a critical area of bioinformatics
research.
Another aspect of
bioinformatics in sequence
analysis is the automatic
search for genes and
regulatory sequences within a
genome. Not all of the nucleotides
within a genome are genes. Within
the genome of higher organisms,
large parts of the DNA do not
serve any obvious purpose. This
so-called
junk DNA may, however, contain
unrecognized functional elements.
Bioinformatics helps to bridge the
gap between genome and
proteome projects, for example
in the use of DNA sequence for
protein identification.
See also:
sequence analysis,
sequence profiling tool,
sequence motif.
Genome annotation
Main articles:
Gene finding
In the context of genomics,
annotation is the process of
marking the genes and other
biological features in a DNA
sequence. The first genome
annotation software system was
designed in 1995 by Owen White,
who was part of the team that
sequenced and analyzed the first
genome of a free-living organism
to be decoded, the bacterium
Haemophilus influenzae. Dr.
White built a software system to
find the genes (places in the DNA
sequence that encode a protein),
the transfer RNA, and other
features, and to make initial
assignments of function to those
genes. Most current genome
annotation systems work similarly,
but the programs available for
analysis of genomic DNA are
constantly changing and improving.
The
Ensembl system contains a
genome annotation pipeline for the
human genome (as well as others),
originally developed by Ewan
Birney while at the
Wellcome Trust Sanger Institute
near
Cambridge, England[1].
Computational evolutionary
biology
Evolutionary biology is the
study of the origin and descent of
species, as well as their change
over time. Recent developments in
genome sequencing and the ubiquity
of fast computers enable
researchers to trace evolution of
species by tracing changes in
their DNA. CEB research from the
pre-genome era involved building
computational models of
populations and watching their
behavior over time.
The field of
genetic algorithms might be
described as the rough inverse of
CEB — rather than investigating
evolution through computer
programs, it aims to improve
computer programs through
evolutionary principles.
Gene expression analysis
The
expression of many genes can
be determined by measuring
mRNA levels with multiple
techniques including
microarrays, expressed cDNA
sequence tag (EST)
sequencing,
serial analysis of gene expression
(SAGE) tag sequencing,
massively parallel signature
sequencing (MPSS), or by
measuring protein concentrations
with high-throughput
mass spectroscopy. All of
these techniques are extremely
noise-prone and/or subject to bias
in the biological measurement, and
a major research area in
computational biology involves
developing statistical tools to
separate
signal from
noise in high-throughput gene
expression (HT) studies. HT
studies are often used to
determine the genes implicated in
a disorder: one might compare
microarray data from cancerous
epithelial cells to data from
non-cancerous cells to determine
the proteins that cancer
up-regulates and down-regulates.
Expression data is also used to
infer gene regulation: one might
compare microarray data from a
wide variety of states of an
organism to form hypotheses about
the genes involved in each state.
In a single-cell organism, one
might compare stages of the
cell cycle, along with various
stress conditions (heat shock,
starvation, etc.). One can then
apply
clustering algorithms to that
expression data to determine which
genes are co-expressed. Further
analysis could take a variety of
directions: one 2004 study
analyzed the
promoter sequences of
co-expressed (clustered together)
genes to find common
regulatory elements and used
machine learning techniques to
identify the promoter elements
involved in regulating each
cluster[2].
Protein expression analysis
Protein
microarrays and high
throughput (HT)
mass spectrometry (MS) can
provide a snapshot of the proteins
present in a biological sample.
Bioinformatics is very much
involved in making sense of
protein microarray and HT MS data;
the former involves a number of
the same problems involve in
examining microarrays targeted at
mRNA, the latter involves the
bioinformatics problem of matching
MS data against protein sequence
databases.
Analysis of mutations in
cancer
Massive sequencing efforts are
currently underway to identify
point mutations in a variety
of
genes in
cancer. The sheer volume of
data produced requires automated
systems to read sequence data, and
to compare the sequencing results
to the known sequence of the
human genome, including known
germline polymorphisms.
Oligonucleotide microarrays,
including
comparative genomic hybridization
and
single nucleotide polymorphism
arrays, able to probe
simultaneously up to several
hundred thousand sites throughout
the genome are being used to
identify chromosomal gains and
losses in cancer.
Hidden Markov model and
change-point analysis methods
are being developed to infer real
copy number changes from often
noisy data. Further informatics
approaches are being developed to
understand the implications of
lesions found to be recurrent
across many tumors.
Structure prediction
Main article:
Protein structure prediction
Protein structure prediction is
another important application of
bioinformatics. The
amino acid sequence of a
protein, the so-called primary
structure, can be easily
determined from the sequence on
the gene that codes for it. In the
vast majority of cases, this
primary structure uniquely
determine a structure in its
native environment. (Of course,
there are exceptions, such as the
bovine spongiform encephalopathy
- aka Mad Cow Disease - prion.)
Knowledge of this structure is
vital in understanding the
function of the protein. For lack
of better terms, structural
information are usually classified
as one of
secondary,
tertiary and
quaternary structures. A
viable general solution to such
predictions remains an open
problem. As of now, most efforts
have been directed towards
heuristics that works most of the
time.
One of the key ideas in
bioinformatics research is the
notion of
homology. In the genomic
branch of bioinformatics, homology
is used to predict the function of
a gene: if the sequence of gene
A, whose function is known, is
homologous to the sequence of gene
B, whose function is
unknown, one could infer that B
may share A's function. In the
structural branch of
bioinformatics homology is used to
determine which parts of the
protein are important in structure
formation and interaction with
other proteins. In a technique
called homology modelling, this
information is used to predict the
structure of a protein once the
structure of a homologous protein
is known. This currently remains
the only way to predict protein
structures reliably.
One example of this is the
similar protein homology between
hemoglobin in humans and the
hemoglobin in legumes (leghemoglobin).
Both serve the same purpose of
transporting oxygen in both
organisms. Though both of these
proteins have completely different
amino acid sequences, their
protein structures are virtually
identical, which reflects their
near identical purposes.
Other techniques for predicting
protein structure include protein
threading and de novo (from
scratch) physics-based modeling.
See also
structural motif and
structural domain.
Preserving biodiversity
Bioinformatics is often used
for preserving
biodiversity. The most
important information collected is
the
species names, descriptions,
distributions, status and size of
populations,
habitat needs, and how each
organism interacts with other
species. This information is
compiled with computer
databases, accessed with
software programs to find,
visualize, and analyze the
information automatically, and
most importantly, communicated to
other people, especially over the
internet.
DNA sequences of
endangered species can be
preserved, and names and
descriptions of specimens living
in captivity are stored in order
to allow as much access to the
information needed to preserve
biodiversity as possible.
An example of this application
is the Species 2000 project[3].
It is an internet-based global
research project which intends to
provide information about every
known species of
plant,
animal,
fungus, and
microbe in existence to be the
foundation for studies of global
biodiversity. Anyone in the world
will be able to find vast
information about any known
species from an array of
participating databases.
Modeling biological systems
Main article:
Systems biology
Systems biology involves the
use of
computer simulations of
cellular subsystems (such as
the networks of metabolites and
enzymes which comprise
metabolism,
signal transduction pathways
and
gene regulatory networks) to
both analyze and visualize the
complex connections of these
cellular processes.
Artificial life or virtual
evolution attempts to understand
evolutionary processes via the
computer simulation of simple
(artificial) life forms.
Other applications
Morphometrics is used to
analyze pictures of
embryos to track and to
predict the fate of cell clusters
during
morphogenesis.
Software tools
The computational biology tool
best-known among biologists is
probably
BLAST, an algorithm for
searching large sequence (protein,
DNA) databases.
NCBI provides a popular
implementation that searches their
massive sequence databases.
Computer scripting languages
such as
Perl and
Python are often used to
interface with
biological databases and
parse output from
bioinformatics programs.
Bioinformatic meta search
engines (Entrez,
Bioinformatic Harvester) help
finding relevant information from
several databases.
Communities of bioinformatics
programmers have set up
free/open source projects such
as
EMBOSS,
Bioconductor,
BioPerl,
BioPython,
BioRuby, and
BioJava which develop and
distribute shared programming
tools and objects (as program
modules) that make bioinformatics
easier.
See also
Related fields
External links
Notes & references
-
^
Ensembl Genome Browser
- ^ Beer MA,
Tavazoie S. "Predicting
gene expression from sequence."
In Cell. 2004 Apr
16;117(2):185-98.
- ^
Species 2000
Bibliography
- Baxevanis, A.D. and
Ouellette, B.F.F., eds.,
Bioinformatics: A Practical
Guide to the Analysis of Genes
and Proteins, third edition.
Wiley, 2005.
ISBN 0471478784
- Claverie, J.M. and C.
Notredame, Bioinformatics for
Dummies. Wiley, 2003.
ISBN 0764516965
- Durbin, R., S. Eddy, A.
Krogh and G. Mitchison,
Biological sequence analysis.
Cambridge University Press,
1998.
ISBN 0521629713
- Kohane, et al.
Microarrays for an Integrative
Genomics. The MIT Press,
2002.
ISBN 026211271X
- Michael S. Waterman,
Introduction to Computational
Biology: Sequences, Maps and
Genomes. CRC Press, 1995.
ISBN 0412993910
- Mount, David W.
Bioinformatics: Sequence and
Genome Analysis Spring
Harbor Press, May 2002.
ISBN 0879696087