Title: Introduction to bioinformatics Lecture 3
1Introduction to bioinformaticsLecture 3
High-throughput Biological Data-data deluge,
bioinformatics algorithms- and evolution
2Last lecture
- Many different genomics datasets
- Genome sequencing more than 300 species
completely sequenced and data in public domain
(i.e. information is freely available), virus
genome can be sequenced in a day - Gene expression (microarray) data many
microarrays measured per day
- Proteomics Protein Data Bank (PDB) - as of
Tuesday February 07, 2006 there are 35026
Structures. http//www.rcsb.org/pdb/
- Protein-protein interaction data many databases
worldwide
- Metabolic pathway, regulation and signaling data,
many databases worldwide
3Growth in number of protein tertiary structures
4The data deluge
- Although a lot of tertiary structural data is
being produced (preceding slide), there is the
- SEQUENCE-STRUCTURE-FUNCTION GAP
- The gap between sequence data on the one hand,
and structure or function data on the other, is
widening rapidly Sequence data grows much faster
5High-throughput Biological DataThe data deluge
- Hidden in all these data classes is information
that reflects
- existence, organization, activity, functionality
of biological machineries at different levels
in living organisms
Most effectively utilising and analysing this
information computationally is essential for
Bioinformatics
6Data issues from data to distributed knowledge
- Data collection getting the data
- Data representation data standards, data
normalisation ..
- Data organisation and storage database issues
..
- Data analysis and data mining discovering
knowledge, patterns/signals, from data,
establishing associations among data patterns
- Data utilisation and application from data
patterns/signals to models for bio-machineries
- Data visualization viewing complex data
- Data transmission data collection, retrieval,
..
-
7Bio-Data Analysis and Data Mining
- Analysis and mining tools exist and are developed
for
- DNA sequence assembly
- Genetic map construction
- Sequence comparison and database searching
- Gene finding
- Gene expression data analysis
- Phylogenetic tree analysis, e.g. to infer
horizontally-transferred genes
- Mass spectrometry data analysis for protein
complex characterization
8Bio-Data Analysis and Data Mining
- As the amount and types of data and their cross
connections increase rapidly
- the number of analysis tools needed will go up
exponentially if we do not reuse techniques
- blast, blastp, blastx, blastn, from BLAST
family of tools (we will cover BLAST later)
- gene finding tools for human, mouse, fly, rice,
cyanobacteria, ..
- tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, ..
9Bio-Data Analysis and Data Mining
- Many of these data analysis problems are
fundamentally the same problem(s) and can be
solved using the same set of tools
- e.g.
- clustering or
- optimal segmentation by Dynamic Programming
- We will cover both of these techniques in later
lectures
10Bio-data Analysis, Data Mining and Integrative
Bioinformatics
To have analysis capabilities covering a wide
range of problems, we need to discover the common
fundamental structures of these problems
HOWEVER in biology one size does NOT fit all
An important goal of bioinformatics is
development of a data analysis infrastructure in
support of Genomics and beyond
11Protein structure hierarchical levels
12Protein complexes for photosynthesis in plants
13Protein folding problem
Each protein sequence knows how to fold into
its tertiary structure. We still do not
understand exactly how and why
SECONDARY STRUCTURE (helices, strands)
1-step process
2-step process
The 1-step process is based on a hydrophobic
collapse the 2-step process, more common in
forming larger proteins, is called the framework
model of folding
TERTIARY STRUCTURE (fold)
14Protein folding step on the way is secondary
structure prediction
- Long history -- first widely used algorithm was
by Chou and Fasman (1974)
- Different algorithms have been developed over the
years to crack the problem
- Statistical approaches
- Neural networks (first from speech recognition)
- K-nearest neighbour algorithms
- Support Vector machines
15Algorithms in bioinformatics (recap)
- Sometimes the same basic algorithm can be re-used
for different problems (1-method-multiple-problem)
- Normally, biological problems are approached by
different researchers using a variety of methods
(1-problem-multiple-method)
16Algorithms in bioinformatics
- string algorithms
- dynamic programming
- machine learning (Neural Netsworks, k-Nearest
Neighbour, Support Vector Machines, Genetic
Algorithm, ..)
- Markov chain models, hidden Markov models,
Markov Chain Monte Carlo (MCMC) algorithms
- molecular mechanics, e.g. molecular dynamics,
Monte Carlo, simplified force fields
- stochastic context free grammars
- EM algorithms
- Gibbs sampling
- clustering
- tree algorithms
- text analysis
- hybrid/combinatorial techniques and more
17Sequence analysis and homology searching
18Finding genes and regulatory elements
There are many different regulation signals such
as start, stop and skip messages hidden in the
genome for each gene, but what and where are they?
19Expression data
20Functional genomics
Monte Carlo
21Protein translation
22What is life?
- NASA astrobiology program
- Life is a self-sustained chemical system
capable of undergoing Darwinian evolution
23Evolution
- Four requirements
- Template structure providing stability (DNA)
- Copying mechanism (meiosis)
- Mechanism providing variation (mutations
insertions and deletions crossing-over etc.)
- Selection some traits lead to greater fitness of
one individual relative to another. Darwin wrote
survival of the fittest
Evolution is a conservative process the vast
majority of mutations will not be selected (i.e.
will not make it as they lead to worse
performance or are even lethal) this is called
negative (or purifying) selection
24Orthology/paralogy
Orthologous genes are homologous (corresponding)
genes in different species Paralogous genes are h
omologous genes within the same species (genome)
25Changing molecular sequences
- Mutations changing nucleotides (letters)
within DNA, also called point mutations
- A G purines, C T/U pyrimidines
- Transition purine - purine or pyrimidine -
pyrimidine
- Transversion purine - pyrimidine or pyrimidine
- purine
26Types of point mutation
- Synonymous mutation mutation that does not lead
to an amino acid change (where in the codon are
these expected?)
- Non-synonymous mutation does lead to an amino
acid change
- Missense mutation one a.a replaced by other a.a
- Nonsense mutation a.a. replaced by stop codon
(what happens with protein?)
27Ka/Ks Ratios
- Ks is defined as the number of synonymous
nucleotide substitutions per synonymous site
- Ka is defined as the number of nonsynonymous
nucleotide substitutions per nonsynonymous site
- The Ka/Ks ratio is used to estimate the type of
selection exerted on a given gene or DNA
fragment
- Need aligned orthologous sequences to do
calculate Ka/Ks ratios (we will talk about
alignment later).
28Ka/Ks ratios
The frequency of different values of Ka/Ks for
835 mouserat orthologous genes. Figures on the x
axis represent the middle figure of each bin
that is, the 0.05 bin collects data from 0 to 0.1
29Ka/Ks ratios
- Three types of selection
- 1. Negative (purifying) selection - Ka/Ks
- 2. Neutral selection (Kimura) - Ka/Ks 1
- 3. Positive selection - Ka/Ks 1
30Human Evolution
31Divergent Evolution
- Ancestral sequence ABCD
-
- ACCD (B C)
ABD (C ø)
-
- ACCD or ACCD
Pairwise Alignment
- AB-D A-BD
-
mutation deletion
32Evolution
- Ancestral sequence ABCD
-
- ACCD (B C)
ABD (C ø)
- ACCD or ACCD
Pairwise Alignment
- AB-D A-BD
-
mutation deletion
true alignment
33Consequence of evolution
- Notion of comparative analysis (Darwin)
- What you know about one species might be
transferable to another, for example from mouse
to human
- Provides a framework to do the multi-level
large-scale analysis of the genomics data
plethora
34Flavodoxin-cheY Multiple Sequence Alignment
35Human
Yeast
We need to be able to do automatic pathway
comparison (pathway alignment)
This pathway diagram shows a comparison of
pathways in (left) Homo sapiens (human) and
(right) Saccharomyces cerevisiae (bakers yeast).
Changes in controlling enzymes (square boxes in
red) and the pathway itself have occurred (yeast
has one altered (overtaking) path in the graph)
36The citric-acid cycle
http//en.wikipedia.org/wiki/Krebs_cycle
37The citric-acid cycle
Fig. 1. (a) A graphical representation of the
reactions of the citric-acid cycle (CAC),
including the connections with pyruvate and
phosphoenolpyruvate, and the glyoxylate shunt.
When there are two enzymes that are not
homologous to each other but that catalyse the
same reaction (non-homologous gene displacement),
one is marked with a solid line and the other
with a dashed line. The oxidative direction is
clockwise. The enzymes with their EC numbers are
as follows 1, citrate synthase (4.1.3.7) 2,
aconitase (4.2.1.3) 3, isocitrate dehydrogenase
(1.1.1.42) 4, 2-ketoglutarate dehydrogenase
(solid line 1.2.4.2 and 2.3.1.61) and
2-ketoglutarate ferredoxin oxidoreductase (dashed
line 1.2.7.3) 5, succinyl- CoA synthetase
(solid line 6.2.1.5) or succinyl-CoAacetoacetate
-CoA transferase (dashed line 2.8.3.5) 6,
succinate dehydrogenase or fumarate reductase
(1.3.99.1) 7, fumarase (4.2.1.2) class I (dashed
line) and class II (solid line) 8,
bacterial-type malate dehydrogenase (solid line)
or archaeal-type malate dehydrogenase (dashed
line) (1.1.1.37) 9, isocitrate lyase (4.1.3.1)
10, malate synthase (4.1.3.2) 11,
phosphoenolpyruvate carboxykinase (4.1.1.49) or
phosphoenolpyruvate carboxylase (4.1.1.32) 12,
malic enzyme (1.1.1.40 or 1.1.1.38) 13, pyruvate
carboxylase or oxaloacetate decarboxylase
(6.4.1.1) 14, pyruvate dehydrogenase (solid
line 1.2.4.1 and 2.3.1.12) and pyruvate
ferredoxin oxidoreductase (dashed line 1.2.7.1).
M. A. Huynen, T. Dandekar and P. Bork Variation
and evolution of the citric acid cycle a genomic
approach'' Trends Microbiol, 7, 281-29 (1999)
38The citric-acid cycle
b) Individual species might not have a complete
CAC. This diagram shows the genes for the CAC for
each unicellular species for which a genome
sequence has been published, together with the
phylogeny of the species. The distance-based
phylogeny was constructed using the fraction of
genes shared between genomes as a similarity
criterion29. The major kingdoms of life are
indicated in red (Archaea), blue (Bacteria) and
yellow (Eukarya). Question marks represent
reactions for which there is biochemical evidence
in the species itself or in a related species but
for which no genes could be found. Genes that lie
in a single operon are shown in the same color.
Genes were assumed to be located in a single
operon when they were transcribed in the same
direction and the stretches of non-coding DNA
separating them were less than 50 nucleotides in
length.
M. A. Huynen, T. Dandekar and P. Bork Variation
and evolution of the citric acid cycle a genomic
approach'' Trends Microbiol, 7, 281-29 (1999)
39Thinking about evolution
- Is the evolutionary model applicable to other
systems?
- Story telling in old cultures
- Richard Dawkins book entitled A Selfish Gene
talks about Memes
- The Genetic Algorithm (GA) is arguably the best
computational optimisation strategy around, and
is based entirely on Darwinian evolution