Title: Principles of Sequence Alignments
1Principles of Sequence Alignments
Jens Kleinjung National Institute for Medical
Research London
Cagliari 26.09.2008
2Overview
- Why?
- Inheritance, Evolution
- DNA and Culture
- DNA .dat or .bin ?
- Diseases
- When?
- Divergence, Homology
- Convergence
- Inferences
- How?
- Reality v. Observation
- Alignment Model
- Information and Noise
- Extended Model
- Word Matching
- Dynamic Programming
- Multiple Alignment
- HMMs
- Structure Alignment
31. Sequence AlignmentWHY?
This section provides the larger context
within which sequence alignment is applied.
4Thanks
Some slides are authored by Dr. Cedric
Notredame or my colleagues from the VU Amsterdam.
51.1 Inheritance, Evolution
6Wrong Model of Evolution
Before Charles Darwin, evolutionary history
was thought to be a linear process.
7Correct Model of Evolution
But in fact, it is a tree based on progressive
divergence.
81.2 DNA and Culture
Texts are used in human culture for cultural
inheritance since thousand of years. With the
human genome sequencing the molecular inheritance
became available in text form. This formalisation
opens the possibility to decipher the book of
life.
91.3 DNA .dat or .bin ?
1.3.1. The DNA stores the sequences of our
genes. In this respect the DNA behaves like data
(.dat).
1.3.2. The DNA has regulatory areas that control
the expression of genes. In this respect the
DNA behaves like a program (.bin).
1.3.3. DNA has a temporal dimension Consider
that caterpillar and butterfly are the phenotype
of the same genome.
101.4 Diseases
1.4.1 Each disease can be traced back to a
molecular cause. To ultimately understand
diseases, we need to know the underlying
molecular mechanisms and ideallytheir effect at
the level of the cell, organ and
organism. 1.4.2. Bioinformatics is mostly
concerned with the molecular level, Medicine
mostly with organs and the organism. Systems
Biology is a new field to bridge the different
levels.
112. Sequence AlignmentWHEN?
This sections specifies the situations in which
sequence alignments help understanding molecular
relationships.
122.1 Divergence and Homology
Divergent evolution is the rule.
...PANEEFRITTATA...
homologous sequences B C
...PA-EEFRRRITTATA...
protein pan in species B
13Comments to Slide 2.1
2.1.1. Mutations of the DNA occur constantly.
Mutations in somatic cells remain localised
(although in the case of cancer they can spread),
while mutations in the germline will be contained
in the child generation and may be fixed in the
population. 2.1.2. Within each species the
exchange of genetic material keeps the average
sequence distance short. Example Despite
geographic separation there is only one human
species.
PANEEFRITTATA
14Comments to Slide 2.1
2.1.3. If two groups of a species stop exchanging
genes, their genomes will diverge up to a point
where their genetic material becomes incompatible
for reproduction. 2.1.4. Speciation event The
point in time when two groups of a species have
diverged such that they cannot produce viable
offspring together. Because species cannot mix
genes between each other, mutations increase
gradually in time the sequence distance. 2.1.5.
Species definition Two species cannot produce
viable offspring together.
PANEEFRITTATA
15Comments to Slide 2.1
Example Horse and Donkey
PANEEFRITTATA
16Comments to Slide 2.1 Sequence Alignment
Pairwise alignment of homologous
sequences. Insert gaps (-) to bridge INDEL events.
Sequence identity 11/15 73
One can assume homology if sequence identity gt
30.
PANEEFRITTATA
17Alignment and Homology
2.1.6. A sequence alignment matches amino acid
or nucleotide residues that are evolutionary
related. 2.1.7. Therefore, sequence alignment
makes only sense for homologous
sequences! 2.1.8. The sequence space is so
immense (20200 for a protein with 200 residues)
that the probability of a significant similarity
between unrelated (non-holologous) sequences is
close to zero. 2.1.9. The only exceptions are
relatively short segments of convergent evolution.
182.2 Convergence
Convergent evolution is the exception.
2.2.1. Convergent evolution happens when the
same selection pressure generates the same result
in two unrelated organisms. Example The
similarity of swift and swallow (rondone and
rondine).
19Convergence in Molecules
202.3 Inferences
Knowing the sequence one can often infer
other molecular properties.
2.3.1. Structure Certain sequence motifs are
strongly correlated with the molecular
conformation, for example (Pro-Hyp-Gly)x forms
the collagen triple helix. 2.3.2. Function
Functional sites such as Ca2 binding sites and
catalytic triades have characteristic sequence
motifs (patterns). 2.3.3. Interaction Binding
between proteins is often associated with
sequence patterns.
21Scheme of Mutual Information in Molecular
Evolution
Slide by Cedric Notredame
223. Sequence AlignmentHOW?
This section describes the most important and
mostfrequently used techniques for sequence
comparison.
233.1 Alignment Model
243.2 Reality versus Observation Minkowski space
in Biology
253.3 Information and Noise
3.3.1. Pairs of identical or similar amino acid
residues carry information to construct an
alignment. 3.3.2. In terms of alignability,
mutations, in particular INDEL events, represent
noise, because the original signal (amino acid
residue) is lost.
263.4 Extended Model
273.5 Word Matching
3.5.1. Exact word matching Unix grep Suffix
trees 3.5.2 Approximate word matching (pattern
matching) Unix grep pattern matching syntax in
Perl or C 3.5.3 Fast word matching with with
precise statistics Blast
283.6 Dynamic Programming
Dynamic Programming is an iterative algorithm
with square complexity and square memory usage
that generates the optimal pairwise alignment
given a scoring scheme, i.e. a substitution
matrix and gap penalties.
29Global and Local Alignment
Global Alignment highest scoring path from top
left to bottom right
Local Alignment highest scoring segment with
cell scores gt0
Needleman Wunsch
Smith Waterman
30Amino Acid Substitution Matrix
Substituting amino acid A with B ScoreAB log
p(AB) / (p(A) p(B)) p(AB) probability of
aligned amino acid pair AB in trusted
alignment. p(A) p(B) probability of
observing AB in random alignment ( background
probability). This type of score (log
p(observed) / p(random)) is called relative
entropy and it is related to the mutual
information.
31Amino Acid Substitution Matrix
Similar amino acids (coloured blocks) have
often positive substitution scores the
substitution occurs more often than randomly in
proteins.
32Gap Penalties
One can envisage sequence alignment as placing
gaps at the correct place. A gap is the model of
an INDEL event. There are many types of INDEL
events, but usually only one gap
penalty parametrisation. Common is the affine
gap penalty scheme with a high gap-open (go)
penalty and a low gap-extension (ge)
value. Score(gap) go l ge with l gap
length
33The DP Algorithm Initialise
Task align GAGGCGA with GAGTGA!
DP algorithm
DP alignment matrix
34The DP Algorithm Fist Column and Row
Procedure Fill the DP matrix using the DP
algorithm!
35The DP Algorithm Neighbour Cell Scores
1
-4
-4
36The DP Algorithm The First Step
-3
-6
-1
37The DP Algorithm and so on ...
-3
-8
-3
38The DP Algorithm The Complete DP Matrix
393.6 Dynamic Programming
GA-GTGA GAGGCGA
GAGT-GA GAGGCGA
403.6 Pairwise Alignment Quality
Vogt et al., JMB 249, 816-831,1995
413.7 Multiple Alignment
Multiple Alignment extends the concept of
pairwise sequence alignment to gt2 sequences. The
aim is to align all evolutionary related amino
acid or nucleotide residues in the same column.
42Multiple DP
In principle one could perform multi-dimensional D
ynamic Programming, but that becomes very slow
for many sequences. Complexity ln with l
sequence length and n sequence number.
43A Muliple Sequnce Alignment
44Progressive Alignment
54/2
45Profiles from Guide Tree
46Comments to Progressive Alignment
3.7.1. The idea is that the pairwise alignments
between the closest ( highest scoring)
sequences have the least number of
errors. 3.7.2. Errors in the pairwise alignments
will not be corrected in the progressive
phase! 3.7.3. At later stages in the scheme one
needs to score and align sequences against
profiles and profiles against profiles.
47Improvement of Alignment Quality
3.7.4. Use consistency of matching
(transititvity) if A?B and B?C also A?C? All the
top-performing multiple alignment programs use
consistency scores. 3.7.5. Use profiles to
enhance positional information. Before the actual
sequence alignment, collect all homologues from
the database and use the profile for the
alignment instead of the single sequence.
483.8 Hidden Markov Models
3.8.1. Hidden Markov Models are similar to finite
state automata. 3.8.2. The system HMM
represents a Markov chain with states and
transitions between states. Walking along the
Markov chain, each state emits a character with a
given (hidden) probability and each transition
leads to a new state with a given (hidden)
probabilities. 3.8.3. To determine the
probabilities, a HMM has to be trained on
representaive data.
49Profile HMM
Assuming that the HMM above has been trained on
the multiple alignment (profile) of a protein
family, the square boxes represent the observed
substitutions in alignment column 1, 2, 3, 4 and
the arrows between them the associated transition
probabilities.
50A HMM Example
.4 .3 .46 .6 .97 .5 .015 .73 .01
1 1.76x10-6
log(.4) log(.3) log(.46) log(.6) log(.97)
log(.5) log(.015) log(.73) log(.01)
log(1) -13.25
http//compbio.soe.ucsc.edu/ismb99.handouts/KK185F
P.htmlhmm
51What Can We Do With A HMM?
3.8.4. Compute the probability of a
sequence. 3.8.5. Generate sequences according to
the HMM parametrisation. 3.8.6 Compute the
probability of a sequence alignment using the
Viterbi algorithm (most probable path). 3.8.7.
Given 3.8.6., decide whether a sequence
belongs to a protein family.
523.9 Structure Alignment
Structure alignment is a computationally hard
problem. Residue matches are not independent (as
in sequence alignment) because of the rigid
3D-structure.
53Possible Alignment Schemes
3.9.1. Align in all possible orientations (6D
search space). Unfeasible for most structure
pairs. 3.9.2. Use a coarse-grained
representation (grid) and align in all possible
orientations (reduced 6D search space). Limited
to few pairwise comparisons. 3.9.3. Create
optimal sub-solutions (fragment matches) and
assemble these to near-optimal total solution.
The search space is approximately n2 with n
number of fragments.
54A Structure Alignment Scheme
553.10 Phylogeny
This section explains basic aspects of phylogeny.
563.10 Phylogeny
Phylogenetic tree by Charles Darwin in the
Origin of Species.
57Phylogenetic tree (unrooted)
Trees are binary. Leafs are observed taxonomical
units. Edge (branch) lengths are proportional to
evolutionary distance.
58Phylogenetic tree (rooted)
59How to Root a Tree
- Outgroup place root between distant sequence
and rest group. - Midpoint place root at midpoint of longest path
(sum of branches between any two observed
taxonomical units (OTUs)). - Gene duplication place root between paralogous
gene copies.
60Combinatorics of Trees
61Distances on a Tree
One assumes that the species distance
is proportional to the sequence distance. The
distance (in evolutionary time) is represented by
the horizontal branch length.
62Phylogenetic Distance Computation
Parsimony fewest number of evolutionary events
(mutations) relatively often fails to
reconstruct correct phylogeny. Distance-based
pairwise sequence distances. Maximum Likelihood
L(ikelihood) P(robability)Tree
Data Probability obtained with Bayesian
method PTree Data PData Tree PTree
/ PData Algorithm Markov chain Monte Carlo
(MCMC).
63Tree Confidence Intervals
Bayesian method Compute probability of tree
given the data and compare to probability of
other well-fitting trees. Distance method
bootstrap Select multiple alignment columns
with replacement and recalculate tree. Compare
branches with original (target) tree. Repeat
100-1000 times, so calculate 100-1000 different
trees, and derive confidence intervals for
internal nodes.
643.11 Practical Tips
3.11.1. Biological information is noisy. Use as
much information as you can to infer conclusions
or to build models (HMM instead of Pairwise
Alignment). 3.11.2. Be aware of the limitations
of the methods and combine methods to overcome
limitation. Example Use secondary structure
prediction together with sequence
alignment. 3.11.3. Use the hierarchical
organisation of biomolecules to dissect the
problem into subproblems. Example Are there
gaps in secondary structure elements?
653.12 References
This section lists some resources for
sequence alignment.
66Substitution Matrices
PAM Original substitution matrix based on Markov
process of amino acid substitution. BLOSUM Uses
the substitutions obseved in conserved blocks of
multiple sequence alignments. GONNET Modern
version of PAM matrices.
67Alignment Programs
Pairwise Alignment Lalign (local
alignment) SSEARCH (global alignment) Multiple
Alignment T-Coffee (consistency-based
alignment) Praline (homology-enhanced
alignment) Structure Alignment TMalign
(pairwise) MAMMOTH (multiple)
68Databases
There is a database for nearly everything.
NR The non-redundant sequence database of the
NCBI. PFAM Collection of HMMs of protein
families. PDB The database of protein and
nucleic acid structures. GO The Gene Ontology
links different levels of biological information
together.
694. Sequence AlignmentWhat Else?
This section lists some of the (numerous)
alignmenttechniques and applications that have
not been treated in detail in this presentation.
704.1 Protein Alignment
4.1.1 Repeats 4.1.2 Biphasic gap schemes 4.1.3
Divide-and-Conquer
714.2 RNA / DNA Alignment
4.2.1 Suffix Trees 4.2.2 Segment
Alignment 4.2.3 Genome Alignment
724.3 Prediction and Alignment
4.3.1 Secondary Structure Prediction ( 70-80
correct) 4.3.2 Functional-site Prediction 4.3.3
3D-Structure Prediction (not yet generally
applicable see folding problem)