Multiple Alignment by profile HMM training and Phylogenetic Trees - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Multiple Alignment by profile HMM training and Phylogenetic Trees

Description:

Title: Probability Theory and Basic Alignment of String Sequences Author: apstjhan Last modified by: Nastya Created Date: 11/17/2004 1:36:46 PM Document presentation ... – PowerPoint PPT presentation

Number of Views:190

Avg rating:3.0/5.0

Slides: 51

Provided by: apst

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Alignment by profile HMM training and Phylogenetic Trees

1
Multiple Alignment by profile HMM
trainingandPhylogenetic Trees

Elze de Groot
Anastacia Berdnikova

2
Topics

Multiple alignment with known HMM
HMM training from unaligned sequences
Avoiding local maxima
Simulated annealing
Noise injection
Stochastic sampling traceback algorithm
Model surgery
Phylogenetic trees

3
Multiple alignment with known profile HMM

Multiple alignment and model known -gt align large
number of other family members
Calculating Viterbi alignment for every sequence
Residues in same match state are aligned in
columns
Thats a difference between profile HMM and
traditional multiple alignment

4
Example

Model estimated from an alignment

5
Example continued

The most probable paths and alignment

6
Profile HMM training from unaligned sequences

Algorithm

7
Initial Model

Choose length of model
- M is number of match states
- set M to be the average length
Choose initial models carefully
Randomness in choice of initial model

8
Parameter Estimation

Use forward and backward variables to re-estimate
emission and transition probability parameters
Baum-Welch re-estimation can be replaced by
viterbi alternative

9
Forward Algorithm
10
Backward algorithm
11
Baum-Welch re-estimation equations

Expected emission counts from sequence x

12
Baum-Welch re-estimation equations

Expected transition counts from sequence x

13
Avoiding local maxima

Baum-Welch guaranteed to find local maxima
Not guaranteed it is anywhere near global optimum
or biologically reasonable solution
Reason models are long -gt many options to get
wrong solution

14
Avoiding local maxima

Use stochastic search algorithm
Commonly used Simulated annealing

15
Simulated annealing

Some compounds only cristallise if they are
slowly annealed from high to low temperature
Optimisation problem minimise function energy
E(x)
Maximising function same as minimising negative
value of function

16
Simulated annealing (2)

temperature T
Probability of state x is given by Gibbs
distribution
Partition function
x usually multidimensional so impossible to
calculate Z

17
Simulated annealing (3)

T?0, all configurations except with lowest energy
are prob 0 (system is frozen)
T??, All configuration have same prob (system is
molten)
With crystallisation minimum can be found by
sampling this distribution at high temperature
first and then decreasing temperatures

18
Simulated annealing for HMM

Natural energy function negative log of
likelihood logP(data?)
Non-trivial, the two methods Im going to mention
are approximations

19
Noise injection

Adding noise to counts estimated in
forward-backward procedure and let size of noise
decrease slowly
In Krogh et al.1994 the noise was generated by
a random walk in the initial model

20
Simulated annealing Viterbi estimation

If there are N sequences, theres an exact
translation from the N paths ?1,, ?N to the
parameters of the model
Treat the paths as fundamental parameters in
which to maximise the likelihood
Simulated annealing done in these variables
instead of the model parameters

21
Simulated annealing Viterbi estimation

Denominator is Z, the partition function -gt sum
over all paths
Can be obtained by modified forward algorithm
using exponentiated transmission and emission
parameters

22
Simulated annealing Viterbi estimation

Exponentiated transmission parameter
âij aij1/T
Exponentiated emission parameter
êj(x) ej(x)1/T
Used in place of unmodified probability
parameters in forward algorithm
Z is result of forward algorithm

23
Simulated annealing Viterbi estimation

Algorithm Stochastic sampling traceback
algorithm for HMMs

Initialisation pL1 End. Recursion for L1
i 1,
24
Simulated annealing Viterbi vs Viterbi

Key difference
Viterbi selects highest probable path for each
sequence
Simulated annealing samples each path according
to the likelihood of the path

25
Model Surgery

During training a model two things can happen
(a) some match states are redundant and should be
absorbed in insert state
(b) one or more insert states aborb too much
sequence, in which case they should be expanded

26
Model Surgery

How much is a certain transition used by training
sequences
Usage of match state is sum of counts for all
letters in state

27
Model surgery

If match state is used by less than ½ sequences
-gt delete module
If more than ½ of sequences use the transitions
into an insert state, this is expanded to new
modules

28
Model surgery Example SAM

I tried a sequence in SAM with and without model
surgery
Same 7 sequences as in example before
Parameters ltcutinsert 0.25gt ltcutmatch 0.5gt -gt
delete any match state used by fewer than half
the sequences, and insert match states for any
insert node used by greater than one quarter of
the sequences

29
Model surgery Example SAM

Without model surgery
gtseq1
FPHFD.....L...S.....-HGSAQ
gtseq2
FESFG.....D...LstpdaVMGNPK
gtseq3
FDRFKhlkteA...E.....MKASED
gtseq4
FTQFA.....G...Kdles.IKGTAP
gtseq5
FPKFK.....G...LttadqLKKSAD
gtseq6
FSFLK.....GtseV.....PQNNPE
gtseq7
FGFSG.....A...-.....--SDPG

With model surgery
gtseq1
FPHF.DLS-..-..--HGSAQ
gtseq2
FESF.GDLStpD..AVMGNPK
gtseq3
FDRF.KHLK..TeaEMKASED
gtseq4
FTQFaGKDL..E..SIKGTAP
gtseq5
FPKF.KGLTtaD..QLKKSAD
gtseq6
FSFL.KGTS..E..VPQNNPE
gtseq7
FGFS.G---..-..--ASDPG

30
Building phylogenetic trees
31
Overview

The tree of life description
Background on trees

32
Multiple alignment and trees

Alignment of sequences should take account of
their evolutionary relationship. Sankoff, Morel
Cedergren, 1973
Several progressive alignment algorithms use a
guide tree (to guide the clustering process).
We begin to build trees.

33
The tree of life

The similarity of molecular mechanisms of the
organisms that have been studied strongly
suggests that all organisms on Earth had a common
ancestor. Thus any sets of species is related,
and this relationship is called a phylogeny.
Usually the relationship can be represented by a
phylogenetic tree.

Zuckerkandl Paulings paper 1962 showed that
molecular sequences provide sets of morphological
characters that can carry a large amount of
information.
An assumption the sequencies we want to analyze
on the phylogeny matter have descended from some
common ancestral gene in a common ancestral
species.
Gene duplication exists gt we have to check the
assumption carefully.

35
Gene duplication and speciation

By another mechanism, gene duplication, two
sequences can also be separated and diverge from
the common ancestor.
Genes which diverged because of speciation are
called orthologues. Genes which diverged by gene
duplication are called paralogues.

36
A tree of orthologues alpha haemoglobins
HBA_ACCGE, HBA_AEGMO, HBA_AILFU, HBA_AILME,
HBA_ALCAA, HBA_ALLMI, HBA_AMBME, HBA_ANAPL
(SWISS-PROT).
37
A tree of paralogues HBAT_HUMAN, HBAZ_HUMAN,
HBA_HUMAN, HBB_HUMAN, HBD_HUMAN, HBE_HUMAN,
HBG_HUMAN, MYG_HUMAN (SWISS-PROT).
38
Background on trees

All trees will be assumed to be binary (an edge
that branches splits into two daughter edges).
Each edge of the tree has a certain amount of
evolutionary divergence associated to it. We
adopt the general term length, which will be
represented by lengthes of edges on figures.
A true biological phylogeny has a root, or
ultimate ancestor of all sequences.

39
Rooted and unrooted tree
40

A tree with a given labelling will be called a
labelled branching pattern.
We refer to this as the tree topology and denote
it by T.
Lengths of the edges ti with a suitable
numbering scheme for the is.

41
Counting and labelling

Rooted tree
n leaves, plus (n-1) branch nodes in addition to
leaves -gt we have 2n-1 nodes in all, and 2n-2
edges.
leaves 1..n, branch nodes n1 .. 2n-1,
(2n-1)th node is root.

42
Counting and labelling

Unrooted tree
n leaves, 2n-2 nodes and 2n-3 edges.
a root can be added at any of its edges gt we can
get 2n-3 rooted trees.

43
Number of rooted and unrooted trees
A root can be added at any edge, producing 2n-3
rooted trees from unrooted tree gt there are
(2n-3) times as many rooted trees as unrooted
trees, for a given number n of leaves.
44
Instead of the root, we can add an extra edge or
branch with a distinct label in its leaf.
45

There are three such trees with (2n-3)5 leaves
they are distinct labelled branching patterns.
There are then five ways of adding a further
branch labelled with a distinct label (5),
giving in all 3x515 unrooted trees with five
leaves.
The number of unrooted trees with n leaves is
equal to 35...(2n-5) (2n-5)!! So, we have
(2n-3)!! rooted trees with n leaves.

46
Building phylogenetic trees

Questions?

47
Exercise 7.2

The trees with three and four leaves in Figure
7.3 all have the same unlabelled branching
pattern. For both rooted and unrooted trees, how
many leaves do there have to be to obtain more
than one unlabelled branching pattern? Find a
recurrence relation for the number of rooted
trees. (Hint consider the trees formed by
joining two trees at their root).

48
Exercise 7.2
49
Exercise 7.3

All trees considered so far have been binary, but
one can envisage ternary trees that, in their
rooted form, have three branches descending from
a branch node. If there are m branch nodes in an
unrooted ternary tree, how many leaves are there
and how many edges?

50
Exercise 7.4

Consider next a composite unrooted tree with m
ternary branch nodes and n binary branch nodes.
How many leaves are there, and how many edges?
Let Nm,n denote the number of distinct labelled
branching patterns of this tree. Extend the
counting argument for binary trees to show that
Nm,n (3m2n-1)N m,n-1 (n1)N m-1,n1
(Hint the first term after the counts
the number of ways that a new edge can be added
to an existing edge, thereby creating an
additional binary node the second term
corresponds to edges added at binary nodes,
thereby producing ternary nodes.)