Title: Motif Refinement using Hybrid Expectation Maximization Algorithm
1Motif Refinement using Hybrid Expectation
Maximization Algorithm
Chandan ReddyYao-Chung WengHsiao-Dong
ChiangSchool of Electrical and Computer
Engr.Cornell University, Ithaca, NY - 14853.
2Motif Finding Problem
- Motifs are certain patterns in DNA and protein
sequences that are strongly conserved i.e. they
have important biological functions like gene
regulation and gene interaction
- Finding these conserved patterns might be very
useful for controlling the expression of genes
- Motif finding problem is to detect novel,
over-represented unknown signals in a set of
sequences (for eg. transcription factor binding
sites in a genome).
3Motif Finding Problem
Consensus Pattern - CCGATTACCGA
( l, d ) (11,2) consensus pattern
4Problem Definition
- Without any previous knowledge about the
consensus pattern, discover all instances
(alignment positions) of the motifs and then
recover the final pattern to which all these
instances are within a given number of mutations.
5Complexity of the Problem
Let n is the length of the DNA sequence l is
the length of the motif t is the number of
sequences d is the number of mutations in a
motif The running time of a brute force
approach There are (n-l1) l-mers in each of t
sequences. Total combination is (n-l1)t l-mers
for t sequences. Typically, n is much larger
than l. ie. n 600, t 20.
6Existing methodologies
- Generative probabilistic representation -
continuous -
- Gibbs Sampling
- Expectation Maximization
- Greedy CONSENSUS
- HMM based
- Mismatch representation Discrete Consensus
- Projection Methods
- Multiprofiler
- Suffix Trees
7Existing methodologies
- Global Solvers
- Advantage neighborhood of global optimal
solutions. - Disadvantage misses out better solutions
locally. - ie Random Projection, Pattern Branching, etc
- Local Solvers
- Advantage returns best solution in
neighborhood. - Disadvantage relies heavily on initial
conditions. - ie EM, Gibbs Sampling, Greedy CONSENSUS, etc
8Our Approach
- Performs global solver to estimate neighborhood
of a promising solution. (Random Projection) - Using this neighborhood as initial guess, apply
local solver to refine the solution to be the
global optimal solution. (Expectation
Maximization) - Performs efficient neighborhood search to jump
out of convergence region to find another local
solutions systematically. - A hybrid approach includes the advantages of
both the global and local solvers.
9Random Projection
- Implements a hash function h(x) to map l-mer
onto a k-dimensional space. -
- Hashes all possible l-mers in t sequences into
4k buckets where each bucket corresponds an
unique k-mer. - Imposing certain conditions and setting a
reasonable bucket threshold S, the buckets that
exceed S is returned as the solution.
10Expectation Maximization
- Expectation Maximization is a local optimal
solver in which we refine the solution yielded by
random projection methodology. The EM method
iteratively updates the solution until it
converges to a locally optimal one. - Follow these steps
- Compute the scoring function
- Iterate the Expectation step and the
Maximization step
11Profile Space
A profile is a matrix of probabilities, where
the rows represent possible bases, and the
columns represent consecutive sequence positions.
- Applying the Profile Space into the coefficient
formula constructs PSSM.
12Scoring function- Maximum Likelihood
13Expectation Step
- The Expectation step returns the expected number
of jth residue in each position of the motif
instance and overall sequence. The algorithm is
as follows - Obtains ?k,j from the previous M-step
iteration. - Uses ?k,j to calculate the probability of all
possible l-mers against the expected motif. - Given probability of each l-mer, calculates
probability of the correct starting position for
each l-mer using Bayes formula. - Multiplying weight to each position of each
l-mer, calculate the expected number of j at
position k.
14Maximization Step
- The Maximization Step receives the expected
values passed on by E-Step to calculate the new
probability ?k,j and ?0,j and return them for
E-Step. - ?(q)k,j Ek,j / t , ?(q)0,j E0,j / (t
n-l ) - If ?(q) ?(q-1), then iteration ends. All the
local optimal solution sites are returned with
the consensus made up of jth residue with
highest probability at kth position. - Else, ?(q)k,j and ?(q)0,j are used to the q1
iteration of the E-Step.
15(No Transcript)
16Basic Idea
- one-to-one correspondence of the critical points
Local Minimum
Stable Equilibrium Point
Saddle Point
Decomposition Point
Local Maximum
Source
17Theoretical Background
Practical Stability Boundary
The problem of finding all the Tier-1 stable
equilibrium points of xs is the problem of
finding all the decomposition points on its
stability boundary
18Theoretical background
Theorem (Unstable manifold of type-1 equilibrium
point) Let xs1 be a stable e.p. of the
gradient system (2) and xd be a type-1 e.p. on
the practical stability boundary ?Ap(xs). Assume
that there exist e and d such that ?f (x) gt e
unless x ? x ?f (x) 0. If every e.p. of (1)
is hyperbolic and its stable and unstable
manifolds satisfy the transversality condition,
then there exists another stable e.p. xs2 to
which the one dimensional unstable manifold of xd
converges.Our method finds the stability
boundary between the two local minima and traces
the stability boundary to find the saddle point.
We used a new trajectory adjustment procedure to
move along the practical stability boundary.
19Definitions
Def 1 x is said to be a critical point of (1)
if it satisfies the condition ?f (x) 0 where f
(x) is the objective function assumed to be in
C2(?n, ?).The corresponding nonlinear dynamical
system is -------- Eq. (1) The solution
curve of Eq. (1) starting from x at time t 0 is
called a trajectory and it is denoted by F( x ,
.) ? ? ?n. A state vector x is called an
equilibrium point (e.p.) of Eq. (3) if f ( x )
0.
20Definitions (contd.)
Def 2 An equilibrium point is said to be
hyperbolic if the Jacobian of f at point x has no
eigenvalues with zero real part. A hyperbolic
e.p. is a Stable e.p. - if all the eigenvalues
of its Jacobian have negative real part.
Unstable e.p. - if some eigenvalues have
positive real part. Type-k e.p. - if its
Jacobian has exact k eigenvalues with positive
real part.We propose to build a negative
gradient system associated with ( 1) as shown
below dx /dt - ?f (x) -------- Eq. (2)
21Definitions (contd.)
A dynamical system is completely stable if every
trajectory of the system leads to one of its
stable equilibrium points. Def 3 The
stability region (or region of attraction) of a
stable equilibrium point xs of a nonlinear
dynamical system (1) is denoted by A(xs) and is
A(xs) x ? ?n limt?8 F( x , t) xs
The boundary of stability region is called the
stability boundary of xs and is represented as
?A(xs).
22Definitions (contd.)
Def 4 The practical stability region of a
stable equilibrium point xs of a nonlinear
dynamical system (1), denoted by Ap(xs) and is
. The practical
stability boundary (? Ap(xs) ) is a subset of its
stability boundary. It eliminates the complex
portion of the stability boundary which has no
contact with the complement of the closure of
the stability region. Def 5 A decomposition
point is a type-1 equilibrium point xd on the
practical stability boundary of a stable
equilibrium point xs .
23Theoretical background
Theorem 1 (Unstable manifold of type-1
equilibrium point) Let xs1 be a stable e.p. of
the gradient system (2) and xd be a type-1 e.p.
on the practical stability boundary ?Ap(xs).
Assume that there exist e and d such that ?f
(x) gt e unless x ? x ?f (x) 0. If every
e.p. of (1) is hyperbolic and its stable and
unstable manifolds satisfy the transversality
condition, then there exists another stable e.p.
xs2 to which the one dimensional unstable
manifold of xd converges.Our method finds the
stability boundary between the two local minima
and traces the stability boundary to find the
saddle point. We used a new trajectory adjustment
procedure to move along the practical stability
boundary.
24Our Method
25Search Directions
26Search Directions
27Our Method
- The exit point method is implemented so that EM
can move out of its convergence region to seek
out other local optimal solutions. - Construct a PSSM from initial alignments.
- Calculate eigenvectors of Hessian matrix.
- Find exit points (or saddle points) along each
eigenvector. - Apply EM from the new stability/convergence
region. - Repeat first step.
- Return max score A, a1i, a2j
28Results
29Improvements in the Alignment Scores
30Improvements in the Alignment Scores
Random Projection method results
31Performance Coefficient
K is the set of the residue positions of the
planted motif instances, and P is the
corresponding set of positions predicted
32Results
Different Motifs and the average score using
random starts. The first tier and second tier
improvements on synthetic data.
33Results
Different Motifs and the average score using
random projection. The first tier and second tier
improvements on synthetic data.
34Results
Different Motifs and the average score using
random projections and the first tier and second
tier improvements on real human sequences.
35Results on Real data
36Concluding discussion
- Using dynamical system approach, we have shown
that the EM algorithm can be improved
significantly. - In the context of motif finding, we see that
there are many local optimal solutions and it is
important to search the neighborhood space. - Try different global methods and other
techniques like GibbsDNA
37Questions and suggestions !!!!!