Motif Refinement using Hybrid Expectation Maximization Algorithm - PowerPoint PPT Presentation

About This Presentation

Title:

Motif Refinement using Hybrid Expectation Maximization Algorithm

Description:

Motifs are certain patterns in DNA and protein ... Finding these conserved patterns might be very useful for controlling the ... of Hessian matrix. ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 32

Provided by: dimacsR

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Motif Refinement using Hybrid Expectation Maximization Algorithm

1
Motif Refinement using Hybrid Expectation
Maximization Algorithm
Chandan ReddyYao-Chung WengHsiao-Dong
ChiangSchool of Electrical and Computer
Engr.Cornell University, Ithaca, NY - 14853.
2
Motif Finding Problem

Motifs are certain patterns in DNA and protein
sequences that are strongly conserved i.e. they
have important biological functions like gene
regulation and gene interaction

Finding these conserved patterns might be very
useful for controlling the expression of genes

Motif finding problem is to detect novel,
over-represented unknown signals in a set of
sequences (for eg. transcription factor binding
sites in a genome).

3
Motif Finding Problem
Consensus Pattern - CCGATTACCGA
( l, d ) (11,2) consensus pattern
4
Problem Definition

Without any previous knowledge about the
consensus pattern, discover all instances
(alignment positions) of the motifs and then
recover the final pattern to which all these
instances are within a given number of mutations.

5
Complexity of the Problem
Let n is the length of the DNA sequence l is
the length of the motif t is the number of
sequences d is the number of mutations in a
motif The running time of a brute force
approach There are (n-l1) l-mers in each of t
sequences. Total combination is (n-l1)t l-mers
for t sequences. Typically, n is much larger
than l. ie. n 600, t 20.
6
Existing methodologies

Generative probabilistic representation -
continuous
Gibbs Sampling
Expectation Maximization
Greedy CONSENSUS
HMM based
Mismatch representation Discrete Consensus
Projection Methods
Multiprofiler
Suffix Trees

7
Existing methodologies

Global Solvers
Advantage neighborhood of global optimal
solutions.
Disadvantage misses out better solutions
locally.
ie Random Projection, Pattern Branching, etc
Local Solvers
Advantage returns best solution in
neighborhood.
Disadvantage relies heavily on initial
conditions.
ie EM, Gibbs Sampling, Greedy CONSENSUS, etc

8
Our Approach

Performs global solver to estimate neighborhood
of a promising solution. (Random Projection)
Using this neighborhood as initial guess, apply
local solver to refine the solution to be the
global optimal solution. (Expectation
Maximization)
Performs efficient neighborhood search to jump
out of convergence region to find another local
solutions systematically.
A hybrid approach includes the advantages of
both the global and local solvers.

9
Random Projection

Implements a hash function h(x) to map l-mer
onto a k-dimensional space.
Hashes all possible l-mers in t sequences into
4k buckets where each bucket corresponds an
unique k-mer.
Imposing certain conditions and setting a
reasonable bucket threshold S, the buckets that
exceed S is returned as the solution.

10
Expectation Maximization

Expectation Maximization is a local optimal
solver in which we refine the solution yielded by
random projection methodology. The EM method
iteratively updates the solution until it
converges to a locally optimal one.
Follow these steps
Compute the scoring function
Iterate the Expectation step and the
Maximization step

11
Profile Space
A profile is a matrix of probabilities, where
the rows represent possible bases, and the
columns represent consecutive sequence positions.

Applying the Profile Space into the coefficient
formula constructs PSSM.

12
Scoring function- Maximum Likelihood
13
Expectation Step

The Expectation step returns the expected number
of jth residue in each position of the motif
instance and overall sequence. The algorithm is
as follows
Obtains ?k,j from the previous M-step
iteration.
Uses ?k,j to calculate the probability of all
possible l-mers against the expected motif.
Given probability of each l-mer, calculates
probability of the correct starting position for
each l-mer using Bayes formula.
Multiplying weight to each position of each
l-mer, calculate the expected number of j at
position k.

14
Maximization Step

The Maximization Step receives the expected
values passed on by E-Step to calculate the new
probability ?k,j and ?0,j and return them for
E-Step.
?(q)k,j Ek,j / t , ?(q)0,j E0,j / (t
n-l )
If ?(q) ?(q-1), then iteration ends. All the
local optimal solution sites are returned with
the consensus made up of jth residue with
highest probability at kth position.
Else, ?(q)k,j and ?(q)0,j are used to the q1
iteration of the E-Step.

15
(No Transcript)
16
Basic Idea

one-to-one correspondence of the critical points

Local Minimum
Stable Equilibrium Point
Saddle Point
Decomposition Point
Local Maximum
Source
17
Theoretical Background
Practical Stability Boundary
The problem of finding all the Tier-1 stable
equilibrium points of xs is the problem of
finding all the decomposition points on its
stability boundary
18
Theoretical background
Theorem (Unstable manifold of type-1 equilibrium
point) Let xs1 be a stable e.p. of the
gradient system (2) and xd be a type-1 e.p. on
the practical stability boundary ?Ap(xs). Assume
that there exist e and d such that ?f (x) gt e
unless x ? x ?f (x) 0. If every e.p. of (1)
is hyperbolic and its stable and unstable
manifolds satisfy the transversality condition,
then there exists another stable e.p. xs2 to
which the one dimensional unstable manifold of xd
converges.Our method finds the stability
boundary between the two local minima and traces
the stability boundary to find the saddle point.
We used a new trajectory adjustment procedure to
move along the practical stability boundary.
19
Definitions
Def 1 x is said to be a critical point of (1)
if it satisfies the condition ?f (x) 0 where f
(x) is the objective function assumed to be in
C2(?n, ?).The corresponding nonlinear dynamical
system is -------- Eq. (1) The solution
curve of Eq. (1) starting from x at time t 0 is
called a trajectory and it is denoted by F( x ,
.) ? ? ?n. A state vector x is called an
equilibrium point (e.p.) of Eq. (3) if f ( x )
0.
20
Definitions (contd.)
Def 2 An equilibrium point is said to be
hyperbolic if the Jacobian of f at point x has no
eigenvalues with zero real part. A hyperbolic
e.p. is a Stable e.p. - if all the eigenvalues
of its Jacobian have negative real part.
Unstable e.p. - if some eigenvalues have
positive real part. Type-k e.p. - if its
Jacobian has exact k eigenvalues with positive
real part.We propose to build a negative
gradient system associated with ( 1) as shown
below dx /dt - ?f (x) -------- Eq. (2)
21
Definitions (contd.)
A dynamical system is completely stable if every
trajectory of the system leads to one of its
stable equilibrium points. Def 3 The
stability region (or region of attraction) of a
stable equilibrium point xs of a nonlinear
dynamical system (1) is denoted by A(xs) and is
A(xs) x ? ?n limt?8 F( x , t) xs
The boundary of stability region is called the
stability boundary of xs and is represented as
?A(xs).
22
Definitions (contd.)
Def 4 The practical stability region of a
stable equilibrium point xs of a nonlinear
dynamical system (1), denoted by Ap(xs) and is
. The practical
stability boundary (? Ap(xs) ) is a subset of its
stability boundary. It eliminates the complex
portion of the stability boundary which has no
contact with the complement of the closure of
the stability region. Def 5 A decomposition
point is a type-1 equilibrium point xd on the
practical stability boundary of a stable
equilibrium point xs .
23
Theoretical background
Theorem 1 (Unstable manifold of type-1
equilibrium point) Let xs1 be a stable e.p. of
the gradient system (2) and xd be a type-1 e.p.
on the practical stability boundary ?Ap(xs).
Assume that there exist e and d such that ?f
(x) gt e unless x ? x ?f (x) 0. If every
e.p. of (1) is hyperbolic and its stable and
unstable manifolds satisfy the transversality
condition, then there exists another stable e.p.
xs2 to which the one dimensional unstable
manifold of xd converges.Our method finds the
stability boundary between the two local minima
and traces the stability boundary to find the
saddle point. We used a new trajectory adjustment
procedure to move along the practical stability
boundary.
24
Our Method
25
Search Directions
26
Search Directions
27
Our Method

The exit point method is implemented so that EM
can move out of its convergence region to seek
out other local optimal solutions.
Construct a PSSM from initial alignments.
Calculate eigenvectors of Hessian matrix.
Find exit points (or saddle points) along each
eigenvector.
Apply EM from the new stability/convergence
region.
Repeat first step.
Return max score A, a1i, a2j

28
Results
29
Improvements in the Alignment Scores
30
Improvements in the Alignment Scores
Random Projection method results
31
Performance Coefficient
K is the set of the residue positions of the
planted motif instances, and P is the
corresponding set of positions predicted
32
Results
Different Motifs and the average score using
random starts. The first tier and second tier
improvements on synthetic data.
33
Results
Different Motifs and the average score using
random projection. The first tier and second tier
improvements on synthetic data.
34
Results
Different Motifs and the average score using
random projections and the first tier and second
tier improvements on real human sequences.
35
Results on Real data
36
Concluding discussion

Using dynamical system approach, we have shown
that the EM algorithm can be improved
significantly.
In the context of motif finding, we see that
there are many local optimal solutions and it is
important to search the neighborhood space.
Try different global methods and other
techniques like GibbsDNA

37
Questions and suggestions !!!!!

Write a Comment

User Comments (0)