Guiding motif discovery by iterative pattern refinement

About This Presentation

Title:

Guiding motif discovery by iterative pattern refinement

Description:

Protein motifs are short patterns conserved in proteins. They are generally important for the function of a protein or the maintenance of ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 31

Provided by: bioInform1

Category:

more less

Transcript and Presenter's Notes

Title: Guiding motif discovery by iterative pattern refinement

1
Guiding motif discovery by iterative pattern
refinement

Zhiping Wang
Advisor Sun Kim, Mehmet Dalkilic
School of Informatics, Indiana University

2
Outline

Introduction and motivation
Our framework for motif discovery
Pattern finding
Build seed motif
Extract subsequences
Find motif
Iterative refinement
Performance of our framework
Discussion and Future work

3
Introduction motifs their applications

Protein motifs are short patterns conserved in
proteins.
They are generally important for the function of
a protein or the maintenance of protein
structures.
Enzyme catalytic sites
Regions involved in binding a molecule (ADP/ATP,
DNA) or another protein.
A fold important for general 3D structure.
Distinguish protein groups based on such
patterns.
Classify a sequenced protein to a specific family
of proteins.

4
Introduction - motif discovery

PROSITE find patterns manually
Deterministic algorithm, expectation maximization
based
MEME (time consuming)
Stochastic algorithm (Gibbs sampling algorithm),
random jumps in the search space
Gibbs Sampler
AlignACE

5
Motivation

Motif discover is, in a sense, to look for
signals compared to noise.
The model for noise largely depends on the input
sequences (See previous capstones).
Our goal is to use subsequences to guide motif
discovery.
We use an iterative pattern refinement procedure
to improve the performance of motif discovery.

6
Outline

Introduction and motivation
Our framework for motif discovery
Pattern finding
Build seed motif
Extract subsequences
Find motif
Iterative refinement
Performance of our framework
Discussion and Future work

7
Test Data Preparation

1. Download PROSITE pattern and sequence
databases.
2. Parse all positive sequences for each PROSITE
ID and store them as a PROSITE family.
3. All sequences of one family contain the same
PROSITE pattern.
4. We used PROSITE families for motif discovery.

8
Framework Overview

Find patterns in a PROSITE family
Build seed motifs according to patterns
Select subsequences based on seed motifs
Run motif finding program (MEME) on the
subsequences
Search motifs using MAST over entire family
Select subsequences around the motif regions
Go to step 4, until the final motif is stable

9
Outline

Introduction and motivation
Our framework for motif discovery
Pattern finding
Build seed motif
Extract subsequences
Find motif
Iterative refinement
Performance of our framework
Discussion and Future work

10
Pattern Finding - thresholds

For each PROSITE family, we find conserved
patterns first.
Three thresholds to find a qualified pattern
1. length of patterns.
2. log-odd value of 1st Markov model to random
model.
3. support value, the occurrence of a pattern in
different sequences.

11
Pattern Finding - algorithm

Use thresholds to scan the sequences in one
family, find out qualified patterns in each
sequence.
Rank the sequences according to how many
qualified patterns each sequence has.
Output the qualified patterns in the top half
sequences.
Repeat this algorithm (go to step 1) on the rest
half sequences until no more patterns can be
found.

12
Pattern Finding - example

Qualified Patterns (p1, p2, p3)

13
Outline

Introduction and motivation
Our framework for motif discovery
Pattern finding
Build seed motif
Extract subsequences
Find motif
Iterative refinement
Performance of our framework
Discussion and Future work

14
Build Seed Motif

Start from the pattern with maximal support, use
it as the seed motif.
Calculate the scores of the candidate patterns
(in sequences not covered by the seed motif) to
the seed motif.
Si SSi-jWj (j 1 n)
Si score of candidate pattern i to seed motif
Si-j score of candidate pattern to jth pattern
in the seed motif
Wj the weight (support ratio) of jth pattern in
the seed motif
Add the pattern with the highest score (also
larger than a score threshold) to the seed motif.
Go to step 2, until no more patterns can be added
to the seed motif.

15
Build Seed Motif - example

Calculate pattern scores (threshold 5)

Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 1
P2 CLN 2 13
P3 ALG 2 10
P4 ALN 2 4
P1 C L G 9 4 0 P2 C L N
S2-1 940 13 S2 S2-1W1 13
16
Build Seed Motif - example

Calculate pattern scores (threshold 5)

Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 4 / (42)
P2 CLN 2 W2 2 / (42)
P3 ALG 2 8
P4 ALN 2 6
S3-1 10, S3-2 4 S3 S3-1W1 S3-2W2
8 gt 5 S4-1 4, S4-2 10 S4 S4-1W1
S4-2W2 6 gt 5
17
Build Seed Motif - example

Calculate pattern scores (threshold 5)

Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 4 / 8
P2 CLN 2 W2 2 / 8
P3 ALG 2 W3 2 / 8
P4 ALN 2 9
S4-1 4, S4-2 10, S4-3 8 S4
S4-1W1 S4-2W2 S4-3W3 9 gt 5
18
Build Seed Motif
19
Outline

Introduction and motivation
Our framework for motif discovery
Pattern finding
Build seed motif
Extract subsequences
Find motif
Iterative refinement
Performance of our framework
Discussion and Future work

20
Extract Subsequences
21
Find Motif
MEME
22
Iterative refinement
motif1, motif2, motif3
MAST
entire PROSITE family
sub1, sub2, sub3
MEME
motif1, motif2, motif3
no
Stable?
yes
choose the best motif
23
Outline

Introduction and motivation
Our framework for motif discovery
Pattern finding
Build seed motif
Extract subsequences
Find motif
Iterative refinement
Performance of our framework
Discussion and Future work

24
Experiment

We randomly chose 17 PROSITE families as test
data set.
Ran MEME directly on these families and got the
best motif for each of them.
Ran our framework and got the best motif.
Compared the results.

25
PROSITE Patterns

PS00010 C-x-DN-x(4)-FY-x-C-x-C.
PS00011 x(12)-E-x(3)-E-x-C-x(6)-DEN-x-LIVMFY-
x(9)-FYW.
PS00014 KRHQSA-DENQ-E-Lgt.
PS00018 D-x-DNS-ILVFYW-DENSTG-DNQGHRK-GP
-LIVMC-DENQSTAGC-x(2)-DE-LIVMFYW.
PS00020 LIVM-x-SGN-LIVM-DAGHE-SAG-x-DN
EAG-LIVM-x-DEAG-x(4)-LIVM-x-LM-SAG-LIV
M-LIVMT-W-x-LIVM(2).
PS00099 AG-LIVMA-STAGCLIVM-STAG-LIVMA-C
-x-AG-x-AG-x-AG-x-SAG.
PS00342 STAGCN-RKH-LIVMAFYgt.
PS00343 L-P-x-T-G-STGAVDE.
PS00409 KRHEQSTAG-G-FYLIVM-ST-LT-LIVP-E
-LIVMFWSTAG(14).
PS00881 DNEG-x-LIVFA-LIVMY-LVAST-H-N-STC
.
PS01286 P-x(8,10)-LM-R-x-GE-LIVP-x-G-C.
PS00012 DEQGSTALMKRH-LIVMFYSTAC-GNQ-LIVMFY
AG-DNEKHS-S-LIVMST-PCFY-STAGCPQLIVMF-LIV
MATN-DENQGTAKRHLM-LIVMWSTA-LIVGSTACR-x(2)-
LIVMFA.
PS00019 EQ-x(2)-ATV-FY-x(2)-W-x-N.
PS00660 W-LIV-x(3)-KRQ-x-LIVM-x(2)-QH-x(0
,2)-LIVMF-x(6,8)-LIVMF-x(3,5)-F-FY-x(2)-DEN
S.
PS00661 HYW-x(9)-DENQSTV-SA-x(3)-FY-LIVM
-x(2)-ACV-x(2)-LM-x(2)-FY-G-x-DENQST-LIV
MFYS.
PS00889 LIVMF-G-E-x-GAS-LIVM-x(5,11)-R-STA
Q-A-x-LIVMA-x-STACV.
PS01177 CSH-C-x(2)-GAP-x(7,8)-GASTDEQR-C-G
ASTDEQL-x(3,9)-GASTDEQN-x(2)-CE-x(6,7)-C-C.

26
Performance

The result of the comparison.

PS00010 PS00011 PS00018 PS00020 PS00409 PS00881 PS00012 PS01286 PS00099 PS00019 PS00660 PS00014 PS00342 PS00343 PS00661 PS00889 PS01177
MEME ? ?
Frame -work ? ?
27
Outline

Introduction and motivation
Our framework for motif discovery
Pattern finding
Build seed motif
Extract subsequences
Find motif
Iterative refinement
Performance of our framework
Discussion and Future work

28
Discussion

One flaw Local optima
PS01286 is the only family our framework has
worse performance on
PROSITE pattern P-x(8,10)-LM-R-x-GE-LIVP-x-G
-C
MEME TNS W HE GN RG I AGS LM R LV
E LV YLF G C
our framework 1. EP W x(4) L G x L KM x VI
T GA VI IA T Q G
2. X(4)-P-x(8)-LM-R-x-E-LV-x-G-C

29
Future Work

Design our own motif discovery algorithm
Convert the framework to a complete program
Test the performance of our program on more
PROSITE patterns

30
Acknowledgement

Prof. Sun Kim
Prof. Mehmet Dalkilic (Memo)
Arvind Gopu
Scott Martin

Write a Comment

User Comments (0)

About PowerShow.com

Guiding motif discovery by iterative pattern refinement - PowerPoint PPT Presentation

Guiding motif discovery by iterative pattern refinement

Protein motifs are short patterns conserved in proteins. They are generally important for the function of a protein or the maintenance of ... – PowerPoint PPT presentation