Title: Guiding motif discovery by iterative pattern refinement
1Guiding motif discovery by iterative pattern
refinement
- Zhiping Wang
- Advisor Sun Kim, Mehmet Dalkilic
- School of Informatics, Indiana University
2Outline
- Introduction and motivation
- Our framework for motif discovery
- Pattern finding
- Build seed motif
- Extract subsequences
- Find motif
- Iterative refinement
- Performance of our framework
- Discussion and Future work
3Introduction motifs their applications
- Protein motifs are short patterns conserved in
proteins. - They are generally important for the function of
a protein or the maintenance of protein
structures. - Enzyme catalytic sites
- Regions involved in binding a molecule (ADP/ATP,
DNA) or another protein. - A fold important for general 3D structure.
- Distinguish protein groups based on such
patterns. - Classify a sequenced protein to a specific family
of proteins.
4Introduction - motif discovery
- PROSITE find patterns manually
- Deterministic algorithm, expectation maximization
based - MEME (time consuming)
- Stochastic algorithm (Gibbs sampling algorithm),
random jumps in the search space - Gibbs Sampler
- AlignACE
5Motivation
- Motif discover is, in a sense, to look for
signals compared to noise. - The model for noise largely depends on the input
sequences (See previous capstones). - Our goal is to use subsequences to guide motif
discovery. - We use an iterative pattern refinement procedure
to improve the performance of motif discovery.
6Outline
- Introduction and motivation
- Our framework for motif discovery
- Pattern finding
- Build seed motif
- Extract subsequences
- Find motif
- Iterative refinement
- Performance of our framework
- Discussion and Future work
7Test Data Preparation
- 1. Download PROSITE pattern and sequence
databases. - 2. Parse all positive sequences for each PROSITE
ID and store them as a PROSITE family. - 3. All sequences of one family contain the same
PROSITE pattern. - 4. We used PROSITE families for motif discovery.
8Framework Overview
- Find patterns in a PROSITE family
- Build seed motifs according to patterns
- Select subsequences based on seed motifs
- Run motif finding program (MEME) on the
subsequences - Search motifs using MAST over entire family
- Select subsequences around the motif regions
- Go to step 4, until the final motif is stable
9Outline
- Introduction and motivation
- Our framework for motif discovery
- Pattern finding
- Build seed motif
- Extract subsequences
- Find motif
- Iterative refinement
- Performance of our framework
- Discussion and Future work
10Pattern Finding - thresholds
- For each PROSITE family, we find conserved
patterns first. - Three thresholds to find a qualified pattern
- 1. length of patterns.
- 2. log-odd value of 1st Markov model to random
model. - 3. support value, the occurrence of a pattern in
different sequences.
11Pattern Finding - algorithm
- Use thresholds to scan the sequences in one
family, find out qualified patterns in each
sequence. - Rank the sequences according to how many
qualified patterns each sequence has. - Output the qualified patterns in the top half
sequences. - Repeat this algorithm (go to step 1) on the rest
half sequences until no more patterns can be
found.
12Pattern Finding - example
- Qualified Patterns (p1, p2, p3)
13Outline
- Introduction and motivation
- Our framework for motif discovery
- Pattern finding
- Build seed motif
- Extract subsequences
- Find motif
- Iterative refinement
- Performance of our framework
- Discussion and Future work
14Build Seed Motif
- Start from the pattern with maximal support, use
it as the seed motif. - Calculate the scores of the candidate patterns
(in sequences not covered by the seed motif) to
the seed motif. - Si SSi-jWj (j 1 n)
- Si score of candidate pattern i to seed motif
- Si-j score of candidate pattern to jth pattern
in the seed motif - Wj the weight (support ratio) of jth pattern in
the seed motif - Add the pattern with the highest score (also
larger than a score threshold) to the seed motif. - Go to step 2, until no more patterns can be added
to the seed motif.
15Build Seed Motif - example
- Calculate pattern scores (threshold 5)
Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 1
P2 CLN 2 13
P3 ALG 2 10
P4 ALN 2 4
P1 C L G 9 4 0 P2 C L N
S2-1 940 13 S2 S2-1W1 13
16Build Seed Motif - example
- Calculate pattern scores (threshold 5)
Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 4 / (42)
P2 CLN 2 W2 2 / (42)
P3 ALG 2 8
P4 ALN 2 6
S3-1 10, S3-2 4 S3 S3-1W1 S3-2W2
8 gt 5 S4-1 4, S4-2 10 S4 S4-1W1
S4-2W2 6 gt 5
17Build Seed Motif - example
- Calculate pattern scores (threshold 5)
Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 4 / 8
P2 CLN 2 W2 2 / 8
P3 ALG 2 W3 2 / 8
P4 ALN 2 9
S4-1 4, S4-2 10, S4-3 8 S4
S4-1W1 S4-2W2 S4-3W3 9 gt 5
18Build Seed Motif
19Outline
- Introduction and motivation
- Our framework for motif discovery
- Pattern finding
- Build seed motif
- Extract subsequences
- Find motif
- Iterative refinement
- Performance of our framework
- Discussion and Future work
20Extract Subsequences
21Find Motif
MEME
22Iterative refinement
motif1, motif2, motif3
MAST
entire PROSITE family
sub1, sub2, sub3
MEME
motif1, motif2, motif3
no
Stable?
yes
choose the best motif
23Outline
- Introduction and motivation
- Our framework for motif discovery
- Pattern finding
- Build seed motif
- Extract subsequences
- Find motif
- Iterative refinement
- Performance of our framework
- Discussion and Future work
24Experiment
- We randomly chose 17 PROSITE families as test
data set. - Ran MEME directly on these families and got the
best motif for each of them. - Ran our framework and got the best motif.
- Compared the results.
25PROSITE Patterns
- PS00010 C-x-DN-x(4)-FY-x-C-x-C.
- PS00011 x(12)-E-x(3)-E-x-C-x(6)-DEN-x-LIVMFY-
x(9)-FYW. - PS00014 KRHQSA-DENQ-E-Lgt.
- PS00018 D-x-DNS-ILVFYW-DENSTG-DNQGHRK-GP
-LIVMC-DENQSTAGC-x(2)-DE-LIVMFYW. - PS00020 LIVM-x-SGN-LIVM-DAGHE-SAG-x-DN
EAG-LIVM-x-DEAG-x(4)-LIVM-x-LM-SAG-LIV
M-LIVMT-W-x-LIVM(2). - PS00099 AG-LIVMA-STAGCLIVM-STAG-LIVMA-C
-x-AG-x-AG-x-AG-x-SAG. - PS00342 STAGCN-RKH-LIVMAFYgt.
- PS00343 L-P-x-T-G-STGAVDE.
- PS00409 KRHEQSTAG-G-FYLIVM-ST-LT-LIVP-E
-LIVMFWSTAG(14). - PS00881 DNEG-x-LIVFA-LIVMY-LVAST-H-N-STC
. - PS01286 P-x(8,10)-LM-R-x-GE-LIVP-x-G-C.
- PS00012 DEQGSTALMKRH-LIVMFYSTAC-GNQ-LIVMFY
AG-DNEKHS-S-LIVMST-PCFY-STAGCPQLIVMF-LIV
MATN-DENQGTAKRHLM-LIVMWSTA-LIVGSTACR-x(2)-
LIVMFA. - PS00019 EQ-x(2)-ATV-FY-x(2)-W-x-N.
- PS00660 W-LIV-x(3)-KRQ-x-LIVM-x(2)-QH-x(0
,2)-LIVMF-x(6,8)-LIVMF-x(3,5)-F-FY-x(2)-DEN
S. - PS00661 HYW-x(9)-DENQSTV-SA-x(3)-FY-LIVM
-x(2)-ACV-x(2)-LM-x(2)-FY-G-x-DENQST-LIV
MFYS. - PS00889 LIVMF-G-E-x-GAS-LIVM-x(5,11)-R-STA
Q-A-x-LIVMA-x-STACV. - PS01177 CSH-C-x(2)-GAP-x(7,8)-GASTDEQR-C-G
ASTDEQL-x(3,9)-GASTDEQN-x(2)-CE-x(6,7)-C-C.
26Performance
- The result of the comparison.
PS00010 PS00011 PS00018 PS00020 PS00409 PS00881 PS00012 PS01286 PS00099 PS00019 PS00660 PS00014 PS00342 PS00343 PS00661 PS00889 PS01177
MEME ? ?
Frame -work ? ?
27Outline
- Introduction and motivation
- Our framework for motif discovery
- Pattern finding
- Build seed motif
- Extract subsequences
- Find motif
- Iterative refinement
- Performance of our framework
- Discussion and Future work
28Discussion
- One flaw Local optima
- PS01286 is the only family our framework has
worse performance on - PROSITE pattern P-x(8,10)-LM-R-x-GE-LIVP-x-G
-C - MEME TNS W HE GN RG I AGS LM R LV
E LV YLF G C - our framework 1. EP W x(4) L G x L KM x VI
T GA VI IA T Q G - 2. X(4)-P-x(8)-LM-R-x-E-LV-x-G-C
-
29Future Work
- Design our own motif discovery algorithm
- Convert the framework to a complete program
- Test the performance of our program on more
PROSITE patterns
30Acknowledgement
- Prof. Sun Kim
- Prof. Mehmet Dalkilic (Memo)
- Arvind Gopu
- Scott Martin