Guiding motif discovery by iterative pattern refinement - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Guiding motif discovery by iterative pattern refinement

Description:

Protein motifs are short patterns conserved in proteins. They are generally important for the function of a protein or the maintenance of ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 31
Provided by: bioInform1
Category:

less

Transcript and Presenter's Notes

Title: Guiding motif discovery by iterative pattern refinement


1
Guiding motif discovery by iterative pattern
refinement
  • Zhiping Wang
  • Advisor Sun Kim, Mehmet Dalkilic
  • School of Informatics, Indiana University

2
Outline
  • Introduction and motivation
  • Our framework for motif discovery
  • Pattern finding
  • Build seed motif
  • Extract subsequences
  • Find motif
  • Iterative refinement
  • Performance of our framework
  • Discussion and Future work

3
Introduction motifs their applications
  • Protein motifs are short patterns conserved in
    proteins.
  • They are generally important for the function of
    a protein or the maintenance of protein
    structures.
  • Enzyme catalytic sites
  • Regions involved in binding a molecule (ADP/ATP,
    DNA) or another protein.
  • A fold important for general 3D structure.
  • Distinguish protein groups based on such
    patterns.
  • Classify a sequenced protein to a specific family
    of proteins.

4
Introduction - motif discovery
  • PROSITE find patterns manually
  • Deterministic algorithm, expectation maximization
    based
  • MEME (time consuming)
  • Stochastic algorithm (Gibbs sampling algorithm),
    random jumps in the search space
  • Gibbs Sampler
  • AlignACE

5
Motivation
  • Motif discover is, in a sense, to look for
    signals compared to noise.
  • The model for noise largely depends on the input
    sequences (See previous capstones).
  • Our goal is to use subsequences to guide motif
    discovery.
  • We use an iterative pattern refinement procedure
    to improve the performance of motif discovery.

6
Outline
  • Introduction and motivation
  • Our framework for motif discovery
  • Pattern finding
  • Build seed motif
  • Extract subsequences
  • Find motif
  • Iterative refinement
  • Performance of our framework
  • Discussion and Future work

7
Test Data Preparation
  • 1. Download PROSITE pattern and sequence
    databases.
  • 2. Parse all positive sequences for each PROSITE
    ID and store them as a PROSITE family.
  • 3. All sequences of one family contain the same
    PROSITE pattern.
  • 4. We used PROSITE families for motif discovery.

8
Framework Overview
  1. Find patterns in a PROSITE family
  2. Build seed motifs according to patterns
  3. Select subsequences based on seed motifs
  4. Run motif finding program (MEME) on the
    subsequences
  5. Search motifs using MAST over entire family
  6. Select subsequences around the motif regions
  7. Go to step 4, until the final motif is stable

9
Outline
  • Introduction and motivation
  • Our framework for motif discovery
  • Pattern finding
  • Build seed motif
  • Extract subsequences
  • Find motif
  • Iterative refinement
  • Performance of our framework
  • Discussion and Future work

10
Pattern Finding - thresholds
  • For each PROSITE family, we find conserved
    patterns first.
  • Three thresholds to find a qualified pattern
  • 1. length of patterns.
  • 2. log-odd value of 1st Markov model to random
    model.
  • 3. support value, the occurrence of a pattern in
    different sequences.

11
Pattern Finding - algorithm
  1. Use thresholds to scan the sequences in one
    family, find out qualified patterns in each
    sequence.
  2. Rank the sequences according to how many
    qualified patterns each sequence has.
  3. Output the qualified patterns in the top half
    sequences.
  4. Repeat this algorithm (go to step 1) on the rest
    half sequences until no more patterns can be
    found.

12
Pattern Finding - example
  • Qualified Patterns (p1, p2, p3)

13
Outline
  • Introduction and motivation
  • Our framework for motif discovery
  • Pattern finding
  • Build seed motif
  • Extract subsequences
  • Find motif
  • Iterative refinement
  • Performance of our framework
  • Discussion and Future work

14
Build Seed Motif
  • Start from the pattern with maximal support, use
    it as the seed motif.
  • Calculate the scores of the candidate patterns
    (in sequences not covered by the seed motif) to
    the seed motif.
  • Si SSi-jWj (j 1 n)
  • Si score of candidate pattern i to seed motif
  • Si-j score of candidate pattern to jth pattern
    in the seed motif
  • Wj the weight (support ratio) of jth pattern in
    the seed motif
  • Add the pattern with the highest score (also
    larger than a score threshold) to the seed motif.
  • Go to step 2, until no more patterns can be added
    to the seed motif.

15
Build Seed Motif - example
  • Calculate pattern scores (threshold 5)

Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 1
P2 CLN 2 13
P3 ALG 2 10
P4 ALN 2 4
P1 C L G 9 4 0 P2 C L N
S2-1 940 13 S2 S2-1W1 13
16
Build Seed Motif - example
  • Calculate pattern scores (threshold 5)

Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 4 / (42)
P2 CLN 2 W2 2 / (42)
P3 ALG 2 8
P4 ALN 2 6
S3-1 10, S3-2 4 S3 S3-1W1 S3-2W2
8 gt 5 S4-1 4, S4-2 10 S4 S4-1W1
S4-2W2 6 gt 5
17
Build Seed Motif - example
  • Calculate pattern scores (threshold 5)

Pattern Sequence Support (suppose no shared sequences) Weight Score to motif
P1 CLG 4 W1 4 / 8
P2 CLN 2 W2 2 / 8
P3 ALG 2 W3 2 / 8
P4 ALN 2 9
S4-1 4, S4-2 10, S4-3 8 S4
S4-1W1 S4-2W2 S4-3W3 9 gt 5
18
Build Seed Motif
19
Outline
  • Introduction and motivation
  • Our framework for motif discovery
  • Pattern finding
  • Build seed motif
  • Extract subsequences
  • Find motif
  • Iterative refinement
  • Performance of our framework
  • Discussion and Future work

20
Extract Subsequences
21
Find Motif
MEME
22
Iterative refinement
motif1, motif2, motif3
MAST
entire PROSITE family
sub1, sub2, sub3
MEME
motif1, motif2, motif3
no
Stable?
yes
choose the best motif
23
Outline
  • Introduction and motivation
  • Our framework for motif discovery
  • Pattern finding
  • Build seed motif
  • Extract subsequences
  • Find motif
  • Iterative refinement
  • Performance of our framework
  • Discussion and Future work

24
Experiment
  1. We randomly chose 17 PROSITE families as test
    data set.
  2. Ran MEME directly on these families and got the
    best motif for each of them.
  3. Ran our framework and got the best motif.
  4. Compared the results.

25
PROSITE Patterns
  • PS00010 C-x-DN-x(4)-FY-x-C-x-C.
  • PS00011 x(12)-E-x(3)-E-x-C-x(6)-DEN-x-LIVMFY-
    x(9)-FYW.
  • PS00014 KRHQSA-DENQ-E-Lgt.
  • PS00018 D-x-DNS-ILVFYW-DENSTG-DNQGHRK-GP
    -LIVMC-DENQSTAGC-x(2)-DE-LIVMFYW.
  • PS00020 LIVM-x-SGN-LIVM-DAGHE-SAG-x-DN
    EAG-LIVM-x-DEAG-x(4)-LIVM-x-LM-SAG-LIV
    M-LIVMT-W-x-LIVM(2).
  • PS00099 AG-LIVMA-STAGCLIVM-STAG-LIVMA-C
    -x-AG-x-AG-x-AG-x-SAG.
  • PS00342 STAGCN-RKH-LIVMAFYgt.
  • PS00343 L-P-x-T-G-STGAVDE.
  • PS00409 KRHEQSTAG-G-FYLIVM-ST-LT-LIVP-E
    -LIVMFWSTAG(14).
  • PS00881 DNEG-x-LIVFA-LIVMY-LVAST-H-N-STC
    .
  • PS01286 P-x(8,10)-LM-R-x-GE-LIVP-x-G-C.
  • PS00012 DEQGSTALMKRH-LIVMFYSTAC-GNQ-LIVMFY
    AG-DNEKHS-S-LIVMST-PCFY-STAGCPQLIVMF-LIV
    MATN-DENQGTAKRHLM-LIVMWSTA-LIVGSTACR-x(2)-
    LIVMFA.
  • PS00019 EQ-x(2)-ATV-FY-x(2)-W-x-N.
  • PS00660 W-LIV-x(3)-KRQ-x-LIVM-x(2)-QH-x(0
    ,2)-LIVMF-x(6,8)-LIVMF-x(3,5)-F-FY-x(2)-DEN
    S.
  • PS00661 HYW-x(9)-DENQSTV-SA-x(3)-FY-LIVM
    -x(2)-ACV-x(2)-LM-x(2)-FY-G-x-DENQST-LIV
    MFYS.
  • PS00889 LIVMF-G-E-x-GAS-LIVM-x(5,11)-R-STA
    Q-A-x-LIVMA-x-STACV.
  • PS01177 CSH-C-x(2)-GAP-x(7,8)-GASTDEQR-C-G
    ASTDEQL-x(3,9)-GASTDEQN-x(2)-CE-x(6,7)-C-C.

26
Performance
  • The result of the comparison.

PS00010 PS00011 PS00018 PS00020 PS00409 PS00881 PS00012 PS01286 PS00099 PS00019 PS00660 PS00014 PS00342 PS00343 PS00661 PS00889 PS01177
MEME ? ?
Frame -work ? ?
27
Outline
  • Introduction and motivation
  • Our framework for motif discovery
  • Pattern finding
  • Build seed motif
  • Extract subsequences
  • Find motif
  • Iterative refinement
  • Performance of our framework
  • Discussion and Future work

28
Discussion
  • One flaw Local optima
  • PS01286 is the only family our framework has
    worse performance on
  • PROSITE pattern P-x(8,10)-LM-R-x-GE-LIVP-x-G
    -C
  • MEME TNS W HE GN RG I AGS LM R LV
    E LV YLF G C
  • our framework 1. EP W x(4) L G x L KM x VI
    T GA VI IA T Q G
  • 2. X(4)-P-x(8)-LM-R-x-E-LV-x-G-C

29
Future Work
  • Design our own motif discovery algorithm
  • Convert the framework to a complete program
  • Test the performance of our program on more
    PROSITE patterns

30
Acknowledgement
  • Prof. Sun Kim
  • Prof. Mehmet Dalkilic (Memo)
  • Arvind Gopu
  • Scott Martin
Write a Comment
User Comments (0)
About PowerShow.com