Data Mining for Protein Structure Prediction - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Data Mining for Protein Structure Prediction

Description:

Data Mining for Protein Structure Prediction – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 55
Provided by: ipam
Category:

less

Transcript and Presenter's Notes

Title: Data Mining for Protein Structure Prediction


1
Data Mining for Protein Structure Prediction
  • Mohammed J. Zaki
  • SPIDER Data Mining Project Scalable, Parallel
    and Interactive Data Mining and Exploration at
    RPI
  • http//www.cs.rpi.edu/zaki

2
Outline of the Talk
  • How do proteins form?
  • Protein folding problem
  • Contact map mining
  • Using HMMs based on local motifs
  • Mining physical dense frequent patterns
    (non-local motifs)
  • Future directions
  • Heuristic rules
  • Folding pathways

3
How do Proteins Form?
4
How do Proteins Form?
  • Building Blocks of Biological Systems
  • DNA (nucleotides, 4 types) information
    carrier/encoder
  • RNA bridge from DNA to protein
  • Protein (amino acids, 20 types) action
    molecules.
  • Processes
  • Replication of DNA
  • Transcription of gene (DNA) to messenger RNA
    (mRNA)
  • Splicing of non-coding regions of the genes
    (introns)
  • Translation of mRNA into proteins
  • Folding of proteins into 3D structure
  • Biochemical or structural functions of proteins

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Protein Folding Problem
10
Protein Structures
  • Primary structure
  • Un-branched polymer
  • 20 side chains (residues or amino acids)
  • Higher order structures
  • Secondary local (consecutive) in sequence
  • Tertiary 3D fold of one polypeptide chain
  • Quaternary Chains packing together

11
Amino Acid
12
Polypeptide Chain
13
Torsion Angles
14
The Protein Folding Problem
15
Contact Map Mining
16
Contact Map
  • Amino acids Ai and Aj are in contact if their 3D
    distance is less than threshold (7Ã¥)
  • Sequence separation is given as i-j
  • Contact map C is an N x N matrix, where
  • C(i,j) 1 if Ai and Aj are in contact
  • C(i,j) 0 otherwise
  • Consider all pairs with i-j gt 4

17
Protein 2igd 3D Structure
Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
18
Contact Map (2igd PDB)
Anti-parallel Beta Sheets
Parallel Beta Sheets
Alpha Helix
Amino Acid Aj
Amino Acid Ai
19
How much information in Amino Acids Alone
Classification Problem
  • A pair of amino acids (Ai,Aj) is an instance
  • The class C (1) or NC (0), i.e., contact or
    non-contact
  • Highly skewed class distribution
  • 1.7 C and 98.3 NC 300K C vs 17,3M NC
  • Features for each instance
  • Ai and Aj
  • Class C or NC

20
Predicting Protein Contacts
  • Predict contacts for new sequence

21
Classification via Association Mining
  • Association mining good for skewed data
  • Mining Mine frequent itemsets in C data (Dc)
  • P(X Dc) Frequency(X Dc) / Dc
  • Counting find P(X Dnc)
  • Pruning
  • Likelihood of a contact r P(XDc) / P(XDnc)
  • Prune pattern X if ratio r of contact to
    non-contact probability is less then some
    threshold
  • i.e., keep only the patterns highly predictive of
    contacts

22
Testing Phase
  • 90-10 split into training and testing
  • 2.4 million pairs, with 36K contacts (1.5)
  • Evidence calculation
  • Find matching patterns P for each instance
  • Compute cumulative frequency in C and NC
  • Sc Sum of frequency (X Dc) where X in P
  • Snc Sum of frequency (X Dnc) where X in P
  • Compute evidence ratio of Sc / Snc
  • Prediction Sort instances on evidence
  • Predict top PR fraction as contacts

23
Experiments
  • 794 Proteins from Protein Data Bank
  • Distinct structures (lt 25 similarity)
  • Longest 907, Smallest 35 amino acids
  • 90-10 split for training-testing
  • Total pairs 20 million (gt 2.5 GB)
  • Contacts 330 thousand (1.6)
  • Highly uneven class distribution

24
Evaluation Metrics
  • Na set of all pairs
  • Na all pairs with positive evidence
  • Ntc true contacts in test data
  • Ntc true contacts with positive evidence
  • Npc predicted contacts
  • Ntpc correctly predicted contacts
  • Accuracy Ntpc / Npc
  • Coverage Ntpc / Ntc
  • Prediction Ratio (PR) Ntc/Na
  • Random Predictor Accuracy Ntc/Na

25
Results (Amino Acids All Lengths)
Crossover 7 accuracy and 7 coverage 2 times
over Random
26
Results (Amino Acids by length)
1-100 12 accuracy(A) and coverage (C) 100-170
6 A and C
170-300 4.5 A and C 300 2 A and C
27
Using HMMs based on Local Motifs to Improve
Classification
28
An HMM for Local Predictions
  • HMMSTR (Chris Bystroff, Biology, RPI)
  • Build a library of short sequences that tend to
    fold uniquely across protein families the
    I-Sites Library
  • Treat each motif as a Markov chain
  • Merge the motifs into a global HMM for local
    structure prediction

29
Training the HMM
  • Build I-sites Library
  • Short sequence motifs (3 to 19)
  • Exhaustive clustering of sequences
  • Non-redundant PDB dataset (lt 25 similarity)
  • Build an HMM
  • Each of 262 motifs is a chain of Markov states
  • Each state has sequence and structure for one
    position
  • Merge I-sites motifs hierarchically to get one
    global HMM for all the motifs

30
HMM Output
  • Total of 282 States in the HMM
  • Each state produces or emits
  • Amino acid profile (20 probability values)
  • Secondary structure (D) (helix, strand or loop)
  • Backbone angles (R) (11 dihedral angle symbols)
  • Finer structural context (C) (10 context symbols)

31
I-Sites Motifs (Initiation Sites)
Beta Hairpin
Beta to Alpha
Helix C-Cap
32
(No Transcript)
33
Data Format and Preparation
  • Take the 794 PDB proteins
  • Compute optimal alignment to HMM
  • Find best state sequence for the observed acids
  • Output probability distribution of a residue over
    all the 282 HMM states
  • Integrate the 3 datasets
  • Alignment probability distribution (Nx282)
  • Amino acid and context information (D, R, C)
  • Contact map (NxN)

34
HMMSTR Output (per Protein)
35
Adding features from HMMSTR
  • The class C (1) or NC (0)
  • Highly skewed class distribution
  • Approx 1.5 C and 98.5 NC
  • Features for each instance
  • Ai Aj Di Dj Ri Rj Ci Cj
  • Profile pi1 pi2 pi20 pj1 pj2 pj20
  • HMM States qi1 qi2 .. qi282 qj1 qj2 .. qj282
  • Class C or NC

36
HMM and AA (R,D,C) All Lengths
Left Crossover 19 accuracy and coverage 5.3
times over Random
Right Crossover (RDC) 17 accuracy and
coverage 5 times over Random
37
HMM AA R,D,C (by length)
1-100 30 accuracy(A) and coverage (C) 100-170
17 A and C
170-300 10 A and C 300 6 A and C
38
Predicted Contact Map (2igd)
39
Summary of Classification Results
  • Challenging prediction problem
  • In essence, we have to predict a contact matrix
    for a new protein
  • Hybrid HMM/Associations approach
  • Best results to-date 19 overall
    accuracy/coverage, 30 for short proteins
  • 14.4 Accuracy (Fariselli, Casadio 99 NN)
  • 13 Accuracy (Thomas et al 96)
  • Short proteins 26 (Olmea, Valencia, 97)

40
Mining Physical Dense Frequent Patterns
(non-local motifs)
41
Characterizing Physical, Protein-like Contact Maps
  • A very small subset of all contact maps code for
    physically possible proteins (self-avoiding,
    globular chains)
  • A contact map must
  • Satisfy geometric constraints
  • Represent low-energy structure
  • What are the typical non-local interactions?
  • Frequent dense 0/1 submatrices in contact maps
  • 3-step approach 1) data generation, 2) dense
    pattern mining, and 3) mapping to structure space

42
Dense Pattern Mining
  • 12,524 protein-like 60 residue structures
  • Use HMMSTR to generate protein-like sequences
  • Use ROSETTA to generate their structures
  • Monte Carlo fragment insertion (from I-sites
    library)
  • Up to 5 possible low-energy structures retained
  • Frequent 2D Pattern Mining
  • Use WxW sliding window W window size
  • Measure density under each window
  • (N-W)2 / 2 possible windows per N length protein
  • Look for minimum density scale away from diag
  • Try different window sizes

43
Counting Dense Patterns
  • Naïve Approachfor W5, N60 there are 1485
    windows per protein. Total 15 Million possible
    windows for 12,524 proteins
  • Test if two submatrices are equal
  • Linear search O(P x W2) with P current dense
    patterns
  • Hash based O(W2)
  • Our Approach 2-level Hashing
  • O(W) time

44
Pattern (WxW Submatrix) Encoding
  • Encode submatrix as string (W integers)
  • Submatrix Integer Value
  • 00000 0
  • 01100 12
  • 01000 8
  • 01000 8
  • 00000 0
  • Concatenated String 0.12.8.8.0

45
Two-level Hashing
  • String ID (M)
  • Level 1 (approximate)
  • Level2 (exact) h2 (M) StringID (M)

46
Binding Patterns to Proteins Sequence and
Structure
  • Using window size, W5
  • StringID0.12.8.8.0, Support 170
  • 00000
  • 01100
  • 01000
  • 01000
  • 00000
  • Occurrences
  • pdb-name (X,Y) X_sequence Y_sequence Interaction
  • 1070.0 52,30 ILLKN TFVRI
    alphabeta
  • 1145.0 51,13 VFALH GFHIA
    alphastrand
  • 1251.2 42,6 EVCLR GSKFG
    alphastrand
  • 1312.0 54,11 HGYDE ATFAK
    alphabeta
  • 1732.0 49,6 HRFAK KELAG
    alphabeta
  • 2895.0 49,7 SRCLD DTIYY
    alphabeta
  • ...

47
Frequent Dense Local Patterns
48
Frequent Dense Non-Local Patterns
Alpha Alpha
Alpha Beta Sheet
49
Frequent Dense Non-Local Patterns
Alpha Beta Turn
Beta Sheet Beta Turn
50
Future Directions
51
Mining Physicality Rules
  • Comprehensive list of non-local motifs
  • I-sites library catalogs local motifs
  • Mining heuristic rules for physicality
  • Based on simple geometric constraints
  • Rules governing contacts and non-contacts
  • Parallel Beta Sheets If C(i,j) 1 and
    C(i2,j2) 1, then C(i,j2) 0 and C(i2,j)
    0
  • Anti-parallel Beta Sheets If C(i,j2) 1 and
    C(i2,j) 1, then C(i,j) 0 and C(i2,j2) 0
  • Alpha Helices If C(i,i4) 1, C(i,j) 1, and
    C(i4,j) 1, then C(i2,j) 0

52
Heuristic Rules of Physicality
Parallel Beta Sheets
Anti-parallel Beta Sheets
i2
j2
i2
j
i
j
i
j2
If C(i,j2) 1 and C(i2,j) 1, then C(i,j) 0
and C(i2,j2) 0
If C(i,j) 1 and C(i2,j2) 1, then C(i,j2)
0 and C(i2,j) 0
53
Protein Folding Pathways
  • Rules for Pathways in Contact Map Space
  • Pathway is time-ordered sequence of contacts
  • Condensation rule New contact within Smax
  • U(i,j) lt Smax U(i,j) is unfolded residues from
    i to j
  • Pathway prediction is complementary to structure
    prediction

54
Contact Map Folding Pathways
Write a Comment
User Comments (0)
About PowerShow.com