Protein Structural Prediction - PowerPoint PPT Presentation

About This Presentation
Title:

Protein Structural Prediction

Description:

Fitness of aa in each position (example, hydrophobicity) ... Use sequence template based on hydrophobicity to find many candidate rungs ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 52
Provided by: root
Learn more at: https://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Protein Structural Prediction


1
Protein Structural Prediction
2
Structure Determines Function
The Protein Folding Problem
  • What determines structure?
  • Energy
  • Kinematics
  • How can we determine structure?
  • Experimental methods
  • Computational predictions

3
Protein Structure Prediction
  • ab initio
  • Use just first principles energy, geometry, and
    kinematics
  • Homology
  • Find the best match to a database of sequences
    with known 3D-structure
  • Threading
  • Meta-servers and other methods

4
Threading
MTYKLILN . NGVDGEWTYTE
Main difference between homology-based prediction
and threading Threading uses the structure to
compute energy function during alignment
  • Threading is the golden mean between
    homology-based prediction and molecular modeling
    (?)

5
Threading Overview
  • Build a structural template database
  • Define a sequencestructure energy function
  • Apply a threading algorithm to query sequence
  • Perform local refinement of secondary structure
  • Report best resulting structural model

6
Threading Search Space
Protein Sequence X
Protein Structure Y
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
7
Threading Template Database
  • FSSP, SCOP, CATH
  • Remove pairs of proteins with highly similar
    structures
  • Efficiency
  • Statistical skew in favor of large families

8
Threading Energy Function
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
how well a residue fits a structural
environment Es
how preferable to put two particular residues
nearby Ep
how often a residue mutates to the template
residue Em
alignment gap penalty Eg
compatibility with local secondary structure
prediction Ess
total energy wmEm wsEs wpEp wgEg
wssEss
9
Threading Formulation
x
  • Contact graph captures amino acid interactions
  • Cores represent important local structure units
  • No gaps within each core

y
z
u
Ci
v
Cj
x
z
y
C1
C2
C3
C4
u
v
a
t1a
?1
?0
t4a
?4
?3
t3a
?2
t2a
10
Threading Formulation
CMG (v, ?)
11
Threading Formulation
  • From Lathrop Smith

12
Threading Search Space
Protein Sequence X
Protein Structure Y
How Hard is Threading?
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
CORES
13
How Hard is Threading?
  • At least as hard as MAX-CUT
  • MAX-CUT Given graph G (V, E), find a cut (S,
    T) of V with maximum number of edges between S
    and T.
  • The Bad News APX-complete even when each node
    has at most B edges (where Bgt2)

14
Reduction of MAX-CUT to Threading
0 1 0 1 0 1 0 1 0 1 0 1 0 1 v1 v2 v3
v4 v5 v6 v7
Sequence consists of V 01-pairs
  • V cores, each core i has length 1 and
    corresponds to vi
  • Let Ep(0,1) 1 every edge labeled 0-1 or 1-0
    gets a score of 1
  • Then, size of cut threading score

15
Threading with Branch Bound
  • Set of solutions can be partitioned into subsets
    (branch)
  • Upper limit on a subsets solution can be
    computed fast (bound)
  • Branch Bound
  • Select subset with best possible bound
  • Subdivide it, and compute a bound for each subset

16
Threading with Branch Bound
  • Key to this algorithm is tradeoff on lower bound
  • efficient
  • tight

17
Threading with Integer Programming
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xy 11 -x2y 5 x, y 0
Linear constraints
Integral constraints (nonlinear)
x, y ? 0, 1
RAPTOR integer programming-based
threading perhaps the best protein threading
system
18
Threading with Integer Programming
  • x(i,k) denotes that core i is aligned to
    sequence position k
  • y(i,k,j,l) denotes that core i is aligned to
    position k and core j is aligned to position l
  • D(i) all positions where core i can be aligned
    to
  • R(i, j, k) set of possible alignments of core j,
    given that core i aligns to position k
  • corei (headi, taili, lengthi taili headi
    1)

19
Threading with Integer Programming
Cores are aligned in order
Each y variable is 1 if and only if its two x
variables are 1 x and y represent exactly the
same threading
Each core has only one alignment position
20
Energy Function is Linear
  • Sequence substitution score
  • Fitness of aa in each position (example,
    hydrophobicity)
  • Agreement with secondary structure prediction
  • Pairwise interaction between two cores
  • Gap between two successive cores

21
LP Relaxation and (again) Branch Bound
  • Relax the integral constraint, to
  • x(i,j), y(i,k,j,l) ? 0
  • Solve the LP using a standard method
  • (RAPTOR uses IBMs OSL)
  • If resulting solution is integral, done
  • Else, select one non-integral variable
    (heuristically), and generate two subproblems by
    setting it to 0, and 1 -- use Branch Bound
  • In practice, in RAPTOR only 1 of the instances
    in the test database required step 4 almost
    all solutions are integral !!!

22
CAFASP
  • GOAL
  • The goal of CAFASP is to evaluate the performance
    of fully automatic structure prediction servers
    available to the community. In contrast to the
    normal CASP procedure, CAFASP aims to answer the
    question of how well servers do without any
    intervention of experts, i.e. how well ANY user
    using only automated methods can predict protein
    structure. CAFASP assesses the performance of
    methods without the user intervention allowed in
    CASP.

23
Performance Evaluation in CAFASP3
Servers with name in italic are meta servers
MaxSub score ranges from 0 to 1 Therefore,
maximum total score is 30
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
24
One structure where RAPTOR did best
Red true structure Blue correct part of
prediction Green wrong part of prediction
  • Target Size144
  • Super-imposable size within 5A 118
  • RMSD1.9

25
Some more results by other programs
26
Some more results by other programs
27
Some more results by other programs
28
Structural Motifs
beta helix
beta barrel
beta trefoil
29
Structural Motif Recognition
  • Secondary Structure Prediction
  • Find the ? helices, ? sheets, loops in a protein
    sequence
  • Given an amino acid residue sequence, does it
    fold as a
  • Coiled Coil?
  • ? helix?
  • ? barrel?
  • Zinc finger?
  • Intermediate goals towards folding
  • Useful information about the function of a
    protein
  • More amenable to sequence analysis, than full
    fold prediction

30
Structural Motif Recognition
  • Collect a database of known motifs and
    corresponding amino acid subsequences
  • Devise a method/model to match a new sequence
    to existing motif database
  • Verify computationally on a test set (divide
    database into training and testing subsets)
  • Verify in lab

31
Structural Motif Recognition Methods
  • Alignment
  • Neural Nets
  • Hidden Markov Models
  • Threading
  • Profile-based Methods
  • Other Statistical Methods

32
Predicting Coiled Coils
33
Predicting Coiled Coils
  • NewCoils multiply probs of frequencies in each
    coiled coil position

34
Predicting Coiled Coils
  • PairCoil multiply pairwise probs of spatially
    neighboring positions
  • Use a sliding window of length 28
  • Perfect score separation between true and false
    examples (false non-coil-coil ? helices)
  • Berger et al. PNAS 1995

35
Predicting ? helices
  • Helix composed of three parallel ? sheets
  • Very few solved structures, very different from
    one another
  • Absent in eukaryotes!
  • Probably evolved subsequent to prok/euk split

36
Predicting ? helices
  • Only available program BetaWrap
  • The rungs subproblem
  • Given the location of a T2 turn of one rung, find
    location of T2 turn of next rung
  • Distribution of turn lengths
  • Bonus/penalty for stacked pairs in the parallel
    strands
  • Discard if highly charged residues in the
    inward-point positions of ? strand
  • From a rung to multiple rungs
  • Find multiple initial B2-T2-B3 rungs
  • Use sequence template based on hydrophobicity to
    find many candidate rungs
  • Find optimal wrap by DP heuristic score,
    based on 5 consistent rungs
  • Completing the parse
  • Find B1 strands by locally optimizing their
    location

37
Predicting ? helices
  • BetaWrap gives scores that separate true from
    false ? helices
  • Bradley et al. PNAS 2001

38
Predicting ? trefoils
http//betawrappro.csail.mit.edu/ Similar idea
use a combination of domain-specific expert
knowledge with statistics WRAP-AND-PACK WRAP
Search for antiparallel ? strands to wrap a
cap PACK Place the side chains in the interior
of the wrapped ? strands
39
Predicting Secondary Structure
  • Given amino acid sequence, classify positions
    into ? helices, ? strands, or loops
  • In general, harder than protein motif
    identification
  • Best methods rely on Neural Networks
  • Similarly good separation can be achieved by SVMs
  • PSIPRED
  • Given a sequence x, generate profile using
    PSI-BLAST
  • Pass the profile to a pre-trained NN
  • Output classification ? helix / ? strand / loops

40
PSIPRED
Profile M
  • Training Testing
  • Start with database of determined folds (lt1.87
    Ao)
  • Remove redundancy any pair of proteins with
    high similarity (found by PSI-BLAST) 187
    remaining proteins
  • 3-fold cross validation
  • 76 classification accuracy

41
PSIPRED server
  • PSIPRED PREDICTION RESULTS
  • Conf Confidence (0low, 9high)
  • Pred Predicted secondary structure (Hhelix,
    Estrand, Ccoil)
  • AA Target sequence
  • PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
  • Conf 9888788777656877765688766579
  • Pred CCCCCCCCCCCCCCCCCCCCCCCCCCCC
  • AA PEPTIDEPEPTIDEPEPTIDEPEPTIDE
  • Conf Confidence (0low, 9high)
  • Pred Predicted secondary structure (Hhelix,
    Estrand, Ccoil)
  • AA Target sequence

PSIPRED PREDICTION RESULTS Conf Confidence
(0low, 9high) Pred Predicted secondary
structure (Hhelix, Estrand, Ccoil) AA
Target sequence PSIPRED HFORMAT (PSIPRED
V2.3 by David Jones) Conf 9988888721001112100121
12359 Pred CCCCCCCCCCCHHHHHHHHCCCCCCCC AA
PTYPTYPTXXXXXXXXXXXXTEETEET PSIPRED PREDICTION
RESULTS Conf Confidence (0low, 9high) Pred
Predicted secondary structure (Hhelix, Estrand,
Ccoil) AA Target sequence PSIPRED
HFORMAT (PSIPRED V2.3 by David Jones) Conf
91025687432236422336410232027743223653334679 Pred
CCCCCCCCCCCCCCCCCCCCCCCEEEECCCCCCCCCCCCCCCCC
AA THISISAPRXTEINSEQXENCETHISISAPRXTEINSEQXENCE
42
TRILOGY SequenceStructure Patterns
  • Identify short sequencestructure patterns 3
    amino acids
  • Find statistically significant ones
    (hypergeometric distribution)
  • Correct for multiple trials
  • These patterns may have structural or functional
    importance
  • Pseq R1xa-bR2xc-dR3
  • Pstr 3 C? C? distances, 3 C? C? vectors
  • Start with short patterns of 3 amino acids
  • V, I, L, M, F, Y, W, D, E, K, R, H, N,
    Q, S, T, A, G, S
  • Extend to longer patterns
  • Bradley et al. PNAS 998500-8505, 2002

43
TRILOGY
44
TRILOGY Extension
Glue together two 3-aa patterns that overlap in 2
amino acids
P-score ?iMpat,,min(Mseq, Mstr) C(Mseq, i)
C(T Mseq, Mstr i) C(T, Mstr)-1
45
TRILOGY Longer Patterns
?-?-? unit found in three proteins with the
TIM-barrel fold
NAD/RAD binding motif found in several folds
Type-II ? turn between unpaired ? strands
Helix-hairpin-helix DNA-binding motif
A ?-hairpin connected with a crossover to a third
?-strand
Three strands of an anti-parallel ?-sheet
A fold with repeated aligned ?-sheets
Four Cysteines forming 4 S-S disulfide bonds
46
Small Libraries of Structural Fragments for
Representing Protein Structures
47
Fragment Libraries For Structure Modeling
predicted structure
known structures
48
Small Libraries of Protein Fragments
  • Kolodny, Koehl, Guibas, Levitt, JMB 2002
  • Goal
  • Small alphabet of protein structural fragments
    that can be used to represent any structure
  • Generate fragments from known proteins
  • Cluster fragments to identify common structural
    motifs
  • Test library accuracy on proteins not in the
    initial set

49
Small Libraries of Protein Fragments
  • Dataset 200 unique protein domains with most
    reliable distinct structures from SCOP
  • 36,397 residues
  • Divide each protein domain into consecutive
    fragments beginning at random initial position
  • Library Four sets of backbone fragments
  • 4, 5, 6, and 7-residue long fragments
  • Cluster the resulting small structures into k
    clusters using cRMS, and applying k-means
    clustering with simulated annealing
  • Cluster with k-means
  • Iteratively break join clusters with simulated
    annealing to optimize total variance S(x µ)2

50
Evaluating the Quality of a Library
  • Test set of 145 highly reliable protein
    structures (Park Levitt)
  • Protein structures broken into set of overlapping
    fragments of length f
  • Find for each protein fragment the most similar
    fragment in the library (cRMS)
  • Local Fit Average cRMS value over all fragments
    in all proteins in the test set
  • Global Fit Find best composition of structure
    out of overlapping fragments
  • Complexity is O(LibraryN)
  • Greedy approach extends the C best structures so
    far from posn 1 to N

51
Results
C
Write a Comment
User Comments (0)
About PowerShow.com