Protein Structure Prediction - PowerPoint PPT Presentation

1 / 97
About This Presentation
Title:

Protein Structure Prediction

Description:

A protein is composed of a central backbone and a collection of (typically) 50 ... Jackal (http://honiglab.cpmc.columbia.edu/programs/jackal/intro.html) CASP/CAFASP ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 98
Provided by: tticUc
Category:

less

Transcript and Presenter's Notes

Title: Protein Structure Prediction


1
Protein Structure Prediction
  • Jinbo Xu
  • j3xu_at_tti-c.org
  • Toyota Technological Institute at Chicago

2
Outline
  • Introduction to Protein Structures
  • Introduction to Protein Structure Prediction
  • Ab initio folding
  • Protein threading
  • Homology modeling
  • CASP structure prediction competition
  • My Work
  • RAPTOR protein threading
  • SVM approach to fold recognition
  • TreePack protein side-chain packing
  • Prediction examples

3
Biology in One Slide
Organism
Protein
4
Proteins
Proteins are the building blocks of life. In a
cell, 70 is water and 15-20 are
proteins. Examples hormones regulate
metabolism structural hair, wool,
muscle, antibodies immune response enzymes
chemical reactions
5
Amino Acids
A protein is composed of a central backbone and a
collection of (typically) 50-2000 amino acids
(a.k.a. residues). There are 20 different kinds
of amino acids each consisting of up to 18
atoms, e.g.,
Name 3-letter code 1-letter
code Leucine Leu L Alanine
Ala A Serine Ser S Glycine
Gly G Valine Val V Glutamic acid
Glu E Threonine Thr T
6
Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino
acid.
7
Side Chain Properties
  • Hydrophobic stays inside, while hydrophilic stay
    close to water
  • Oppositely charged amino acids can form salt
    bridge.
  • Polar amino acids can participate hydrogen bonding

The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().
8
Protein Folding
  • Proteins must fold to function
  • Some diseases are caused by misfolding
  • e.g., mad cow disease

9
Protein Structure
O H O H O H O H
O H O H O H H3N CH C
N CH C N CH C N CH C N CH C N CH C N
CH C N CH COO-
10
Three Structure Levels
  • Primary structure sequence of amino acids
  • e.g., DRVYIHPF
  • Secondary structure local folding patterns
  • e.g., alpha-helix, beta-sheet, loop
  • Tertiary structure complete 3D fold

Helix
Beta Sheet
Loop
PDB ID 12as
11
Beta Sheet Examples
Anti-parallel beta sheet
Parallel beta sheet
12
Beta Sheet Examples (Contd)
13
Helix Examples
14
Domain, Fold, Motif
  • A protein chain could have several domains
  • A domain is a discrete portion of a protein, can
    fold independently, possess its own function
  • The overall shape of a domain is called a fold.
    There are only a few thousand possible folds.
  • Sequence motif highly conserved protein
    subsequence
  • Structure motif highly conserved substructure

15
Protein Data Bank
  • About 30,000 protein structures, solved using
    experimental techniques 800 are unique
    structural folds

Same structural folds
Different structural folds
16
Protein Structure Determination
  • High-resolution structure determination
  • X-ray crystallography (1Ã…)
  • Nuclear magnetic resonance (NMR) (1-2.5Ã…)
  • Low-resolution structure determination
  • Cryo-EM (electron-microscropy) 10-15Ã…

17
X-ray crystallography
  • most accurate
  • An extremely pure protein sample is needed.
  • The protein sample must form crystals that are
    relatively large without flaws. Generally the
    biggest problem.
  • Many proteins arent amenable to crystallization
    at all (i.e., proteins that do their work inside
    of a cell membrane).
  • 100K per structure

18
Nuclear Magnetic Resonance
  • Fairly accurate
  • No need for crystals
  • limited to small, soluble proteins only.

19
Protein Classification
  • Family homologous, same ancestor, high sequence
    identity, similar structures
  • Super Family distant homologous, same ancestor,
    sequence identity is around 25-30, similar
    structures.
  • Fold only shapes are similar, no homologous
    relationship, low sequence identity.
  • Protein classification databases Pfam, SCOP,
    CATH, FSSP

20
Pfam
  • http//www.sanger.ac.uk/Software/Pfam/
  • Protein sequence classification database
  • As of December 2005, 8183 families
  • Most sequences be covered
  • Multiple sequence alignment for each family, then
    modeled by a HMM model

21
SCOP
  • http//scop.mrc-lmb.cam.ac.uk/scop/
  • Protein structure classification database,
    manually curated
  • 70859 domains, 25973 PDB entries

22
The Problem
  • Protein functions determined by 3D structures
  • 30,000 protein structures in PDB (Protein Data
    Bank)
  • Experimental determination of protein structures
    time-consuming and expensive
  • Many protein sequences available

protein structure
medicine
sequence
function
23
Outline
  • Introduction to Protein Structures
  • Introduction to Protein Structure Prediction
  • Ab initio folding
  • Protein threading
  • Homology modeling
  • CASP structure prediction competition
  • My Work
  • RAPTOR protein threading
  • SVM approach to fold recognition
  • TreePack protein side-chain packing
  • Prediction examples

24
Protein Structure Prediction
  • In theory, a protein structure can be solved
    computationally
  • A protein folds into a 3D structure to minimizes
  • its free potential energy
  • The problem can be formulated as a search problem
  • for minimum energy
  • the search space is enormous
  • the number of local minima increases exponentially

Computationally it is an exceedingly difficult
problem
25
Who Cares?
  • Long history more than 30 years
  • Listed as a grand challenge problem
  • IBMs big blue
  • Competitions CASP (1992-2006)
  • Useful for
  • Drug design
  • Function annotation
  • Rational protein engineering
  • Target selection

26
Observations
  • Sequences determine structures
  • Proteins fold into minimum energy state.
  • Structures are more conserved than sequences. Two
    protein with 30 identity likely share the same
    fold.

27
What determines structures?
  • Hydrogen bonds essential in stabilizing the
    basic secondary structures
  • Hydrophobic effects strongest determinants of
    protein structures
  • Van der Waal Forces stabilizing the hydrophobic
    cores
  • Electrostatic forces oppositely charged side
    chains form salt bridges

28
Protein Structure Prediction
  • Stage 1 Backbone Prediction
  • Ab initio folding
  • Homology modeling
  • Protein threading
  • Stage 2 Loop Modeling
  • Stage 3 Side-Chain Packing
  • Stage 4 Structure Refinement

The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
29
State of The Art
  • Ab inito folding (simulation-based method)
  • 1998 Duan and Kollman
  • 36 residues, 1000 ns, 256 processors, 2 months
  • Do not find native structure
  • Template-based (or knowledge-based) methods
  • Homology modeling sequence-sequence alignment,
    works if sequence identity gt 25
  • Protein threading sequence-structure alignment,
    can go beyond the 25 limit

30
Ab Initio Folding
  • Based on the first-principle, build structures
    purely from protein sequences
  • An energy function to describe the protein
  • bond energy
  • bond angle energy
  • dihedral angle energy
  • van der Waals energy
  • electrostatic energy
  • Calculating the structure through minimizing the
    energy function
  • Not practical in general
  • Computationally very expensive
  • Accuracy is poor

31
Lattice Model
  • A simplified representation of proteins
  • Arrange atoms at grid points by Monte Carlo
    simulation
  • NMR restraints can speed up simulation

taken from Jeff. Skolnick et al.
32
Fragment Assembly
  • Also called Mini-Threading
  • David Bakers method
  • Construct a library of small structure fragments
    with 9 residues.
  • Correlate a small structure fragment with some
    small sequence segments
  • Cut a target sequence to many small segments.
    Each segment has some potential structural
    fragments.
  • Assemble these fragments by Monte Carlo
    simulation
  • Thousands of simulations. Simulated structures
    are clustered and ranked by energy.

Target sequence
33
Comparative Modeling
  • Homology modeling
  • identification of homologous proteins through
    sequence alignment
  • structure prediction through placing residues
    into corresponding positions of homologous
    structure models
  • Protein threading
  • make structure prediction through identification
    of good sequence-structure fit

34
PDB New Fold Growth
Old fold
New fold
  • Only a few thousand unique folds in nature
  • 90 of new structures deposited to PDB in the
    past three years have similar structural folds

35
Comparative Modeling
  • Find homologous proteins
  • Identify cores and loops conserved segments are
    cores, otherwise loops
  • Core modeling copy backbone coordinates from the
    homologous one with know structure
  • Loop modeling
  • Side chain modeling
  • Refinement

36
Homology Modeling
Query Sequence
DRVYIHPFADRVYIHPFA
  • PSI-BLAST
  • HMM
  • Smith-Waterman algorithm

Protein sequence classification database
The Best Match
37
Protein Threading
38
Threading Example
39
Comparative Modeling Procedures
  • Step 1 Construction of Template Library
  • Step 2 Design of Scoring Function
  • Step 3 Sequence-Sequence or Sequence-Structure
    Alignment
  • Step 4 Template Selection and Model Construction

40
Template Database
  • A representative set of protein structures from
    PDB.
  • Representative structures should have high
    resolution
  • X-ray structures better than NMR structures
  • Non-redundant
  • Some tools PDB_SELECT, SCOP, FSSP

PDB Protein Data Bank at http//www.rcsb.org/pdb/
Welcome.do
41
Alignment Model
  • Sequence-sequence alignment
  • Sequence-profile alignment
  • Sequence-HMM model alignment
  • e.g. SAMT02 (K. Karplus et al.)
  • Profile-sequence alignment
  • e.g. PDB-Blast (A. Godzik et al.)
  • Profile-profile alignment
  • e.g. PROSPECT-II (Y. Xu et al.)
  • Sequence-structure alignment threading

42
Fold Recognition and Model Building
  • Alignment algorithms can only generate alignments
  • The best alignments should be chosen
  • zScore
  • Machine learning approaches like Neural Network,
    SVM, Boosting
  • Some tools to generate 3D structure from the
    alignment
  • MODELLER (http//salilab.org/modeller/modeller.htm
    l)
  • MaxSprout (http//www.ebi.ac.uk/maxsprout/)
  • Jackal (http//honiglab.cpmc.columbia.edu/programs
    /jackal/intro.html)

43
CASP/CAFASP
  • CASP Critical Assessment of Structure Prediction
  • CAFASP Critical Assessment of Fully Automated
    Structure Prediction

CASP Predictor
CAFASP Predictor
  • Wont get tired
  • High-throughput

44
CASP/CAFASP (contd)
  • Public
  • Organized by structure community
  • Evaluated by the unbiased third-party
  • Held every two years
  • Blind
  • Experimental structures to be determined by
    structure centers after competition
  • Drawback lt100 targets
  • Blindness
  • Some centers are reluctant to release their
    structures

45
CAFASP/CASP (contd)
  • Time for each target
  • Individual Servers 48 hours
  • Meta Servers 96 hours
  • CASP5 Predictors 3 to 4 months
  • Resources for predictors
  • No X-ray, NMR machines (of course)
  • CAFASP3 predictors no manual intervention
  • CASP5 predictors anything (servers, google,)
  • Evaluation
  • CASP5 assessed by experts computer
  • CAFASP3 evaluated by a computer program.

46
Test Protein Category
  • Homology Modeling (HM) targets
  • Easy HM has a homologous protein in PDB
  • Hard HM has a distant homologous protein in PDB
  • Also called Comparative Modeling (CM) targets
  • Fold Recognition (FR) targets
  • Has a similar fold in PDB
  • New Fold (NF) targets
  • No similar fold in PDB

47
Online Servers
http//www.bioinformatics.uwaterloo.ca/j3xu/rapto
r/index.php
http//robetta.bakerlab.org/index.html
http//www.sbg.bio.ic.ac.uk/phyre/
48
Outline
  • Introduction to Protein Structures
  • Introduction to Protein Structure Prediction
  • Ab initio folding
  • Protein threading
  • Homology modeling
  • CASP structure prediction competition
  • My Work
  • RAPTOR protein threading
  • SVM approach to fold recognition
  • TreePack protein side-chain packing
  • Prediction examples

49
Protein Structure Prediction
  • Stage 1 Backbone Prediction
  • Ab initio folding
  • Homology modeling
  • Protein threading
  • Stage 2 Loop Modeling
  • Stage 3 Side-Chain Packing
  • Stage 4 Structure Refinement

The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
50
Scoring Function
how well a residue fits a structural
environment E_s (Fitness score)
how preferable to put two particular residues
nearby E_p (Pairwise potential)
sequence similarity between query and template
proteins E_m (Mutation score)
alignment gap penalty E_g (gap score)
How consistent of the secondary structures E_ss
E E_p E_s E_m E_g E_ss
Minimize E to find a sequence-template alignment
51
Scoring Sequence Similarity
A R N D C Q E G H I L K M F P S
T W Y V A R N D C Q E G H
I L K M F P S T W Y V 1 D
-3 -2 -1 5 -5 5 4 -3 -2 -5 -5 -1 -3 -5 -3 -2
-3 -5 -4 -4 0 0 0 33 0 30 36 0 0
0 0 0 0 0 0 0 0 0 0 0 1.09
0.67 2 K -2 -3 0 -3 -4 1 1 -4 -3 -3 -3
0 1 -4 3 0 5 -5 -4 -3 0 0 3 0 0
7 7 0 0 0 0 6 4 0 15 4 53 0
0 0 0.81 0.87 3 A 5 -4 -4 -4 -3 -3 0
-3 -4 0 -3 -1 -3 -4 -2 0 2 -5 -4 2 50 0
0 0 0 0 8 0 0 5 0 4 0 0 2
3 12 0 0 17 0.67 1.07 4 T 3 -4 1
-4 -3 -3 -3 -2 -4 -3 -3 -2 2 -4 -4 2 4 -5 -4
1 29 0 6 0 0 0 0 1 0 0 0
1 5 0 0 13 32 0 0 12 0.68 1.17
5 I 4 -4 -4 -5 -3 0 -4 -2 -5 0 1 0 -2 0
-3 -1 -3 -5 -4 4 34 0 0 0 0 5 0
2 0 3 13 5 0 4 1 3 0 0 0 31
0.56 1.19 6 P 4 0 -4 -4 -1 -3 -2 -3 -4
0 -4 -3 -3 -3 2 0 2 -5 -4 1 46 7 0 0
1 0 1 0 0 5 0 0 0 1 11 6
14 0 0 8 0.67 1.28 7 S 2 -4 -3 -3
-4 -1 2 3 -1 -4 -2 -2 -2 0 -4 2 -3 -5 -4 1
18 0 0 0 0 2 15 25 1 0 4 1
1 4 0 17 0 0 0 12 0.47 1.33 8 E
3 -4 -2 -4 -4 -2 3 0 -1 0 -1 0 -2 2 0 -2
-2 0 -3 0 29 0 1 0 0 1 19 6 2
4 7 6 1 9 4 1 1 1 0 8 0.34
1.34 9 S 2 0 -2 -4 -4 -3 1 1 -1 -3 0
-2 2 -3 1 0 2 -5 -4 -1 21 4 1 0 0
0 8 13 1 0 9 2 5 1 6 7 15 0
0 4 0.29 1.37 10 P 1 -4 -2 -2 -4 -3 0
0 -1 0 1 -2 -3 -4 4 1 2 -5 -4 1 11 0
2 1 0 0 5 7 1 5 13 2 0 0 20
12 12 0 0 9 0.36 1.35 11 F 3 -4 -4
-3 -4 -2 -1 -3 -5 -3 -1 -2 -2 0 2 1 2 2 -4
2 25 0 0 1 0 1 3 2 0 0 6
1 1 5 9 11 15 3 0 16 0.42 1.41
12 A 4 -2 -4 -4 -4 0 -1 3 -4 -3 -1 -1 -2 -3
-3 1 -1 -5 -2 -2 37 2 0 0 0 4 4
24 0 1 6 3 1 1 1 11 2 0 2
2 0.57 1.42 13 A 3 0 2 -1 -4 2 0 -1 -4
-3 -4 -3 -2 0 2 2 -2 -5 -4 0 22 5 10 2
0 9 6 4 0 1 1 0 1 4 9 15
2 0 0 9 0.32 1.40 14 A 4 -3 -3 0 -4
1 -3 0 -4 -4 -2 0 -2 -5 2 0 2 -5 -5 -2 43
1 0 5 0 6 0 6 0 0 3 5 1
0 9 7 12 0 0 3 0.58 1.41 15 E 2
-2 1 0 -4 -2 3 -1 -4 -3 -3 -1 3 -5 3 0 -1
-5 -5 -1 21 1 6 4 0 1 19 4 0 1
2 4 8 0 16 5 3 0 0 5 0.43
1.42 16 V 3 -4 -1 2 -4 -1 -1 -2 -4 -1 -2
-1 -3 -5 -4 0 3 -6 -5 1 31 0 3 11 0
2 4 2 0 4 4 4 0 0 0 4 21 0
0 11 0.48 1.46 17 A 2 1 -2 -2 0 2 -1
-2 0 1 -1 -1 -1 -1 1 0 -1 -5 -2 2 17 9
1 2 2 10 3 2 2 9 4 3 1 2 6
8 3 0 1 13 0.18 1.40
Sequence similarity profile similarity
52
Scoring Fitness Score
occurring probability of amino acid a with s
occurring probability of amino acid a
occurring probability of solvent accessibility s
53
Scoring Pairwise Potential
occurring probability of a and b with distance lt
cutoff
occurring probability of amino acid a
occurring probability of amino acid b
54
Scoring Secondary Structure
  • Difference between predicted secondary structure
    and
  • template secondary structure
  • 2. PSIPRED for secondary structure prediction

55
Contact Graph
  • Each residue as a vertex
  • One edge between two residues if their spatial
    distance is within given cutoff.
  • Cores are the most conserved segments in the
    template

template
56
Simplified Contact Graph
57
Alignment Example
58
Alignment Example
59
Calculation of Alignment Score
60
Hardness of Protein Threading
  • Protein Threading is NP-hard
  • Reduction from Max-Cut
  • Given a graph, number its nodes by certain order.
    Assume there are M nodes.
  • Consider a sequence of length 2M like this
    PHPHPH
  • Pairwise score is 1 only if two different types
    of residues are mapped to two ends of one graph
    edge, otherwise 0

61
Threading Algorithms
  • Approximation Algorithm
  • Interaction-frozen algorithm (A. Godzik et al.)
  • Monte Carlo sampling (S.H. Bryant et al.)
  • Double dynamic programming (D. Jones et al.)
  • Exact Algorithm
  • Branch-and-bound (R.H. Lathrop and T.F. Smith)
  • PROSPECT-I uses divide-and-conquer (Y. Xu et al.)
  • Linear programming by RAPTOR (J. Xu et al.)

62
Outline
  • Introduction to Protein Structures
  • Introduction to Protein Structure Prediction
  • Ab initio folding
  • Protein threading
  • Homology modeling
  • CASP structure prediction competition
  • My Work
  • RAPTOR protein threading
  • SVM approach to fold recognition
  • TreePack protein side-chain packing
  • Prediction examples

63
Template Ranking
  • Alignment score
  • residue composition bias
  • template length bias
  • Z-score (S.H. Bryant et al., 1995)
  • statistical test, cancel out bias
  • time-consuming to calculate, sequences must be
    shuffled and threaded many times (100 times)
  • Classification-based methods
  • A threading pair is positive if they have similar
    structures
  • Noise caused by bad alignments
  • Regression-based method
  • Predict alignment accuracy
  • Rank templates based on predicted accuracy

64
Feature Extraction
  • topology of a predicted structure
  • sizes
  • pairwise contacts
  • independent domain?
  • sequence-template alignment
  • alignment scores
  • sequence identify
  • gaps

features
65
CAFASP3 result
Servers with name in italic are meta servers
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
66
Protein Structure Prediction
  • Stage 1 Backbone Prediction
  • Ab initio folding
  • Homology modeling
  • Protein threading
  • Stage 2 Loop Modeling
  • Stage 3 Side-Chain Packing
  • Stage 4 Structure Refinement

The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
67
Protein Side-Chain Packing
  • Problem given the backbone coordinates of a
    protein, predict the coordinates of the
    side-chain atoms
  • Insight a protein structure is a geometric
    object with special features
  • Method decompose a protein structure into some
    very small blocks

68
Side-Chain Packing
0.3
0.2
0.3
0.7
0.1
0.4
0.1
0.1
0.6
clash
Each residue has many possible side-chain
positions. Each possible position is called a
rotamer. Need to avoid atomic clashes.
69
Energy Function
Assume rotamer A(i) is assigned to residue i. The
side-chain packing quality is measured by
clash penalty
10
clash penalty
0.82
1
occurring preference The higher the occurring
probability, the smaller the value
distance between two atoms atom radii
Minimize the energy function to obtain the best
side-chain packing.
70
Related Work
  • NP-hard Akutsu, 1997 Pierce et al., 2002 and
    NP-complete to achieve an approximation ratio
    O(N) Chazelle et al, 2004
  • Dead-End Elimination eliminate rotamers
    one-by-one
  • SCWRL biconnected decomposition of a protein
    structure Dunbrack et al., 2003
  • One of the most popular side-chain packing
    programs
  • Linear integer programming Althaus et al, 2000
    Eriksson et al, 2001 Kingsford et al, 2004
  • The formulation similar to that used in RAPTOR

71
Residue Interaction Graph
h
  • Each residue as a vertex
  • Two residues interact if there is a potential
    clash between their rotamer atoms
  • Add one edge between two residues that interact.

f
b
d
s
m
c
a
e
i
j
k
l
Residue Interaction Graph
72
Key Observations
  • A residue interaction graph is a geometric
    neighborhood graph
  • Each rotamer is bounded to its backbone position
    by a constant distance
  • There is no interaction edge between two residues
    if their distance is beyond D. D is a constant
    depending on rotamer diameter.
  • Residue interaction graphs are sparse!
  • Any two residue centers cannot be too close.
    Their distance is at least a constant C.

No previous algorithms exploit these features!
73
Tree DecompositionRobertson Seymour, 1986
Greedy minimum degree heuristic
h
  • Choose the vertex with minimal degree
  • The chosen vertex and its neighbors form a
    component
  • Add one edge to any two neighbors of the chosen
    vertex
  • Remove the chosen vertex
  • Repeat the above steps until the graph is empty

74
Tree Decomposition (Contd)
Tree Decomposition
Tree width is the maximal component size minus 1.
75
Side-Chain Packing Algorithm
  • Bottom-to-Top Calculate the minimal energy
    function
  • 2. Top-to-Bottom Extract the optimal assignment
  • 3. Time complexity exponential to tree width,
    linear to graph size

A tree decomposition rooted at Xr
The score of component Xi
The scores of subtree rooted at Xl
The score of subtree rooted at Xi
The scores of subtree rooted at Xj
76
Theoretical Treewidth Bounds
  • For a general graph, it is NP-hard to determine
    its optimal treewidth.
  • Has a treewidth
  • Can be found within a low-degree polynomial-time
    algorithm, based on Sphere Separator Theorem
    G.L. Miller et al., 1997, a generalization of
    the Planar Separator Theorem
  • Has a treewidth lower bound
  • The residue interaction graph is a cube
  • Each residue is a grid point

77
Result (1)
Method tree-decomposition of the protein
structure to take advantage of its geometric
characteristics.
CPU time (seconds)
  • Five times faster on average, tested on 180
    proteins used by SCWRL
  • Same prediction accuracy as SCWRL 3.0

Theoretical time complexity ltlt is
the average number rotamers for each residue.
78
Result (2)
An optimization problem admits a PTAS if given an
error e (0ltelt1), there is a polynomial-time
algorithm to obtain a solution close to the
optimal within a factor of (1e).
  • Has a PTAS if one of the following conditions is
    satisfied
  • All the energy items are non-positive
  • All the pairwise energy items have the same sign,
    and the lowest system energy is away from 0 by a
    certain amount

Chazelle et al. have proved that it is
NP-complete to approximate this problem within a
factor of O(N), without considering the geometric
characteristics of a protein structure.
79
Outline
  • Introduction to Protein Structures
  • Introduction to Protein Structure Prediction
  • Ab initio folding
  • Protein threading
  • Homology modeling
  • CASP structure prediction competition
  • My Work (shameless promotion)
  • RAPTOR protein threading
  • SVM approach to fold recognition
  • TreePack protein side-chain packing
  • Prediction examples

80
CASP6 Example
T0267 (PDB id 1wk4)
T0268_1 (PDB id 1wg8)
MaxSub score 0.9
MaxSub score 0.6
Sequence identity 19
Sequence identity 50
81
CASP6 Example
T0224 (PDB id 1rhx or 1x9a)
T0228_1 (PDB id 1vlp)
MaxSub score 0.5
MaxSub score 0.3
Sequence identity 8
Sequence identity 15
82
CASP6 Example
T0238 (PDB id 1w33)
T0242 (PDB id 2blk)
MaxSub score 0.2
MaxSub score 0.17
Sequence identity 9
Sequence identity 10
83
Summary
  • Protein Structures
  • Primary structure, secondary structure, tertiary
    structure
  • Protein classification
  • Experimental structure determination
  • Protein Structure Prediction
  • Ab initio folding
  • Protein threading
  • Homology modeling
  • CASP structure prediction competition
  • My Work
  • RAPTOR protein threading
  • SVM approach to fold recognition
  • TreePack protein side-chain packing

84
Reading List
  • CASP1, CASP2, CASP3, CASP4, CASP5 and CASP6
    Special Issues, Proteins Structure, Function and
    Genetics, 1995, 1997, 1999, 2001, 2003, 2005
  • Jinbo Xu. Rapid Protein Side-Chain Packing via
    Tree Decomposition. RECOMB 2005.
  • Email to j3xu_at_tti-c.org

85
Acknowledgements
  • Bonnie Berger, Introduction to Computational
    Molecular Biology, course notes, 2001
  • Bin Ma, Bioinformatics, course notes, 2004

86
Growth of Protein Sequences and Structures
Data from http//www.dna.affrc.go.jp
87
Three Structure Levels
alpha-helix
  • Primary structure sequence of amino acids
  • e.g., DRVYIHPF
  • Secondary structure local folding patterns
  • e.g., alpha-helix, beta-sheet, loop
  • Tertiary structure complete 3D fold

beta-sheet
loop
88
Linear Integer Program
  • Linear programs can be solved within polynomial
    time
  • No polynomial time for integer programs so far
  • Relaxed to linear program, solve the linear
    version
  • Branch-and-bound or branch-and-cut (may cost
    exponential time)

89
Linear Integer Program
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xylt11 -x2ylt5 x, ygt0
Linear contraints
Integral contraints (nonlinear)
x, y integer
90
Variables
  • x(i,l) denotes core i is aligned to sequence
    position l
  • y(i,l,j,k) denotes that core i is aligned to
    position l and core j is aligned to position k at
    the same time.

91
LP Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
92
Protein Threading
  • Make a structure prediction through finding an
    optimal alignment (placement) of a protein
    sequence onto each known structure (structural
    template)
  • alignment quality is measured by some
    statistics-based scoring function
  • best overall alignment among all templates may
    give a structure prediction

93
Threading Model
  • Each template is parsed as a chain of cores. Two
    adjacent cores are connected by a loop. Cores are
    the most conserved segments in a protein.
  • No gap allowed within a core.
  • Only the pairwise contact between two core
    residues are considered because contacts involved
    with loop residues are not conserved well.
  • Global alignment employed

94
CASP5/CAFASP3 targets
Hard
Easy
Prediction Difficulty
CM Comparative Modelling, HM Homology
Modelling FR Fold Recogniton, NF New Fold
95
RAPTORs sensitivity on FR targets
1. RAPTOR is weak at recognizing FR(A) targets
(need improvement ) 2. RAPTOR could not deal with
NF targets at all (normal)
96
Support Vector Machine (SVM) Regression (A.J.
Smola et al)
Notation
Data
Linear regression
If the relationship between f and x is not linear
SVM regression linear regression in a
high-dimension space
(1)
Condition
97
Side Chain Properties
  • Hydrophobic stays inside, while hydrophilic stay
    close to water
  • Oppositely charged amino acids can form salt
    bridge.
  • Polar amino acids can participate hydrogen bonding

The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().
Write a Comment
User Comments (0)
About PowerShow.com