Protein Structure Prediction

About This Presentation

Title:

Protein Structure Prediction

Description:

A protein is composed of a central backbone and a collection of (typically) 50 ... Jackal (http://honiglab.cpmc.columbia.edu/programs/jackal/intro.html) CASP/CAFASP ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 98

Provided by: tticUc

Category:

more less

Transcript and Presenter's Notes

Title: Protein Structure Prediction

1
Protein Structure Prediction

Jinbo Xu
j3xu_at_tti-c.org
Toyota Technological Institute at Chicago

2
Outline

Introduction to Protein Structures
Introduction to Protein Structure Prediction
Ab initio folding
Protein threading
Homology modeling
CASP structure prediction competition
My Work
RAPTOR protein threading
SVM approach to fold recognition
TreePack protein side-chain packing
Prediction examples

3
Biology in One Slide
Organism
Protein
4
Proteins
Proteins are the building blocks of life. In a
cell, 70 is water and 15-20 are
proteins. Examples hormones regulate
metabolism structural hair, wool,
muscle, antibodies immune response enzymes
chemical reactions
5
Amino Acids
A protein is composed of a central backbone and a
collection of (typically) 50-2000 amino acids
(a.k.a. residues). There are 20 different kinds
of amino acids each consisting of up to 18
atoms, e.g.,
Name 3-letter code 1-letter
code Leucine Leu L Alanine
Ala A Serine Ser S Glycine
Gly G Valine Val V Glutamic acid
Glu E Threonine Thr T
6
Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino
acid.
7
Side Chain Properties

Hydrophobic stays inside, while hydrophilic stay
close to water
Oppositely charged amino acids can form salt
bridge.
Polar amino acids can participate hydrogen bonding

The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().
8
Protein Folding

Proteins must fold to function
Some diseases are caused by misfolding
e.g., mad cow disease

9
Protein Structure
O H O H O H O H
O H O H O H H3N CH C
N CH C N CH C N CH C N CH C N CH C N
CH C N CH COO-
10
Three Structure Levels

Primary structure sequence of amino acids
e.g., DRVYIHPF
Secondary structure local folding patterns
e.g., alpha-helix, beta-sheet, loop
Tertiary structure complete 3D fold

Helix
Beta Sheet
Loop
PDB ID 12as
11
Beta Sheet Examples
Anti-parallel beta sheet
Parallel beta sheet
12
Beta Sheet Examples (Contd)
13
Helix Examples
14
Domain, Fold, Motif

A protein chain could have several domains
A domain is a discrete portion of a protein, can
fold independently, possess its own function
The overall shape of a domain is called a fold.
There are only a few thousand possible folds.
Sequence motif highly conserved protein
subsequence
Structure motif highly conserved substructure

15
Protein Data Bank

About 30,000 protein structures, solved using
experimental techniques 800 are unique
structural folds

Same structural folds
Different structural folds
16
Protein Structure Determination

High-resolution structure determination
X-ray crystallography (1Å)
Nuclear magnetic resonance (NMR) (1-2.5Å)
Low-resolution structure determination
Cryo-EM (electron-microscropy) 10-15Å

17
X-ray crystallography

most accurate
An extremely pure protein sample is needed.
The protein sample must form crystals that are
relatively large without flaws. Generally the
biggest problem.
Many proteins arent amenable to crystallization
at all (i.e., proteins that do their work inside
of a cell membrane).
100K per structure

18
Nuclear Magnetic Resonance

Fairly accurate
No need for crystals
limited to small, soluble proteins only.

19
Protein Classification

Family homologous, same ancestor, high sequence
identity, similar structures
Super Family distant homologous, same ancestor,
sequence identity is around 25-30, similar
structures.
Fold only shapes are similar, no homologous
relationship, low sequence identity.
Protein classification databases Pfam, SCOP,
CATH, FSSP

20
Pfam

http//www.sanger.ac.uk/Software/Pfam/
Protein sequence classification database
As of December 2005, 8183 families
Most sequences be covered
Multiple sequence alignment for each family, then
modeled by a HMM model

21
SCOP

http//scop.mrc-lmb.cam.ac.uk/scop/
Protein structure classification database,
manually curated
70859 domains, 25973 PDB entries

22
The Problem

Protein functions determined by 3D structures
30,000 protein structures in PDB (Protein Data
Bank)
Experimental determination of protein structures
time-consuming and expensive
Many protein sequences available

protein structure
medicine
sequence
function
23
Outline

Introduction to Protein Structures
Introduction to Protein Structure Prediction
Ab initio folding
Protein threading
Homology modeling
CASP structure prediction competition
My Work
RAPTOR protein threading
SVM approach to fold recognition
TreePack protein side-chain packing
Prediction examples

24
Protein Structure Prediction

In theory, a protein structure can be solved
computationally
A protein folds into a 3D structure to minimizes
its free potential energy
The problem can be formulated as a search problem
for minimum energy
the search space is enormous
the number of local minima increases exponentially

Computationally it is an exceedingly difficult
problem
25
Who Cares?

Long history more than 30 years
Listed as a grand challenge problem
IBMs big blue
Competitions CASP (1992-2006)
Useful for
Drug design
Function annotation
Rational protein engineering
Target selection

26
Observations

Sequences determine structures
Proteins fold into minimum energy state.
Structures are more conserved than sequences. Two
protein with 30 identity likely share the same
fold.

27
What determines structures?

Hydrogen bonds essential in stabilizing the
basic secondary structures
Hydrophobic effects strongest determinants of
protein structures
Van der Waal Forces stabilizing the hydrophobic
cores
Electrostatic forces oppositely charged side
chains form salt bridges

28
Protein Structure Prediction

Stage 1 Backbone Prediction
Ab initio folding
Homology modeling
Protein threading
Stage 2 Loop Modeling
Stage 3 Side-Chain Packing
Stage 4 Structure Refinement

The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
29
State of The Art

Ab inito folding (simulation-based method)
1998 Duan and Kollman
36 residues, 1000 ns, 256 processors, 2 months
Do not find native structure
Template-based (or knowledge-based) methods
Homology modeling sequence-sequence alignment,
works if sequence identity gt 25
Protein threading sequence-structure alignment,
can go beyond the 25 limit

30
Ab Initio Folding

Based on the first-principle, build structures
purely from protein sequences
An energy function to describe the protein
bond energy
bond angle energy
dihedral angle energy
van der Waals energy
electrostatic energy
Calculating the structure through minimizing the
energy function
Not practical in general
Computationally very expensive
Accuracy is poor

31
Lattice Model

A simplified representation of proteins
Arrange atoms at grid points by Monte Carlo
simulation
NMR restraints can speed up simulation

taken from Jeff. Skolnick et al.
32
Fragment Assembly

Also called Mini-Threading
David Bakers method
Construct a library of small structure fragments
with 9 residues.
Correlate a small structure fragment with some
small sequence segments
Cut a target sequence to many small segments.
Each segment has some potential structural
fragments.
Assemble these fragments by Monte Carlo
simulation
Thousands of simulations. Simulated structures
are clustered and ranked by energy.

Target sequence
33
Comparative Modeling

Homology modeling
identification of homologous proteins through
sequence alignment
structure prediction through placing residues
into corresponding positions of homologous
structure models
Protein threading
make structure prediction through identification
of good sequence-structure fit

34
PDB New Fold Growth
Old fold
New fold

Only a few thousand unique folds in nature
90 of new structures deposited to PDB in the
past three years have similar structural folds

35
Comparative Modeling

Find homologous proteins
Identify cores and loops conserved segments are
cores, otherwise loops
Core modeling copy backbone coordinates from the
homologous one with know structure
Loop modeling
Side chain modeling
Refinement

36
Homology Modeling
Query Sequence
DRVYIHPFADRVYIHPFA

PSI-BLAST
HMM
Smith-Waterman algorithm

Protein sequence classification database
The Best Match
37
Protein Threading
38
Threading Example
39
Comparative Modeling Procedures

Step 1 Construction of Template Library
Step 2 Design of Scoring Function
Step 3 Sequence-Sequence or Sequence-Structure
Alignment
Step 4 Template Selection and Model Construction

40
Template Database

A representative set of protein structures from
PDB.
Representative structures should have high
resolution
X-ray structures better than NMR structures
Non-redundant
Some tools PDB_SELECT, SCOP, FSSP

PDB Protein Data Bank at http//www.rcsb.org/pdb/
Welcome.do
41
Alignment Model

Sequence-sequence alignment
Sequence-profile alignment
Sequence-HMM model alignment
e.g. SAMT02 (K. Karplus et al.)
Profile-sequence alignment
e.g. PDB-Blast (A. Godzik et al.)
Profile-profile alignment
e.g. PROSPECT-II (Y. Xu et al.)
Sequence-structure alignment threading

42
Fold Recognition and Model Building

Alignment algorithms can only generate alignments
The best alignments should be chosen
zScore
Machine learning approaches like Neural Network,
SVM, Boosting
Some tools to generate 3D structure from the
alignment
MODELLER (http//salilab.org/modeller/modeller.htm
l)
MaxSprout (http//www.ebi.ac.uk/maxsprout/)
Jackal (http//honiglab.cpmc.columbia.edu/programs
/jackal/intro.html)

43
CASP/CAFASP

CASP Critical Assessment of Structure Prediction
CAFASP Critical Assessment of Fully Automated
Structure Prediction

CASP Predictor
CAFASP Predictor

Wont get tired
High-throughput

44
CASP/CAFASP (contd)

Public
Organized by structure community
Evaluated by the unbiased third-party
Held every two years
Blind
Experimental structures to be determined by
structure centers after competition
Drawback lt100 targets
Blindness
Some centers are reluctant to release their
structures

45
CAFASP/CASP (contd)

Time for each target
Individual Servers 48 hours
Meta Servers 96 hours
CASP5 Predictors 3 to 4 months
Resources for predictors
No X-ray, NMR machines (of course)
CAFASP3 predictors no manual intervention
CASP5 predictors anything (servers, google,)
Evaluation
CASP5 assessed by experts computer
CAFASP3 evaluated by a computer program.

46
Test Protein Category

Homology Modeling (HM) targets
Easy HM has a homologous protein in PDB
Hard HM has a distant homologous protein in PDB
Also called Comparative Modeling (CM) targets
Fold Recognition (FR) targets
Has a similar fold in PDB
New Fold (NF) targets
No similar fold in PDB

47
Online Servers
http//www.bioinformatics.uwaterloo.ca/j3xu/rapto
r/index.php
http//robetta.bakerlab.org/index.html
http//www.sbg.bio.ic.ac.uk/phyre/
48
Outline

Introduction to Protein Structures
Introduction to Protein Structure Prediction
Ab initio folding
Protein threading
Homology modeling
CASP structure prediction competition
My Work
RAPTOR protein threading
SVM approach to fold recognition
TreePack protein side-chain packing
Prediction examples

49
Protein Structure Prediction

Stage 1 Backbone Prediction
Ab initio folding
Homology modeling
Protein threading
Stage 2 Loop Modeling
Stage 3 Side-Chain Packing
Stage 4 Structure Refinement

The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
50
Scoring Function
how well a residue fits a structural
environment E_s (Fitness score)
how preferable to put two particular residues
nearby E_p (Pairwise potential)
sequence similarity between query and template
proteins E_m (Mutation score)
alignment gap penalty E_g (gap score)
How consistent of the secondary structures E_ss
E E_p E_s E_m E_g E_ss
Minimize E to find a sequence-template alignment
51
Scoring Sequence Similarity
A R N D C Q E G H I L K M F P S
T W Y V A R N D C Q E G H
I L K M F P S T W Y V 1 D
-3 -2 -1 5 -5 5 4 -3 -2 -5 -5 -1 -3 -5 -3 -2
-3 -5 -4 -4 0 0 0 33 0 30 36 0 0
0 0 0 0 0 0 0 0 0 0 0 1.09
0.67 2 K -2 -3 0 -3 -4 1 1 -4 -3 -3 -3
0 1 -4 3 0 5 -5 -4 -3 0 0 3 0 0
7 7 0 0 0 0 6 4 0 15 4 53 0
0 0 0.81 0.87 3 A 5 -4 -4 -4 -3 -3 0
-3 -4 0 -3 -1 -3 -4 -2 0 2 -5 -4 2 50 0
0 0 0 0 8 0 0 5 0 4 0 0 2
3 12 0 0 17 0.67 1.07 4 T 3 -4 1
-4 -3 -3 -3 -2 -4 -3 -3 -2 2 -4 -4 2 4 -5 -4
1 29 0 6 0 0 0 0 1 0 0 0
1 5 0 0 13 32 0 0 12 0.68 1.17
5 I 4 -4 -4 -5 -3 0 -4 -2 -5 0 1 0 -2 0
-3 -1 -3 -5 -4 4 34 0 0 0 0 5 0
2 0 3 13 5 0 4 1 3 0 0 0 31
0.56 1.19 6 P 4 0 -4 -4 -1 -3 -2 -3 -4
0 -4 -3 -3 -3 2 0 2 -5 -4 1 46 7 0 0
1 0 1 0 0 5 0 0 0 1 11 6
14 0 0 8 0.67 1.28 7 S 2 -4 -3 -3
-4 -1 2 3 -1 -4 -2 -2 -2 0 -4 2 -3 -5 -4 1
18 0 0 0 0 2 15 25 1 0 4 1
1 4 0 17 0 0 0 12 0.47 1.33 8 E
3 -4 -2 -4 -4 -2 3 0 -1 0 -1 0 -2 2 0 -2
-2 0 -3 0 29 0 1 0 0 1 19 6 2
4 7 6 1 9 4 1 1 1 0 8 0.34
1.34 9 S 2 0 -2 -4 -4 -3 1 1 -1 -3 0
-2 2 -3 1 0 2 -5 -4 -1 21 4 1 0 0
0 8 13 1 0 9 2 5 1 6 7 15 0
0 4 0.29 1.37 10 P 1 -4 -2 -2 -4 -3 0
0 -1 0 1 -2 -3 -4 4 1 2 -5 -4 1 11 0
2 1 0 0 5 7 1 5 13 2 0 0 20
12 12 0 0 9 0.36 1.35 11 F 3 -4 -4
-3 -4 -2 -1 -3 -5 -3 -1 -2 -2 0 2 1 2 2 -4
2 25 0 0 1 0 1 3 2 0 0 6
1 1 5 9 11 15 3 0 16 0.42 1.41
12 A 4 -2 -4 -4 -4 0 -1 3 -4 -3 -1 -1 -2 -3
-3 1 -1 -5 -2 -2 37 2 0 0 0 4 4
24 0 1 6 3 1 1 1 11 2 0 2
2 0.57 1.42 13 A 3 0 2 -1 -4 2 0 -1 -4
-3 -4 -3 -2 0 2 2 -2 -5 -4 0 22 5 10 2
0 9 6 4 0 1 1 0 1 4 9 15
2 0 0 9 0.32 1.40 14 A 4 -3 -3 0 -4
1 -3 0 -4 -4 -2 0 -2 -5 2 0 2 -5 -5 -2 43
1 0 5 0 6 0 6 0 0 3 5 1
0 9 7 12 0 0 3 0.58 1.41 15 E 2
-2 1 0 -4 -2 3 -1 -4 -3 -3 -1 3 -5 3 0 -1
-5 -5 -1 21 1 6 4 0 1 19 4 0 1
2 4 8 0 16 5 3 0 0 5 0.43
1.42 16 V 3 -4 -1 2 -4 -1 -1 -2 -4 -1 -2
-1 -3 -5 -4 0 3 -6 -5 1 31 0 3 11 0
2 4 2 0 4 4 4 0 0 0 4 21 0
0 11 0.48 1.46 17 A 2 1 -2 -2 0 2 -1
-2 0 1 -1 -1 -1 -1 1 0 -1 -5 -2 2 17 9
1 2 2 10 3 2 2 9 4 3 1 2 6
8 3 0 1 13 0.18 1.40
Sequence similarity profile similarity
52
Scoring Fitness Score
occurring probability of amino acid a with s
occurring probability of amino acid a
occurring probability of solvent accessibility s
53
Scoring Pairwise Potential
occurring probability of a and b with distance lt
cutoff
occurring probability of amino acid a
occurring probability of amino acid b
54
Scoring Secondary Structure

Difference between predicted secondary structure
and
template secondary structure
2. PSIPRED for secondary structure prediction

55
Contact Graph

Each residue as a vertex
One edge between two residues if their spatial
distance is within given cutoff.
Cores are the most conserved segments in the
template

template
56
Simplified Contact Graph
57
Alignment Example
58
Alignment Example
59
Calculation of Alignment Score
60
Hardness of Protein Threading

Protein Threading is NP-hard
Reduction from Max-Cut
Given a graph, number its nodes by certain order.
Assume there are M nodes.
Consider a sequence of length 2M like this
PHPHPH
Pairwise score is 1 only if two different types
of residues are mapped to two ends of one graph
edge, otherwise 0

61
Threading Algorithms

Approximation Algorithm
Interaction-frozen algorithm (A. Godzik et al.)
Monte Carlo sampling (S.H. Bryant et al.)
Double dynamic programming (D. Jones et al.)
Exact Algorithm
Branch-and-bound (R.H. Lathrop and T.F. Smith)
PROSPECT-I uses divide-and-conquer (Y. Xu et al.)
Linear programming by RAPTOR (J. Xu et al.)

62
Outline

Introduction to Protein Structures
Introduction to Protein Structure Prediction
Ab initio folding
Protein threading
Homology modeling
CASP structure prediction competition
My Work
RAPTOR protein threading
SVM approach to fold recognition
TreePack protein side-chain packing
Prediction examples

63
Template Ranking

Alignment score
residue composition bias
template length bias
Z-score (S.H. Bryant et al., 1995)
statistical test, cancel out bias
time-consuming to calculate, sequences must be
shuffled and threaded many times (100 times)
Classification-based methods
A threading pair is positive if they have similar
structures
Noise caused by bad alignments
Regression-based method
Predict alignment accuracy
Rank templates based on predicted accuracy

64
Feature Extraction

topology of a predicted structure
sizes
pairwise contacts
independent domain?
sequence-template alignment
alignment scores
sequence identify
gaps

features
65
CAFASP3 result
Servers with name in italic are meta servers
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
66
Protein Structure Prediction

Stage 1 Backbone Prediction
Ab initio folding
Homology modeling
Protein threading
Stage 2 Loop Modeling
Stage 3 Side-Chain Packing
Stage 4 Structure Refinement

The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
67
Protein Side-Chain Packing

Problem given the backbone coordinates of a
protein, predict the coordinates of the
side-chain atoms
Insight a protein structure is a geometric
object with special features
Method decompose a protein structure into some
very small blocks

68
Side-Chain Packing
0.3
0.2
0.3
0.7
0.1
0.4
0.1
0.1
0.6
clash
Each residue has many possible side-chain
positions. Each possible position is called a
rotamer. Need to avoid atomic clashes.
69
Energy Function
Assume rotamer A(i) is assigned to residue i. The
side-chain packing quality is measured by
clash penalty
10
clash penalty
0.82
1
occurring preference The higher the occurring
probability, the smaller the value
distance between two atoms atom radii
Minimize the energy function to obtain the best
side-chain packing.
70
Related Work

NP-hard Akutsu, 1997 Pierce et al., 2002 and
NP-complete to achieve an approximation ratio
O(N) Chazelle et al, 2004
Dead-End Elimination eliminate rotamers
one-by-one
SCWRL biconnected decomposition of a protein
structure Dunbrack et al., 2003
One of the most popular side-chain packing
programs
Linear integer programming Althaus et al, 2000
Eriksson et al, 2001 Kingsford et al, 2004
The formulation similar to that used in RAPTOR

71
Residue Interaction Graph
h

Each residue as a vertex
Two residues interact if there is a potential
clash between their rotamer atoms
Add one edge between two residues that interact.

f
b
d
s
m
c
a
e
i
j
k
l
Residue Interaction Graph
72
Key Observations

A residue interaction graph is a geometric
neighborhood graph
Each rotamer is bounded to its backbone position
by a constant distance
There is no interaction edge between two residues
if their distance is beyond D. D is a constant
depending on rotamer diameter.
Residue interaction graphs are sparse!
Any two residue centers cannot be too close.
Their distance is at least a constant C.

No previous algorithms exploit these features!
73
Tree DecompositionRobertson Seymour, 1986
Greedy minimum degree heuristic
h

Choose the vertex with minimal degree
The chosen vertex and its neighbors form a
component
Add one edge to any two neighbors of the chosen
vertex
Remove the chosen vertex
Repeat the above steps until the graph is empty

74
Tree Decomposition (Contd)
Tree Decomposition
Tree width is the maximal component size minus 1.
75
Side-Chain Packing Algorithm

Bottom-to-Top Calculate the minimal energy
function
2. Top-to-Bottom Extract the optimal assignment
3. Time complexity exponential to tree width,
linear to graph size

A tree decomposition rooted at Xr
The score of component Xi
The scores of subtree rooted at Xl
The score of subtree rooted at Xi
The scores of subtree rooted at Xj
76
Theoretical Treewidth Bounds

For a general graph, it is NP-hard to determine
its optimal treewidth.
Has a treewidth
Can be found within a low-degree polynomial-time
algorithm, based on Sphere Separator Theorem
G.L. Miller et al., 1997, a generalization of
the Planar Separator Theorem
Has a treewidth lower bound
The residue interaction graph is a cube
Each residue is a grid point

77
Result (1)
Method tree-decomposition of the protein
structure to take advantage of its geometric
characteristics.
CPU time (seconds)

Five times faster on average, tested on 180
proteins used by SCWRL
Same prediction accuracy as SCWRL 3.0

Theoretical time complexity ltlt is
the average number rotamers for each residue.
78
Result (2)
An optimization problem admits a PTAS if given an
error e (0ltelt1), there is a polynomial-time
algorithm to obtain a solution close to the
optimal within a factor of (1e).

Has a PTAS if one of the following conditions is
satisfied
All the energy items are non-positive
All the pairwise energy items have the same sign,
and the lowest system energy is away from 0 by a
certain amount

Chazelle et al. have proved that it is
NP-complete to approximate this problem within a
factor of O(N), without considering the geometric
characteristics of a protein structure.
79
Outline

Introduction to Protein Structures
Introduction to Protein Structure Prediction
Ab initio folding
Protein threading
Homology modeling
CASP structure prediction competition
My Work (shameless promotion)
RAPTOR protein threading
SVM approach to fold recognition
TreePack protein side-chain packing
Prediction examples

80
CASP6 Example
T0267 (PDB id 1wk4)
T0268_1 (PDB id 1wg8)
MaxSub score 0.9
MaxSub score 0.6
Sequence identity 19
Sequence identity 50
81
CASP6 Example
T0224 (PDB id 1rhx or 1x9a)
T0228_1 (PDB id 1vlp)
MaxSub score 0.5
MaxSub score 0.3
Sequence identity 8
Sequence identity 15
82
CASP6 Example
T0238 (PDB id 1w33)
T0242 (PDB id 2blk)
MaxSub score 0.2
MaxSub score 0.17
Sequence identity 9
Sequence identity 10
83
Summary

Protein Structures
Primary structure, secondary structure, tertiary
structure
Protein classification
Experimental structure determination
Protein Structure Prediction
Ab initio folding
Protein threading
Homology modeling
CASP structure prediction competition
My Work
RAPTOR protein threading
SVM approach to fold recognition
TreePack protein side-chain packing

84
Reading List

CASP1, CASP2, CASP3, CASP4, CASP5 and CASP6
Special Issues, Proteins Structure, Function and
Genetics, 1995, 1997, 1999, 2001, 2003, 2005
Jinbo Xu. Rapid Protein Side-Chain Packing via
Tree Decomposition. RECOMB 2005.
Email to j3xu_at_tti-c.org

85
Acknowledgements

Bonnie Berger, Introduction to Computational
Molecular Biology, course notes, 2001
Bin Ma, Bioinformatics, course notes, 2004

86
Growth of Protein Sequences and Structures
Data from http//www.dna.affrc.go.jp
87
Three Structure Levels
alpha-helix

Primary structure sequence of amino acids
e.g., DRVYIHPF
Secondary structure local folding patterns
e.g., alpha-helix, beta-sheet, loop
Tertiary structure complete 3D fold

beta-sheet
loop
88
Linear Integer Program

Linear programs can be solved within polynomial
time
No polynomial time for integer programs so far
Relaxed to linear program, solve the linear
version
Branch-and-bound or branch-and-cut (may cost
exponential time)

89
Linear Integer Program
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xylt11 -x2ylt5 x, ygt0
Linear contraints
Integral contraints (nonlinear)
x, y integer
90
Variables

x(i,l) denotes core i is aligned to sequence
position l
y(i,l,j,k) denotes that core i is aligned to
position l and core j is aligned to position k at
the same time.

91
LP Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
92
Protein Threading

Make a structure prediction through finding an
optimal alignment (placement) of a protein
sequence onto each known structure (structural
template)
alignment quality is measured by some
statistics-based scoring function
best overall alignment among all templates may
give a structure prediction

93
Threading Model

Each template is parsed as a chain of cores. Two
adjacent cores are connected by a loop. Cores are
the most conserved segments in a protein.
No gap allowed within a core.
Only the pairwise contact between two core
residues are considered because contacts involved
with loop residues are not conserved well.
Global alignment employed

94
CASP5/CAFASP3 targets
Hard
Easy
Prediction Difficulty
CM Comparative Modelling, HM Homology
Modelling FR Fold Recogniton, NF New Fold
95
RAPTORs sensitivity on FR targets
1. RAPTOR is weak at recognizing FR(A) targets
(need improvement ) 2. RAPTOR could not deal with
NF targets at all (normal)
96
Support Vector Machine (SVM) Regression (A.J.
Smola et al)
Notation
Data
Linear regression
If the relationship between f and x is not linear
SVM regression linear regression in a
high-dimension space
(1)
Condition
97
Side Chain Properties