Title: Protein Structure Prediction
1Protein Structure Prediction
- Jinbo Xu
- j3xu_at_tti-c.org
- Toyota Technological Institute at Chicago
2Outline
- Introduction to Protein Structures
- Introduction to Protein Structure Prediction
- Ab initio folding
- Protein threading
- Homology modeling
- CASP structure prediction competition
- My Work
- RAPTOR protein threading
- SVM approach to fold recognition
- TreePack protein side-chain packing
- Prediction examples
3Biology in One Slide
Organism
Protein
4Proteins
Proteins are the building blocks of life. In a
cell, 70 is water and 15-20 are
proteins. Examples hormones regulate
metabolism structural hair, wool,
muscle, antibodies immune response enzymes
chemical reactions
5Amino Acids
A protein is composed of a central backbone and a
collection of (typically) 50-2000 amino acids
(a.k.a. residues). There are 20 different kinds
of amino acids each consisting of up to 18
atoms, e.g.,
Name 3-letter code 1-letter
code Leucine Leu L Alanine
Ala A Serine Ser S Glycine
Gly G Valine Val V Glutamic acid
Glu E Threonine Thr T
6Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino
acid.
7Side Chain Properties
- Hydrophobic stays inside, while hydrophilic stay
close to water - Oppositely charged amino acids can form salt
bridge. - Polar amino acids can participate hydrogen bonding
The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().
8Protein Folding
- Proteins must fold to function
- Some diseases are caused by misfolding
- e.g., mad cow disease
9Protein Structure
O H O H O H O H
O H O H O H H3N CH C
N CH C N CH C N CH C N CH C N CH C N
CH C N CH COO-
10Three Structure Levels
- Primary structure sequence of amino acids
- e.g., DRVYIHPF
- Secondary structure local folding patterns
- e.g., alpha-helix, beta-sheet, loop
- Tertiary structure complete 3D fold
Helix
Beta Sheet
Loop
PDB ID 12as
11Beta Sheet Examples
Anti-parallel beta sheet
Parallel beta sheet
12Beta Sheet Examples (Contd)
13Helix Examples
14Domain, Fold, Motif
- A protein chain could have several domains
- A domain is a discrete portion of a protein, can
fold independently, possess its own function - The overall shape of a domain is called a fold.
There are only a few thousand possible folds. - Sequence motif highly conserved protein
subsequence - Structure motif highly conserved substructure
15Protein Data Bank
- About 30,000 protein structures, solved using
experimental techniques 800 are unique
structural folds
Same structural folds
Different structural folds
16Protein Structure Determination
- High-resolution structure determination
- X-ray crystallography (1Ã…)
- Nuclear magnetic resonance (NMR) (1-2.5Ã…)
- Low-resolution structure determination
- Cryo-EM (electron-microscropy) 10-15Ã…
17X-ray crystallography
- most accurate
- An extremely pure protein sample is needed.
- The protein sample must form crystals that are
relatively large without flaws. Generally the
biggest problem. - Many proteins arent amenable to crystallization
at all (i.e., proteins that do their work inside
of a cell membrane). - 100K per structure
18Nuclear Magnetic Resonance
- Fairly accurate
- No need for crystals
- limited to small, soluble proteins only.
19Protein Classification
- Family homologous, same ancestor, high sequence
identity, similar structures - Super Family distant homologous, same ancestor,
sequence identity is around 25-30, similar
structures. - Fold only shapes are similar, no homologous
relationship, low sequence identity. - Protein classification databases Pfam, SCOP,
CATH, FSSP
20Pfam
- http//www.sanger.ac.uk/Software/Pfam/
- Protein sequence classification database
- As of December 2005, 8183 families
- Most sequences be covered
- Multiple sequence alignment for each family, then
modeled by a HMM model
21SCOP
- http//scop.mrc-lmb.cam.ac.uk/scop/
- Protein structure classification database,
manually curated - 70859 domains, 25973 PDB entries
22The Problem
- Protein functions determined by 3D structures
- 30,000 protein structures in PDB (Protein Data
Bank) - Experimental determination of protein structures
time-consuming and expensive - Many protein sequences available
protein structure
medicine
sequence
function
23Outline
- Introduction to Protein Structures
- Introduction to Protein Structure Prediction
- Ab initio folding
- Protein threading
- Homology modeling
- CASP structure prediction competition
- My Work
- RAPTOR protein threading
- SVM approach to fold recognition
- TreePack protein side-chain packing
- Prediction examples
24Protein Structure Prediction
- In theory, a protein structure can be solved
computationally - A protein folds into a 3D structure to minimizes
- its free potential energy
- The problem can be formulated as a search problem
- for minimum energy
- the search space is enormous
- the number of local minima increases exponentially
Computationally it is an exceedingly difficult
problem
25Who Cares?
- Long history more than 30 years
- Listed as a grand challenge problem
- IBMs big blue
- Competitions CASP (1992-2006)
- Useful for
- Drug design
- Function annotation
- Rational protein engineering
- Target selection
26Observations
- Sequences determine structures
- Proteins fold into minimum energy state.
- Structures are more conserved than sequences. Two
protein with 30 identity likely share the same
fold.
27What determines structures?
- Hydrogen bonds essential in stabilizing the
basic secondary structures - Hydrophobic effects strongest determinants of
protein structures - Van der Waal Forces stabilizing the hydrophobic
cores - Electrostatic forces oppositely charged side
chains form salt bridges
28Protein Structure Prediction
- Stage 1 Backbone Prediction
- Ab initio folding
- Homology modeling
- Protein threading
- Stage 2 Loop Modeling
- Stage 3 Side-Chain Packing
- Stage 4 Structure Refinement
The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
29State of The Art
- Ab inito folding (simulation-based method)
- 1998 Duan and Kollman
- 36 residues, 1000 ns, 256 processors, 2 months
- Do not find native structure
- Template-based (or knowledge-based) methods
- Homology modeling sequence-sequence alignment,
works if sequence identity gt 25 - Protein threading sequence-structure alignment,
can go beyond the 25 limit
30Ab Initio Folding
- Based on the first-principle, build structures
purely from protein sequences - An energy function to describe the protein
- bond energy
- bond angle energy
- dihedral angle energy
- van der Waals energy
- electrostatic energy
- Calculating the structure through minimizing the
energy function - Not practical in general
- Computationally very expensive
- Accuracy is poor
31Lattice Model
- A simplified representation of proteins
- Arrange atoms at grid points by Monte Carlo
simulation - NMR restraints can speed up simulation
taken from Jeff. Skolnick et al.
32Fragment Assembly
- Also called Mini-Threading
- David Bakers method
- Construct a library of small structure fragments
with 9 residues. - Correlate a small structure fragment with some
small sequence segments - Cut a target sequence to many small segments.
Each segment has some potential structural
fragments. - Assemble these fragments by Monte Carlo
simulation - Thousands of simulations. Simulated structures
are clustered and ranked by energy.
Target sequence
33Comparative Modeling
- Homology modeling
- identification of homologous proteins through
sequence alignment - structure prediction through placing residues
into corresponding positions of homologous
structure models - Protein threading
- make structure prediction through identification
of good sequence-structure fit
34PDB New Fold Growth
Old fold
New fold
- Only a few thousand unique folds in nature
- 90 of new structures deposited to PDB in the
past three years have similar structural folds
35Comparative Modeling
- Find homologous proteins
- Identify cores and loops conserved segments are
cores, otherwise loops - Core modeling copy backbone coordinates from the
homologous one with know structure - Loop modeling
- Side chain modeling
- Refinement
36Homology Modeling
Query Sequence
DRVYIHPFADRVYIHPFA
- PSI-BLAST
- HMM
- Smith-Waterman algorithm
Protein sequence classification database
The Best Match
37Protein Threading
38Threading Example
39Comparative Modeling Procedures
- Step 1 Construction of Template Library
- Step 2 Design of Scoring Function
- Step 3 Sequence-Sequence or Sequence-Structure
Alignment - Step 4 Template Selection and Model Construction
40Template Database
- A representative set of protein structures from
PDB. - Representative structures should have high
resolution - X-ray structures better than NMR structures
- Non-redundant
- Some tools PDB_SELECT, SCOP, FSSP
PDB Protein Data Bank at http//www.rcsb.org/pdb/
Welcome.do
41Alignment Model
- Sequence-sequence alignment
- Sequence-profile alignment
- Sequence-HMM model alignment
- e.g. SAMT02 (K. Karplus et al.)
- Profile-sequence alignment
- e.g. PDB-Blast (A. Godzik et al.)
- Profile-profile alignment
- e.g. PROSPECT-II (Y. Xu et al.)
- Sequence-structure alignment threading
42Fold Recognition and Model Building
- Alignment algorithms can only generate alignments
- The best alignments should be chosen
- zScore
- Machine learning approaches like Neural Network,
SVM, Boosting - Some tools to generate 3D structure from the
alignment - MODELLER (http//salilab.org/modeller/modeller.htm
l) - MaxSprout (http//www.ebi.ac.uk/maxsprout/)
- Jackal (http//honiglab.cpmc.columbia.edu/programs
/jackal/intro.html)
43CASP/CAFASP
- CASP Critical Assessment of Structure Prediction
- CAFASP Critical Assessment of Fully Automated
Structure Prediction
CASP Predictor
CAFASP Predictor
- Wont get tired
- High-throughput
44CASP/CAFASP (contd)
- Public
- Organized by structure community
- Evaluated by the unbiased third-party
- Held every two years
- Blind
- Experimental structures to be determined by
structure centers after competition - Drawback lt100 targets
- Blindness
- Some centers are reluctant to release their
structures
45CAFASP/CASP (contd)
- Time for each target
- Individual Servers 48 hours
- Meta Servers 96 hours
- CASP5 Predictors 3 to 4 months
- Resources for predictors
- No X-ray, NMR machines (of course)
- CAFASP3 predictors no manual intervention
- CASP5 predictors anything (servers, google,)
- Evaluation
- CASP5 assessed by experts computer
- CAFASP3 evaluated by a computer program.
46Test Protein Category
- Homology Modeling (HM) targets
- Easy HM has a homologous protein in PDB
- Hard HM has a distant homologous protein in PDB
- Also called Comparative Modeling (CM) targets
- Fold Recognition (FR) targets
- Has a similar fold in PDB
- New Fold (NF) targets
- No similar fold in PDB
47Online Servers
http//www.bioinformatics.uwaterloo.ca/j3xu/rapto
r/index.php
http//robetta.bakerlab.org/index.html
http//www.sbg.bio.ic.ac.uk/phyre/
48Outline
- Introduction to Protein Structures
- Introduction to Protein Structure Prediction
- Ab initio folding
- Protein threading
- Homology modeling
- CASP structure prediction competition
- My Work
- RAPTOR protein threading
- SVM approach to fold recognition
- TreePack protein side-chain packing
- Prediction examples
49Protein Structure Prediction
- Stage 1 Backbone Prediction
- Ab initio folding
- Homology modeling
- Protein threading
- Stage 2 Loop Modeling
- Stage 3 Side-Chain Packing
- Stage 4 Structure Refinement
The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
50Scoring Function
how well a residue fits a structural
environment E_s (Fitness score)
how preferable to put two particular residues
nearby E_p (Pairwise potential)
sequence similarity between query and template
proteins E_m (Mutation score)
alignment gap penalty E_g (gap score)
How consistent of the secondary structures E_ss
E E_p E_s E_m E_g E_ss
Minimize E to find a sequence-template alignment
51Scoring Sequence Similarity
A R N D C Q E G H I L K M F P S
T W Y V A R N D C Q E G H
I L K M F P S T W Y V 1 D
-3 -2 -1 5 -5 5 4 -3 -2 -5 -5 -1 -3 -5 -3 -2
-3 -5 -4 -4 0 0 0 33 0 30 36 0 0
0 0 0 0 0 0 0 0 0 0 0 1.09
0.67 2 K -2 -3 0 -3 -4 1 1 -4 -3 -3 -3
0 1 -4 3 0 5 -5 -4 -3 0 0 3 0 0
7 7 0 0 0 0 6 4 0 15 4 53 0
0 0 0.81 0.87 3 A 5 -4 -4 -4 -3 -3 0
-3 -4 0 -3 -1 -3 -4 -2 0 2 -5 -4 2 50 0
0 0 0 0 8 0 0 5 0 4 0 0 2
3 12 0 0 17 0.67 1.07 4 T 3 -4 1
-4 -3 -3 -3 -2 -4 -3 -3 -2 2 -4 -4 2 4 -5 -4
1 29 0 6 0 0 0 0 1 0 0 0
1 5 0 0 13 32 0 0 12 0.68 1.17
5 I 4 -4 -4 -5 -3 0 -4 -2 -5 0 1 0 -2 0
-3 -1 -3 -5 -4 4 34 0 0 0 0 5 0
2 0 3 13 5 0 4 1 3 0 0 0 31
0.56 1.19 6 P 4 0 -4 -4 -1 -3 -2 -3 -4
0 -4 -3 -3 -3 2 0 2 -5 -4 1 46 7 0 0
1 0 1 0 0 5 0 0 0 1 11 6
14 0 0 8 0.67 1.28 7 S 2 -4 -3 -3
-4 -1 2 3 -1 -4 -2 -2 -2 0 -4 2 -3 -5 -4 1
18 0 0 0 0 2 15 25 1 0 4 1
1 4 0 17 0 0 0 12 0.47 1.33 8 E
3 -4 -2 -4 -4 -2 3 0 -1 0 -1 0 -2 2 0 -2
-2 0 -3 0 29 0 1 0 0 1 19 6 2
4 7 6 1 9 4 1 1 1 0 8 0.34
1.34 9 S 2 0 -2 -4 -4 -3 1 1 -1 -3 0
-2 2 -3 1 0 2 -5 -4 -1 21 4 1 0 0
0 8 13 1 0 9 2 5 1 6 7 15 0
0 4 0.29 1.37 10 P 1 -4 -2 -2 -4 -3 0
0 -1 0 1 -2 -3 -4 4 1 2 -5 -4 1 11 0
2 1 0 0 5 7 1 5 13 2 0 0 20
12 12 0 0 9 0.36 1.35 11 F 3 -4 -4
-3 -4 -2 -1 -3 -5 -3 -1 -2 -2 0 2 1 2 2 -4
2 25 0 0 1 0 1 3 2 0 0 6
1 1 5 9 11 15 3 0 16 0.42 1.41
12 A 4 -2 -4 -4 -4 0 -1 3 -4 -3 -1 -1 -2 -3
-3 1 -1 -5 -2 -2 37 2 0 0 0 4 4
24 0 1 6 3 1 1 1 11 2 0 2
2 0.57 1.42 13 A 3 0 2 -1 -4 2 0 -1 -4
-3 -4 -3 -2 0 2 2 -2 -5 -4 0 22 5 10 2
0 9 6 4 0 1 1 0 1 4 9 15
2 0 0 9 0.32 1.40 14 A 4 -3 -3 0 -4
1 -3 0 -4 -4 -2 0 -2 -5 2 0 2 -5 -5 -2 43
1 0 5 0 6 0 6 0 0 3 5 1
0 9 7 12 0 0 3 0.58 1.41 15 E 2
-2 1 0 -4 -2 3 -1 -4 -3 -3 -1 3 -5 3 0 -1
-5 -5 -1 21 1 6 4 0 1 19 4 0 1
2 4 8 0 16 5 3 0 0 5 0.43
1.42 16 V 3 -4 -1 2 -4 -1 -1 -2 -4 -1 -2
-1 -3 -5 -4 0 3 -6 -5 1 31 0 3 11 0
2 4 2 0 4 4 4 0 0 0 4 21 0
0 11 0.48 1.46 17 A 2 1 -2 -2 0 2 -1
-2 0 1 -1 -1 -1 -1 1 0 -1 -5 -2 2 17 9
1 2 2 10 3 2 2 9 4 3 1 2 6
8 3 0 1 13 0.18 1.40
Sequence similarity profile similarity
52Scoring Fitness Score
occurring probability of amino acid a with s
occurring probability of amino acid a
occurring probability of solvent accessibility s
53Scoring Pairwise Potential
occurring probability of a and b with distance lt
cutoff
occurring probability of amino acid a
occurring probability of amino acid b
54Scoring Secondary Structure
- Difference between predicted secondary structure
and - template secondary structure
- 2. PSIPRED for secondary structure prediction
55Contact Graph
- Each residue as a vertex
- One edge between two residues if their spatial
distance is within given cutoff. - Cores are the most conserved segments in the
template
template
56Simplified Contact Graph
57Alignment Example
58Alignment Example
59Calculation of Alignment Score
60Hardness of Protein Threading
- Protein Threading is NP-hard
- Reduction from Max-Cut
- Given a graph, number its nodes by certain order.
Assume there are M nodes. - Consider a sequence of length 2M like this
PHPHPH - Pairwise score is 1 only if two different types
of residues are mapped to two ends of one graph
edge, otherwise 0
61Threading Algorithms
- Approximation Algorithm
- Interaction-frozen algorithm (A. Godzik et al.)
- Monte Carlo sampling (S.H. Bryant et al.)
- Double dynamic programming (D. Jones et al.)
- Exact Algorithm
- Branch-and-bound (R.H. Lathrop and T.F. Smith)
- PROSPECT-I uses divide-and-conquer (Y. Xu et al.)
- Linear programming by RAPTOR (J. Xu et al.)
62Outline
- Introduction to Protein Structures
- Introduction to Protein Structure Prediction
- Ab initio folding
- Protein threading
- Homology modeling
- CASP structure prediction competition
- My Work
- RAPTOR protein threading
- SVM approach to fold recognition
- TreePack protein side-chain packing
- Prediction examples
63Template Ranking
- Alignment score
- residue composition bias
- template length bias
- Z-score (S.H. Bryant et al., 1995)
- statistical test, cancel out bias
- time-consuming to calculate, sequences must be
shuffled and threaded many times (100 times) - Classification-based methods
- A threading pair is positive if they have similar
structures - Noise caused by bad alignments
- Regression-based method
- Predict alignment accuracy
- Rank templates based on predicted accuracy
64Feature Extraction
- topology of a predicted structure
- sizes
- pairwise contacts
- independent domain?
- sequence-template alignment
- alignment scores
- sequence identify
- gaps
features
65CAFASP3 result
Servers with name in italic are meta servers
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
66Protein Structure Prediction
- Stage 1 Backbone Prediction
- Ab initio folding
- Homology modeling
- Protein threading
- Stage 2 Loop Modeling
- Stage 3 Side-Chain Packing
- Stage 4 Structure Refinement
The picture is adapted from http//www.cs.ucdavis.
edu/koehl/ProModel/fillgap.html
67Protein Side-Chain Packing
- Problem given the backbone coordinates of a
protein, predict the coordinates of the
side-chain atoms - Insight a protein structure is a geometric
object with special features - Method decompose a protein structure into some
very small blocks
68Side-Chain Packing
0.3
0.2
0.3
0.7
0.1
0.4
0.1
0.1
0.6
clash
Each residue has many possible side-chain
positions. Each possible position is called a
rotamer. Need to avoid atomic clashes.
69Energy Function
Assume rotamer A(i) is assigned to residue i. The
side-chain packing quality is measured by
clash penalty
10
clash penalty
0.82
1
occurring preference The higher the occurring
probability, the smaller the value
distance between two atoms atom radii
Minimize the energy function to obtain the best
side-chain packing.
70Related Work
- NP-hard Akutsu, 1997 Pierce et al., 2002 and
NP-complete to achieve an approximation ratio
O(N) Chazelle et al, 2004 - Dead-End Elimination eliminate rotamers
one-by-one - SCWRL biconnected decomposition of a protein
structure Dunbrack et al., 2003 - One of the most popular side-chain packing
programs - Linear integer programming Althaus et al, 2000
Eriksson et al, 2001 Kingsford et al, 2004 - The formulation similar to that used in RAPTOR
71Residue Interaction Graph
h
- Each residue as a vertex
- Two residues interact if there is a potential
clash between their rotamer atoms - Add one edge between two residues that interact.
f
b
d
s
m
c
a
e
i
j
k
l
Residue Interaction Graph
72Key Observations
- A residue interaction graph is a geometric
neighborhood graph - Each rotamer is bounded to its backbone position
by a constant distance - There is no interaction edge between two residues
if their distance is beyond D. D is a constant
depending on rotamer diameter. - Residue interaction graphs are sparse!
- Any two residue centers cannot be too close.
Their distance is at least a constant C.
No previous algorithms exploit these features!
73Tree DecompositionRobertson Seymour, 1986
Greedy minimum degree heuristic
h
- Choose the vertex with minimal degree
- The chosen vertex and its neighbors form a
component - Add one edge to any two neighbors of the chosen
vertex - Remove the chosen vertex
- Repeat the above steps until the graph is empty
74Tree Decomposition (Contd)
Tree Decomposition
Tree width is the maximal component size minus 1.
75Side-Chain Packing Algorithm
- Bottom-to-Top Calculate the minimal energy
function - 2. Top-to-Bottom Extract the optimal assignment
- 3. Time complexity exponential to tree width,
linear to graph size
A tree decomposition rooted at Xr
The score of component Xi
The scores of subtree rooted at Xl
The score of subtree rooted at Xi
The scores of subtree rooted at Xj
76Theoretical Treewidth Bounds
- For a general graph, it is NP-hard to determine
its optimal treewidth. - Has a treewidth
- Can be found within a low-degree polynomial-time
algorithm, based on Sphere Separator Theorem
G.L. Miller et al., 1997, a generalization of
the Planar Separator Theorem - Has a treewidth lower bound
- The residue interaction graph is a cube
- Each residue is a grid point
77Result (1)
Method tree-decomposition of the protein
structure to take advantage of its geometric
characteristics.
CPU time (seconds)
- Five times faster on average, tested on 180
proteins used by SCWRL - Same prediction accuracy as SCWRL 3.0
Theoretical time complexity ltlt is
the average number rotamers for each residue.
78Result (2)
An optimization problem admits a PTAS if given an
error e (0ltelt1), there is a polynomial-time
algorithm to obtain a solution close to the
optimal within a factor of (1e).
- Has a PTAS if one of the following conditions is
satisfied - All the energy items are non-positive
- All the pairwise energy items have the same sign,
and the lowest system energy is away from 0 by a
certain amount
Chazelle et al. have proved that it is
NP-complete to approximate this problem within a
factor of O(N), without considering the geometric
characteristics of a protein structure.
79Outline
- Introduction to Protein Structures
- Introduction to Protein Structure Prediction
- Ab initio folding
- Protein threading
- Homology modeling
- CASP structure prediction competition
- My Work (shameless promotion)
- RAPTOR protein threading
- SVM approach to fold recognition
- TreePack protein side-chain packing
- Prediction examples
80CASP6 Example
T0267 (PDB id 1wk4)
T0268_1 (PDB id 1wg8)
MaxSub score 0.9
MaxSub score 0.6
Sequence identity 19
Sequence identity 50
81CASP6 Example
T0224 (PDB id 1rhx or 1x9a)
T0228_1 (PDB id 1vlp)
MaxSub score 0.5
MaxSub score 0.3
Sequence identity 8
Sequence identity 15
82CASP6 Example
T0238 (PDB id 1w33)
T0242 (PDB id 2blk)
MaxSub score 0.2
MaxSub score 0.17
Sequence identity 9
Sequence identity 10
83Summary
- Protein Structures
- Primary structure, secondary structure, tertiary
structure - Protein classification
- Experimental structure determination
- Protein Structure Prediction
- Ab initio folding
- Protein threading
- Homology modeling
- CASP structure prediction competition
- My Work
- RAPTOR protein threading
- SVM approach to fold recognition
- TreePack protein side-chain packing
84Reading List
- CASP1, CASP2, CASP3, CASP4, CASP5 and CASP6
Special Issues, Proteins Structure, Function and
Genetics, 1995, 1997, 1999, 2001, 2003, 2005 - Jinbo Xu. Rapid Protein Side-Chain Packing via
Tree Decomposition. RECOMB 2005. - Email to j3xu_at_tti-c.org
85Acknowledgements
- Bonnie Berger, Introduction to Computational
Molecular Biology, course notes, 2001 - Bin Ma, Bioinformatics, course notes, 2004
86Growth of Protein Sequences and Structures
Data from http//www.dna.affrc.go.jp
87Three Structure Levels
alpha-helix
- Primary structure sequence of amino acids
- e.g., DRVYIHPF
- Secondary structure local folding patterns
- e.g., alpha-helix, beta-sheet, loop
- Tertiary structure complete 3D fold
beta-sheet
loop
88Linear Integer Program
- Linear programs can be solved within polynomial
time - No polynomial time for integer programs so far
- Relaxed to linear program, solve the linear
version - Branch-and-bound or branch-and-cut (may cost
exponential time)
89Linear Integer Program
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xylt11 -x2ylt5 x, ygt0
Linear contraints
Integral contraints (nonlinear)
x, y integer
90Variables
- x(i,l) denotes core i is aligned to sequence
position l - y(i,l,j,k) denotes that core i is aligned to
position l and core j is aligned to position k at
the same time.
91LP Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
92Protein Threading
- Make a structure prediction through finding an
optimal alignment (placement) of a protein
sequence onto each known structure (structural
template) - alignment quality is measured by some
statistics-based scoring function - best overall alignment among all templates may
give a structure prediction
93Threading Model
- Each template is parsed as a chain of cores. Two
adjacent cores are connected by a loop. Cores are
the most conserved segments in a protein. - No gap allowed within a core.
- Only the pairwise contact between two core
residues are considered because contacts involved
with loop residues are not conserved well. - Global alignment employed
94CASP5/CAFASP3 targets
Hard
Easy
Prediction Difficulty
CM Comparative Modelling, HM Homology
Modelling FR Fold Recogniton, NF New Fold
95RAPTORs sensitivity on FR targets
1. RAPTOR is weak at recognizing FR(A) targets
(need improvement ) 2. RAPTOR could not deal with
NF targets at all (normal)
96Support Vector Machine (SVM) Regression (A.J.
Smola et al)
Notation
Data
Linear regression
If the relationship between f and x is not linear
SVM regression linear regression in a
high-dimension space
(1)
Condition
97Side Chain Properties
- Hydrophobic stays inside, while hydrophilic stay
close to water - Oppositely charged amino acids can form salt
bridge. - Polar amino acids can participate hydrogen bonding
The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().