Title: In Silico Protein Structure Prediction
1In Silico Protein Structure Prediction
New data creates new opportunity
New drug / biological data
New sequence
New structure
Novel drug development
2The promise of genome projects
Human Genome Watch , 05 Feb. 2003 Draft
3.0 Finished 95.8 Total 98.8
http//www.ncbi.nlm.nih.gov/genome/seq/
3The promise of genome projects
The genomes of some 800 organisms have now been
either completely or partially sequenced, and
this number will double within the next year
4The promise of genome projects
- Diagnosing and Predicting Disease and Disease
- Susceptibility
- Disease intervention
- Solutions at the DNA level
- Gene Therapy. Although promising,
- it is not straightforward
5Proteins The workhorses of living organisms
- Proteins are the primary components of the
networks that conduct the flows of mass, energy
and information. - Protein function
- Biological (phenotypic, cellular)
- Biochemical (molecular)
- The amino acid sequence and the 3D structure
confer unique functionality.
6Disease prevention and management at the protein
level
- Diseases can be linked to the aberrant activity
of proteins (enzymes / receptors). - Pharmaceutical research is based on the search of
molecules that will interact in a desirable
manner with the therapeutic target.
7Drug Discovery and Development
8Protein Sequence-Structure -Function
structure
ALA - PHE - CYS - LYS - GLU - GLN - PRO - MET-
TRP - TYR - GLY - ARG
- Reaction / substrate
- Interactions
- Metabolic pathway
sequence
function
9Protein Structure Initiative (PSI)
- The Structural Genomics Project aims at
determination of the 3D structure of all
proteins. This aim can be achieved in four steps
- Organize known protein sequences into families.
- Select family representatives as targets.
- Solve the 3D structure of targets by X-ray
crystallography or NMR spectroscopy. - Build models for other proteins by homology to
solved 3D structures. - www.structuralgenomics.org
10Protein Structure Initiative (PSI)
- Unfortunately, only a few hundred protein
structures can be determined every year
experimentally. - There are possibly hundreds of thousands of
protein molecules just in humans. - There is a need for computational/bioinformatics
strategies to accelerate protein structure
prediction
11Computer simulations to fold proteins
- At physiological conditions, biomolecules undergo
several movements and changes. - The time-scales of the motions are diverse,
ranging from few femtoseconds to few seconds.
12Computer simulations to fold proteins
- Newtons second law of motion
13We can fold small peptides using Molecular
Dynamics
14We cannot fold proteins using MD
PDB structure of CDK_2
20 ns simulation result
15Sequence-Structure Theoretical Relationship
Folding time scales. Physicochemical principles.
Simulations
ALA - PHE - CYS - LYS - GLU - GLN - PRO - MET-
TRP - TYR - GLY - ARG
Evolutionary time scales. Knowledge-based.
Structural bioinformatics
16Protein Structure Prediction Using Bioinformatics
Secondary structure prediction is accurate
ALA - PHE - CYS - LYS - GLU - GLN - PRO - MET-
TRP - TYR - GLY - ARG
17Protein Structure Prediction Using Bioinformatics
It is not straightforward to fold the secondary
structural elements in 3D space
?
18Protein Structure Prediction Using Bioinformatics
Can we predict pairs of amino acids, distant in
sequence but proximal in structure?
19Correlated Mutations
- Goal Manipulate multiple sequence alignments of
protein families to identify residues that are
close in 3D space. - Hypothesis During evolution residues that are in
proximity in 3D space mutate in a covariant
fashion, so as to retain structural and
functional properties of the protein.
20Covariant Mutations
- Seq. 57 132
-
- 1 MENFQAVEKI..GEGTYGVWY
- 2 KERNKATGEV..VALKKIRWM
- 3 TETEGAPSTA..IREISAFWR
- 4 MEGEGAYRNE..VVATAIIWA
- 5 MENFIALDPV..PSTAIREWI
- 6 REPSTFIREI..SFALPRFHI
- 7 MENGHFTNKH..FCDIGEGHI
- 8 MEALKFVRLT..ETRCVGPHT
21Measure of Covariance
- Correlation coefficient between positions i and
j. - where qil and qjl are the values for some amino
acid physicochemical vector (volume,
hydrophobicity, etc.) for sequence l at positions
i and j. mi, mj, si and sj are the mean values
and the standard deviations.
22Covariant Mutations
- Seq. 57 132
-
- 1 MENFQAVEKI..GEGTYGVWY
- 2 KERNKATGEV..VALKKIRWM
- 3 TETEGAPSTA..IREISAFWR
- 4 MEGEGAYRNE..VVATAIIWA
- 5 MENFIALDPV..PSTAIREWI
- 6 REPSTFIREI..SFALPRFHI
- 7 MENGHFTNKH..FCDIGEGHI
- 8 MEALKFVRLT..ETRCVGPHT
Seq. 57 132 1 ..
52.60 .. 135.4 .. 2 .. 52.60 .. 135.4 .. 3 ..
52.60 .. 135.4 .. 4 .. 52.60 .. 135.4 .. 5 ..
52.60 .. 135.4 .. 6 .. 113.9 .. 91.90 .. 7 ..
113.9 .. 91.90 .. 8 .. 113.9 .. 91.90 ..
rc57,1321.0
23Covariant Mutations
- Seq. 57 132
-
- 1 MENFQAVEKI..GEGTYGVWY
- 2 KERNKATGEV..VALKKIRWM
- 3 TETEGAPSTA..IREISAFWR
- 4 MEGEGAYRNE..VVATAIIWA
- 5 MENFIALDPV..PSTAIREWI
- 6 REPSTFIREI..SFALPRFHI
- 7 MENGHFTNKH..FCDIGEGHI
- 8 MEALKFVRLT..ETRCVGPHT
- 9 TETEGFPSTA..IREISAFTR
Seq. 57 132 1 ..
52.60 .. 135.4 .. 2 .. 52.60 .. 135.4 .. 3 ..
52.60 .. 135.4 .. 4 .. 52.60 .. 135.4 .. 5 ..
52.60 .. 135.4 .. 6 .. 113.9 .. 91.90 .. 7 ..
113.9 .. 91.90 .. 8 .. 113.9 .. 91.90 .. 9 ..
113.9 .. 71.20 ..
rc57,1320.97
24Physicochemical Properties
- AAindex collection of published amino acid
properties. - Kawashima S. et al. Nucleic Acids Research, 27,
368, 1999 - e.g.
25Physicochemical Descriptors
- 142 descriptors used. Redundancy eliminated
calculating principal components. - 12 principal components explain 97 of the
variance. PCA_1 is associated with
hydrophobicity, PCA_2 represents a measure of
size, PCA_3 is related to AA electronic
properties, etc. - Four different measures of distance Ca-Ca,
Cß-Cß, minimum distance, COM-COM. Proximity if lt
6 Ã…. - Which correlation coefficient best predicts
proximity?
26Validation Model-System CDK-2
- Cyclin Dependent Kinase 2
- CDKs are the switches that regulate the cell
cycle. - CDKs control gene transcription and coordinate
proliferation. Important therapeutic targets.
27CDK-2 Multiple Sequence Alignment
28Correlation Coefficient as Diagnostic Test
29Accuracy of Diagnostic Tests
AccuracyTP/(TPFP)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34CMA-based constraints for MD
Start with a stretched protein conformation
35CMA-based constraints for MD
Predict secondary structural elements Add spring
forces between pairs predicted to be proximal
36CMA-based constraints for MD
37Fold CDK-2 with 10 constraints
With 6 TP and 4 FP RMSD 14 Angstrom No secondary
structure
With 10 True Positives RMSD 5 Angstrom No
secondary structure
38Summary
- We need protein structures to harness the
promise of genome projects - Further, we need computational tools to
accelerate protein structure determination - Use bioinformatics to guide computer simulations
- Proximal residues do mutate in a covariant fashion
39Next steps
- Improve predictive ability of CMA.
- Systematically study the influence of constraints
on folding. - Develop additional methods for non-local contact
prediction (free energy based techniques)
40Acknowledgements
Spyros Vicatos (CEMS) Himanshu Khandelia
(CEMS) Eric Fauman (Pfizer) Sangtae Kim (Eli
Lilly) Biotechnology Institute Digital
Technology Center Minnesota Supercomputing
Institute