Title: Protein structure prediction
1 Protein structure prediction
Einat Granot Liron Atedgi
2Protein folding
- Protein folding determined by AA sequence
- Why knowing the folding is importance ?
- Determine its functionality
- Find distant evolutionary relationship
- Design drugs
3Protein structures
- Primary structure
- Secondary structure
- Tertiary structure
4Two prediction methods
- PSI-PRED secondary structure prediction based
on PSIBLAST - GenTHREADER tertiary structure prediction
Were developed by the group of David
T.Jones,University of Warwick
5Methods general format
- Sequence
- Alignment
-
- Additional
- data
Neuron networks
Structure prediction
6 Neuron networks
7Neuron networks
Output
Numerical inputs
Units
Why do we call it neuron network ?
Every unit performs weighted calculation
8Neuron network hidden layer
with the increasing number of added layers the
mean square error is lower
Hidden layer
9Neuron networks training
- Network connections and weights determined by
training process - Training performs by samples of input and
expected output. - The learning algorithm is called back propagation
10Network training testing
- After training we perform testing
- Training and testing groups must be chosen very
carefully - What problems can arise ?
- Insufficient training or testing
- Testing group may be biased
11Neuron networks is a black-box
- The specific algorithm ofa working neuron
networkis not known - Its hard to deduce new biological principles
about the solved problem
12PSI-PRED Secondary structure prediction
13Secondary structure prediction
- In DSSP 8 secondary structures categories
- In PSI-PRED were joined into 3Strand(E),
Helix(H) and Coil(C)
AA RLMPHIKRSAIPVNHGQCRWEDNVDERTNCMIQYVLIMRD Pre
d CCCCCHHHCCCCCCEEEEEECCCCCCHHHHEEEEEECCCC
14PSI-PRED
- sequence alignment (Find homologous)
- Create protein profile
- Insert to first neuron network
- Insert to second neuron network
- Final prediction
15sequence alignment
- Finding homologous for target protein using
PSI-BLAST - Reminder ? What is PSI-BLAST?
- Position Specific Iterated Blast,giving
output to PSSM. -
16PSI-BLAST Pros Cons
- Pros
- Sensitive to distant homologous
- Reliable
- Accessible from every workstation
- Cons
- Sensitive to distant homologous - Result might be
biased - Sensitive to repetitive sequences
17Solving PSI-BLAST problems
- A special DB of 340,000 sequences was constructed
for PSI-PRED - This DB contains only unique and unrepetitive
sequences
18PSI-PRED
- sequence alignment (Find homologous)
- Create protein profile
- Insert to first neuron network
- Insert to second neuron network
- Final prediction
19Create protein profile
- PSI-PRED uses the PSSM from PSI-BLAST produced
after 3 iteration - This matrix is processed by transformation
- f(x) , so the final values are
- between 0 to 1
20PSSM Output of PSI-BLAST
Transformation
21Create protein profile
- The matrix size is M x 20, when M is the sequence
length - Addition column is added which defined the N/C
terminus -gt M x 21 matrix
22PSI-PRED
- sequence alignment (Find homologous)
- Create protein profile
- Insert to first neuron network
- Insert to second neuron network
- Final prediction
23Networks training testing
- 187 proteins were selected according to CATH and
PSI-BLAST - CATH filters proteins according to their folding
domains configuration (T-level) - This considered to be a strict selection
24First neuron network
Every time, a sequence of 15 AA long is inserted
into the first network
E H C
0.2 0.5 0.8
0.1 0.2 0.9
0.8 0.4 0.9
0.2 0.5 0.3
0.7 0.8 0.3
0.2 0.1 0.5
0.3 0.8 0.4
0.9 0.8 0.1
0.2 0.2 0.6
0.7 0.3 0.6
0.3 0.6 0.8
0.3 0.1 0.8
0.7 0.4 0.4
0.6 0.6 0.1
0.4 0.3 0.6
The output is a matrix 15 x 3
25PSI-PRED
- sequence alignment (Find homologous)
- Create protein profile
- Insert to first neuron network
- Insert to second neuron network
- Final prediction
26Second neuron network
The input for the 2nd network is the output from
the 1st one
N/C E H C
0 0.2 0.5 0.8
0 0.1 0.2 0.9
0 0.8 0.4 0.9
0 0.2 0.5 0.3
0 0.7 0.8 0.3
0 0.2 0.1 0.5
0 0.3 0.8 0.4
0 0.9 0.8 0.1
0 0.2 0.2 0.6
0 0.7 0.3 0.6
0 0.3 0.6 0.8
0 0.3 0.1 0.8
0 0.7 0.4 0.4
0 0.6 0.6 0.1
1 0.4 0.3 0.6
E H C
0.1 0.5 0.9
0.2 0.1 0.9
0.3 0.4 0.8
0.2 0.7 0.1
0.5 0.8 0.2
0.2 0.7 0.3
0.2 0.9 0.3
0.6 0.8 0.3
0.1 0.3 0.7
0.4 0.2 0.8
0.3 0.5 0.9
0.3 0.2 0.8
0.9 0.2 0.1
0.8 0.3 0.1
0.8 0.3 0.4
Again, another column is added, indicates the N/C
terminus
27Why do we need a second network?
- Lets examine a possible prediction from
- the 1st network
- What is the problem with this prediction ?
Seq VLFLNDNLDDVVIGRPKRTYTAITL Pred
EEEECCCCHHHCCCHCCCEEEECC
A single AA helix does not exist
The 2nd network maintains the coherency between
adjacent AA and improves the accuracy
28PSI-PRED
- sequence alignment (Find homologous)
- Create protein profile
- Insert to first neuron network
- Insert to second neuron network
- Final prediction
29Final prediction
Image of prediction
Degree Of confidence
Target sequence
Secondary structure
30PSI-PRED evaluation
- CASP Critical Assessment of technique for
protein Structure Prediction experiments - At CASP3 PSI-PRED achieved the best results from
all other methods participated
31PSI-PRED evaluation
Q3 average PSI-PRED - 76.3 JPRED
72.4 DSC - 67.3
Q3 score percentage of AA predicted correctly
32 Reasons for success
- The use of PSI-BLAST
- More sensitive (iterative algorithm)
- More accurate (pairwise local alignments)
- Usage of neuron networks
- Strict selection for training testing
33Possible improvements
- Larger data bases (training alignment)
- Combinations with other methods (JPRED)
- Predict more than 3 secondary structure
34Bring out the food
35GenTHREADER Tertiary structure Prediction
36Threading methods
- Trying to thread a target AA sequence on a
template 3D structure
N
S
Q
M
V
D
L
I
R
E
R
A
Q
T
V
L
C
N
K
37Templates collection
- Target sequence is compared against a collection
of sequences with known folding - The collection was taken from Brookhaven Protein
Data Bank and includes unique sequences
38GenTHREADER
- Sequence alignment
- Calculate threading potential
- Insert to neuron network
- Final prediction
39Sequence alignment
- The target sequence is aligned against each of
the templates twice - Target profile against template sequence
- Target sequence against template profile
- The best result is taken
40Creating a profile
- Steps for creating a profile
- Alignment against OWL DB(A DB for coding
sequences) - Selection of sequences with E-Value lower than
0.01 - Constructing a profile using BLOSUM50
41Creating a profile
A L M P H I K R S A I P V N H G Y V I M Q C R W E
D N S T K V
42GenTHREADER
- Sequence alignment
- Calculate threading potential
- Insert to neuron network
- Final prediction
43Calculate threading potential
- Threading potential includes
- pairwise potential
- solvation potential
44Pairwise potential
- Potential for interaction between two AA
- Considerate analysis of known structure and
favorable energy configuration - Lower pairwise potential indicates a favorable
state
45Solvation potential
- Calculated per AA and proportional to its degree
of burial - Degree of burial (DOB) The num of other AA
located in a radius of 10Ã… - Hydrophobic acids - a high DOB is preferred
- Hydrophilic acids - a low DOB is preferred
46GenTHREADER
- Sequence alignment
- Calculate threading potential
- Insert to neuron network
- Final prediction
47Insert to neuron network
- Prediction is very complex therefore a neuron
network is used
48Neuron network
- Again, the 6 input parameters were converted to
values between 0 1 using the function f(x) - The output is a value between 0 -1 showing the
confidence of the match
49Network training testing
- The network was trained using pairs of proteins
with known folding patterns - Again the training and testing sets were
separated to avoid bias
50GenTHREADER
- Sequence alignment
- Calculate threading potential
- Insert to neuron network
- Final prediction
51Final prediction
- Example for GenTHREADER results
52GenTHREADER evaluation
- Evaluated using Fischer benchmarking, 68 hard for
predictions proteins pairs - 73.5 were properly detectedBest method - 76.5,
Other methods 50-60 - For the unrecognized proteins the scores were
less than 0.5
53Genome analysis example
- As an example for GenTHREADER efficiency
Mycoplasma Genitalium genome was analyzed - M.Genitalium is the smallest known bacterial
genome contains 468 ORFs
54Genome analyzing example
In one day the whole genome was analyzed
Confidence categories
Distribution of protein domain architecture
55Genome analyzing example
GenTHREADER succeed to predict 46 from the
ORFs. For comparison, in other methods Fischer
Eisenberg 22 Using PSI-BLAST OWL 30
Huynen 38
56Genome analyzing example
- While analyzing M.Genitalium genome ORF MG353 was
assigned to 1HUE - (Histone like protein)
- ORF MG353 function wasnot known
57Genome analyzing example
- The results show high similarity in both
secondary structure and functional AA
58GenTHREADER advantages
- Fast
- No need for human intervention
- Can distinguish false positive in high accuracy
59Possible improvements
- Improvement of sequence alignment (PSI-BLAST)
- Additional input of sequence features (for
example, secondary structure) - Larger DB of known folding proteins
- Combination with other methods
Already exists (mGenTHREADER)
60Dont try this at home
- www.psipred.net
- The web is for both GenTHREADER PSI-PRED
61References
- Jones DT. Protein secondary structure prediction
based on position-specific matrices. J Mol Biol.
1999 292195-202 - Jones DT. GenTHREADER an efficient and reliable
protein fold recognition method for genomic
sequences. J Mol Biol. 1999 287797-815