Title: Bioinformatics CS 40400
1BioinformaticsCS 40400
- Gianluca Pollastri
- office CS A1.07
- email gianluca.pollastri_at_ucd.ie
2Lecture notes
- http//gruyere.ucd.ie/2007_courses/40400/
- confidential..
3Recommended/useful readings
- No book is actually required
- Introduction to Computational Molecular Biology
- Setubal, Meidanis
- Introduction to Bioinformatics
- Lesk
- Bioinformatics the Machine Learning approach
- Baldi, Brunak
- Biological sequence analysis (but this is a tough
one) - Eddy, Durbin, Krogh, Mitchison
4(No Transcript)
5The course so far..
- Introduction proteins, RNA, DNA. GenBank,
SWISS-PROT, PDB, ExPaSy. - Sequence comparison. Needleman-Wunsch and
Smith-Waterman algorithms. Semiglobal comparison.
Variations to basic algorithms. Approximate
algorithms BLAST. Multiple sequence alignments.
Approximate algorithms ClustalW - Molecular phylogenetics. Distance based
algorithms UPGMA, Neighbour Joining. Maximum
parsimony. Maximum Likelihood. Rooting trees.
Estimating times. Bootstrapping.
6Whats next?
- Protein structure prediction Comparative
Modelling Threading De novo. Introduction to
artificial neural networks. Prediction of protein
structural features by machine learning
algorithms secondary structure, solvent
accessibility, contact maps.
7Protein Structure Prediction and Structural
Genomics
Loosely based on
- David Baker and Andrej Sali
8- MKTLVHVASV EKGRSYEDFQ KVYNAIALKL REDDEYENYI
GYGDDLVRLA - WHISGTWDKH DNTGGSYGGT YRFKKEFNDP SNAGLQNGFK
FLEPIHKEFP - WISSGDLFSL GGVTAVQEMQ GPKIPWRCGR VDTPEDTTPD
NGRLPDADKD - AGYVRTFFQR LNMNDREVVA LMGAHALGKT HLKNSGYEGP
WGAANNVFTN - EFYLNLLNED WKLEKNDANN EQWDSKSGYM MLPTDYSLIQ
DPKYLSIVKE - YANDQDKFFK DFSKAFEKLL ENGITFPKDA PSPFIFKTLE EQGL
9(No Transcript)
10Why?
- 1.7M protein sequences, cheap product of genome
sequencing projects. - 29k high resolution protein structures.
Determined by X-ray christallography, NMR
painful, costly and time consuming. - Sequence determines Structure, Structure
determines Function. This is why we want to know
the structure..
11(No Transcript)
12ATOM 1 N LEU A 4 12.803 88.583
75.298 1.00 25.27 N ATOM 2 CA
LEU A 4 12.284 89.166 74.064 1.00 25.73
C ATOM 3 C LEU A 4
11.896 88.062 73.094 1.00 24.57 C
ATOM 4 O LEU A 4 12.740 87.441
72.459 1.00 21.83 O ATOM 5 CB
LEU A 4 13.283 90.098 73.387 1.00 25.28
C ATOM 6 CG LEU A 4
12.714 91.348 72.710 1.00 30.06 C
ATOM 7 CD1 LEU A 4 13.446 91.644
71.405 1.00 23.16 C ATOM 8 CD2
LEU A 4 11.221 91.221 72.456 1.00 27.13
C ATOM 9 N VAL A 5
10.588 87.839 72.988 1.00 19.24 N
ATOM 10 CA VAL A 5 10.180 86.742
72.108 1.00 22.98 C ATOM 11 C
VAL A 5 9.286 87.293 71.005 1.00 25.18
C ATOM 12 O VAL A 5
8.388 88.103 71.215 1.00 19.54 O
ATOM 13 CB VAL A 5 9.527 85.607
72.915 1.00 36.54 C ATOM 14 CG1
VAL A 5 8.876 86.145 74.185 1.00 68.19
C ATOM 15 CG2 VAL A 5
8.518 84.844 72.075 1.00 42.84 C
ATOM 16 N HIS A 6 9.594 86.832
69.801 1.00 19.98 N ATOM 17 CA
HIS A 6 8.898 87.164 68.570 1.00 14.76
C ATOM 18 C HIS A 6
8.153 85.933 68.072 1.00 13.19 C
ATOM 19 O HIS A 6 8.794 85.029
67.536 1.00 12.12 O ATOM 20 CB
HIS A 6 9.900 87.636 67.521 1.00 15.61
C ATOM 21 CG HIS A 6
10.488 88.969 67.851 1.00 16.42 C
ATOM 22 ND1 HIS A 6 11.808 89.287
67.631 1.00 17.91 N ATOM 23 CD2
HIS A 6 9.916 90.073 68.382 1.00 12.99
C ATOM 24 CE1 HIS A 6
12.036 90.531 68.009 1.00 10.96 C
ATOM 25 NE2 HIS A 6 10.904 91.032
68.472 1.00 17.40 N ATOM 26 N
VAL A 7 6.839 85.922 68.277 1.00 10.72
N ATOM 27 CA VAL A 7
6.048 84.781 67.851 1.00 11.90 C
ATOM 28 C VAL A 7 5.539 85.014
66.423 1.00 17.11 C ATOM 29 O
VAL A 7 4.938 86.053 66.131 1.00 8.14
O ATOM 30 CB VAL A 7
4.833 84.488 68.746 1.00 12.98 C
ATOM 31 CG1 VAL A 7 4.223 83.146
68.336 1.00 11.94 C ATOM 32 CG2
VAL A 7 5.188 84.475 70.218 1.00 14.19
C
13(No Transcript)
14Simulating nature?
- We probably dont know the physics well enough
(or rather, we know it well on an intractably
small scale) - Ugly landscapes to search.
- An enormous amount of time steps needed.
- Computationally intractable.
- We need to
- start close to the solution
- approximate/simplify
15Methods for 3D prediction
- If there are proteins of known structure that
look like the one I want to model, comparative
modelling (CM) or threading/fold recognition
methods available (starting close to the solution
and possibly simplify). - If there arent, we use de novo (or ab initio)
methods (cant start close to the solution we
need to simplify/approximate).
16Comparative Modelling (CM)
- Find proteins of known structure whose sequence
looks like my sequence (templates). - Align sequence and template(s)
- Build a model
- Figure out if the model makes sense
17Structure more conserved than sequence
- If two sequences are more than 30 similar,
strong structural similarity almost guaranteed. - (Average similarity of unrelated sequences around
7)
18CM
- Find proteins of known structure whose sequence
looks at least 30 like my sequence (templates). - Align sequence and template(s)
- ..
19Sequence similarity
- Query 3 FEFHGYARSGVIMNDSGASTKSGAYITPAGETGGAIGRL
GNQADTYVEMNLEHKQTLDNG 62 - FEFH YAR V MND A K AY PA E A
RL NQAD YVEMNLEHKQ LDN - Sbjct 3 FEFHHYARCHVHMNDCHACCKCHAYHCPAHECHHAHHRL
HNQADCYVEMNLEHKQCLDNH 62 - Query 63 ATTRFKVMVADGQTSYNDWTASTSDLNVRQAFVELGNLP
TFAGPFKGSTLWAGKRFDRDN 122 - A RFKVMVAD Q YNDW A DLNVRQAFVEL
NLP FA PFK LWA KRFDRDN - Sbjct 63 ACCRFKVMVADHQCCYNDWCACCCDLNVRQAFVELHNLP
CFAHPFKH--LWAHKRFDRDN 120 - Query 123 FDIHWIDSDVVFLAGTGGGIYDVKWNDGLRSNFSLYGRN
FGDIDDSSNSVQNYILTMNHF 182 - FD HW D DVVFLA YDVKWND LR NF LY
RNF D DD N VQNY L MNHF - Sbjct 121 FDHHWHDCDVVFLAHCHHHHYDVKWNDHLRCNFCLYHRN
FHDHDDCCNCVQNYHLCMNHF 180
20CM finding templates, aligning sequence and
templates
- Sequence comparison methods
- Exact version (complexity o(nm) where n and m are
sequence lengths) a bit demanding. - Linear approximations (blast, psi-blast)
aligning a sequence vs 1M sequences takes tens of
seconds.
21CM building/assessing model
- Copy template or parts thereof (to start close to
the solution).. - Fondle it a bit and assess fondling by physical
energy or pseudo-energy.
22Threading/Fold Recognition
- If no sequence similarity is detected
- Find proteins of known structure (templates) by
some other method (not sequence comparison) - Align sequence and template(s)
- Build a model
- Figure out if the model makes sense
23Threading finding templates
- Libraries of folds.
- Thread the sequence into each of the folds and
check if it has low energy in one or more of them.
24Threading
- Energy computations constrained by folds.
- This is a lot simpler (quicker) than
unconstrained search. - Still 1-2k folds
25Threading building/assessing model
- Copy template or parts thereof (to start close to
the solution).. - Fondle it a bit more than in CM and assess
fondling by physical energy or pseudo-energy.
26De novo prediction
- No sequence similarity with proteins of known
structure detected. - No fold where threading is possible at acceptable
energy levels.
27De novo prediction note
- Sequence similarity methods are not perfect.
- Threading methods are far from perfect
(especially energy functions). - Often de novo methods are used for proteins whose
structure does resemble a known one.
28De novo how does it work?
- It usually does not.
- (neither does threading)
- Search for a minimum of some energy function. Key
actors - How we search the space of 3D configurations.
- The energy function we use.
29De novo simplify
- An all-atom model is computationally heavy
- Only some atoms are modelled (e.g. backbone
atoms).
30De novo simplify more
- An all-atom model is computationally heavy
- Whole stretches of atoms are modelled together.
31Choosing the stretches
- Regular local structures (helices, strands) are a
natural modelling unit. - We dont know where they are.
- Machine learning.
- Huge field.
32(No Transcript)
33(No Transcript)
34The energy function
- Purely physical functions are not accurate enough
for all-atom models, are very inaccurate for
coarser models, and generally dont provide
decent landscapes to search. - Pseudo-energies huge room for machine learning,
e.g. contact map prediction.
35Contact maps
- Amino acid adjacency map.
- Invariant to rotations and translations, unlike
xyz coordinates. - Maps with 50-60 uniform random noise compatible
with correct 3D structure.
36Do CM, threading, de novo work?
- CM works, so long as the template is correct (it
often is) - Threading works, so long as the template is
correct (it never is) - De novo in some cases produces one model out of 5
that is correct over a short stretch of amino
acids (80?).
37(No Transcript)