Title: Improving Sequence Alignment For Protein Modelling
1Improving Sequence Alignment For Protein Modelling
- Danielle Talbot
- Supervisor Dr. Andrew Martin
- University of Reading
- E-mail d.talbot_at_rdg.ac.uk
2Overview
- Why model?
- The problems of comparative modelling
- The importance of correct alignments
- Analyzing misaligned sequences
- Scoring models to select the best
- Current and Future work
3Why Model Protein Sequences?
- GenPept gt1.53M protein sequences
- PDB 22,500 structures
- Both increasing exponentially
- Experimental determination slow
- Some structures cannot be determined
experimentally - Comparative Modelling offers an alternative
4Identify Parents
Align target and parent(s)
Identify SCRs/SVRs
Build sidechains
Build SVRs ab initio or database
Build SCRs from parent(s)
NO
OK?
Evaluate model
Refine model
Final model
YES
5Major Problems in CM
- Correct sequence alignment
- Modelling SVRs
- Decided to concentrate on sequence alignment
6Importance of correct alignment
7Difficulties with Alignments
- Low sequence identity
- Insertions / deletions
- Number of indels
- Length of indels
8Why Sequence Alignment Doesnt Work
- Needleman-Wunsch finds optimum global alignment
between two sequences - Optimum depends simply on a similarity matrix
- Does not account for factors in 3D
- secondary structures
- charges
- hydrophobicity
- distance constraints affecting indels
9Improving Alignment
- Only successful scoring will be in 3D
- Two problems
- Generating (sensible) alternative alignments
- Scoring the resulting structures
10Generating Alternative Alignments
11Misleading Local Sequence Alignments
- MLSAs extreme case of misalignment
- regions where apparently obvious sequence
alignment not observed in the structure - not restricted to protein pairs with particularly
high or low global sequence similarity - Analysed to suggest causes of error
12Saqi, Russell Sternberg
Cytochrome b Reductase
Pig VDLVIKVYFKDTHP- E.coli
-FDLLVKVYFKNEHP
9 out of 14 identities
13Finding MLSAs
- Using CATH H-families, select all pairs of NREPs
(56,000 pairs) - For each pair
- Perform NW sequence alignment
- Perform SSAP structural alignment
- MLSA window of 10aa has ?5aa mis-aligned
14Finding MLSAs
- 8 (4,500) of pairs had an MLSA
- 0.13 (82) pairs had ?6aa misaligned
- 31 of these had Sseq ? 2Sstruc
- 9 of these were genuine. Others were
- errors in CATH domain assignments
- SSAP errors
- arbitrary structural alignments
- (non-globular proteins, highly flexible regions)
15Genuine MLSAs
- One because of a hinge region
16Genuine MLSAs
17Genuine MLSAs
- Four minimized exposure of hydrophobics
18Sequence Structure MisAlignments
- SSMAs are less extreme examples
- Simply local regions where sequence and structure
alignments do not match - Have identified these
- Currently being analysed
- Analysis will give rules for generating
alternative alignments
19Evaluating Alternative Models
20Evaluating Alternative Models
- Ultimate evaluation comparison with crystal
structure - Not applicable in real-life modelling!
- Successful evaluation empirically based
- Empirical potentials (PROSA-II, RAM, etc.)
- Rule-based (neural nets, etc.)
21The RAM Potential
- Atom-level empirical potential
- Developed by
- Ram Samudrala and John Moult
22Evaluating the RAM Potential
- Using CATH H-families, select all pairs of NREPs
(56,000 pairs) - For each pair
- Perform NW sequence alignment
- Perform SSAP structural alignment
- Select one as parent and one as target
- Create model from each alignment with MODELLER
- Calculate RMSd (model vs. parent) and RAM
potential
23Results of Large Scale Analysis
Values for the structurally aligned model
24How can Models Based on Sequence Alignments be
Better Than Those Based on Structural Alignment?
- Structural alignment is the gold standard we
aim to obtain with sequence alignment - HOWEVER
- RMS(seq) lt RMS(struc) in 9.3 of cases
Why?
25How can Models Based on Sequence Alignments be
Better Than Those Based on Structural Alignment?
- All cases examined so far have a large indel
- Parent GSVIQMRLVNYIPLADLPSSVWY
- Seq ATVLNMR-----------STLWY
- Struc ATVLNMRS-----------TLWY
- The DRMSd is usually very small (lt0.5A)
26Conclusions
- Causes of MLSAs have been identified
- Terminal regions
- Hydrophobicity / accessibility
- Structurally flexible regions
- SSMAs identified
- Performance of RAM potential evaluated
27Future Work
- Re-run MLSA analysis with latest dataset
- Analyse trends seen in SSMAs (Seq-Str
Mis-Alignments) - Create rules for generating alternative
alignments - Use analysis to train a neural net to predict
correct alignments - Compare results with RAM and PROSA-II potentials
28Acknowledgements
- Id like to thank
- Dr. Andrew Martin
- Everyone in the Bioinformatics Lab. at Reading
- MRC