Bootstrap and jackknife calculation resampling - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Bootstrap and jackknife calculation resampling

Description:

Bootstrap and jack-knife calculation (resampling) General principal is to stress ... Jack-knife: drop each of N sequences from the alignment and recompute the ... – PowerPoint PPT presentation

Number of Views:1150
Avg rating:3.0/5.0
Slides: 28
Provided by: jamesh78
Category:

less

Transcript and Presenter's Notes

Title: Bootstrap and jackknife calculation resampling


1
Bootstrap and jack-knife calculation (resampling)
  • General principal is to stress the data
    repeatedly and recompute the tree each time,
    looking for robust features.
  • Jack-knife drop each of N sequences from the
    alignment and recompute the resulting N trees,
    testing whether they are compatible with
    original.
  • Bootstrap recompute pairwise distances from a
    random sample of alignment columns (with
    replacement). Recompute the tree for each new
    distance set and see how often a particular tree
    branch is positioned the same.

(given in the distance method context, the same
methods works for any tree construction method)
2
Bootstrap (cont.)
real alignment
Similar resampling and distance derivation done
for each sequence pair, then a new tree
calculated. Repeat ad nauseum. Compare bootstrap
trees with each other.
3
The molecular clock concept
  • divergence distance vs. time NOT the same!
  • molecular clock is common default hypothesis
    for translating distance into time (assume that
    divergence time).
  • known to be violated whenever sufficiently broad
    groups are considered.
  • however, frequently approximately valid for a
    particular sequence family over relatively short
    times (e.g. most proteins in primates).

4
Things that tend to invalidate the molecular clock
  • differential changes in generation time on some
    branches.
  • differential changes in selective constraints on
    some branches (extreme example would be positive
    selection).
  • depth of divergence - though correctable unless
    distances are too long.

5
Protein structure and alignment
  • When you see a protein sequence alignment,
    notice the blocks with higher and lower
    similarity (they are almost always there).
  • (most of the time) These are not simply
    stochastic variation they represent regions
    under more or less strong purifying selection.
  • These blocks can vary from rather small segments
    to rather long domains (or both depending on your
    window).
  • Longer blocks usually correspond to different
    protein domains (which can vary as a unit in
    selective pressure).
  • Shorter blocks usually correspond to
    intra-domain structural features.

6
Kinases form a complex, diverse family
Example from a particular enzyme
(many subtypes)
(many subtypes)
7
CaM Kinase I and CaM Kinase II (CaMKII) (CaM
stands for calcium-calmodulin)
  • Very similar in the kinase and calmodulin
    regulatory domains.
  • CaMKI is monomeric, whereas CaMKII is a 10-12
    subunit multimer.
  • CaMKII most likely arose after the CaM Kinase
    domain by fusing a multimer formation domain to
    the C-terminus.

8
CaM Kinase II structure
N
C
multimer
serine-threonine
calmodulin
formation
protein kinase
regulation
12 subunits
with the catalytic
domains facing out
9
unc-43 --------------------MQLQQINSGAFSVV
RRCVHKTTGLEFAAKIINTKKLSARD rCaMKII
-------MATITCTRFTEEYQLFEELGKGAFSVVRRCVKVLAGQEYPAKI
INTKKLSARD hCaMKI MLGAVEGPRWKQAEDIRDIYDFR
DVLGTGAFSEVILAEDKRTQKLVAIKCIAKEALEGKE rCaMKI
MPGAVEGPRWKQAEDIRDIYDFRDVLGTGAFSEVILAEDKRTQKLV
AIKCIAKKALEGKE
.. . .
..   unc-43 FQKLEREARICRKLQHPNIVRLHDSIQEE
SFHYLVFDLVTGGELFEDIVAREFYSEADAS rCaMKII
HQKLEREARICRLLKHPNIVRLHDSISEEGHHYLIFDLVTGGELFEDIVA
REYYSEADAS hCaMKI GS-MENEIAVLHKIKHPNIVALD
DIYESGGHLYLIMQLVSGGELFDRIVEKGFYTERDAS rCaMKI
GS-MENEIAVLHKIKHPNIVALDDIYESGGHLYLIMQLVSGGELFD
RIVEKGFYTERDAS . . .
.. . .. . ..
  unc-43 HCIQQILESIAYCHSNGIVHRDLKPENL
LLASKAKGAAVKLADFGLAIEVN-DSEAWHGF rCaMKII
HCIQQILEAVLHCHQMGVVHRDLKPENLLLASKLKGAAVKLADFGLAIEV
EGEQQRWFGF hCaMKI RLIFQVLDAVKYLHDLGIVHRDL
KPENLLYYSLDEDSKIMISDFGLSKMED-PGSVLSTA rCaMKI
RLIFQVLDAVKYLHDLGIVHRDLKPENLLYYSLDEDSKIMISDFGL
SKMED-PGSVLSTA . ....
. . . ...
  unc-43 AGTPGYLSPEVLKKDPYSKPVDIWACGVILY
ILLVGYPPFWDEDQHRLYAQIKAGAYDYP rCaMKII
AGTPGYLSPEVLRKDPYGKPVDLWACGVILYILLVGYPPFWDEDQHRLYQ
QIKARAYDFP hCaMKI CGTPGYVAPEVLAQKPYSKAVDC
WSIGVIAYILLCGYPPFYDENDAKLFEQILKAEYEFD rCaMKI
CGTPGYVAPEVLAQKPYSKAVDCWSIGVIAYILLCGYPPFYDENDA
KLFEQILKAEYEFD ... .
. .. ..
  unc-43 SPEWDTVTPEAKSLIDSMLTVNPKKRITADQ
ALKVPWICNRERVASAIHRQDTVDCLKKF rCaMKII
SPEWDTVTPEAKDLINKMLTINPSKRITAAEALKHPWISHRSTVASCMHR
QETVDCLKKF hCaMKI SPYWDDISDSAKDFIRHLMEKDP
EKRFTCEQALQHPWIAGDTALDKNIH-QSVSEQIKKN rCaMKI
SPYWDDISDSAKDFIRHLMEKDPEKRFTCEQALQHPWIAGDTALDK
NIH-QSVSEQIKKN ..
.. . .. . . . . .
  unc-43 NARRKLKGAILTTMIATRNLSSKRSYRLTLG
AEKLVISMKNIEYWQVLLNKIFATYKIKM rCaMKII
NARRKLKGAILTTMLATRNFSGG---------------------------
--------KS hCaMKI FAKSKWKQAFNATAVVRHMR---
------------------------------------- rCaMKI
FAKSKWKQAFNATAVVRHMR--------------------------
-------------- . . . .

continued
10
continued (overlapped)
unc-43 SPEWDTVTPEAKSLIDSMLTVNPKKRITADQALK
VPWICNRERVASAIHRQDTVDCLKKF rCaMKII
SPEWDTVTPEAKDLINKMLTINPSKRITAAEALKHPWISHRSTVASCMHR
QETVDCLKKF hCaMKI SPYWDDISDSAKDFIRHLMEKDP
EKRFTCEQALQHPWIAGDTALDKNIH-QSVSEQIKKN rCaMKI
SPYWDDISDSAKDFIRHLMEKDPEKRFTCEQALQHPWIAGDTALDK
NIH-QSVSEQIKKN ..
.. . .. . . . . .
  unc-43 NARRKLKGAILTTMIATRNLSSKRSYRLTLG
AEKLVISMKNIEYWQVLLNKIFATYKIKM rCaMKII
NARRKLKGAILTTMLATRNFSGG---------------------------
--------KS hCaMKI FAKSKWKQAFNATAVVRHMR---
------------------------------------- rCaMKI
FAKSKWKQAFNATAVVRHMR--------------------------
-------------- . . . .

  unc-43 KQCRNLLNKKEQGPPSTIKESSESS-QTIDD
NDSEKGGGQLKHENTVVRADGATGIVSSS rCaMKII
G--G---NKKNDG----VKESSESTNTTIEDED-----------------
---------- .
.. ..   unc-43
NSSTASKSSSTNLSAQKQDIVRVTQTLLDAISCKDFETYTRLCDTSMTCF
EPEALGNLIE rCaMKII ------------TKVRKQEIIKV
TEQLIEAISNGDFESYTKMCDPGMTAFEPEALGNLVE
.... .. ...
..   unc-43
GIEFHRFYFD--GNRKNQ-VHTTMLNPNVHIIGEDAACVAYVKLTQFLDR
NGEAHTRQSQ rCaMKII GLDFHRFYFENLWSRNSKPVHTT
ILNPHIHLMGDESACIAYIRITQYLDAGGIPRTAQSE
... . .....
..... .   unc-43
ESRVWSKKQGRWVCVHVHRSTQPSTNTTVSEF rCaMKII
ETRVWHRRDGKWQIVHFHRSGAPSVLPH----
. .. . .
(note both inter- and intra-domain differences in
conservation)
11
Protein structure basics
  • proteins consist mostly of a-helices, b-sheets,
    and turns.
  • the a-helices and b-sheets typically form the
    framework of the protein.
  • the turns and other atypical structures often
    play important binding and catalytic roles.
  • the core of the protein is hydrophobic, whereas
    the surface is usually polar or charged.
  • most sharp turns (kinks) have glycine or proline.

12
alpha helix
13
three-stranded antiparallel b-sheet
14
three-stranded antiparallel b-sheet, space filled
15
substrate binding cleft
rCaMKII SPEWDTVTPEAKDLINKMLTINPSKRITAAEALK
HPWISHRSTVASCMHRQETVDCLKKF rCaMKI
SPYWDDISDSAKDFIRHLMEKDPEKRFTCEQALQHPWIAGDTALDKNIH-
QSVSEQIKKN 297 ..
.. . . ... . . . . .
   rCaMKII NARRKLKGAILTTMLATRN rCaMKI
FAKSKWKQAFNATAVVRHM 316
. . . . .
 
16
sliced half-way through the protein
red - charged blue - polar green - hydrophobic
17
(No Transcript)
18
rCaMKII HQKLEREARICRLLKHPNIVRLHDSISEEGHHYL
IFDLVTGGELFEDIVAREYYSEADAS rCaMKI
GS-MENEIAVLHKIKHPNIVALDDIYESGGHLYLIMQLVSGGELFDRIVE
KGFYTERDAS 119 . . .
. ... . ..
  rCaMKII HCIQQILEAVLHCHQMGVVHRDLKPENL
LLASKLKGAAVKLADFGLAIEVEGEQQRWFGF rCaMKI
RLIFQVLDAVKYLHDLGIVHRDLKPENLLYYSLDEDSKIMISDFGLSKME
D-PGSVLSTA 178 . ..
. . . ... .
19
rCaMKII HCIQQILEAVLHCHQMGVVHRDLKPENLLLASKL
KGAAVKLADFGLAIEVEGEQQRWFGF rCaMKI
RLIFQVLDAVKYLHDLGIVHRDLKPENLLYYSLDEDSKIMISDFGLSKME
D-PGSVLSTA 178 . ..
. . . ... .
    rCaMKII AGTPGYLSPEVLRKDPYGKPVDLWACGVI
LYILLVGYPPFWDEDQHRLYQQIKARAYDFP rCaMKI
CGTPGYVAPEVLAQKPYSKAVDCWSIGVIAYILLCGYPPFYDENDAKLFE
QILKAEYEFD 238 ... .
. .. ... .
20
(No Transcript)
21
Measuring structural similarity
  • Structural similarity can persist after sequence
    similarity has reached noise levels.
  • More generally, how do you measure two
    structures for degree of similarity?
  • Commonly used approach is root mean square
    deviation (RMSD) between the positions of matched
    backbone atoms.

22
No statistically significant sequence similarity
RMSD for shared regions 3.5 Angstroms
23
Illustration of three points on a structure of
poorly known function
  • gaps in alignments tend to be on surface loops
  • areas of highest conservation tend to be at key
    sites (e.g. active sites of enzymes) and in core
    structural elements
  • BUT when positive selection acts, binding faces
    may tend to be the parts that vary.

24
MATH domain containing genes a mystery family
in C. elegans
25
(No Transcript)
26
(No Transcript)
27
No assignment for Thursday. Final assignment
will be posted by this evening.
Write a Comment
User Comments (0)
About PowerShow.com