Title: Nothing in computational biology makes
1Using (and abusing) sequence analysis to make
biological discoveries
Nothing in (computational) biology makes sense
except in the light of evolution
after Theodosius Dobzhansky (1970)
2Significant sequence similarity is evidence of
homology
Only a small fraction of amino acid residues is
directly involved in protein function (including
enzymatic) the rest of the protein serves
largely as structural scaffold
Conserved sequence motifs are determinants
of conserved ancestral functions
3The evolving roles of computational analysis in
biology
4(No Transcript)
5Sequence complexity Measure of the randomness of
a sequence Random sequence - highest complexity
(entropy) - globular protein domains Homopolymer
- lowest complexity (entropy) - non-globular
structures
Algorithmic complexity QQQQQQQQQQQQQ
(Q)n KRKRKRKRKRKR (KR)n ASDFGHKLCVNM - random
sequence - no algorithm to derive from a simpler
one
6seg BRCA1 45 3.4 3.7 gt BRCA1.seg
gtgi728984spP38398BRC1_HUMAN Breast cancer
type 1 susceptibility protein
1-388 MDLSALRVEEVQNVINAMQKILECPICL
EL
IKEPVSTKCDHIFCKFCMLKLLNQKKGPSQ
CPLCKNDITKRSLQESTRFSQLVEELLK
II
CAFQLDTGLEYANSYNFAKKENNSPEHLKD
EVSIIQSMGYRNRAKRLLQSEPENPSLQ
ET
SLSVQLSNLGTVRTLRTKQRIQPQKTSVYI
ELGSDSSEDTVNKATYCSVGDQELLQIT
PQ
GTRDEISLDSAKKAACEFSETDVTNTEHHQ
PSNNDLNTTEKRAAERHPEKYQGSSVSN
LH
VEPCGTNTHASSLQHENSSLLLTKDRMNVE
KAEFCNKSKQPGLARSQHNRWAGSKETC
ND
RRTPSTEKKVDLNADPLCERKEWNKQKLPC
SENPRDTEDVPWITLNSSIQKVNEWFSR
sdellgsddshdgesesnakvadvldvlne
389-458 vdeysgssekidllasdphealickservh
sksvesnied
459-526 KIFGKTYRKKASLPNLSHVTENLIIGAFVT
EPQIIQERPLTNKLKRKRRPTSGLHPEDFI
KKADLAVQ ktpeminqgtnqteqngqv
mnitnsghenk 527-635 tkgdsiqneknpnpieslekesafktkae
p isssisnmelelnihnskapkknrlrrkss
trhihalelvvsrnlsppn
636-995 CTELQIDSCSSSEEIKKKKYNQMPVRHSRN
LQLMEGKEPATGAKKSNKPNEQTSKRHDSD
TFPELKLTNAPGSFTKCSNTSELKEFVN
PS
LPREEKEEKLETVKVSNNAEDPKDLMLSGE
RVLQTERSVESSSISLVPGTDYGTQESI
SL
LEVSTLGKAKTEPNKCVSQCAAFENPKGLI
HGCSKDNRNDTEGFKYPLGHEVNHSRET
SI
EMEESELDAQYLQNTFKVSKRQSFAPFSNP
GNAEEECATFSAHSGSLKKQSPKVTFEC
EQ
KEENQGKNESNIKPVQTVNITAGFPVVGQK
DKPVDNAKCSIKGGSRFCLSSQFRGNET
GL
ITPNKHGLLQNPYRIPPLFPIKSFVKTKCK knlleenfeehsmsperem
gnenipstvst 996-1089 isrnnirenvfkeasssninevgsstne
vg ssineigssdeniqaelgrnrgpklnamlr
lgvl
1090-1238 QPEVYKQSLPGSNCKHPEIKKQEYEEVVQT
VNTDFSPYLISDNLEQPMGSSHASQVCSET
PDDLLDDGEIKEDTSFAENDIKESSAVF
SK
SVQKGELSRSPSPFTHTHLAQGYRRGAKKL
ESSEENLSSEDEELPCFQHLLFGKVNNI
P sqstrhstvateclsknteenllslknsln
1239-1312 dcsnqvilakasqehhlseetkcsaslfss
qcseledltantnt
1313-1316 QDPF
Non-globular regions
Globular domains
7 1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE
KAVLTSQKSSEYPISQNPEGLSADKFEVSA
DSSTSKNKEPGVERSSPSKCPSLDDRWY
MH
SC sgslqnrnypsqeelikvvdveeqqleesg
1514-1616 phdltetsylprqdlegtpylesgislfsd dpesdpsed
rapesarvgnipsstsalkvp
qlkvaesaqspaa
1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV
NKRMSMVVSGLTPEEFMLVYKFARKHHITL
TNLITEETTHVVMKTDAEFVCERTLKYF
LG
IAGGKWVVSYFWVTQSIKERKMLNEHDFEV
RGDVVNGRNHQGPKRARESQDRKIFRGL
EI
CCYGPFTNMPTDQLEWMVQLCGASVVKELS
SFTLGTGVHPIVVVQPDAWTEDNGFHAI
GQ
MCEAPVVTREWVLDSVALYQCQELDTYLIP
QIPHSHY
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12 1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE
KAVLTSQKSSEYPISQNPEGLSADKFEVSA
DSSTSKNKEPGVERSSPSKCPSLDDRWY
MH
SC sgslqnrnypsqeelikvvdveeqqleesg
1514-1616 phdltetsylprqdlegtpylesgislfsd dpesdpsed
rapesarvgnipsstsalkvp
qlkvaesaqspaa
1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV
NKRMSMVVSGLTPEEFMLVYKFARKHHITL
TNLITEETTHVVMKTDAEFVCERTLKYF
LG
IAGGKWVVSYFWVTQSIKERKMLNEHDFEV
RGDVVNGRNHQGPKRARESQDRKIFRGL
EI
CCYGPFTNMPTDQLEWMVQLCGASVVKELS
SFTLGTGVHPIVVVQPDAWTEDNGFHAI
GQ
MCEAPVVTREWVLDSVALYQCQELDTYLIP
QIPHSHY
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Paradigm shift in database searching
Traditional
PSI-BLAST
Set of homologs
Query sequence
Sequence database
PSSM
RPS-BLAST
New
Query sequence
Domain architecture
PSSM database
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25DOMAIN ARCHITECTURE OF SELECTED BRCT PROTEINS
BRCT
RING
BRCA1
BARD1
PHD-l
BRCA1/BARD homolog plant
CMP-trans
REV1 yeast
DPB11 yeast
AZF
PARP vertebrates
PARP
DNA ligase III
ATP-dep ligase
human
HhH
TdT eukaryotes
polX
RFC1
eukaryotes
ATP and PCNA-binding
DNA ligase bacteria
NAD-dep ligase
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41Use of profile libraries to examine domain
representation in individual proteomes
yeast
6,200
Detect domains using PSI-BLAST, IMPALA
Compare domain distributions
Profile library
worm
20,000
Chervitz SA, Aravind L, Sherlock G, Ball CA,
Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr
S, Smith T, Weng S, Cherry JM, Botstein D. 1998.
Comparison of the complete protein sets of worm
and yeast orthology and divergence. Science
282 2022-8
42Normalized domain counts in worm and yeast
1.Hormone receptor 2.POZ 3.EGF 4.MATH
5.PTPase 6.Cation Channels 7.PDZ 8.SH2
9.FNIII 10.Homeodomain 11.LRR 12.EF hands
13.Ankyrin 14.RING finger 15.C2H2 finger
16.small GTPase 17.RRM 18.AAA 19.C6 finger
43- Searching a domain library is often easier and
more informative - than searching the entire sequence database.
However, the latter - yields complementary information and should not
be skipped - if details are of interest.
- Varying the search parameters, e.g. switching
composition-based statistics - on and off, can make a difference.
- Using subsequences, preferably chosen according
to objective criteria, - e.g. separation from the rest of the protein by a
low-complexity linker, - may improve search performance.
- Trying different queries is a must when analyzing
protein (super)families. - Even hits below the threshold of statistical
significance often are worth - analyzing, albeit with extreme care. Transferring
functional information - between homologs on the basis of a database
description alone is dangerous. - Conservation of domain architectures, active
sites and other features - needs to be analyzed (hence automated
identification of protein families is - difficult and automated prediction of functions
is extremely error-prone). - Always do a reality check!