Title: Identification and Quantification of Polypeptide Similarity Tim
1Sequence Similarity Homology
2The Importance of Sequence Similarity
- Laboratory alignments
- Sequence identification
- Conservation analyses
- Experimental design
- Predict structure
- Functional relationship
- Drug discovery
- Protein design
- Large scale analyses
- Annotation of sequencing projects
- Comparative genomics
3Sequence Conservation in Evolution
Amino acid replacements in a protein that survive
during the course of evolution tend to conserve
structure (and hence function).To be accepted,
the new amino acid side-chain usually functions
in a similar way to the old one.
Protein
Known Family
GKV--NVDEVGGEA GKV--NEEEVGGEA GKV--NVADCGAEA GKVEA
DIPGHGETV
GKVDVDVVGAQA
Known 3D structure, Known Function
3D structure, Function
4Protein Domains
FIBRONECTIN
RECEPTOR TYROSINE KINASE EPH
PROTEIN KINASE SRC
- Recurring evolutionary unit
- Compact, spatially distinct
- Fold in isolation
- Functional units
Doolitle and Bork (1993)
5Glossary
Homology Alike because of shared ancestry (Do not
confuse with sequence similarity) Homologue
Family Group of evolutionarily related
proteins Superfamily A homologue family where
membership might only be evident from the
structure ancestry is not always inferable from
sequence Fold A particular spatial arrangement
and connectivity of major secondary structure
elements Domain An autonomously folding protein
region A protein region with distinct
evolutionary history
6Different Types of Database Searching
7Why compare protein sequences rather than DNA
sequences?
- DNA base matches randomly 1-in-4
- DNA sequence changes more often than protein
- due to codons
- With DNA random matches to insertions can be
high! - Amino acids can be matched by similarity
CATGA------------------ACGTATCCCAGTAACTC CATGAGTCA
GATGAGCAAAGTCAACGTATCCCAGTAACTC
Protein sequences can greatly improve the
signal-to-noise ratio.
8Sequence likeness Amino Acid Identity
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHK
L G VK HGKKV A AH D LS LH
KLHBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL
HCDKL
Alignment of ? and ? chains of human
haemoglobin Many identities -gt highly
similar Identical at 18 from 41 positions
18/41 44
HBA_HUMAN GSAQVKG-HGKKVADALTNAVAHVDDMPNALSALSDLHAH
KL V G G V P0
TDREVYGAVGSQVTLHCSFWSSEWVSDDISFTWRYQPEGGRD
9Correlation between sequence and structural
similarity
Chothia and Lesk (1986)
10Chance Sequence Matches
Brenner et al. (1998)
No structural similarity but 39 identity!
11How Proteins Change During Evolution
1. Single amino acid replacements a) Result of
random mutation in the gene b) Acceptance by
natural selection 2. Small-scale deletions and
insertions The gap problem 3. Larger-scale
duplications and rearrangements
Non-random substitutions
12How to quantify sequence likeness Amino Acid
Similarity
Alignment of ? and ? chains of human
haemoglobin HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMP
NALSALSDLHAHKL G VKHGKKV
AAHD LSLH KL HBB_HUMAN
GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL Alignm
ent of human haemoglobin and leghaemoglobin from
lupin Biologically meaningful alignment, but low
identity. HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D-
-DMPNALSALSDLHAHKL H KV A
L LH K LGB2_LUPLU
NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
13Sequence likeness Similarity Score
Similarity score is the summation of scores
between pairs of aligned residue positions.
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHK
L G VKHGKKV AAHD LSLH
KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL
HCDKL
51-114 Total Score
N . . S...1
P . . A..-1
K . . Q...1
V . . V...4
G . . G...5
Amino acid pair scores come from a scoring scheme
14Substitution Matrices
PAM250 A R N D C Q E G H I L K M
F P S T W Y V B Z X A 2 -2 0 0 -2
0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0
0 -8 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4
0 0 -1 2 -4 -2 -1 0 -1 -8 N 0 0 2 2 -4 1
1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1
0 -8 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6
-1 0 0 -7 -4 -2 3 3 -1 -8 C -2 -4 -4 -5 12 -5
-5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5
-3 -8 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5
0 -1 -1 -5 -4 -2 1 3 -1 -8 E 0 -1 1 3 -5 2
4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3
-1 -8 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5
0 1 0 -7 -5 -1 0 0 -1 -8 H -1 2 2 1 -3 3
1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2
-1 -8 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1
-2 -1 0 -5 -1 4 -2 -2 -1 -8 L -2 -3 -3 -4 -6 -2
-3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3
-1 -8 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5
-1 0 0 -3 -4 -2 1 0 -1 -8 M -1 0 -2 -3 -5 -1
-2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2
-1 -8 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
-5 -3 -3 0 7 -1 -4 -5 -2 -8 P 1 0 0 -1 -3 0
-1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0
-1 -8 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3
1 2 1 -2 -3 -1 0 0 0 -8 T 1 -1 0 0 -2 -1
0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1
0 -8 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0
-6 -2 -5 17 0 -6 -5 -6 -4 -8 Y -3 -4 -2 -4 0 -4
-4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4
-2 -8 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1
-1 -1 0 -6 -2 4 -2 -2 -1 -8 B 0 -1 2 3 -4 1
3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2
-1 -8 Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5
0 0 -1 -6 -4 -2 2 3 -1 -8 X 0 -1 0 -1 -3 -1
-1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1
-1 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
-8 -8 -8 -8 -8 -8 -8 -8 -8 1
15Gaps for Optimal Alignments
Gap
HBA_HUMAN -VLSPADKTNVKAAWGKVGAHAGEYGAEALERM HBB_
HUMAN VHLTPEEKSAVTALWGKVN--VDEVGGEALGRL
Gap
Score ?(substitution scores) ? ?(gap
penalties) Insertion penalty Extension
penalties
Best alignment Maximum Score
16Dynamic Programming
misspelled mis-pel--d
misspelled mispeld
i.e. The best route to this point. Other routes
are cut They cannot possibly be better for an
alignment that goes through this point.
Guaranteed optimum for two sequences
17Calculation of Accepted Point Mutations
ADGH
ECGH
AEIJ
DCIK
- Accepted point mutation (PAM)
- An exchange of one amino acid for another,
accepted by natural selection. Dayhoff et al.,
(1978)
C?D
C?E
A?E
A?D
Substitution Counts
ACGH
ACIK
I?G K?H
Probabilities of residue substitution in a
specified unit of evolutionary distance (PAM1)
Extrapolate to larger distances (PAM120, 250)
Matrix of accepted point mutations derived from
the tree
Log-odds scores
18BLOSUM substitution matrices
- Developed for distantly related proteins
- Substitutions only from multiple alignments of
conserved regions of protein families - Identity threshold to define conserved blocks can
be varied, e.g. 62 idenitity gives BLOSUM62 - Scores calculated from frequency of amino acids
in aligned pairs compared to what would be
expected due to abundance alone, given all
sequences
Heinkoff Heinkoff 1992
19Series of substitution matrices
BLOSUM 80 BLOSUM62 BLOSUM45 PAM1 PAM120 PAM250 Le
ss divergent More divergent
- BLOSUM62 is a general purpose matrix and the
default choice in many programs normally fine
for most searches - Different matrices could be chosen if distant or
close homologues are specifically being sought.
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Sc
oring2.html
20Local versus global alignment methods
Global Best alignment over the length of two
sequences. - Accurate but slow. GSAQVKGHGKKVADALT
NAV G VKHGKKV A GNPKVKAHGKKVLGAFSDGL Loc
al Best alignment fragment. - Fast. GSAQVKGHGKKV
ADALTNAV VKHGKKV GNPKVKAHGKKVLGAFSDGL
Structure-based methods
BLAST, FASTA, SSEARCH
21Fast Database Search Methods
Dynamic programming Rigorous but
slow Heuristic methods Based on rules. Not
guaranteed to find the optimal alignment but can
give a good identification of sequence
similarity. Fast enough to use on large databases.
BLAST, FASTA
22The BLAST Search Algorithm
Query Word
Query GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIE
ERLNLVEAFVEDAELRQTLQEDL
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG
13 PMG 13 PSG 13 PQA 12 PQN 12
Neighbourhood Words
Score Threshold (13)
Query 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNL
VEA 365 LAL TP G R W P D
ER A Query 290 TLASVLDCTVTPMGSRMLKRWLHMPV
RDTRVLLERQQTIGA 330
High-scoring Segment Pair
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/BL
AST_algorithm.html
23Statistical Significance of Alignment Scores
hba_human.fa, 141 aa vs hbb_human.fa s-w
est lt 22 0 0 24 0 0 26 0
1 28 4 2 30 6 6
32 15 11 34 31
19 36 45
29
38 42 39
40 54 49
42 64
58
44 53 64
46 57
67
48 65 68
50 68
67
52 61 63
54 61
59
56 49 54
58 48
48
60 39 43
62 40 38
64 26 33
66 22 28
68 28 24
70 15 21 72
19 17 74 12
15 76 14 12
78 10 10 80 9
8 82 12 8 84 5
6 86 6 5 88 4
4 90 5 4 92 1 3
94 4 3 96 0 2 98 2
2 100 1 1 102 1 1 104 2
1 106 0 1 108 0 1 110
0 1 112 0 0 gt114 0 0 O
146000 residues in 1000 sequences, BLOSUM50
matrix, gap penalties -12,-2 local shuffle,
window size 10 unshuffled s-w score 381
shuffled score range 29 - 106 Lambda 0.092651
K 0.004191 P(381) 4.2203e-14 For 1000
sequences, a score gt381 is expected 4.22e-11
times
- Calculate scores for a set of random sequences
- Obtain a distribution
- Calculate the probability that a given score
would be obtained by chance by a random sequence
E value (expectation value) The expected number
of times that a score equal to or above S would
be obtained by chance in a database search.
24Information contained in a group of sequences
helps to align them more accurately
Multiple-Sequence Alignment
Hb? GKVDVDVVGAQA
Ambiguity in pairwise alignment
?
VGGNAPAY
GKV--DVDVVGAQA GKV--NVDEVGGEA GKV--NEEEVGGEA GKV--
NVADCGAEA GKVEADIPGHGETV
VGGNAPAY
25Standard Sequence Alignment
Target KINE-NYVLTVTQPGAYLVKITPHYAMGMIAL... Temp
late PMMDKEQAYSLTFTEAGTYDYHITPHP--GFM...
A C D E F ... A 4 0 -2 0 -2 ... C 9
-3 -4 -2 ... D 6 2 -3 ... E 5 -3
... . . ... . . .
Gap opening penalty Gap extension penalty
Scoring table/gap penalties are general
BLOSUM matrix
Alignment Score ??Similarity score) - ?(Gap
penalties)
26Family-specific Scoring Scheme
Multiple Sequence Alignment
Family Profile
FHP_CANNO ( 35) TSTMYKYMFQTYPEVRSYFNMT
20 GLB1_ARTSX ( 32) GKATFGKLFAAHPEYQQMFRFF
18 GLB1_CALSO ( 50) SGIAMKRQALVFGAILQEFVAN
52 GLB1_GLYDI ( 27) GKDCLIKFLSAHPQMAAVFGFS
14 GLB1_LUCPE ( 14) WAKASAAWGTAGPEFFMALFDA
88 GLB1_LUMTE ( 28) GLELWKGILREHPEIKAPFSRV
18 GLB1_PHESE ( 28) SLHFWKEFLHDHPDLVSLFKRV
24 GLB1_SCAIN ( 33) GVALMTTLFADNQETIGYFKRL
14 GLB1_TYLHE ( 69) GFDILISVLDDKPVLDQALAHY
58 GLB2_ASCSU ( 91) VDPHLRMSVHLEPKLWSEFWPI
64 GLB2_CALSO ( 104) LNELVKFIGNQQPAWKNVTAVI
100 GLB2_LUMTE ( 30) SQAIWRATFAQVPESRSLFKRV 19
(See Schneider and Stephens, 1990)
27Sequence Profile forSrc Homology Domain 3
(Peptide) Length 53 ID SH3 AC
PS90004 DE Src homology domain SH3.
Cons A C D E F G H I K
L M N P Q Y Gap Len ..
F -20 -30 -30 -40 20 -30 -20 10 -20
0 -10 -20 -30 -30 20 260 30 I -10
-50 -20 -30 -20 -30 0 10 10 -10 10
-10 -20 -10 -40 260 30 A 20 -30
10 0 -50 20 -20 -10 -10 -30 -20 10
10 0 -50 260 30 L -30 -80 -50 -40
20 -60 -20 20 -40 60 40 -30 -30 -20
0 260 30 Y -40 -20 -60 -60 90
-70 0 -10 -50 -10 -30 -30 -60 -50
110 260 30 D 10 -60 30 30 -70 0
0 -20 -10 -40 -30 20 0 10 -60
260 30 Y -50 -30 -60 -60 100 -70 -10
-10 -20 -10 -20 -30 -60 -50 110 260
30 K -10 -60 10 10 -40 -20 0 -20
20 -30 -10 10 -10 10 -60 260 30 A
10 -40 10 0 -50 10 -10 -10 0 -30
-10 10 0 0 -60 260 30 R 0
-50 0 0 -50 -10 0 -10 10 -30 -10
10 0 10 -50 260 30
Residue Types
...
Position Specific Score Matrix (PSSM)
Alignment Position
28Iterative Profile Searches PSI-BLAST
Initial Matches GKATFGKLFAAHPEYQQMFRFF GKDCLIKFLSA
HPQMAAVFGFS GLELWKGILREHPEIKAPFSRV SLHFWKEFLHDHPDL
VSLFKRV GFDILISVLDDKPVLDQALAHY
Query GKATFGKLFAAHPEYQQMFRFF
Search
Refined Matches TSTMYKYMFQTYPEVRSYFNMT GKATFGKLFAA
HPEYQQMFRFF SGIAMKRQALVFGAILQEFVAN GKDCLIKFLSAHPQM
AAVFGFS WAKASAAWGTAGPEFFMALFDA GLELWKGILREHPEIKAPF
SRV SLHFWKEFLHDHPDLVSLFKRV GVALMTTLFADNQETIGYFKRL
GFDILISVLDDKPVLDQALAHY VDPHLRMSVHLEPKLWSEFWPI
PSSM Profile Pos Ala Cys Glu Asp Phe Gly His
... 1 -20 -30 -30 -40 20 -30 -20 2 -10
-50 -20 -30 -20 -30 0 3 20 -30 10
0 -50 20 -20 4 -30 -80 -50 -40 20 -60
-20 5 -50 -30 -60 -60 100 -70 -10 ...
Search again
http//www.ncbi.nlm.nih.gov/blast/Blast.cgi
http//www.ebi.ac.uk/blastpgp/
29Pair-wise vs. Profile Searches
Lindahl and Elofsson (1999)
30Biological Programming
- Programming Languages
- Python upcoming my personal recommendation
- Perl popular
- Java web apps
- C/C fast
- Courses
- http//www.cam.ac.uk/cs/courses/
- http//www.biomed.cam.ac.uk/gradschool/current/cou
rses/bioinformatics.html - Books
- Biological Sequence Analysis Probabilistic
Models of Proteins and Nucleic Acids. R. Durbin,
SR. Eddy, A. Krogh, G. Mitchison ISBN-13
9780521629713 ISBN-10 0521629713