Identification and Quantification of Polypeptide Similarity Tim - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Identification and Quantification of Polypeptide Similarity Tim

Description:

Amino acid replacements in a protein that survive during the ... Target: KINE-NYVLTVTQPGAYLVKITPHYAMGMIAL... Template: PMMDKEQAYSLTFTEAGTYDYHITPHP--GFM... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 31

Provided by: timst7

Category:

more less

Transcript and Presenter's Notes

Title: Identification and Quantification of Polypeptide Similarity Tim

1
Sequence Similarity Homology
2
The Importance of Sequence Similarity

Laboratory alignments
Sequence identification
Conservation analyses
Experimental design
Predict structure
Functional relationship
Drug discovery
Protein design
Large scale analyses
Annotation of sequencing projects
Comparative genomics

3
Sequence Conservation in Evolution
Amino acid replacements in a protein that survive
during the course of evolution tend to conserve
structure (and hence function).To be accepted,
the new amino acid side-chain usually functions
in a similar way to the old one.
Protein
Known Family
GKV--NVDEVGGEA GKV--NEEEVGGEA GKV--NVADCGAEA GKVEA
DIPGHGETV
GKVDVDVVGAQA
Known 3D structure, Known Function
3D structure, Function
4
Protein Domains
FIBRONECTIN
RECEPTOR TYROSINE KINASE EPH
PROTEIN KINASE SRC

Recurring evolutionary unit
Compact, spatially distinct
Fold in isolation
Functional units

Doolitle and Bork (1993)
5
Glossary
Homology Alike because of shared ancestry (Do not
confuse with sequence similarity) Homologue
Family Group of evolutionarily related
proteins Superfamily A homologue family where
membership might only be evident from the
structure ancestry is not always inferable from
sequence Fold A particular spatial arrangement
and connectivity of major secondary structure
elements Domain An autonomously folding protein
region A protein region with distinct
evolutionary history
6
Different Types of Database Searching
7
Why compare protein sequences rather than DNA
sequences?

DNA base matches randomly 1-in-4
DNA sequence changes more often than protein
due to codons
With DNA random matches to insertions can be
high!
Amino acids can be matched by similarity

CATGA------------------ACGTATCCCAGTAACTC CATGAGTCA
GATGAGCAAAGTCAACGTATCCCAGTAACTC
Protein sequences can greatly improve the
signal-to-noise ratio.
8
Sequence likeness Amino Acid Identity
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHK
L G VK HGKKV A AH D LS LH
KLHBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL
HCDKL
Alignment of ? and ? chains of human
haemoglobin Many identities -gt highly
similar Identical at 18 from 41 positions
18/41 44
HBA_HUMAN GSAQVKG-HGKKVADALTNAVAHVDDMPNALSALSDLHAH
KL V G G V P0
TDREVYGAVGSQVTLHCSFWSSEWVSDDISFTWRYQPEGGRD
9
Correlation between sequence and structural
similarity
Chothia and Lesk (1986)
10
Chance Sequence Matches
Brenner et al. (1998)
No structural similarity but 39 identity!
11
How Proteins Change During Evolution
1. Single amino acid replacements a) Result of
random mutation in the gene b) Acceptance by
natural selection 2. Small-scale deletions and
insertions The gap problem 3. Larger-scale
duplications and rearrangements
Non-random substitutions
12
How to quantify sequence likeness Amino Acid
Similarity
Alignment of ? and ? chains of human
haemoglobin HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMP
NALSALSDLHAHKL G VKHGKKV
AAHD LSLH KL HBB_HUMAN
GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL Alignm
ent of human haemoglobin and leghaemoglobin from
lupin Biologically meaningful alignment, but low
identity. HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D-
-DMPNALSALSDLHAHKL H KV A
L LH K LGB2_LUPLU
NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
13
Sequence likeness Similarity Score
Similarity score is the summation of scores
between pairs of aligned residue positions.
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHK
L G VKHGKKV AAHD LSLH
KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL
HCDKL
51-114 Total Score
N . . S...1
P . . A..-1
K . . Q...1
V . . V...4
G . . G...5
Amino acid pair scores come from a scoring scheme
14
Substitution Matrices
PAM250 A R N D C Q E G H I L K M
F P S T W Y V B Z X A 2 -2 0 0 -2
0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0
0 -8 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4
0 0 -1 2 -4 -2 -1 0 -1 -8 N 0 0 2 2 -4 1
1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1
0 -8 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6
-1 0 0 -7 -4 -2 3 3 -1 -8 C -2 -4 -4 -5 12 -5
-5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5
-3 -8 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5
0 -1 -1 -5 -4 -2 1 3 -1 -8 E 0 -1 1 3 -5 2
4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3
-1 -8 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5
0 1 0 -7 -5 -1 0 0 -1 -8 H -1 2 2 1 -3 3
1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2
-1 -8 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1
-2 -1 0 -5 -1 4 -2 -2 -1 -8 L -2 -3 -3 -4 -6 -2
-3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3
-1 -8 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5
-1 0 0 -3 -4 -2 1 0 -1 -8 M -1 0 -2 -3 -5 -1
-2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2
-1 -8 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
-5 -3 -3 0 7 -1 -4 -5 -2 -8 P 1 0 0 -1 -3 0
-1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0
-1 -8 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3
1 2 1 -2 -3 -1 0 0 0 -8 T 1 -1 0 0 -2 -1
0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1
0 -8 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0
-6 -2 -5 17 0 -6 -5 -6 -4 -8 Y -3 -4 -2 -4 0 -4
-4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4
-2 -8 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1
-1 -1 0 -6 -2 4 -2 -2 -1 -8 B 0 -1 2 3 -4 1
3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2
-1 -8 Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5
0 0 -1 -6 -4 -2 2 3 -1 -8 X 0 -1 0 -1 -3 -1
-1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1
-1 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
-8 -8 -8 -8 -8 -8 -8 -8 -8 1
15
Gaps for Optimal Alignments
Gap

HBA_HUMAN -VLSPADKTNVKAAWGKVGAHAGEYGAEALERM HBB_
HUMAN VHLTPEEKSAVTALWGKVN--VDEVGGEALGRL

Gap
Score ?(substitution scores) ? ?(gap
penalties) Insertion penalty Extension
penalties
Best alignment Maximum Score
16
Dynamic Programming
misspelled mis-pel--d
misspelled mispeld
i.e. The best route to this point. Other routes
are cut They cannot possibly be better for an
alignment that goes through this point.
Guaranteed optimum for two sequences
17
Calculation of Accepted Point Mutations
ADGH
ECGH
AEIJ
DCIK

Accepted point mutation (PAM)
An exchange of one amino acid for another,
accepted by natural selection. Dayhoff et al.,
(1978)

C?D
C?E
A?E
A?D
Substitution Counts
ACGH
ACIK
I?G K?H
Probabilities of residue substitution in a
specified unit of evolutionary distance (PAM1)
Extrapolate to larger distances (PAM120, 250)
Matrix of accepted point mutations derived from
the tree
Log-odds scores
18
BLOSUM substitution matrices

Developed for distantly related proteins
Substitutions only from multiple alignments of
conserved regions of protein families
Identity threshold to define conserved blocks can
be varied, e.g. 62 idenitity gives BLOSUM62
Scores calculated from frequency of amino acids
in aligned pairs compared to what would be
expected due to abundance alone, given all
sequences

Heinkoff Heinkoff 1992
19
Series of substitution matrices
BLOSUM 80 BLOSUM62 BLOSUM45 PAM1 PAM120 PAM250 Le
ss divergent More divergent

BLOSUM62 is a general purpose matrix and the
default choice in many programs normally fine
for most searches
Different matrices could be chosen if distant or
close homologues are specifically being sought.

http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Sc
oring2.html
20
Local versus global alignment methods
Global Best alignment over the length of two
sequences. - Accurate but slow. GSAQVKGHGKKVADALT
NAV G VKHGKKV A GNPKVKAHGKKVLGAFSDGL Loc
al Best alignment fragment. - Fast. GSAQVKGHGKKV
ADALTNAV VKHGKKV GNPKVKAHGKKVLGAFSDGL
Structure-based methods
BLAST, FASTA, SSEARCH
21
Fast Database Search Methods
Dynamic programming Rigorous but
slow Heuristic methods Based on rules. Not
guaranteed to find the optimal alignment but can
give a good identification of sequence
similarity. Fast enough to use on large databases.
BLAST, FASTA
22
The BLAST Search Algorithm
Query Word

Query GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIE
ERLNLVEAFVEDAELRQTLQEDL
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG
13 PMG 13 PSG 13 PQA 12 PQN 12
Neighbourhood Words
Score Threshold (13)
Query 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNL
VEA 365 LAL TP G R W P D
ER A Query 290 TLASVLDCTVTPMGSRMLKRWLHMPV
RDTRVLLERQQTIGA 330
High-scoring Segment Pair
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/BL
AST_algorithm.html
23
Statistical Significance of Alignment Scores
hba_human.fa, 141 aa vs hbb_human.fa s-w
est lt 22 0 0 24 0 0 26 0
1 28 4 2 30 6 6
32 15 11 34 31
19 36 45
29
38 42 39
40 54 49
42 64
58
44 53 64
46 57
67
48 65 68
50 68
67
52 61 63
54 61
59
56 49 54
58 48
48
60 39 43
62 40 38
64 26 33
66 22 28
68 28 24
70 15 21 72
19 17 74 12
15 76 14 12
78 10 10 80 9
8 82 12 8 84 5
6 86 6 5 88 4
4 90 5 4 92 1 3
94 4 3 96 0 2 98 2
2 100 1 1 102 1 1 104 2
1 106 0 1 108 0 1 110
0 1 112 0 0 gt114 0 0 O
146000 residues in 1000 sequences, BLOSUM50
matrix, gap penalties -12,-2 local shuffle,
window size 10 unshuffled s-w score 381
shuffled score range 29 - 106 Lambda 0.092651
K 0.004191 P(381) 4.2203e-14 For 1000
sequences, a score gt381 is expected 4.22e-11
times

Calculate scores for a set of random sequences
Obtain a distribution
Calculate the probability that a given score
would be obtained by chance by a random sequence

E value (expectation value) The expected number
of times that a score equal to or above S would
be obtained by chance in a database search.
24
Information contained in a group of sequences
helps to align them more accurately
Multiple-Sequence Alignment
Hb? GKVDVDVVGAQA
Ambiguity in pairwise alignment
?
VGGNAPAY
GKV--DVDVVGAQA GKV--NVDEVGGEA GKV--NEEEVGGEA GKV--
NVADCGAEA GKVEADIPGHGETV
VGGNAPAY
25
Standard Sequence Alignment
Target KINE-NYVLTVTQPGAYLVKITPHYAMGMIAL... Temp
late PMMDKEQAYSLTFTEAGTYDYHITPHP--GFM...
A C D E F ... A 4 0 -2 0 -2 ... C 9
-3 -4 -2 ... D 6 2 -3 ... E 5 -3
... . . ... . . .
Gap opening penalty Gap extension penalty
Scoring table/gap penalties are general
BLOSUM matrix
Alignment Score ??Similarity score) - ?(Gap
penalties)
26
Family-specific Scoring Scheme
Multiple Sequence Alignment
Family Profile
FHP_CANNO ( 35) TSTMYKYMFQTYPEVRSYFNMT
20 GLB1_ARTSX ( 32) GKATFGKLFAAHPEYQQMFRFF
18 GLB1_CALSO ( 50) SGIAMKRQALVFGAILQEFVAN
52 GLB1_GLYDI ( 27) GKDCLIKFLSAHPQMAAVFGFS
14 GLB1_LUCPE ( 14) WAKASAAWGTAGPEFFMALFDA
88 GLB1_LUMTE ( 28) GLELWKGILREHPEIKAPFSRV
18 GLB1_PHESE ( 28) SLHFWKEFLHDHPDLVSLFKRV
24 GLB1_SCAIN ( 33) GVALMTTLFADNQETIGYFKRL
14 GLB1_TYLHE ( 69) GFDILISVLDDKPVLDQALAHY
58 GLB2_ASCSU ( 91) VDPHLRMSVHLEPKLWSEFWPI
64 GLB2_CALSO ( 104) LNELVKFIGNQQPAWKNVTAVI
100 GLB2_LUMTE ( 30) SQAIWRATFAQVPESRSLFKRV 19
(See Schneider and Stephens, 1990)
27
Sequence Profile forSrc Homology Domain 3
(Peptide) Length 53 ID SH3 AC
PS90004 DE Src homology domain SH3.
Cons A C D E F G H I K
L M N P Q Y Gap Len ..
F -20 -30 -30 -40 20 -30 -20 10 -20
0 -10 -20 -30 -30 20 260 30 I -10
-50 -20 -30 -20 -30 0 10 10 -10 10
-10 -20 -10 -40 260 30 A 20 -30
10 0 -50 20 -20 -10 -10 -30 -20 10
10 0 -50 260 30 L -30 -80 -50 -40
20 -60 -20 20 -40 60 40 -30 -30 -20
0 260 30 Y -40 -20 -60 -60 90
-70 0 -10 -50 -10 -30 -30 -60 -50
110 260 30 D 10 -60 30 30 -70 0
0 -20 -10 -40 -30 20 0 10 -60
260 30 Y -50 -30 -60 -60 100 -70 -10
-10 -20 -10 -20 -30 -60 -50 110 260
30 K -10 -60 10 10 -40 -20 0 -20
20 -30 -10 10 -10 10 -60 260 30 A
10 -40 10 0 -50 10 -10 -10 0 -30
-10 10 0 0 -60 260 30 R 0
-50 0 0 -50 -10 0 -10 10 -30 -10
10 0 10 -50 260 30
Residue Types
...

Position Specific Score Matrix (PSSM)
Alignment Position
28
Iterative Profile Searches PSI-BLAST
Initial Matches GKATFGKLFAAHPEYQQMFRFF GKDCLIKFLSA
HPQMAAVFGFS GLELWKGILREHPEIKAPFSRV SLHFWKEFLHDHPDL
VSLFKRV GFDILISVLDDKPVLDQALAHY
Query GKATFGKLFAAHPEYQQMFRFF
Search
Refined Matches TSTMYKYMFQTYPEVRSYFNMT GKATFGKLFAA
HPEYQQMFRFF SGIAMKRQALVFGAILQEFVAN GKDCLIKFLSAHPQM
AAVFGFS WAKASAAWGTAGPEFFMALFDA GLELWKGILREHPEIKAPF
SRV SLHFWKEFLHDHPDLVSLFKRV GVALMTTLFADNQETIGYFKRL
GFDILISVLDDKPVLDQALAHY VDPHLRMSVHLEPKLWSEFWPI
PSSM Profile Pos Ala Cys Glu Asp Phe Gly His
... 1 -20 -30 -30 -40 20 -30 -20 2 -10
-50 -20 -30 -20 -30 0 3 20 -30 10
0 -50 20 -20 4 -30 -80 -50 -40 20 -60
-20 5 -50 -30 -60 -60 100 -70 -10 ...
Search again
http//www.ncbi.nlm.nih.gov/blast/Blast.cgi
http//www.ebi.ac.uk/blastpgp/
29
Pair-wise vs. Profile Searches
Lindahl and Elofsson (1999)
30
Biological Programming

Programming Languages
Python upcoming my personal recommendation
Perl popular
Java web apps
C/C fast
Courses
http//www.cam.ac.uk/cs/courses/
http//www.biomed.cam.ac.uk/gradschool/current/cou
rses/bioinformatics.html
Books
Biological Sequence Analysis Probabilistic
Models of Proteins and Nucleic Acids. R. Durbin,
SR. Eddy, A. Krogh, G. Mitchison ISBN-13
9780521629713 ISBN-10 0521629713