Title: Day 2: Homology prediction
1Day 2 Homology prediction
- (parts) of proteins are homologous when they
share a common ancestor. - Why are we interested
- Function prediction -gt Homologous proteins tend
to have similar functions - Evolutionary dynamics -gt Tracing the evolution of
protein families
2How do we know what proteins do ?
Sequence similarity
Functional similarity
Evolutionary origin
Similar 3D structure
Similar sequences have a similar 3D structure and
similar functions
3The importance of sequence similarity
Compare with a database of proteins
new, unknown protein
similar sequence with known function. E.g.
proteine kinase
Extrapolate the function
4Similar function
What is function ? Various levels of
description Sequence similarity, Homology
has the largest relevance for Molecular Function.
This is aspect of protein function that is best
conserved, protein sequence, structure can often
be interpreted in terms of function.
5How do we detect homology Similarity of 3D
structure -gt most conserved aspect, yet few
structures are available. Structures are compared
and classified by eye (A. Morzin, Scop), and
software packages (Dali). More info on 3D in
Bioinf II. Sequence -gt less conserved, many
sequences are however available. Homology
determination is mainly based on theoretical
models of sequence evolution and the likelihood
that when you compare a sequence to a database
you will find a sequence of at least that
similarity. 3D structure similarity is used as
a benchmark for detection of homology by sequence
similarity.
6Models of protein sequence evolution ??ways of
aligning sequences and of judging sequence
similarity. The simplest model All amino-acids
are equally dissimilar, are replaced at equal
rates, independent of the position in the
sequence. (aligning sequences purely based on
matches/identity matrices) A more complicated
model Some amino-acids are more equal than
others, independent of the position on the
sequence. (aligning sequences based on similarity
matrices) Bootstrapping in deriving the
substitution/similarity matrices The matrices
themselves are based on some sequence alignments,
and hence on some model of evolution..
7BLOSUM (62, 80 etc.) (BLocks SUbstitution Matrix)
based on (gap-less) alignments of sequences that
are maximally 62 80 etc. identical. By choosing
gapless alignments (Blocks) they are relatively
independent of the model used to make the
alignment. PAM (Percentage Accepted
Mutation) matrices PAM(80, 250 etc.).
Extrapolation from PAM 1 matrices (what are the
substitution frequencies for sequences that are
99 identical), by multiplying the matrix with
itself N times -gt PAMN. Obtain independence
from alignment by extrapolating from PAM-1.
8The scores are the log-odds of the observed
substitution frequencies divided by the
frequencies of the individual amino acids
Sij ln qij/pipj
9PAM matrices become flatter when N
increases. Disadvantage of PAM and to a less
extent BLOSUM matrices In the long run, all
sequences will become equally identical, and
because the models of evolution are equal for all
sites, there are no really conserved positions.
10PAM100
PAM10
PAM200
PAM490
11Sequence alignment Give a cost to gaps (opening
and extension costs), just like you give costs to
mismatches in the matrix. Dynamic programming
(Needleman Wunsch, 1970) obtains the best
alignment between pairs of sequences.
Filling in the matrix
Initialization step
12Filling in the matrix
Traceback
13Global alignment vs. Local alignment
Often only parts of the sequence are homologous
(e.g. gene fusion or recombination) ? One would
like to detect/align the homologous parts and
eliminate the noise from the non-homologous
parts. (Smith Waterman 1981)
14- One can in principle also use Needleman Wunsch
to align multiple sequences, but computationally
this is impossible. - Heuristics most programs do progressive
alignments, starting with all pairwise alignments
(Recent inventions like Muscle skip that step),
and proceeding from there. Manual
checking/refinement is generally required. - Multiple sequence alignment is intimately
connected to phylogeny, tree reconstruction
15- For database searches SmithWaterman is
generally to expensive, heuristic approaches like
BLAST, FASTA have been developed. They
approximate the sensitivity of SmithWaterman. - Blast and FASTA work by first detecting short
(nearly) identical pieces between sequences, and
then filling in the gaps.
16E-values
- Theory based on extreme value distributions
comparing two random sequences with each other
will not tend to give you a high similarity, but
when you compare one sequence with a large set of
sequences you will always with some high scoring
hits ? the extreme values. For your hit to be
significant it has to be better than those
expected extreme values. - E-values Expected number of hits of that
similarity, if the sequence would have been
compared to a database of random sequences. - Experimental benchmarking of E-values by
comparisons of 3D structures (e.g. Brenner et
al., 1996), where we know what is homologous
and what is not.
17How many hits of a certain quality/score (e.g.
the Smith Waterman score) do you expect if you
were to compare your sequence to a random database
18How do we know for sure that significantly
similar sequences are truly homologous (aside
from the statistical argument)
EGF
TGAa
3D structurele similarity evolves at a lower rate
than 2D similarity and is being used to test the
quality of the statistics
19Benchmarking homology detection with the
Smith-Waterman algorithm, using 3D-structures
(PDB40) as the golden rule for what is
homologous and what is not
. Use those E-values
20Sequences that are not significantly similar do
not have to be non-homologous
Bola (red) en OsmC (green) have no significant
similarity at the sequence level, but are
significantly similar at the 3D level.
21Practical and theoretical considerations in
pairwise sequence comparisons One of the
assumptions behind the statistics is that the
sequences are random ? No low complexity areas,
(SEG XXXXXXX). Convergence in sequence and in
structure space occurs e.g. in Transmembrane and
coiled coil areas. ? No homology but convergence
in structure and in sequence space. Databases
are assumed to be non-redundant ? E-values are
too high, Solution compare against non-redundant
databases. E-values are based on the whole
sequence ? search with separate domains if you
have indications that there are such. Databases
are full of indirectly annotated proteins ? there
is no solution, except by manually checking which
annotations are reliable.
22- The main increase in sensitivity (2 to 3 fold)
comes from profile-based searches, Like
Position Specific Iterated BLAST (PSI-BLAST) and
from Hidden Markov Models - PSI-Blast and HMMs allow more complicated models
of sequence evolution rather than substitution
matrix that is equal for all positions, we have
one for each position, as well as
position-specific gap-penalties (positions are
regarded as independent, though). - Again, as with the gene-prediction, building a
mathematical, probabilistic model that
generates our protein domain allows us to asses
the probability that any sequence of interest has
been generated by any specific model.
23A very simple Hidden Markov Model
Pos. 1
Pos. 2
Pos. 3
Pos. 4
P(A)0.01 P(C)0.8 P(E)0.1 Etc.
P(A)0.3 P(C)0.01 P(E)0.02 Etc.
P(A)0.05 P(C)0.01 P(E)0.4 Etc.
P(A)0.01 P(C)0.01 P(E)0.3 Etc.
(No insertions/deletions)
24A slightly more complicated Hidden Markov Model
(With insertions (I) /deletions (D))
D
D
D
M
M
M
M
I
I
I
I
25- You can get all the obvious homologs, align
them, make an HMM - using HMMer, run it on somewhere on a big
computer. (dynamic programming, slow) - Or you can use PSI-Blast !!!! (Altschul et al.,
1997) - Rel. fast and easy, and a bit less accurate
- (alignment never exceeds the length of the seed
protein) - An example how making powerful bioinformatics
tools easy accessible leads to increase of usage,
speed-up of research.
26Comparison of various homology search techniques
in terms of sensitivity (number of homologues
detected) and selectivity (number of
non-homologous detected) SAM-T98 HMM ISS
Intermediate Sequence Search
27After 1) sequence vs. sequence, 2) sequence vs.
profile ? 3) profile vs. profile e.g. Compass,
HHsearch
28Â
Using profile-to-profile comparisons BolA is
predicted to be homologous to OsmC
29Protein domains
Proteins often consist of multiple domains.
Separate in structure ? structural definition
(need a 3D structure) Separate in evolution ?
comparative sequence analysis definition
30The multidomain architecture of proteins is one
of the reasons why protein function prediction by
best hit homology search is often incorrect.
2
1
A
B
B
Protein B is wrongly annotated as having the
function of domain 1, based on homology with the
multidomain protein A, but not with domain 1
312
1
A
B
B
Protein B is incompletely annotated as having the
function of domain 2, based on homology with the
single domain protein A, the second domain is
missed in the annotation
32Other reasons for mis- or underannotation
A
B
B
Undetected homology
C
A
B
B
A and B are homologous, but the annotation of A
itself is wrong
33An example of the spreading of misannotation The
Methanococcus jannaschii protein MJ1612 is
homologous to a protein annotated
as spQ54271BCPC_STRHY 3-PHOSPHONOPYRUVATE
DECARBOXYLASE (Expect 1e-15, Identities
89/315 (28), Positives 138/315 (43), Gaps
24/315 (7)) Whose function is based on an
unpublished observation/direct submission. It is
also homologous to phosphoglycerate mutase (part
of the glycolysis), an enzyme that in the Archaea
appears to be missing from the genomes. Because
of this, and because of the unpublishedness of
the results, some researchers have annotated
MJ1612 as phosphoglycerate mutase in some pathway
papers... Meanwhile the true
3-PHOSPHONOPYRUVATE DECARBOXYLASE has been found
and published, and it is not spQ54271BCPC_STRHY
. Still sequences from newly published genomes
are being annotated as 3-PHOSPHONOPYRUVATE
DECARBOXYLASE, and the old likely wrongly
annotated enzyme is still as such in the
database. Recent experimental results indicate
that the archaeal enzymes are indeed, as
suspected, phosphoglycerate mutases.
34A
B
B
A and B are homologous, but some critical
residues in B are missing. B has e.g. lost its
activity, or is catalyzing the conversion of
another substrate.
35Levels of sequence conservation correlate with
levels of structural, function conservation.
Catalytic activity is better conserved than
substrate specificity.
36Homology does not imly that proteins have the
same function
Oxoglutarate carrier
Uncoupling protein (H)
37However, aspects of the molecular function, like
a transporter function, are conserved between
homologs
Oxoglutarate carrier
Uncoupling protein
malaat
H
Oxoglutaraat
Mitochondrium
Mitochondrium
The oxoglutarate carrier and the uncoupling
protein do similar things transport across the
mitochondrial membrane, they however have a
different substrate (oxoglutaraat/malaat versus
H)
38Humane proteine kinases JNK en ERK are homologous
JNK
ERK
39But phosphorylate different transcription factors
CFOS
P
P
JNK
P
P
CJUN
P
P
ERK
ELK
P
40Malate dehydrogenase homologous to lactate
dehydrogenase
Malate dehydrogenase
Lactate dehydrogenase
41Lactate dehydrogenase (LDH) and Malate
Dehydrogenase (MDH) do the same thing, but on
different substrates
LDH
MDH
42Detection of new domains
Find pieces of proteins that occur in the context
of different other pieces/domains
Be careful though when you do not detect
homology, that does not necessarily indicate that
it is not there -gt having a separate 3D structure
is often regarded as proof for being a separate
domain.
43Determining Domain Boundaries
- When domains are either N- or C-terminal that
gives basically the N- or C- terminal domain of
the protein. - Domains do not overlap knowledge of existing
domains can provide maximum boundaries for new
ones. - Domains do not cross membranes
TM
TM
TM
44Domains are not necessarily contiguous in the
sequence
Pyruvate kinase contains an eight-fold beta
barrel, interrupted by a beta barrel.
45Sequence domain databases prefab sets of HMMs
against which sequences can be scanned. PFAM -gt
alignments generated automatically. Large
coverage, less quality SMART -gt curated
alignments, less coverage (focus on signalling
domains and on mobile domains)
46- Homology detection Bork P, Koonin EV (1998)
Predicting functions from protein
sequences--where are the bottlenecks? Nat Genet.
1998 Apr18(4)313-8.