Sequence Alignment and Phylogeny - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Sequence Alignment and Phylogeny

Description:

Sequence Alignment and Phylogeny B I O I N F O R M A T I C S | | | | | | | B I O L O G Y - M A T H - S Dr Peter Smooker, peter.smooker_at_rmit.edu.au – PowerPoint PPT presentation

Number of Views:306

Avg rating:3.0/5.0

Slides: 55

Provided by: e451

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment and Phylogeny

1
Sequence Alignment and Phylogeny
B I O I N F O R M A T I C S
B I O L O G Y - M A T H - S
Dr Peter Smooker, peter.smooker_at_rmit.edu.au
2
Uses of alignments

To determine the relationship (ie distance)
between two sequences (pair-wise alignment)
To search databanks for the presence of
homologues
To look for sequence conservation in families of
proteins
To use molecular approaches to phylogeny

3
Comments/Caveats

When sequences are aligned, we assume they share
a common ancestor
Protein fold is more conserved than protein
sequence
DNA sequences are less informative than protein
sequences
Two sequences can always be aligned- we need to
determine what is a meaningful result

4
Homology

Proteins or genes are defined as homologous if
they can be said to have shared an ancestor
Genes or proteins are either homologs or they are
not- there is no such thing as percent homology.
There is percent identity or similarity of the
sequences

5
Ologies

Homology - descent from a common ancestor
Orthology - descent from a speciation event
Paralogy - descent from a duplication event
Xenology - descent from a horizontal transfer
event

6
When Is Homology Real?

As a general rule, in a pairwise alignment
gt25 identical aas, proteins will have similar
folding pattern- most likely homologous
18-25 identical- twilight zone- tantalizing
lt18 identical- cannot determine from alignment

7
Measuring Sequence Similarity

Two measures of the distance between two strings
Hamming distance strings equal length, number of
positions with mismatches
Levenshtein distance not equal length, number of
edit operations to change one string to the other

agtc Hamming distance 2
cgta
ag-tcc Levenshtein distance 3
cgctca

9
Protein Alignments-Substitution Matrices

When sequences diverge over time, they accumulate
mutations- some are deleterious, some are
neutral, some are advantageous
Some changes are more likely than others
This can be examined and the relative probability
of a change occurring calculated
Substitution matrices have been developed

10
Matrices.

PAM Percent Accepted Mutation
Matrices are derived from families of proteins
with a set level of identity.
PAM matrices proposed by Margaret Dayhoff. Based
on sequences with gt 85 identity. The PAM 1
matrix was computed. Extrapolated for larger
evolutionary distances

11
PAM Matrices

PAM 0 30 80 110 200 250
identity 100 75 50 60 25 20
The PAM250 matrix is corresponds to proteins of
average 20 identity (lowest we can reasonably be
confident about). It was derived by the
extrapolation of observed substitution
frequencies. PAM250 refers to 250 substitutions
per 100 amino acids.

12
Definition of PAM from BLAST literature

http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html
One "PAM" corresponds to an average change in 1
of all amino acid positions. After 100 PAMs of
evolution, not every residue will have changed
some will have mutated several times, perhaps
returning to their original state, and others not
at all. Thus it is possible to recognize as
homologous proteins separated by much more than
100 PAMs. Note that there is no general
correspondence between PAM distance and
evolutionary time, as different protein families
evolve at different rates.

13
BLOSUM Matrices

Developed by S and JG Henikoff
Made use of a much larger amount of data
Based on the BLOCKS database of aligned protein
domains
http//www.blocks.fhcrc.org/
Used a weighted average of closely related
sequences with identities higher than a
threshold. For example, the common BLOSUM62
matrix is based on proteins with greater than 62
identity

14
BLOCKS

The substitutions in each aligned column are
identified and a score for each substitution
calculated and inserted into the matrix.

15
Which Matrix to use?

In BLASTP, the following matrices are offered
PAM 30
PAM 70
BLOSUM 80
BLOSUM 62 (default)
BLOSUM 42
In PAM, greater numbers more evolutionary
distance. Reverse for BLOSUM

16
Which Matrix to use?

Generally, BLOSUM perform better than PAM for
local alignment searches
Use the matrix appropriate for the task- if you
expect a close match, use a low PAM or high
BLOSUM number
Generally, if you use the default (generally
BLOSUM 62) and find nothing, go to a matrix
derived from a more evolutionarily distant dataset

17
Scoring

Score of mutation i gt j
log observed i gtj
expected i gt j
Expected i gt j is simply calculated by the
frequencies of the amino acids
Result is multiplied by 10. Scores are added.

18
PAM250
A R N D C Q E G H I L K M F
P S T W Y V
A 2
R -2 6
N 0 0 2
D 0 -1 2 4
C -2 -4 -4 -5 4
Q 0 1 1 2 -5 4
E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0
9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2
-5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2
-3 1 3
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1
-2 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
0 -6 -2 -5 17
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2
7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2
-2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
19

Scores below 0 indicate amino acids that are
rarely substituted, and different aas that give
a high ve score are usually functionally
equivalent
Scores below 0 indicates that those substitutions
are rarely observed

Hydrophilic

These aas are hydrophobic (except glycine, often
put in a class by itself).

22
Interpreting scores- BLAST output
23
(No Transcript)
24
Significance

Two values are given- the Bit score and the
E-value.
The E-value is a statistical calculation of the
probability that the match is real, ie that in a
query database of that size, the sequence would
give that score by chance
The bit score is related to both the raw score
(calculated from the BLOSUM or PAM lookup matrix)
but is normalised

25
Bit Score

Bit scores are normalised with respect to the
scoring system. Hence they can be compared across
different searches (using different matrices)
In particular
To convert a raw score S into a normalized score
S' expressed in bits, one uses the formula S'
(lambdaS - ln K)/(ln 2), where lambda and K are
parameters dependent upon the scoring system
(substitution matrix and gap costs) employed

26
Multiple Sequence Alignment

To quote Lesk
One amino acid sequence plays coy a pair of
homologous sequences whisper many aligned
sequences shout out loud

27
Multiple Sequence Alignment

Multiple sequence alignments can offer a
considerable amount of information over a
pairwise alignment.
Regions of similarity (especially distant
similarity) can be detected
Regions of functional significance can often be
detected
Evolutionary relationships can be examined, and
trees drawn.

28
MSAs are computationally expensive

If we use dynamic programming, rather than a 2D
array as for pairwise comparison, have an
n-dimensional array. Computational time grows as
Mn, where n is the number of sequences. Difficult
for n4, impossible for higher values.
Use a heuristic approach. Most common is the
CLUSTAL algorithm

29
Progressive Alignment

Iterative pairwise alignment
Two most similar sequences aligned first, then
next most similar to that pair, etc.
A very popular progressive alignment algorithm is
CLUSTAL W

30
CLUSTAL W- Steps

A matrix of pairwise distances between all
sequences is constructed. This determines the
similarity between all sequences to be aligned.
A guide tree (dendogram), or inferred phylogeny,
is built
The alignment is constructed based on the guide
tree.
Generally results in a near-optimal alignment

31
CLUSTAL W

A major problem in MSA is the selection of an
appropriate matrix for alignments consisting of
divergent and closely related sequences
CLUSTAL W (weighted) assigns weights to a
sequence dependent on how divergent it is from
the two most closely related sequences
Adapts gap penalties and scoring matrix to suit

32
An example (from our research)

Some definitions
Phylogeny Evolutionary history (tree of life)
Molecular phylogeny Determined using sequence
data
Bootstrapping A statistical process to evaluate
phylogenetic trees. The data is resampled 1000
times (generally) and the support for each branch
determined
Homology modelling. Predicting the structure of a
protein based on the experimentally derived
structure of a homologue

33
Fasciola- Liver Fluke
NEJ Adult
34
Liver fluke (Fasciola spp.)

Trematode (flatworm) parasite
Infects ruminants, humans
Has a complex life-cycle
Secretes proteins (excretory/secretory material)
Major secreted protein is cathepsin L in adults

35
Cysteine proteases

Digest proteins cleave between adjacent amino
acids.
Not random cleavage, different proteases show a
preference for different targets.

36
There are a number of Fasciola cathepsin L
sequences known.

At least 30 full sequences now known
Only one contains an indel
Protein sequences 46-99 identical

37
What are the differences between the two classes
of CatL that account for the substrate
specificity?

Presumed to be due to changes affecting the S2
subsite of the enzyme.

38
Homology Modelling

FhCatL modelled on the known crystal structure of
human CatL.
Models of CatL2 and CatL5 (functional equivalent
of CatL1) compared, especially around the S2
subsite of the enzyme.

39
Homology Modelling

Three substitutions is residues lining the S2
subsite were observed (L5-gt L2)
L69Y Makes substantial contacts with the P2 Phe
N161T Side chain points away from pocket
G163A Bottom of pocket, no substantial contact
with P2 Phe

40
L2
L5
GRASP electrostatic surface potential The
architecture around the S2 pocket is
substantially influenced by a Y or L at position
69. Made mutant, expressed in yeast, performed
kinetic analysis.
41
Conclusions

The L69Y change does affect the substrate
specificity
69Y allows increased catalysis of substrates with
a P2 proline
There are other, more subtle changes between L5
and L2

42
What about the other enzymes- CLUSTAL W
43
What amino acid is at 69?
44
FgCatL1-a 61 GNMGCSGGLMENAYEYLKQFGLETESSYPYTAVE
GQCRYNRQLGVAKVTDYYTVHSGSEV 120 FgCatL1-b
GNYGCMGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVAKVTD
YYTVHSGSEV FgCatL1-c GNFGCNGGLMENACEYLKRFGLE
TESSYPYRAVEGPCRYNKQLGVAKVTGYYMVHSGDEV FgCatL1-d
GNHGCGGGYMENAYEYLKHSGLETDSYYPYQAVEGPCQYDGRLAYA
KVTDYYTVHSGDEV FgCatL1-e GNYGCMGGLMENAYEYLKQ
FGLETESSYPYTAVEDQCRYNRQLGVAKVTDYYTVHSGSEV FgCatL1-
f GNNGCRGGLMEIAYEYLRRFGLEIESTYPYRAVEGPCRYDRR
LGVAKVTGYYIVHSGDEV FgCatL2
GNMGCSGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVAKVTD
YYTVHSGSEV FgCatL3 GNINCMGGLMENAYEYLKQFGLE
TESSYPYTAVEGQCRYNRQLGVAKVTDYYTVHSGSEV FhCatL1
GNNGCGGGLMENAYQYLKQFGLETESSYPYTAVGGQCRYNKQLGVA
KVTGYYTVQSGSEV FhCatL2 GNYGCGGGYMENAYEYLKH
NGLETESYYPYQAVEGPCQYDGRLAYAKVTGYYTVHSGDEI FhCatL3
GNNGCSGGLMENAYQYLKQFGLETESSYPYTAVEGQCRYNKQ
LGVAKVTGYYTVHSGSEV FhCatL4
GNYGCNGGLMENAYEYLKRFGLETESSYPYRAVEGQCRYNEQLGVAKVTG
YYTVHSGDEV FhCatL5 GNYGCNGGLMENAYEYLKRFGLE
TESSYPYRAVEGQCRYNEQLGVAKVTGYYTVHSGDEV FhCatL6
GNYGCMGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVA
KVTDYYTVHSGSEV FhCatL7 GNYGCGGGYMENAYEYLKH
NGLETESYYPYQAVEGPCQYDGRLAYAKVTGYYTVHSGDEI FhCatL8
GNHGCGGGWMENAYKYLKNSGLETASYYPYQAVEYQCQYRKE
LGVAKVTGAYTVHSGDEM FhCatL9
GNNGCSGGLMENAYEYLKRFGLETESSYPYRAVEGQCRYNEQLGVAKVTG
YYTVHSGSEV FhCatL10 GNHGCGGGWMENAYKYLKNSGLE
TASDYPYQGWEYQCQYRKELGVAKVTGAYTVHSGDEM
. . . ..
. .
45
Fasciola CatLs form a monophyletic clade

Fasciola sequences aligned to the family of
papain-like cysteine proteases
100 bootstrap support for clade
All Fasciola sequences arose after divergence
from Schistosoma
Probably all parasitic catLs have diverged after
speciation (Sajid and McKerrow)

46
(No Transcript)
47
Relationship of Fasciola enzymes

Tree constructed using 18 full-length sequences
Resolved into 4 distinct clades

48
AA69 Predicted Substrate
L69 Phe-Arg
L69 Phe-Arg
Y69 Pro-Arg
W69 ??-Arg
49
(No Transcript)
50
Evolutionary Timeframe

First observed divergence (clade A) 135 MYA
F. hepatica and F. gigantica predicted to diverge
approx. 19 25 MYA
Confirmed by constructing a neighbour-joining
tree using Glutathione-S transferase sequences
19 /- 5.2 MYA

51
Practice runs- 1. Blast

Go to the BLAST server at NCBI
http//www.ncbi.nlm.nih.gov/BLAST/
Note the different flavours of BLAST that can
be performed.
Go to Protein-Protein BLAST. Look at the format
and the searching parameters.
Paste in sequence 1 and run the BLAST

52
Sequence 1

What is it? (note that a conserved domain is
detected)
From what organism (should be 100 match)?
What is the organism that has the closest
relative?
What is meant by positives?

For interest, use sequence 2 to run a BLAST. This
is the mRNA sequence from which the protein
sequence is translated. (note- choose your BLAST
flavour carefully!)
Is the same result obtained?

54
Practice runs- 2. CLUSTAL W