Title: From Pairwise To
1From Pairwise To Multiple Sequence Alignments
Workshop, January 2008 Maya Schushan
2Outline
1. Introduction
2. Applications
3. General Alignment Methodology
- 4. Pairwise Alignment
- Smith-Waterman
- Needlman-Wunch
- 5. Multiple Sequence Alignment
- ClustalW
- MUSCLE
- T-coffee
3What Is An Alignment?
Introduction
- Comparing 2 (pairwise) or more (multiple)
sequences. - Searching for a series of identical or similar
characters in the sequences.
4Introduction
What Is An Alignment?
A process of lining-up 2 or more sequences to
achieve maximum level of identity, in order to
find homologies.
T C A T G C A T T G
?
T C A T G C A T T G
T C A T G C A T T G
or
5Introduction
Basic Terms
- Homology
- Relation of sequences which is a result of
divergence from a common ancestor.
- Identity
- Sequences or Sub-sequences that are invariant.
- Similarity
- Sequences or Sub-sequences that are related.
6Introduction
Homologues Orthology vs Paralogy
Reproduced from NCBI education website
7Introduction
The Limits of Sequence Similarity
8Outline
1. Introduction
2. Applications
3. General Alignment Methodology
- 4. Pairwise Alignment
- Smith-Waterman
- Needlman-Wunch
- 5. Multiple Sequence Alignment
- ClustalW
- MUSCLE
- T-coffee
9Applications
Why Sequence Alignment?
- Predict characteristics of a protein
- Use the structure or function information on
homologous protein to predict structure\function
of an unknown protein. - Similar sequences ? similar proteins (?)
- Conserved vs. variable regions ? functional site
-
10Applications
Why Sequence Alignment?
- Predict characteristics of a protein
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNW
YQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-
QYYSTWYQQPPG
11Applications
Why Sequence Alignment?
A model is generated according to a template
structure of a homologous protein
12Applications
Why Sequence Alignment?
- 2. Learn about evolutionary relationships
- Two sequences from different organisms are
similar ? they may have a common ancestor. - Needed for construction of phylogenetic trees
13Applications
Why Sequence Alignment?
- Research of disease
- Comparison of sequences between individuals can
detect changes that are related to diseases - Analysis of residues substitutions mutation or
polymorphism?
14Applications
Why Sequence Alignment?
- Examples for specific applications
- Evolutionary conservation analysis
(ConSeq/ConSurf) - Motif and domain prediction (Prosite/InterPro/Pfam
) - Phylogenetic trees
ConSurf analysis of PDB entry 1hyt-hydrolase
15Outline
1. Introduction
2. Applications
3. General Alignment Methodology
- 4. Pairwise Alignment
- Smith-Waterman
- Needlman-Wunch
- 5. Multiple Sequence Alignment
- ClustalW
- MUSCLE
- T-coffee
16Sequence Modifications
General Alignment Methodology
- 1. Insertion - an insertion of a letter or
several letters to the - sequence.
-
- AAGA? AAGTA
- 2. Deletion - deleting a letter (or more) from
the sequence. -
- AAGA? AGA
- 3. Substitution - replacing a sequence letter by
another. -
- AAGA? AACA
INEDL- Insertions Deletions
17General Alignment Methodology
Measuring An Alignment
S ACTG S AC_TG S ACTG S ACTG T
AGT T A_GT_ T AGT_ T _AGT
Good Identical characters- match. Bad Different
characters- mismatch gap (InDel).
- Each pair of characters gets a value, depending
on its identity. - The similarity score of the alignment is the sum
of pair values.
18Example Aligning Two Globins
General Alignment Methodology
- Human Hemoglobin (HH)
- VLSPADKTNVKAAWGKVGAHAGYEG
- Sperm Whale Myoglobin (SWM)
- VLSEGEWQLVLHVWAKVEADVAGHG
19Example Aligning Two Globins
General Alignment Methodology
- No Gaps
- Percent identity 36
- Percent similarity 40
- (HH) VLSPADKTNVKAAWGKVGAHAGYEG
-
- (SWM) VLSEGEWQLVLHVWAKVEADVAGHG
20Example Aligning Two Globins
General Alignment Methodology
- With Gaps
- Gaps 2
- Percent identity 45.833 (instead of 36 without
gaps) - Percent similarity 54.167 (instead of 40
without gaps)
- (HH) VLSPADKTNVKAAWGKVGAH-AGYEG
-
- (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
21General Alignment Methodology
Types of Gap Penalties
(insertions or deletions)
- InDels are rare in evolution once created, easy
to - extend
- Gap open penalty for the first residue in a
gap - Gap extension penalty for additional residue
in a gap.
22General Alignment Methodology
Types of Gap Penalties
Motivation Aligning cDNAs to Genomic DNA
cDNA query
Genomic DNA
Conclusion gap opening and extension should be
ranked differently to properly align the sequences
23Alignment Scoring
General Alignment Methodology
- 1. Assume independent mutation model
- 2. Score at each position
- Positive if the same/similar
- Negative if different or gap
- 3. Score of an alignment is sum of position score
24Scoring Matrix
General Alignment Methodology
- A matrix n ? n n4 for DNA, n20 for proteins
- Each entry matrix defines the score for observing
the two letters in the alignment - Positive if likely to change
- Negative otherwise
25Alignment Scoring
General Alignment Methodology
- Different scoring ? different best alignments
- Scoring systems implicitly represent a particular
theory of evolution - Some mismatches are more plausible
- Transition vs. Transversion
- Lys?Arg ? Lys?Cys
- Gap extension Vs. Gap opening
26DNA scoring matrices
General Alignment Methodology
27DNA scoring matrices
General Alignment Methodology
- Transitions purine to purine or pyrmidine to
pyrmidine - (4 possibilities)
- Transversions purine to pyrmidine or pyrmidine
to purine (8 possibilities) - By chance alone transversions should occur twice
as often as transitions. - De-facto transitions are more frequent than
transversions.
28General Alignment Methodology
DNA scoring matrices
Transversion
Match
Transition
29General Alignment Methodology
Proteins scoring matrices
30General Alignment Methodology
Proteins scoring matrices
- Observation some substitutions are more frequent
than others, e.g., chemically similar amino acids
- As for DNA, protein matrices define the
probabilities of change between the different
amino acids - Popular matrices are based on empirical data
- PAM
- BLOSUM
T L Y D K T L Y D K T L Y E K T L Y D K T L Y Q
K T L Y D K
In the fourth column E and D are found in 7 / 8
31PAM Matrices
General Alignment Methodology
- PAM matrices are based on sequences with 85
identity. - The changes are accepted by natural selection
- 1 PAM unit
- the probability of 1 point mutation per 100
residues. - Multiplying PAM1 by itself gives higher PAMs
matrices that are suitable for larger
evolutionary distance.
32General Alignment Methodology
PAM250
33General Alignment Methodology
PAM Matrices
34General Alignment Methodology
BLOSUM Matrices
- Based on BLOCKS database
- 2000 blocks from 500 families of related
proteins - Blocks short conserved patterns of 3-60 aa
without gaps - Different BLOSUMn matrices are calculated
independently from BLOCKS - BLOSUMn is based on sequences that shared at
least n percent identity
35General Alignment Methodology
BLOSUM Matrices
- Low BLUSOM numbers for distant sequences
- High BLUSOM numbers for similar sequence
- Generally
- BLOSUM62 for general use
- BLOSUM80 for close relations
- BLOSUM45 for distant relations
36General Alignment Methodology
Proteins scoring matrices
Closer sequences
- PAM100 BLOSUM90
- PAM120 BLOSUM80
- PAM160 BLOSUM60
- PAM200 BLOSUM52
- PAM250 BLOSUM45
Distant sequences
37General Alignment Methodology
Proteins scoring matrices
- BLOSUM matrices are based on the replacement
patterns found in more highly conserved regions
of the sequences without gaps -
- PAM matrices based on mutations observed
throughout a global alignment, includes both
highly conserved and highly mutable regions
38Outline
1. Introduction
2. Applications
3. General Alignment Methodology
- 4. Pairwise Alignment
- Smith-Waterman
- Needlman-Wunch
- 5. Multiple Sequence Alignment
- ClustalW
- MUSCLE
- T-coffee
39Local vs. Global
Pairwise Alignment
- Global alignment finds the best alignment
across the whole two sequences.
ADLGAVFALCDRYFQ ADLGRTQN-CDRYYQ
- Local alignment finds regions of similarity in
parts of the - sequences.
ADLG CDRYFQ ADLG CDRYYQ
40Pairwise Alignment
Global Needleman Wunsch (1970)
- Involves an iterative matrix method of
calculation -
- All possible pairs of residues are presented
-
- All possible alignments are presented as
pathways through - this array
Needleman, S. B. and Wunsch, C. D., 1970
41Pairwise Alignment
Local Smith Waterman (1981)
- Makes an optimal alignment of the best segment
of similarity - between two sequences
-
- Sequences that but contain regions that are
highly similar - Use when one sequence is short and the other is
very long - Can return a number of highly aligned segments
Smith, T.F. and Waterman, M.S., 1981
42Alignment in reality
Pairwise Alignment
- Main use of pairwise alignment finding sequence
in a large database - Database are huge ? heuristic procedures in
search - Take a lot of time
- Take a lot of memory
- Find local alignments
- In use in Blast, Fasta
43User Input
Pairwise Alignment
- Pair of sequences
- Local or global alignment
- Scoring
- Gap penalties opening/extension
- Scoring matrix
44Scoring
Pairwise Alignment
- The final score of the alignment is the sum of
the positive scores and penalty scores - Number of Identities
- Number of Similarities
- - Number of Gap insertions
- - Number of Gap extensions
- Alignment score
Scoring Matrix
Gap penalties
45Outline
1. Introduction
2. Applications
3. General Alignment Methodology
- 4. Pairwise Alignment
- Smith-Waterman
- Needlman-Wunch
- 5. Multiple Sequence Alignment
- ClustalW
- MUSCLE
- T-coffee
46Multiple Sequence Alignment
Pairwise Vs. Multiple Sequence Alignment
Alignments help to analyze sequence
data organize, visualize.
MSA For more than 2 sequences
Pairwise For 2 sequences
- F G K ? G K G
- F G K F G K G
- - G K Q G K G
- - - K F G K G
F G K ? G K G F G K F G K G
47Multiple Sequence Alignment
Rules For Choosing Sequences
- Very similar sequences have little information
- Very different sequences cause troublelt30
identical with more than half of the other
sequences in the set - Choose sequences as distantly related as possible
- Sequence between 30-80 identical with more than
half of the sequences in the set - The more sequences the better
48Multiple Sequence Alignment
Similarity Score of MSA
- Each position gets a value, depending on its
identity. - The similarity score of the alignment is the sum
of all - position values.
- A popular way to compute position values
- SP - Sum of Pairs - each pair gets the score
from the - similarity matrix (PAM, BLOSUM).
Goal Find MSA with maximum similarity score Bad
News This problem is NP hard
49Multiple Sequence Alignment
ClustalW- Introduction
- This heuristic approach works because it uses
the biological - meaning of MSA
- Based on the idea that the sequences we usually
want to align - are phylogenetically related
- The first program to implement progressive MSA
- Was introduced in 1994 and still used today.
Thompson, J.D. et al, 1994
50Multiple Sequence Alignment
ClustalW- Progressive Alignment
Hbb_Human 1 Hbb_Horse 2 Hba_Human 3 Hba_Horse
4 Myg_Whale 5
1. Quick pairwise alignment calculate distance
matrix
Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Whale
2. Build a guide tree using the NJ phylogenetic
method
3. Progressive alignment following guide tree
51Multiple Sequence Alignment
ClustalW- Progressive Alignment
A
B
C
D
52Multiple Sequence Alignment
ClustalW- Additional Features
- Sequence weighting
- Each sequence gets a weight derived from the
guide tree - Close sequences are down-weighted
- Distant sequences receive high weights
- The weights are normalized so that the highest
is 1
W(Hbb_Human) .081 ½.226 ¼.061 1/5.015
1/6.062 0.221
w1
w2
w3
w4
w5
w6
w7
53Multiple Sequence Alignment
ClustalW- Additional Features
- Sequence weighting
- Each sequence gets a weight derived from the
guide tree - Close sequences are down-weighted
- Distant sequences receive high weights
- The weights are normalized so that the highest
is 1
Without weights Score M(t,v) M(t,i) M(i,v)
M(l,i) M(k,v) M(k,i) M(k,v) M(k,i) / 8
With weights Score M(t,v)W1W5 M(t,i)W1W6
M(i,v)W2W5 M(l,i)W2W6 M(k,v)W3W5 M(k,i)W3W6
M(k,v)W4W5 M(k,i)W4W6 / 8
54Multiple Sequence Alignment
ClustalW- Additional Features
- Chooses different scoring matrices, as the
alignment proceeds, depending - on the estimated divergence of the sequences
at each stage. - Position specific gap penalties depends on the
position of in the alignment - - Small penalty
- 1. In free loops (stretches of hydrophilic
residues) - 2. Existing gaps
- - Big penalty Near existing gaps
- Residue specific gap penalty
55Multiple Sequence Alignment
ClustalW- Problems
- Sequences that are similar only in some smaller
regions - ? ClustalW tries to find global alignments, not
local. - Sequence that contains a large insertion
compared to the rest - ? global not local
- Sequence that contains a repetitive element,
while another sequence only - contains one copy.
Vs
56Multiple Sequence Alignment
MUSCLE- Introduction
- The most recent popular MSA software
- Considered to be the most accurate MSA software
available - today
-
- The basic idea Progressive Alignment
Edgar, R.C., 2004
57Multiple Sequence Alignment
MUSCLE Innovations
- Faster distance estimation between the input
sequences - Faster construction of an evolutionary tree
- (UPGMA instead of NJ in ClustalW )
faster
- Applying new score function to the profile
alignments - Refinement of the initial results
more accurate
Edgar R.C., 2004
58Multiple Sequence Alignment
MUSCLE Innovations- Refinement Step
- An edge is chosen from the progressive alignment
tree. - The tree is divided into two subtrees by deleting
this edge. - The MSA from each subtree is computed by
progressive alignment. - The two MSAs are aligned, generating an entire
new MSA - If the new MSA achieves higher score than the
previous ? keep it
New MSA ------------------------------------------
--------------------------------------------------
------------------
Old MSA ------------------------------------------
--------------------------------------------------
------------------
MSA2 ----------------------------
MSA1 ----------------------------------
--------
59Multiple Sequence Alignment
MUSCLE- Its Even More Complicated
60Multiple Sequence Alignment
T-Coffee- Introduction
- Heuristic methods based on progressive alignment
use a phylogenetic guide tree to gradually built
the alignment. - ? Errors made in the first alignment cant be
correct later on as more sequences are added - T-Coffee will attempted to minimize the effect by
using data from the all the sequences while
building the alignment. - T-coffee combine the best properties of global
and local multiple alignments
Notredame et al., 2000
61Multiple Sequence Alignment
T-Coffee-Alignment Stages
- Create pairwise library
- Primary library- global pairwise (ClustalW)
- Extended library- 10 top scoring non intersecting
local alignments - Assign weights to each residue pair using the
extended library - Use weights to produce pairwise alignments
instead of regular matrices - Producing a distance matrix and tree using the NJ
method - MSA using the guide tree
62Multiple Sequence Alignment
T-Coffee- Applications
- Disadvantages
- Except for the distance matrix, there is no
possibility to choose parameters (like open gap,
gap extension). - Advantages
- - Combination of local and global alignments
- - Data from the all the sequences while building
the alignment - Makes the results more accurate and is
especially good for - alignments with low identity and big insertions
? identification - of motifs and domains
63Multiple Sequence Alignment
All Against All- SH2 domains
T-coffee
MUSCLE
Edgar, R.C., 2004
64Multiple Sequence Alignment
All Against All- BaliBase 2005
MUSCLE is superior!
Edgar, R.C., 2004
65Thanks
This presentation was partly based on
presentations of Dr. Meytal Landau, Dr. Metsada
Pasmanik-Chor and the Pupko lab