Title: Pairwise and Multiple Sequence Alignment Lesson 2
1Pairwise and Multiple Sequence
AlignmentLesson 2
2Motivation
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAA
CAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTAT
GGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGA
GACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTG
CTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTT
CTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATG
CTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGT
ATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCT
GCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAAT
ATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGG
ACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCA
TCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTAT
AGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTC
TCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATT
TTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACT
TTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTT
GCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGA
ATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTG
CATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGG
GCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAA
ACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCA
GCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGG
AATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTG
GCTAATGCCCTGGCTCACAAGTACCATTGA
MV
HLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE
MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE
3What is sequence alignment?
- Alignment Comparing two (pairwise) or more
(multiple) sequences. Searching for a series of
identical or similar characters in the sequences.
MVNLTSDEKTAVLALWNKVDVEDCGGE
MVHLTPEEKTAVNALWGKVNVDAVGGE
4Why perform a pairwise sequence alignment?
Finding homology between two sequences
- e.g., predicting characteristics of a protein
-
- premised on
- similar sequence (or structure)
- similar function
5Local vs. Global
- Local alignment finds regions of high
similarity in parts of the sequences - Global alignment finds the best alignment
across the entire two sequences
ADLGAVFALCDRYFQ ADLGRTQN CDRYYQ
ADLGAVFALCDRYFQ ADLGRTQN-CDRYYQ
6Evolutionary changes in sequences
- Three types of nucleotide changes
- Substitution a replacement of one (or more)
sequence characters by another - Insertion - an insertion of one (or more)
sequence characters - Deletion a deletion of one (or more) sequence
characters -
AAGA
AACA
?
T
A
AAG
GA
A
A
Insertion Deletion ? Indel
7Choosing an alignment
- Many different alignments between two sequences
are possible
AAGCTGAATTCGAA AGGCTCATTTCTGA
A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
. . .
How do we determine which is the best alignment?
8Toy exercise
Compute the scores of each of the following
alignments using this naïve scoring scheme
Scoring scheme
- Match 1
- Mismatch -2
- Indel -1
-2 -2 -2 1
-2 -2 1 -2
-2 1 -2 -2
1 -2 -2 -2
Substitution matrix
Gap penalty (opening extending)
AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
9Substitution matrices accounting for biological
context
- Which best reflects the biological reality
regarding nucleotide mismatch penalty? - Tr gt Tv gt 0
- Tv gt Tr gt 0
- 0 gt Tr gt Tv
- 0 gt Tv gt Tr
Tr Transition Tv Transversion
10Scoring schemes accounting for biological context
- Which best reflects the biological reality
regarding these mismatch penalties? - Arg-gtLys gt Ala-gtPhe
- Arg-gtLys gt Thr-gtAsp
- Asp-gtVal gt Asp-gtGlu
11PAM matrices
- Family of matrices PAM 80, PAM 120, PAM 250,
- The number with a PAM matrix (the n in PAMn)
represents the evolutionary distance between the
sequences on which the matrix is based - The (ith,jth) cell in a PAMn matrix denotes the
probability that amino-acid i will be replaced by
amino-acid j in time n Pi?j,n - Greater n numbers denote greater distances
12PAM - limitations
- Based on only one original dataset
- Examines proteins with few differences (85
identity) - Based mainly on small globular proteins so the
matrix is biased
13BLOSUM matrices
- Different BLOSUMn matrices are calculated
independently from BLOCKS (ungapped, manually
created local alignments) - BLOSUMn is based on a cluster of BLOCKS of
sequences that share at least n percent identity - The (ith,jth) cell in a BLOSUM matrix denotes the
log of odds of the observed frequency and
expected frequency of amino acids i and j in the
same position in the data log(Pij/qiqj) - Higher n numbers denote higher identity between
the sequences on which the matrix is based
14PAM Vs. BLOSUM
- PAM100 BLOSUM90
- PAM120 BLOSUM80
- PAM160 BLOSUM60
- PAM200 BLOSUM52
- PAM250 BLOSUM45
More distant sequences
- BLOSUM62 for general use
- BLOSUM80 for close relations
- BLOSUM45 for distant relations
- PAM120 for general use
- PAM60 for close relations
- PAM250 for distant relations
15Substitution matrices exercise
- Pick the best substitution matrix (PAM and
BLOSUM) for each pairwise alignment - Human chimp
- Human - yeast
- Human fish
PAM options PAM60 PAM120 PAM250
BLOSUM options BLOSUM45 BLOSUM62 BLOSUM80
16Substitution matrices
- Nucleic acids
- Transition-transversion
- Amino acids
- Evolutionary (empirical data) based (PAM,
BLOSUM) - Physico-chemical properties based (Grantham,
McLachlan)
17Gap penalty
AAGCGAAATTCGAAC A-G-GAA-CTCGAAC
AAGCGAAATTCGAAC AGG---AACTCGAAC
- Which alignment has a higher score?
- Which alignment is more likely?
18Pairwise alignment algorithm matrix
representation formulation
2 sequences S1 and S2 and a Scoring scheme
match 1, mismatch -1, gap -2
Vi,j value of the optimal alignment between
S11i and S21j
Vi,j S(S1i1,S2j1) Vi1,j1 max
Vi1,j S(gap) Vi,j1 S(gap)
Vi,j Vi1,j
Vi,j1 Vi1,j1
19Pairwise alignment algorithm matrix
representation initialization
S1
S2
20Pairwise alignment algorithm matrix
representation filling the matrix
S1
S2
21Pairwise alignment algorithm matrix
representation trace back
S1
S2
22Pairwise alignment algorithm matrix
representation trace back
S1
S2
AAAC AG-C
23Assessing the significance of an alignment score
True
AAGCTGAATTC-GAA AGGCTCATTTCTGA-
AAGCTGAATTCGAA AGGCTCATTTCTGA
28.0
Random
AGATCAGTAGACTA GAGTAGCTATCTCT
AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT
26.0
. .
CGATAGATAGCATA GCATGTCATGATTC
CGATAGATAGCATA--------- ---------GCATGTCATGATTC
16.0
24Web servers for pairwise alignment
25BLAST 2 sequences (bl2Seq) at NCBI
- Produces the local alignment of two given
sequences using BLAST (Basic Local Alignment
Search Tool) engine for local alignment - Does not use an exact algorithm but a heuristic
26Back to NCBI
27BLAST bl2seq
28Bl2Seq - query
blastn nucleotide blastp protein
29Bl2seq results
30Bl2seq results
Dissimilarity
Low complexity
Similarity
Gaps
Match
31Query type AA or DNA?
- For coding sequences, AA (protein) data are
better - Selection operates most strongly at the protein
level ? the homology is more evident - AA 20 char alphabet DNA - 4 char alphabet
- lower chance of random homology for AA
?
32BLAST programs
33BLAST Blastp
34Blastp - results
35Blastp results (cont)
36Blast scores
- Bits score A score for the alignment according
to the number of similarities, identities, etc.
It has a standard set of units and is thus
independent of the scoring scheme - Expected-score (E-value) The number of
alignments with the same or higher score one can
expect to see by chance when searching a random
database with a random sequence of particular
sizes. The closer the e-value is to zero, the
greater the confidence that the hit is really a
homolog
37Multiple Sequence Alignment (MSA)
38Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
Similar to pairwise alignment BUT n sequences are
aligned instead of just 2
Each row represents an individual sequence Each
column represents the same position
39Why perform an MSA?
MSAs are at the heart of comparative genomics
studies which seek to study evolutionary
histories, functional and structural aspects of
sequences, and to understand phenotypic
differences between species
40Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
41Alignment methods
- There is no available optimal solution for MSA
all methods are heuristics - Progressive/hierarchical alignment (ClustalX)
- Iterative alignment (MAFFT, MUSCLE)
-
42Progressive alignment
A B C D E
First step compute pairwise distances
Compute the pairwise alignments for all against
all (10 pairwise alignments). The similarities
are converted to distances and stored in a table
E D C B A
A
8 B
17 15 C
10 14 16 D
32 31 31 32 E
43E D C B A
A
8 B
17 15 C
10 14 16 D
32 31 31 32 E
Second step build a guide tree
- Cluster the sequences to create a tree (guide
tree) - represents the order in which pairs of
sequences are to be aligned - similar sequences are neighbors in the tree
- distant sequences are distant from each other in
the tree
The guide tree is imprecise and is NOT the tree
which truly describes the evolutionary
relationship between the sequences!
44Third step align sequences in a bottom up order
- Align the most similar (neighboring) pairs
- Align pairs of pairs
- Align sequences clustered to pairs of pairs
deeper in the tree
45Main disadvantages of progressive alignments
Guide-tree topology may be considerably wrong
Globally aligning pairs of sequences may create
errors that will propagate through to the final
result
46Iterative alignment
A B C DE
Pairwise distance table
Iterate until the MSA does not change
(convergence)
Guide tree
47Blastp acquiring sequences
48blastp acquiring sequences
49blastp acquiring sequences
50MSA input multiple sequence Fasta file
gtgi4504351refNP_000510.1 delta globin Homo
sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR
FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS
ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN
ALAHKYH gtgi4504349refNP_000509.1 beta
globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA
LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA
HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV
QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1
epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK
MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV
LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT
HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN
P_000175.1 G-gamma globin Homo
sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR
FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS
ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS
ALSSRYH gtgi28302131refNP_000550.2 A-gamma
globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET
LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK
HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV
QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1
hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA
KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA
VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP
ADFTAEAHAAWDKFLSVVSSVLTEK YR
51MSA using ClustalX
52Step1 Load the sequences
53A little unclear
54Edit Fasta headers
gtgi4504351refNP_000510.1 delta globin Homo
sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR
FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS
ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN
ALAHKYH gtgi4504349refNP_000509.1 beta
globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA
LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA
HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV
QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1
epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK
MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV
LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT
HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN
P_000175.1 G-gamma globin Homo
sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR
FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS
ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS
ALSSRYH gtgi28302131refNP_000550.2 A-gamma
globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET
LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK
HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV
QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1
hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA
KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA
VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP
ADFTAEAHAAWDKFLSVVSSVLTEK YR
gt delta globin
gt beta globin
gt epsilon globin
gt G-gamma globin
gt A-gamma globin
gt hemoglobin zeta
55Step2 Perform alignment
56MSA and conservation view
57(No Transcript)
58Messing-up alignment of HIV-1 env
59MSA tools
- Progressive
- CLUSTALX/CLUSTALW
- Iterative
- MUSCLE, MAFFT, PRANK