Pairwise and Multiple Sequence Alignment Lesson 2 - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Pairwise and Multiple Sequence Alignment Lesson 2

Description:

Pairwise and Multiple Sequence Alignment Lesson 2 Motivation What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. – PowerPoint PPT presentation

Number of Views:577

Avg rating:3.0/5.0

Slides: 60

Provided by: ibisTauA

Category:

more less

Transcript and Presenter's Notes

Title: Pairwise and Multiple Sequence Alignment Lesson 2

1
Pairwise and Multiple Sequence
AlignmentLesson 2
2
Motivation
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAA
CAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTAT
GGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGA
GACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTG
CTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTT
CTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATG
CTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGT
ATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCT
GCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAAT
ATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGG
ACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCA
TCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTAT
AGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTC
TCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATT
TTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACT
TTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTT
GCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGA
ATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTG
CATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGG
GCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAA
ACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCA
GCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGG
AATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTG
GCTAATGCCCTGGCTCACAAGTACCATTGA
MV
HLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE
MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE
3
What is sequence alignment?

Alignment Comparing two (pairwise) or more
(multiple) sequences. Searching for a series of
identical or similar characters in the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE
MVHLTPEEKTAVNALWGKVNVDAVGGE
4
Why perform a pairwise sequence alignment?
Finding homology between two sequences

e.g., predicting characteristics of a protein
premised on
similar sequence (or structure)
similar function

5
Local vs. Global

Local alignment finds regions of high
similarity in parts of the sequences
Global alignment finds the best alignment
across the entire two sequences

ADLGAVFALCDRYFQ ADLGRTQN CDRYYQ
ADLGAVFALCDRYFQ ADLGRTQN-CDRYYQ
6
Evolutionary changes in sequences

Three types of nucleotide changes
Substitution a replacement of one (or more)
sequence characters by another
Insertion - an insertion of one (or more)
sequence characters
Deletion a deletion of one (or more) sequence
characters

AAGA
AACA
?
T
A
AAG
GA
A
A
Insertion Deletion ? Indel
7
Choosing an alignment

Many different alignments between two sequences
are possible

AAGCTGAATTCGAA AGGCTCATTTCTGA
A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
. . .
How do we determine which is the best alignment?
8
Toy exercise
Compute the scores of each of the following
alignments using this naïve scoring scheme
Scoring scheme

Match 1
Mismatch -2
Indel -1

-2 -2 -2 1
-2 -2 1 -2
-2 1 -2 -2
1 -2 -2 -2
Substitution matrix
Gap penalty (opening extending)
AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
9
Substitution matrices accounting for biological
context

Which best reflects the biological reality
regarding nucleotide mismatch penalty?
Tr gt Tv gt 0
Tv gt Tr gt 0
0 gt Tr gt Tv
0 gt Tv gt Tr

Tr Transition Tv Transversion
10
Scoring schemes accounting for biological context

Which best reflects the biological reality
regarding these mismatch penalties?
Arg-gtLys gt Ala-gtPhe
Arg-gtLys gt Thr-gtAsp
Asp-gtVal gt Asp-gtGlu

11
PAM matrices

Family of matrices PAM 80, PAM 120, PAM 250,
The number with a PAM matrix (the n in PAMn)
represents the evolutionary distance between the
sequences on which the matrix is based
The (ith,jth) cell in a PAMn matrix denotes the
probability that amino-acid i will be replaced by
amino-acid j in time n Pi?j,n
Greater n numbers denote greater distances

12
PAM - limitations

Based on only one original dataset
Examines proteins with few differences (85
identity)
Based mainly on small globular proteins so the
matrix is biased

13
BLOSUM matrices

Different BLOSUMn matrices are calculated
independently from BLOCKS (ungapped, manually
created local alignments)
BLOSUMn is based on a cluster of BLOCKS of
sequences that share at least n percent identity
The (ith,jth) cell in a BLOSUM matrix denotes the
log of odds of the observed frequency and
expected frequency of amino acids i and j in the
same position in the data log(Pij/qiqj)
Higher n numbers denote higher identity between
the sequences on which the matrix is based

14
PAM Vs. BLOSUM

PAM100 BLOSUM90
PAM120 BLOSUM80
PAM160 BLOSUM60
PAM200 BLOSUM52
PAM250 BLOSUM45

More distant sequences

BLOSUM62 for general use
BLOSUM80 for close relations
BLOSUM45 for distant relations

PAM120 for general use
PAM60 for close relations
PAM250 for distant relations

15
Substitution matrices exercise

Pick the best substitution matrix (PAM and
BLOSUM) for each pairwise alignment
Human chimp
Human - yeast
Human fish

PAM options PAM60 PAM120 PAM250
BLOSUM options BLOSUM45 BLOSUM62 BLOSUM80
16
Substitution matrices

Nucleic acids
Transition-transversion
Amino acids
Evolutionary (empirical data) based (PAM,
BLOSUM)
Physico-chemical properties based (Grantham,
McLachlan)

17
Gap penalty
AAGCGAAATTCGAAC A-G-GAA-CTCGAAC
AAGCGAAATTCGAAC AGG---AACTCGAAC

Which alignment has a higher score?
Which alignment is more likely?

18
Pairwise alignment algorithm matrix
representation formulation
2 sequences S1 and S2 and a Scoring scheme
match 1, mismatch -1, gap -2
Vi,j value of the optimal alignment between
S11i and S21j
Vi,j S(S1i1,S2j1) Vi1,j1 max
Vi1,j S(gap) Vi,j1 S(gap)
Vi,j Vi1,j
Vi,j1 Vi1,j1
19
Pairwise alignment algorithm matrix
representation initialization
S1
S2
20
Pairwise alignment algorithm matrix
representation filling the matrix
S1
S2
21
Pairwise alignment algorithm matrix
representation trace back
S1
S2
22
Pairwise alignment algorithm matrix
representation trace back
S1
S2
AAAC AG-C
23
Assessing the significance of an alignment score
True
AAGCTGAATTC-GAA AGGCTCATTTCTGA-
AAGCTGAATTCGAA AGGCTCATTTCTGA
28.0
Random
AGATCAGTAGACTA GAGTAGCTATCTCT
AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT
26.0
. .
CGATAGATAGCATA GCATGTCATGATTC
CGATAGATAGCATA--------- ---------GCATGTCATGATTC
16.0
24
Web servers for pairwise alignment
25
BLAST 2 sequences (bl2Seq) at NCBI

Produces the local alignment of two given
sequences using BLAST (Basic Local Alignment
Search Tool) engine for local alignment
Does not use an exact algorithm but a heuristic

26
Back to NCBI
27
BLAST bl2seq
28
Bl2Seq - query
blastn nucleotide blastp protein
29
Bl2seq results
30
Bl2seq results
Dissimilarity
Low complexity
Similarity
Gaps
Match
31
Query type AA or DNA?

For coding sequences, AA (protein) data are
better
Selection operates most strongly at the protein
level ? the homology is more evident
AA 20 char alphabet DNA - 4 char alphabet
lower chance of random homology for AA

?
32
BLAST programs
33
BLAST Blastp
34
Blastp - results
35
Blastp results (cont)
36
Blast scores

Bits score A score for the alignment according
to the number of similarities, identities, etc.
It has a standard set of units and is thus
independent of the scoring scheme
Expected-score (E-value) The number of
alignments with the same or higher score one can
expect to see by chance when searching a random
database with a random sequence of particular
sizes. The closer the e-value is to zero, the
greater the confidence that the hit is really a
homolog

37
Multiple Sequence Alignment (MSA)
38
Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
Similar to pairwise alignment BUT n sequences are
aligned instead of just 2
Each row represents an individual sequence Each
column represents the same position
39
Why perform an MSA?
MSAs are at the heart of comparative genomics
studies which seek to study evolutionary
histories, functional and structural aspects of
sequences, and to understand phenotypic
differences between species
40
Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
41
Alignment methods

There is no available optimal solution for MSA
all methods are heuristics
Progressive/hierarchical alignment (ClustalX)
Iterative alignment (MAFFT, MUSCLE)

42
Progressive alignment
A B C D E
First step compute pairwise distances
Compute the pairwise alignments for all against
all (10 pairwise alignments). The similarities
are converted to distances and stored in a table
E D C B A
A
8 B
17 15 C
10 14 16 D
32 31 31 32 E
43
E D C B A
A
8 B
17 15 C
10 14 16 D
32 31 31 32 E
Second step build a guide tree

Cluster the sequences to create a tree (guide
tree)
represents the order in which pairs of
sequences are to be aligned
similar sequences are neighbors in the tree
distant sequences are distant from each other in
the tree

The guide tree is imprecise and is NOT the tree
which truly describes the evolutionary
relationship between the sequences!
44
Third step align sequences in a bottom up order

Align the most similar (neighboring) pairs
Align pairs of pairs
Align sequences clustered to pairs of pairs
deeper in the tree

45
Main disadvantages of progressive alignments
Guide-tree topology may be considerably wrong
Globally aligning pairs of sequences may create
errors that will propagate through to the final
result
46
Iterative alignment
A B C DE
Pairwise distance table
Iterate until the MSA does not change
(convergence)
Guide tree
47
Blastp acquiring sequences
48
blastp acquiring sequences
49
blastp acquiring sequences
50
MSA input multiple sequence Fasta file
gtgi4504351refNP_000510.1 delta globin Homo
sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR
FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS
ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN
ALAHKYH gtgi4504349refNP_000509.1 beta
globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA
LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA
HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV
QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1
epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK
MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV
LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT
HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN
P_000175.1 G-gamma globin Homo
sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR
FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS
ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS
ALSSRYH gtgi28302131refNP_000550.2 A-gamma
globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET
LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK
HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV
QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1
hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA
KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA
VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP
ADFTAEAHAAWDKFLSVVSSVLTEK YR
51
MSA using ClustalX
52
Step1 Load the sequences
53
A little unclear
54
Edit Fasta headers
gtgi4504351refNP_000510.1 delta globin Homo
sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR
FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS
ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN
ALAHKYH gtgi4504349refNP_000509.1 beta
globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA
LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA
HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV
QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1
epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK
MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV
LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT
HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN
P_000175.1 G-gamma globin Homo
sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR
FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS
ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS
ALSSRYH gtgi28302131refNP_000550.2 A-gamma
globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET
LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK
HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV
QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1
hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA
KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA
VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP
ADFTAEAHAAWDKFLSVVSSVLTEK YR
gt delta globin
gt beta globin
gt epsilon globin
gt G-gamma globin
gt A-gamma globin
gt hemoglobin zeta
55
Step2 Perform alignment
56
MSA and conservation view
57
(No Transcript)
58
Messing-up alignment of HIV-1 env
59
MSA tools