Title: Multiple Sequence Alignments
1Multiple Sequence Alignments
- Craig A. Struble, Ph.D.
- Department of Mathematics, Statistics, and
Computer Science - Marquette University
2Overview
- Background
- Applications
- Algorithms
- Dynamic Programming
- Star Alignments
- Progressive Approaches
- Localized Alignments
- Profile analysis
- Blocks analysis
- Statistical Methods
- Expectation Maximization
- Hidden Markov Models
- Position Specific Scoring
- Visualization and Editing
3Example
Multiple sequence alignment of 7 neuroglobins
using clustalx
4Example
- Searching for domains with RPS-BLAST
5Why do we do multiple sequence alignments?
- Infer phylogenetic relationships
- Understand evolutionary pressures acting on a
gene - Formulate test hypotheses about protein 3-D
structure (based on conserved regions) - Formulate test hypotheses about protein
function - Understand how protein function has changed
- Identify primers and probes to search for
homologous sequences in other organisms
6The relationship of MSA to phylogenetics
- The goal of phylogenetics is to reconstruct
evolutionary history using share, derived
characters - Characters that have a common evolutionary
history (are homologous) - For example, eyes of humans and rats (but not
humans and octopi) - Traditionally, morphological characters
- DNA and amino acid sequence alignments are very
common - It is assumed that properly aligned sequences
represents homology
7The relationship of MSA to phylogenetics
AHFGEPDFTV WNAGQFPANL HTQ-DMSSKS TIEINFKAME
MIILGTEYAG ENFGEPDFTV WNAGQFPANT HTS-GMTSKT
TVEINFKQME MVILGTEYAG KNFGEPDFTI YNAGQFPANI
HTK-GMTSAT SVEINFKDME MVILGTEYAG EDFGTPDFTI
YNAGQFPCNR YTH-YMTSST SIDLNLARRE MVIMGTQYAG
ESFGTPDFTI YNAGQFPCNR YTH-YMTSST SVDLNLARRE
MVILGTQYAG LVGFKPDFVV MNGSKVTNPN WKEQGLNSEN
FVAFNLTEGV QLIGGTWYGG LKNFEPDFVV MNGSKVTNPN
WKEQGLNSEN FVAFNLTERI QLIGGTWYGG LAHFKPDFVV
MNGAKCTNAK WKEHGLNSEN FTVFNLTERM QLIGGTWYGG
LKGFEPDFVV LNASKAKVEN FKELGLNSET AVVFNLAEKM
QIILNTWYGG LANFKPDFVV YNASKAKVEN YKELGLHSET
AVVFNLTSRE QVIINTWYGG LENFKADFIV YNACKCINED
YKQDGLNSEV FVIFNVEENI AVIGGTWYGG ATKIKPNFTI
VSAPHFKADP EVD-GTKSET FVIISFKHKV ILIGGTEYAG
KTVEQP-FTI LSAPHFKADP KTD-GTHSET FIIVSFEKRT
ILIGGTEYAG -PAGKDEWQV LNVANFECVP ERD-GTNSDG
CVILNFAQKK VLIAGMRYAG LPSFQPKLTI IDLPSFKADP
VRH-GCRSET VIACDLTNGL VLIGGTSYAG LASFLPKLTI
IDLPSFKANP ERH-GCRGET IIACDLTKGL VLIGGTSYAG
LGQFVPEMTI IDLPSFRADP ARH-GSRTET VIAVDLTRQI
VLIGGTSYAG LENFVPELTL IDLPSFRADP KRH-GCRSEN
VVAIDFARKI VLIGGTQYAG ----SYDMVT IDVP------
-----SYSDV WMLVERRSNS TLVLGSDYYG
Phosphoenolpyruvate carboxylase kinase (PPCK)
gene in 19 species
PPCK_AERPE
8- Phosphoenolpyruvate carboxylase kinase (PPCK)
gene in 19 species, 720 sites. - Standard Neighbor-Joining tree constructed by
ClustalW - Tree will differ with varying tree-building and
distance-estimation methods how do we know
which to use? - Different methods will provide significantly
different estimates of branch lengths, especially
for the long branch.
9The relationship between MSA and evolutionary
history of a group of genes or organisms
NFS
NFLS
NYLS
NKYLS
-L
K
NYLS
10Using known evolutionary relationship for
sequence alignment
NFS
NFLS
NYLS
NKYLS
NFL/-S
NK/-YLS
NK/-Y/FL/-S
11What happens when a sequence alignment is wrong?
A
B
C
A
C
B
B
C
A
A AGT B AT C ATC
A AGT- B A-T- C A-TC
A AGT B AT- C ATC
A AGT- B A-T- C A-TC
III
II
I
Unaligned
12Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Gaps, tv, ts
A.
AATCGCG AACCCGG
0, 2, 1
B.
AATCGCG- AACC-CGG
2, 0, 1
C.
AATCGCG- AA-CCCGG
2, 1, 0
- transition rate
- transversion rate
- These are treated the same for long divergence
times.
D.
AATCGC-G- AA-C-CCGG
4, 0, 0
13Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Indels, tvs, tss
AATCGCG AACCCGG
A.
0, 2, 1
AATCGCG- AACC-CGG
B.
2, 0, 1
AATCGCG- AA-CCCGG
C.
2, 1, 0
AATCGC-G- AA-C-CCGG
D.
4, 0, 0
14Tools for MSE
- Clustal web server or run locally
- Web server http//www.ebi.ac.uk/clustalw/index.ht
ml - Manuscript with details http//www.csc.fi/molbio/
progs/clustalw/ms.html - Goal Find an optimal multiple alignment
15ClustalW
- CLUSTAL Has number of variations, the most
commonly used is CLUSTALW - Generates pairwise alignments of all input
sequences, then ranks scores of identities among
pairs of sequences. - High scoring pairs of sequences align most
readily to each other. - More divergent (less related) pairs are then
added to the alignment. - Generates a phylogenetic tree of relationships to
determine steps in constructing the alignment. - One can view the phylogenetic tree used to
generate the alignment. - Individual pairs in the alignment are aligned
using a FASTA-type (word-based, fast alignment)
or by a dynamic programming algorithm, which is
slower, but produces optimal pairwise alignments.
16Unix version of ClustalX, the graphical interface
to ClustalW, run locally. Note colors for amino
acid qualities and score indicator.
17 Web ClustalW options
18FOSB_MOUSE Protein fosB MFQAFPGDYD SGSRCSSSPS
AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA
ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP
GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA
RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT
DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP
GCKIPYEEGP GPGPLAEVRD LPGSTSAKED GFGWLLPPPP
PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL
NSPSLLAL FOSB_HUMAN Protein fosB MFQAFPGDYD
SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM
PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ
PPVVDPYDMP GTSYSTPGMS GYSSGGASGS GGPSTSGTTS
GPGPARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA
KCRNRRRELT DRLQAETDQL EEEKAELESE IAELQKEKER
LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSAPAKED
GFSWLLPPPP PPPLPFQTSQ DAPPNLTASL FTHSEVQVLG
DPFPVVNPSY TSSFVLTCPE VSAFAGAQRT SGSDQPSDPL
NSPSLLAL FOS_CHICK Proto-oncogene protein
c-fos MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY PSPADSFSSM
GSPVNSQDFC TDLAVSSANF VPTVTAISTS PDLQWLVQPT
LISSVAPSQN RGHPYGVPAP APPAAYSRPA VLKAPGGRGQ
SIGRRGKVEQ LSPEEEEKRR IRRERNKMAA AKCRNRRREL
TDTLQAETDQ LEEEKSALQA EIANLLKEKE KLEFILAAHR
PACKMPEELR FSEELAAATA LDLGAPSPAA AEEAFALPLM
TEAPPAVPPK EPSGSGLELK AEPFDELLFS AGPREASRSV
PDMDLPGASS FYASDWEPLG AGSGGELEPL CTPVVTCTPC
PSTYTSTFVF TYPEADAFPS CAAAHRKGSS
SNEPSSDSLS FOS_RAT Proto-oncogene protein
c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM
GSPVNTQDFC ADLSVSSANF IPTVTAISTS PDLQWLVQPT
LVSSVAPSQT RAPHPYGLPT PSTGAYARAG VVKTMSGGRA
QSIGRRGKVE QLSPEEEEKR RIRRERNKMA AAKCRNRRRE
LTDTLQAETD QLEDEKSALQ TEIANLLKEK EKLEFILAAH
RPACKIPNDL GFPEEMSVTS LDLTGGLPEA TTPESEEAFT
LPLLNDPEPK PSLEPVKNIS NMELKAEPFD DFLFPASSRP
SGSETARSVP DVDLSGSFYA ADWEPLHSSS LGMGPMVTEL
EPLCTPVVTC TPSCTTYTSS FVFTYPEADS FPSCAAAHRK
GSSSNEPSSD SLSSPTLLAL FOS_MOUSE Proto-oncogene
protein c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY
HSPADSFSSM GSPVNTQDFC ADLSVSSANF IPTVTAISTS
PDLQWLVQPT LVSSVAPSQT RAPHPYGLPT QSAGAYARAG
MVKTVSGGRA QSIGRRGKVE QLSPEEEEKR RIRRERNKMA
AAKCRNRRRE LTDTLQAETD QLEDEKSALQ TEIANLLKEK
EKLEFILAAH RPACKIPDDL GFPEEMSVAS LDLTGGLPEA
STPESEEAFT LPLLNDPEPK PSLEPVKSIS NVELKAEPFD
DFLFPASSRP SGSETSRSVP DVDLSGSFYA ADWEPLHSNS
LGMGPMVTEL EPLCTPVVTC TPGCTTYTSS FVFTYPEADS
FPSCAAAHRK GSSSNEPSSD SLSSPTLLAL
Sequence data for two related genes fosB from
mouse and human c-fos from chicken, mouse, and
rat.
19- Significant differences between FosB and C-Fos.
- Rat and mouse C-Fos sequences differ from chicken
C-Fos. - Long conserved region between130 and 225.
- Symbols
- Identity across all sequences
- Conservation of amino acid characteristics
- . Semi-conserved substitutions
20- To better visualize conservation, colors can be
used. - Color code
- AVFPMILW RED, Small (small hydrophobic
(incl.aromatic -Y)) - DE BLUE, Acidic
- RHK MAGENTA, Basic
- STYHCNGQ GREEN, Hydroxyl Amine Basic Q
- Others Grey
- Differences between the genes and species are
more apparent.
21Neighbor Joining tree constructed with a
web-ClustalW applet (Jalview) FosB c-Fos can be
distinguished Rat mouse cluster apart from
chicken, with respect to c-Fos
22A Highly conserved region
B Rather dissimilar region
23Threonyl-tRNA synthetase (thrS2) gene w/
consensus sequence
24Threonyl-tRNA synthetase (thrS2) gene in 6 species
25MALIGN
- construction of pairwise MOTIFS (conserved
regions of similarity without gaps) - construction of MULTIPLE MOTIFS (of thickness
exceeding 2) - forming of SUPERMOTIFS (groupings of motifs that
near each other) from MULTIPLE MOTIFS - construction of MULTIPLE ALIGNMENTS from
previously obtained MOTIFS and SUPERMOTIFS and
consequent selection of the best alignment. - http//www.genebee.msu.su/services/malign_full.htm
l
26(No Transcript)
27Each motif supermotif has a score and involves
a pair of sequences
28Malign
Possibility of receiving a sum of mismatch
weights along an alignment for random sequences
of the same length includes parameters for gaps
mismatches
29Definition
- A multiple alignment of strings S1, Sk is a
series of strings with spaces S1, , Sk such
that - S1 Sk
- Sj is an extension of Sj by insertion of spaces
- Goal Find an optimal multiple alignment.
30Scoring Alignments
- In order to find an optimal alignment, we need to
be able to measure how good an alignment is - Sum of pairs (SP) method in a column, score each
pair of letters and total the scores. Pairs of
gaps score 0. - Total up scores for each column
31SP Method Example
- Using BLOSUM62 matrix, gap penalty -8
- In column 1, we have pairs
- -,S
- -,S
- S,S
- k(k-1)/2 pairs per column
-8 - 8 4 -12
32Dynamic Programming
- The dynamic programming approach can be adapted
to MSA - For simplicity, assume k sequences of length n
- The dynamic programming array F is k-dimensional
of length n1 (including initial gaps) - The entry F(i1, , ik) represents score of
optimal alignment for s11..i1, sk1..ik
33Dynamic Programming
- Letting i represent the vector (i1,,ik) and b
represent a nonzero binary vector of length k, we
fill in the array with the formula - where (selecting a column to score)
34Example
s1 MPE s2 MKE s3 MSKE s4 SKE
- Let i(1,1,1,1), b(1,0,0,0)
- Checking F(0,1,1,1) (i-b)
- Column(s,i,b) is
- SP-score is -24 (assuming gap penalty of -8)
M - - -
35Analysis
- O(nk) entries to fill
- Each entry combines O(2k) other entries
- Costs O(k2) to calculate each SP score
- Overall cost is O(k2 2k nk), or exponential in
the number of sequences! - MSA with SP-score shown NP-complete
36Star Alignments
- Heuristic method for multiple sequence alignments
- Select a sequence sc as the center of the star
- For each sequence s1, , sk such that index i ?
c, perform a Needleman-Wunsch global alignment - Aggregate alignments with the principle once a
gap, always a gap.
37Star Alignments Example
MPE MKE
MSKE - MKE
s1 MPE s2 MKE s3 MSKE s4 SKE
s3
s1
s2
SKE MKE
-MPE -MKE MSKE -SKE
-MPE -MKE MSKE
MPE MKE
s4
38Choosing a center
- Try them all and pick the one with the best score
- Calculate all O(k2) alignments, and pick the
sequence sc that maximizes
39Analysis
- Assuming all sequences have length n
- O(n2) to calculate global alignment
- O(k) global alignments to calculate
- Using a reasonable data structure for joining
alignments, no worse than O(kl), where l is upper
bound on alignment lengths - O(kn2k2l) overall cost
40Progressive Approaches
- CLUSTALW
- Perform pairwise alignments
- Construct a tree, joining most similar sequences
first (guide tree) - Align sequences sequentially, using the
phylogenetic tree - PILEUP
- Similar to CLUSTALW
- Uses UPGMA to produce tree (chapter 6)
41Progressive Approaches
42Problems with Progressive Alignments
- MSA depends on pairwise alignments
- If sequences are very distantly related, much
higher likelihood of errors - Care must be made in choosing scoring matrices
and penalties - Other approaches using Bayesian methods such as
hidden Markov models
43Localized Analysis
44Profile Analysis
- A profile is a scoring matrix defined from a
multiple sequence alignment of related sequences - Profiles are used to score unknown sequences to
estimate whether or not the unknown sequence is
related
45Profile Examplehttp//www.sdsc.edu/projects/profi
le/profile_desc.html
- ATP Binding RNA helicase (DEAD Box)
rhle_ecoli GVDVLVA TPG dbp2_schpo GVEICIA
TPG dbp2_yeast GSEIVIA TPG dbpa_ecoli APHIIVA
TPG
46Using Profiles
- Basically, perform a pairwise sequence alignment
using the profile as a scoring matrix - Several sequences can be aligned to the same
profile, yielding a multiple sequence alignment - Generating profiles is computationally intensive
- Comparing a sequence to many profiles is
computationally intensive - Which profile is my unknown sequence most like?
47Blocks
- Blocks are multiply aligned ungapped segments
corresponding to the most highly conserved
regions of proteins. from BLOCKS WWW Server
(http//www.blocks.fhcrc.org/)
48Statistical Methods
- Expectation Maximization
- Gibbs Sampling
- Hidden Markov Models
49Position Specific Scoring
- We saw these with profile analysis
- Essentially, derive a scoring matrix from similar
sequences - Use the scoring matrix to score residues based on
their alignment position - Scores can also be derived from a log
transformation of the frequency of each amino
acid - Log likelihood of seeing the amino acid in the
position vs. just at random
50Visualization and Editing
- Visualization should present researchers with
information - Point out key information
- See patterns that might otherwise be missed
- Editing
- Sequence alignments might still violate
biological principles - Our models are not perfect
- Editors allow researchers to modify alignments,
based on relevant principles - Principles are not built into most software
51Visualization
- Sequence alignment results
52Visualization
- Sequence logos for consensus sequences
Consensus read off the top
53Visualization
54Visualization
55Editing
- CINEMA
- http//bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
- Applet
- GDE (Genetic Data Environment)
- http//www.tigr.org/jeisen/GDE/GDE.html
- Unix based, requires GCG
- GeneDoc
- Windows based
- MACAW
- Mac and PC
56Editing
- In addition to moving the motifs, etc. editors
are used to format alignments - Colors, highlighting, etc.
- Typically have to reformat the data
- Can use the SEQIO program
- Or use READSEQ on the web
57Summary
- There is a strong relationship between multiple
sequence alignment and phylogenetics - Many different approaches to MSA
- Dynamic programming
- Progressive
- Star, ClustalW, PILEUP
- Iterative methods (not covered)
- Hidden Markov Models, genetic algorithms
- Localized alignments
- Profiles, blocks
58Summary
- Basic probabilistic approaches
- Expectation maximization
- Gibbs sampling
- Hidden Markov Models
- Visualizations and Editing
- Sequence logos