Title: Practical Aspects of Multiple Sequence Alignments
1Practical Aspects of Multiple Sequence Alignments
- Mike Thomas, Ph.D.
- Bioinformatics Research Center, Medical College
of Wisconsin
2Outline Practical aspects of MSE
- Why it is important to accurately assess
alignments - How do conduct MSEs
- What we do with MSEs
3Why do we do multiple sequence alignments?
- Infer phylogenetic relationships
- Understand evolutionary pressures acting on a
gene - Formulate test hypotheses about protein 3-D
structure (based on conserved regions) - Formulate test hypotheses about protein
function - Understand how protein function has changed
- Identify primers and probes to search for
homologous sequences in other organisms
4The relationship of MSA to phylogenetics
- The goal of phylogenetics is to reconstruct
evolutionary history using share, derived
characters - Characters that have a common evolutionary
history (are homologous) - For example, eyes of humans and rats (but not
humans and octopi) - Traditionally, morphological characters were
used. - Now, DNA and amino acid sequence alignments are
very common for phylogenetic reconstruction - It is assumed that properly aligned sequences
represents homology
5The relationship between MSA and evolutionary
history of a group of genes or organisms
NFS
NFLS
NYLS
NKYLS
-L
K
NYLS
6Using known evolutionary relationship for
sequence alignment
NFS
NFLS
NYLS
NKYLS
NFL/-S
NK/-YLS
NK/-Y/FL/-S
7What happens when a sequence alignment is wrong?
A
B
C
A
C
B
B
C
A
A AGT B AT C ATC
A AGT- B A-T- C A-TC
A AGT B AT- C ATC
A AGT- B A-T- C A-TC
III
II
I
Unaligned
8Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Gaps, tv, ts
A.
AATCGCG AACCCGG
0, 2, 1
B.
AATCGCG- AACC-CGG
2, 0, 1
C.
AATCGCG- AA-CCCGG
2, 1, 0
- transition rate
- transversion rate
- These are treated the same for long divergence
times.
D.
AATCGC-G- AA-C-CCGG
4, 0, 0
9Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Indels, tvs, tss
AATCGCG AACCCGG
A.
0, 2, 1
AATCGCG- AACC-CGG
B.
2, 0, 1
AATCGCG- AA-CCCGG
C.
2, 1, 0
AATCGC-G- AA-C-CCGG
D.
4, 0, 0
10Tools for MSE
- Clustal web server or run locally
- Web server http//www.ebi.ac.uk/clustalw/index.ht
ml - Manuscript with details http//www.csc.fi/molbio/
progs/clustalw/ms.html - Goal Find an optimal multiple alignment
11ClustalW
- CLUSTAL Has number of variations, the most
commonly used is CLUSTALW - Generates pairwise alignments of all input
sequences, then ranks scores of identities among
pairs of sequences. - High scoring pairs of sequences align most
readily to each other. - More divergent (less related) pairs are then
added to the alignment. - Generates a phylogenetic tree of relationships to
determine steps in constructing the alignment. - One can view the phylogenetic tree used to
generate the alignment. - Individual pairs in the alignment are aligned
using a FASTA-type (word-based, fast alignment)
or by a dynamic programming algorithm, which is
slower, but produces optimal pairwise alignments.
12Unix version of ClustalX, the graphical interface
to ClustalW, run locally. Note colors for amino
acid qualities and score indicator.
13 Web ClustalW options
14FOSB_MOUSE Protein fosB MFQAFPGDYD SGSRCSSSPS
AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA
ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP
GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA
RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT
DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP
GCKIPYEEGP GPGPLAEVRD LPGSTSAKED GFGWLLPPPP
PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL
NSPSLLAL FOSB_HUMAN Protein fosB MFQAFPGDYD
SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM
PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ
PPVVDPYDMP GTSYSTPGMS GYSSGGASGS GGPSTSGTTS
GPGPARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA
KCRNRRRELT DRLQAETDQL EEEKAELESE IAELQKEKER
LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSAPAKED
GFSWLLPPPP PPPLPFQTSQ DAPPNLTASL FTHSEVQVLG
DPFPVVNPSY TSSFVLTCPE VSAFAGAQRT SGSDQPSDPL
NSPSLLAL FOS_CHICK Proto-oncogene protein
c-fos MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY PSPADSFSSM
GSPVNSQDFC TDLAVSSANF VPTVTAISTS PDLQWLVQPT
LISSVAPSQN RGHPYGVPAP APPAAYSRPA VLKAPGGRGQ
SIGRRGKVEQ LSPEEEEKRR IRRERNKMAA AKCRNRRREL
TDTLQAETDQ LEEEKSALQA EIANLLKEKE KLEFILAAHR
PACKMPEELR FSEELAAATA LDLGAPSPAA AEEAFALPLM
TEAPPAVPPK EPSGSGLELK AEPFDELLFS AGPREASRSV
PDMDLPGASS FYASDWEPLG AGSGGELEPL CTPVVTCTPC
PSTYTSTFVF TYPEADAFPS CAAAHRKGSS
SNEPSSDSLS FOS_RAT Proto-oncogene protein
c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM
GSPVNTQDFC ADLSVSSANF IPTVTAISTS PDLQWLVQPT
LVSSVAPSQT RAPHPYGLPT PSTGAYARAG VVKTMSGGRA
QSIGRRGKVE QLSPEEEEKR RIRRERNKMA AAKCRNRRRE
LTDTLQAETD QLEDEKSALQ TEIANLLKEK EKLEFILAAH
RPACKIPNDL GFPEEMSVTS LDLTGGLPEA TTPESEEAFT
LPLLNDPEPK PSLEPVKNIS NMELKAEPFD DFLFPASSRP
SGSETARSVP DVDLSGSFYA ADWEPLHSSS LGMGPMVTEL
EPLCTPVVTC TPSCTTYTSS FVFTYPEADS FPSCAAAHRK
GSSSNEPSSD SLSSPTLLAL FOS_MOUSE Proto-oncogene
protein c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY
HSPADSFSSM GSPVNTQDFC ADLSVSSANF IPTVTAISTS
PDLQWLVQPT LVSSVAPSQT RAPHPYGLPT QSAGAYARAG
MVKTVSGGRA QSIGRRGKVE QLSPEEEEKR RIRRERNKMA
AAKCRNRRRE LTDTLQAETD QLEDEKSALQ TEIANLLKEK
EKLEFILAAH RPACKIPDDL GFPEEMSVAS LDLTGGLPEA
STPESEEAFT LPLLNDPEPK PSLEPVKSIS NVELKAEPFD
DFLFPASSRP SGSETSRSVP DVDLSGSFYA ADWEPLHSNS
LGMGPMVTEL EPLCTPVVTC TPGCTTYTSS FVFTYPEADS
FPSCAAAHRK GSSSNEPSSD SLSSPTLLAL
Sequence data for two related genes fosB from
mouse and human c-fos from chicken, mouse, and
rat.
15- Significant differences between FosB and C-Fos.
- Rat and mouse C-Fos sequences differ from chicken
C-Fos. - Long conserved region between130 and 225.
- Symbols
- Identity across all sequences
- Conservation of amino acid characteristics
- . Semi-conserved substitutions
16- To better visualize conservation, colors can be
used. - Color code
- AVFPMILW RED, Small (small hydrophobic
(incl.aromatic -Y)) - DE BLUE, Acidic
- RHK MAGENTA, Basic
- STYHCNGQ GREEN, Hydroxyl Amine Basic Q
- Others Grey
- Differences between the genes and species are
more apparent.
17Neighbor Joining tree constructed with a
web-ClustalW applet (Jalview) FosB c-Fos can be
distinguished Rat mouse cluster apart from
chicken, with respect to c-Fos
18A Highly conserved region
B Rather dissimilar region
19Threonyl-tRNA synthetase (thrS2) gene w/
consensus sequence
20Threonyl-tRNA synthetase (thrS2) gene in 6 species
21MALIGN
- construction of pairwise MOTIFS (conserved
regions of similarity without gaps) - construction of MULTIPLE MOTIFS (of thickness
exceeding 2) - forming of SUPERMOTIFS (groupings of motifs that
near each other) from MULTIPLE MOTIFS - construction of MULTIPLE ALIGNMENTS from
previously obtained MOTIFS and SUPERMOTIFS and
consequent selection of the best alignment. - http//www.genebee.msu.su/services/malign_full.htm
l
22(No Transcript)
23Each motif supermotif has a score and involves
a pair of sequences
24Malign
Possibility of receiving a sum of mismatch
weights along an alignment for random sequences
of the same length includes parameters for gaps
mismatches
25The relationship of MSA to phylogenetics
- The goal of phylogenetics is to reconstruct
evolutionary history using share, derived
characters - Characters that have a common evolutionary
history (are homologous) - For example, eyes of humans and rats (but not
humans and octopi) - Traditionally, morphological characters were
used. - Now, DNA and amino acid sequence alignments are
very common for phylogenetic reconstruction - It is assumed that properly aligned sequences
represents homology
26The relationship of MSA to phylogenetics
AHFGEPDFTV WNAGQFPANL HTQ-DMSSKS TIEINFKAME
MIILGTEYAG ENFGEPDFTV WNAGQFPANT HTS-GMTSKT
TVEINFKQME MVILGTEYAG KNFGEPDFTI YNAGQFPANI
HTK-GMTSAT SVEINFKDME MVILGTEYAG EDFGTPDFTI
YNAGQFPCNR YTH-YMTSST SIDLNLARRE MVIMGTQYAG
ESFGTPDFTI YNAGQFPCNR YTH-YMTSST SVDLNLARRE
MVILGTQYAG LVGFKPDFVV MNGSKVTNPN WKEQGLNSEN
FVAFNLTEGV QLIGGTWYGG LKNFEPDFVV MNGSKVTNPN
WKEQGLNSEN FVAFNLTERI QLIGGTWYGG LAHFKPDFVV
MNGAKCTNAK WKEHGLNSEN FTVFNLTERM QLIGGTWYGG
LKGFEPDFVV LNASKAKVEN FKELGLNSET AVVFNLAEKM
QIILNTWYGG LANFKPDFVV YNASKAKVEN YKELGLHSET
AVVFNLTSRE QVIINTWYGG LENFKADFIV YNACKCINED
YKQDGLNSEV FVIFNVEENI AVIGGTWYGG ATKIKPNFTI
VSAPHFKADP EVD-GTKSET FVIISFKHKV ILIGGTEYAG
KTVEQP-FTI LSAPHFKADP KTD-GTHSET FIIVSFEKRT
ILIGGTEYAG -PAGKDEWQV LNVANFECVP ERD-GTNSDG
CVILNFAQKK VLIAGMRYAG LPSFQPKLTI IDLPSFKADP
VRH-GCRSET VIACDLTNGL VLIGGTSYAG LASFLPKLTI
IDLPSFKANP ERH-GCRGET IIACDLTKGL VLIGGTSYAG
LGQFVPEMTI IDLPSFRADP ARH-GSRTET VIAVDLTRQI
VLIGGTSYAG LENFVPELTL IDLPSFRADP KRH-GCRSEN
VVAIDFARKI VLIGGTQYAG ----SYDMVT IDVP------
-----SYSDV WMLVERRSNS TLVLGSDYYG
Phosphoenolpyruvate carboxylase kinase (PPCK)
gene in 19 species
PPCK_AERPE
27- Phosphoenolpyruvate carboxylase kinase (PPCK)
gene in 19 species, 720 sites. - Standard Neighbor-Joining tree constructed by
ClustalW - Tree will differ with varying tree-building and
distance-estimation methods how do we know
which to use? - Different methods will provide significantly
different estimates of branch lengths, especially
for the long branch.
28Next time inferring evolutionary history from
DNA amino acid sequence alignments
- Introduction to phylogenetic approaches
- Maximum Parsimony
- Minimum Evolution
- Maximum Likelihood
- Introduction to tree-building methods
- Assessing phylogenetic reconstructions
- Practical uses of phylogenies