Multiple Sequence Alignments

About This Presentation

Title:

Multiple Sequence Alignments

Description:

NFL/-S. N-YLS. NKYLS. N-F-S. N-FLS. Alignment ... Pairs of gaps score 0. Total up scores for each column ... Costs O(k2) to calculate each SP score ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 59

Provided by: craigs89

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignments

1
Multiple Sequence Alignments

Craig A. Struble, Ph.D.
Department of Mathematics, Statistics, and
Computer Science
Marquette University

2
Overview

Background
Applications
Algorithms
Dynamic Programming
Star Alignments
Progressive Approaches
Localized Alignments
Profile analysis
Blocks analysis

Statistical Methods
Expectation Maximization
Hidden Markov Models
Position Specific Scoring
Visualization and Editing

3
Example
Multiple sequence alignment of 7 neuroglobins
using clustalx
4
Example

Searching for domains with RPS-BLAST

5
Why do we do multiple sequence alignments?

Infer phylogenetic relationships
Understand evolutionary pressures acting on a
gene
Formulate test hypotheses about protein 3-D
structure (based on conserved regions)
Formulate test hypotheses about protein
function
Understand how protein function has changed
Identify primers and probes to search for
homologous sequences in other organisms

6
The relationship of MSA to phylogenetics

The goal of phylogenetics is to reconstruct
evolutionary history using share, derived
characters
Characters that have a common evolutionary
history (are homologous)
For example, eyes of humans and rats (but not
humans and octopi)
Traditionally, morphological characters
DNA and amino acid sequence alignments are very
common
It is assumed that properly aligned sequences
represents homology

7
The relationship of MSA to phylogenetics
AHFGEPDFTV WNAGQFPANL HTQ-DMSSKS TIEINFKAME
MIILGTEYAG ENFGEPDFTV WNAGQFPANT HTS-GMTSKT
TVEINFKQME MVILGTEYAG KNFGEPDFTI YNAGQFPANI
HTK-GMTSAT SVEINFKDME MVILGTEYAG EDFGTPDFTI
YNAGQFPCNR YTH-YMTSST SIDLNLARRE MVIMGTQYAG
ESFGTPDFTI YNAGQFPCNR YTH-YMTSST SVDLNLARRE
MVILGTQYAG LVGFKPDFVV MNGSKVTNPN WKEQGLNSEN
FVAFNLTEGV QLIGGTWYGG LKNFEPDFVV MNGSKVTNPN
WKEQGLNSEN FVAFNLTERI QLIGGTWYGG LAHFKPDFVV
MNGAKCTNAK WKEHGLNSEN FTVFNLTERM QLIGGTWYGG
LKGFEPDFVV LNASKAKVEN FKELGLNSET AVVFNLAEKM
QIILNTWYGG LANFKPDFVV YNASKAKVEN YKELGLHSET
AVVFNLTSRE QVIINTWYGG LENFKADFIV YNACKCINED
YKQDGLNSEV FVIFNVEENI AVIGGTWYGG ATKIKPNFTI
VSAPHFKADP EVD-GTKSET FVIISFKHKV ILIGGTEYAG
KTVEQP-FTI LSAPHFKADP KTD-GTHSET FIIVSFEKRT
ILIGGTEYAG -PAGKDEWQV LNVANFECVP ERD-GTNSDG
CVILNFAQKK VLIAGMRYAG LPSFQPKLTI IDLPSFKADP
VRH-GCRSET VIACDLTNGL VLIGGTSYAG LASFLPKLTI
IDLPSFKANP ERH-GCRGET IIACDLTKGL VLIGGTSYAG
LGQFVPEMTI IDLPSFRADP ARH-GSRTET VIAVDLTRQI
VLIGGTSYAG LENFVPELTL IDLPSFRADP KRH-GCRSEN
VVAIDFARKI VLIGGTQYAG ----SYDMVT IDVP------
-----SYSDV WMLVERRSNS TLVLGSDYYG
Phosphoenolpyruvate carboxylase kinase (PPCK)
gene in 19 species
PPCK_AERPE
8

Phosphoenolpyruvate carboxylase kinase (PPCK)
gene in 19 species, 720 sites.
Standard Neighbor-Joining tree constructed by
ClustalW
Tree will differ with varying tree-building and
distance-estimation methods how do we know
which to use?
Different methods will provide significantly
different estimates of branch lengths, especially
for the long branch.

9
The relationship between MSA and evolutionary
history of a group of genes or organisms
NFS
NFLS
NYLS
NKYLS
-L
K
NYLS
10
Using known evolutionary relationship for
sequence alignment
NFS
NFLS
NYLS
NKYLS
NFL/-S
NK/-YLS
NK/-Y/FL/-S
11
What happens when a sequence alignment is wrong?
A
B
C
A
C
B
B
C
A
A AGT B AT C ATC
A AGT- B A-T- C A-TC
A AGT B AT- C ATC
A AGT- B A-T- C A-TC
III
II
I
Unaligned
12
Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Gaps, tv, ts
A.
AATCGCG AACCCGG
0, 2, 1
B.
AATCGCG- AACC-CGG
2, 0, 1
C.
AATCGCG- AA-CCCGG
2, 1, 0

transition rate
transversion rate
These are treated the same for long divergence
times.

D.
AATCGC-G- AA-C-CCGG
4, 0, 0
13
Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Indels, tvs, tss
AATCGCG AACCCGG
A.
0, 2, 1
AATCGCG- AACC-CGG
B.
2, 0, 1
AATCGCG- AA-CCCGG
C.
2, 1, 0
AATCGC-G- AA-C-CCGG
D.
4, 0, 0
14
Tools for MSE

Clustal web server or run locally
Web server http//www.ebi.ac.uk/clustalw/index.ht
ml
Manuscript with details http//www.csc.fi/molbio/
progs/clustalw/ms.html
Goal Find an optimal multiple alignment

15
ClustalW

CLUSTAL Has number of variations, the most
commonly used is CLUSTALW
Generates pairwise alignments of all input
sequences, then ranks scores of identities among
pairs of sequences.
High scoring pairs of sequences align most
readily to each other.
More divergent (less related) pairs are then
added to the alignment.
Generates a phylogenetic tree of relationships to
determine steps in constructing the alignment.
One can view the phylogenetic tree used to
generate the alignment.
Individual pairs in the alignment are aligned
using a FASTA-type (word-based, fast alignment)
or by a dynamic programming algorithm, which is
slower, but produces optimal pairwise alignments.

16
Unix version of ClustalX, the graphical interface
to ClustalW, run locally. Note colors for amino
acid qualities and score indicator.
17
Web ClustalW options
18
FOSB_MOUSE Protein fosB MFQAFPGDYD SGSRCSSSPS
AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA
ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP
GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA
RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT
DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP
GCKIPYEEGP GPGPLAEVRD LPGSTSAKED GFGWLLPPPP
PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL
NSPSLLAL FOSB_HUMAN Protein fosB MFQAFPGDYD
SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM
PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ
PPVVDPYDMP GTSYSTPGMS GYSSGGASGS GGPSTSGTTS
GPGPARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA
KCRNRRRELT DRLQAETDQL EEEKAELESE IAELQKEKER
LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSAPAKED
GFSWLLPPPP PPPLPFQTSQ DAPPNLTASL FTHSEVQVLG
DPFPVVNPSY TSSFVLTCPE VSAFAGAQRT SGSDQPSDPL
NSPSLLAL FOS_CHICK Proto-oncogene protein
c-fos MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY PSPADSFSSM
GSPVNSQDFC TDLAVSSANF VPTVTAISTS PDLQWLVQPT
LISSVAPSQN RGHPYGVPAP APPAAYSRPA VLKAPGGRGQ
SIGRRGKVEQ LSPEEEEKRR IRRERNKMAA AKCRNRRREL
TDTLQAETDQ LEEEKSALQA EIANLLKEKE KLEFILAAHR
PACKMPEELR FSEELAAATA LDLGAPSPAA AEEAFALPLM
TEAPPAVPPK EPSGSGLELK AEPFDELLFS AGPREASRSV
PDMDLPGASS FYASDWEPLG AGSGGELEPL CTPVVTCTPC
PSTYTSTFVF TYPEADAFPS CAAAHRKGSS
SNEPSSDSLS FOS_RAT Proto-oncogene protein
c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM
GSPVNTQDFC ADLSVSSANF IPTVTAISTS PDLQWLVQPT
LVSSVAPSQT RAPHPYGLPT PSTGAYARAG VVKTMSGGRA
QSIGRRGKVE QLSPEEEEKR RIRRERNKMA AAKCRNRRRE
LTDTLQAETD QLEDEKSALQ TEIANLLKEK EKLEFILAAH
RPACKIPNDL GFPEEMSVTS LDLTGGLPEA TTPESEEAFT
LPLLNDPEPK PSLEPVKNIS NMELKAEPFD DFLFPASSRP
SGSETARSVP DVDLSGSFYA ADWEPLHSSS LGMGPMVTEL
EPLCTPVVTC TPSCTTYTSS FVFTYPEADS FPSCAAAHRK
GSSSNEPSSD SLSSPTLLAL FOS_MOUSE Proto-oncogene
protein c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY
HSPADSFSSM GSPVNTQDFC ADLSVSSANF IPTVTAISTS
PDLQWLVQPT LVSSVAPSQT RAPHPYGLPT QSAGAYARAG
MVKTVSGGRA QSIGRRGKVE QLSPEEEEKR RIRRERNKMA
AAKCRNRRRE LTDTLQAETD QLEDEKSALQ TEIANLLKEK
EKLEFILAAH RPACKIPDDL GFPEEMSVAS LDLTGGLPEA
STPESEEAFT LPLLNDPEPK PSLEPVKSIS NVELKAEPFD
DFLFPASSRP SGSETSRSVP DVDLSGSFYA ADWEPLHSNS
LGMGPMVTEL EPLCTPVVTC TPGCTTYTSS FVFTYPEADS
FPSCAAAHRK GSSSNEPSSD SLSSPTLLAL
Sequence data for two related genes fosB from
mouse and human c-fos from chicken, mouse, and
rat.
19

Significant differences between FosB and C-Fos.
Rat and mouse C-Fos sequences differ from chicken
C-Fos.
Long conserved region between130 and 225.
Symbols
Identity across all sequences
Conservation of amino acid characteristics
. Semi-conserved substitutions

To better visualize conservation, colors can be
used.
Color code
AVFPMILW RED, Small (small hydrophobic
(incl.aromatic -Y))
DE BLUE, Acidic
RHK MAGENTA, Basic
STYHCNGQ GREEN, Hydroxyl Amine Basic Q
Others Grey
Differences between the genes and species are
more apparent.

21
Neighbor Joining tree constructed with a
web-ClustalW applet (Jalview) FosB c-Fos can be
distinguished Rat mouse cluster apart from
chicken, with respect to c-Fos
22
A Highly conserved region
B Rather dissimilar region
23
Threonyl-tRNA synthetase (thrS2) gene w/
consensus sequence
24
Threonyl-tRNA synthetase (thrS2) gene in 6 species
25
MALIGN

construction of pairwise MOTIFS (conserved
regions of similarity without gaps)
construction of MULTIPLE MOTIFS (of thickness
exceeding 2)
forming of SUPERMOTIFS (groupings of motifs that
near each other) from MULTIPLE MOTIFS
construction of MULTIPLE ALIGNMENTS from
previously obtained MOTIFS and SUPERMOTIFS and
consequent selection of the best alignment.
http//www.genebee.msu.su/services/malign_full.htm
l

26
(No Transcript)
27
Each motif supermotif has a score and involves
a pair of sequences
28
Malign
Possibility of receiving a sum of mismatch
weights along an alignment for random sequences
of the same length includes parameters for gaps
mismatches
29
Definition

A multiple alignment of strings S1, Sk is a
series of strings with spaces S1, , Sk such
that
S1 Sk
Sj is an extension of Sj by insertion of spaces
Goal Find an optimal multiple alignment.

30
Scoring Alignments

In order to find an optimal alignment, we need to
be able to measure how good an alignment is
Sum of pairs (SP) method in a column, score each
pair of letters and total the scores. Pairs of
gaps score 0.
Total up scores for each column

31
SP Method Example

Using BLOSUM62 matrix, gap penalty -8
In column 1, we have pairs
-,S
-,S
S,S
k(k-1)/2 pairs per column

-8 - 8 4 -12
32
Dynamic Programming

The dynamic programming approach can be adapted
to MSA
For simplicity, assume k sequences of length n
The dynamic programming array F is k-dimensional
of length n1 (including initial gaps)
The entry F(i1, , ik) represents score of
optimal alignment for s11..i1, sk1..ik

33
Dynamic Programming

Letting i represent the vector (i1,,ik) and b
represent a nonzero binary vector of length k, we
fill in the array with the formula
where (selecting a column to score)

34
Example
s1 MPE s2 MKE s3 MSKE s4 SKE

Let i(1,1,1,1), b(1,0,0,0)
Checking F(0,1,1,1) (i-b)
Column(s,i,b) is
SP-score is -24 (assuming gap penalty of -8)

M - - -
35
Analysis

O(nk) entries to fill
Each entry combines O(2k) other entries
Costs O(k2) to calculate each SP score
Overall cost is O(k2 2k nk), or exponential in
the number of sequences!
MSA with SP-score shown NP-complete

36
Star Alignments

Heuristic method for multiple sequence alignments
Select a sequence sc as the center of the star
For each sequence s1, , sk such that index i ?
c, perform a Needleman-Wunsch global alignment
Aggregate alignments with the principle once a
gap, always a gap.

37
Star Alignments Example
MPE MKE
MSKE - MKE
s1 MPE s2 MKE s3 MSKE s4 SKE
s3
s1
s2
SKE MKE
-MPE -MKE MSKE -SKE
-MPE -MKE MSKE
MPE MKE
s4
38
Choosing a center

Try them all and pick the one with the best score
Calculate all O(k2) alignments, and pick the
sequence sc that maximizes

39
Analysis

Assuming all sequences have length n
O(n2) to calculate global alignment
O(k) global alignments to calculate
Using a reasonable data structure for joining
alignments, no worse than O(kl), where l is upper
bound on alignment lengths
O(kn2k2l) overall cost

40
Progressive Approaches

CLUSTALW
Perform pairwise alignments
Construct a tree, joining most similar sequences
first (guide tree)
Align sequences sequentially, using the
phylogenetic tree
PILEUP
Similar to CLUSTALW
Uses UPGMA to produce tree (chapter 6)

41
Progressive Approaches
42
Problems with Progressive Alignments

MSA depends on pairwise alignments
If sequences are very distantly related, much
higher likelihood of errors
Care must be made in choosing scoring matrices
and penalties
Other approaches using Bayesian methods such as
hidden Markov models

43
Localized Analysis

Profile Analysis
Blocks

44
Profile Analysis

A profile is a scoring matrix defined from a
multiple sequence alignment of related sequences
Profiles are used to score unknown sequences to
estimate whether or not the unknown sequence is
related

45
Profile Examplehttp//www.sdsc.edu/projects/profi
le/profile_desc.html

ATP Binding RNA helicase (DEAD Box)

rhle_ecoli GVDVLVA TPG dbp2_schpo GVEICIA
TPG dbp2_yeast GSEIVIA TPG dbpa_ecoli APHIIVA
TPG
46
Using Profiles

Basically, perform a pairwise sequence alignment
using the profile as a scoring matrix
Several sequences can be aligned to the same
profile, yielding a multiple sequence alignment
Generating profiles is computationally intensive
Comparing a sequence to many profiles is
computationally intensive
Which profile is my unknown sequence most like?

47
Blocks

Blocks are multiply aligned ungapped segments
corresponding to the most highly conserved
regions of proteins. from BLOCKS WWW Server
(http//www.blocks.fhcrc.org/)

48
Statistical Methods

Expectation Maximization
Gibbs Sampling
Hidden Markov Models

49
Position Specific Scoring

We saw these with profile analysis
Essentially, derive a scoring matrix from similar
sequences
Use the scoring matrix to score residues based on
their alignment position
Scores can also be derived from a log
transformation of the frequency of each amino
acid
Log likelihood of seeing the amino acid in the
position vs. just at random

50
Visualization and Editing

Visualization should present researchers with
information
Point out key information
See patterns that might otherwise be missed
Editing
Sequence alignments might still violate
biological principles
Our models are not perfect
Editors allow researchers to modify alignments,
based on relevant principles
Principles are not built into most software

51
Visualization

Sequence alignment results

52
Visualization

Sequence logos for consensus sequences

Consensus read off the top
53
Visualization
54
Visualization
55
Editing

CINEMA
http//bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
Applet
GDE (Genetic Data Environment)
http//www.tigr.org/jeisen/GDE/GDE.html
Unix based, requires GCG
GeneDoc
Windows based
MACAW
Mac and PC

56
Editing

In addition to moving the motifs, etc. editors
are used to format alignments
Colors, highlighting, etc.
Typically have to reformat the data
Can use the SEQIO program
Or use READSEQ on the web

57
Summary

There is a strong relationship between multiple
sequence alignment and phylogenetics
Many different approaches to MSA
Dynamic programming
Progressive
Star, ClustalW, PILEUP
Iterative methods (not covered)
Hidden Markov Models, genetic algorithms
Localized alignments
Profiles, blocks

58
Summary