Multiple Sequence Alignments - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Multiple Sequence Alignments

Description:

NFL/-S. N-YLS. NKYLS. N-F-S. N-FLS. Alignment ... Pairs of gaps score 0. Total up scores for each column ... Costs O(k2) to calculate each SP score ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 59
Provided by: craigs89
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignments


1
Multiple Sequence Alignments
  • Craig A. Struble, Ph.D.
  • Department of Mathematics, Statistics, and
    Computer Science
  • Marquette University

2
Overview
  • Background
  • Applications
  • Algorithms
  • Dynamic Programming
  • Star Alignments
  • Progressive Approaches
  • Localized Alignments
  • Profile analysis
  • Blocks analysis
  • Statistical Methods
  • Expectation Maximization
  • Hidden Markov Models
  • Position Specific Scoring
  • Visualization and Editing

3
Example
Multiple sequence alignment of 7 neuroglobins
using clustalx
4
Example
  • Searching for domains with RPS-BLAST

5
Why do we do multiple sequence alignments?
  • Infer phylogenetic relationships
  • Understand evolutionary pressures acting on a
    gene
  • Formulate test hypotheses about protein 3-D
    structure (based on conserved regions)
  • Formulate test hypotheses about protein
    function
  • Understand how protein function has changed
  • Identify primers and probes to search for
    homologous sequences in other organisms

6
The relationship of MSA to phylogenetics
  • The goal of phylogenetics is to reconstruct
    evolutionary history using share, derived
    characters
  • Characters that have a common evolutionary
    history (are homologous)
  • For example, eyes of humans and rats (but not
    humans and octopi)
  • Traditionally, morphological characters
  • DNA and amino acid sequence alignments are very
    common
  • It is assumed that properly aligned sequences
    represents homology

7
The relationship of MSA to phylogenetics
AHFGEPDFTV WNAGQFPANL HTQ-DMSSKS TIEINFKAME
MIILGTEYAG ENFGEPDFTV WNAGQFPANT HTS-GMTSKT
TVEINFKQME MVILGTEYAG KNFGEPDFTI YNAGQFPANI
HTK-GMTSAT SVEINFKDME MVILGTEYAG EDFGTPDFTI
YNAGQFPCNR YTH-YMTSST SIDLNLARRE MVIMGTQYAG
ESFGTPDFTI YNAGQFPCNR YTH-YMTSST SVDLNLARRE
MVILGTQYAG LVGFKPDFVV MNGSKVTNPN WKEQGLNSEN
FVAFNLTEGV QLIGGTWYGG LKNFEPDFVV MNGSKVTNPN
WKEQGLNSEN FVAFNLTERI QLIGGTWYGG LAHFKPDFVV
MNGAKCTNAK WKEHGLNSEN FTVFNLTERM QLIGGTWYGG
LKGFEPDFVV LNASKAKVEN FKELGLNSET AVVFNLAEKM
QIILNTWYGG LANFKPDFVV YNASKAKVEN YKELGLHSET
AVVFNLTSRE QVIINTWYGG LENFKADFIV YNACKCINED
YKQDGLNSEV FVIFNVEENI AVIGGTWYGG ATKIKPNFTI
VSAPHFKADP EVD-GTKSET FVIISFKHKV ILIGGTEYAG
KTVEQP-FTI LSAPHFKADP KTD-GTHSET FIIVSFEKRT
ILIGGTEYAG -PAGKDEWQV LNVANFECVP ERD-GTNSDG
CVILNFAQKK VLIAGMRYAG LPSFQPKLTI IDLPSFKADP
VRH-GCRSET VIACDLTNGL VLIGGTSYAG LASFLPKLTI
IDLPSFKANP ERH-GCRGET IIACDLTKGL VLIGGTSYAG
LGQFVPEMTI IDLPSFRADP ARH-GSRTET VIAVDLTRQI
VLIGGTSYAG LENFVPELTL IDLPSFRADP KRH-GCRSEN
VVAIDFARKI VLIGGTQYAG ----SYDMVT IDVP------
-----SYSDV WMLVERRSNS TLVLGSDYYG
Phosphoenolpyruvate carboxylase kinase (PPCK)
gene in 19 species
PPCK_AERPE
8
  • Phosphoenolpyruvate carboxylase kinase (PPCK)
    gene in 19 species, 720 sites.
  • Standard Neighbor-Joining tree constructed by
    ClustalW
  • Tree will differ with varying tree-building and
    distance-estimation methods how do we know
    which to use?
  • Different methods will provide significantly
    different estimates of branch lengths, especially
    for the long branch.

9
The relationship between MSA and evolutionary
history of a group of genes or organisms
NFS
NFLS
NYLS
NKYLS
-L
K
NYLS
10
Using known evolutionary relationship for
sequence alignment
NFS
NFLS
NYLS
NKYLS
NFL/-S
NK/-YLS
NK/-Y/FL/-S
11
What happens when a sequence alignment is wrong?
A
B
C
A
C
B
B
C
A
A AGT B AT C ATC
A AGT- B A-T- C A-TC
A AGT B AT- C ATC
A AGT- B A-T- C A-TC
III
II
I
Unaligned
12
Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Gaps, tv, ts
A.
AATCGCG AACCCGG
0, 2, 1
B.
AATCGCG- AACC-CGG
2, 0, 1
C.
AATCGCG- AA-CCCGG
2, 1, 0
  • transition rate
  • transversion rate
  • These are treated the same for long divergence
    times.

D.
AATCGC-G- AA-C-CCGG
4, 0, 0
13
Parameter considerations consequences
transitions, transversions, and gaps
4 possible alignments of AATCGCG AACCCGG
Indels, tvs, tss
AATCGCG AACCCGG
A.
0, 2, 1
AATCGCG- AACC-CGG
B.
2, 0, 1
AATCGCG- AA-CCCGG
C.
2, 1, 0
AATCGC-G- AA-C-CCGG
D.
4, 0, 0
14
Tools for MSE
  • Clustal web server or run locally
  • Web server http//www.ebi.ac.uk/clustalw/index.ht
    ml
  • Manuscript with details http//www.csc.fi/molbio/
    progs/clustalw/ms.html
  • Goal Find an optimal multiple alignment

15
ClustalW
  • CLUSTAL Has number of variations, the most
    commonly used is CLUSTALW
  • Generates pairwise alignments of all input
    sequences, then ranks scores of identities among
    pairs of sequences.
  • High scoring pairs of sequences align most
    readily to each other.
  • More divergent (less related) pairs are then
    added to the alignment.
  • Generates a phylogenetic tree of relationships to
    determine steps in constructing the alignment.
  • One can view the phylogenetic tree used to
    generate the alignment.
  • Individual pairs in the alignment are aligned
    using a FASTA-type (word-based, fast alignment)
    or by a dynamic programming algorithm, which is
    slower, but produces optimal pairwise alignments.

16
Unix version of ClustalX, the graphical interface
to ClustalW, run locally. Note colors for amino
acid qualities and score indicator.
17
Web ClustalW options
18
FOSB_MOUSE Protein fosB MFQAFPGDYD SGSRCSSSPS
AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA
ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP
GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA
RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT
DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP
GCKIPYEEGP GPGPLAEVRD LPGSTSAKED GFGWLLPPPP
PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL
NSPSLLAL FOSB_HUMAN Protein fosB MFQAFPGDYD
SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM
PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ
PPVVDPYDMP GTSYSTPGMS GYSSGGASGS GGPSTSGTTS
GPGPARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA
KCRNRRRELT DRLQAETDQL EEEKAELESE IAELQKEKER
LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSAPAKED
GFSWLLPPPP PPPLPFQTSQ DAPPNLTASL FTHSEVQVLG
DPFPVVNPSY TSSFVLTCPE VSAFAGAQRT SGSDQPSDPL
NSPSLLAL FOS_CHICK Proto-oncogene protein
c-fos MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY PSPADSFSSM
GSPVNSQDFC TDLAVSSANF VPTVTAISTS PDLQWLVQPT
LISSVAPSQN RGHPYGVPAP APPAAYSRPA VLKAPGGRGQ
SIGRRGKVEQ LSPEEEEKRR IRRERNKMAA AKCRNRRREL
TDTLQAETDQ LEEEKSALQA EIANLLKEKE KLEFILAAHR
PACKMPEELR FSEELAAATA LDLGAPSPAA AEEAFALPLM
TEAPPAVPPK EPSGSGLELK AEPFDELLFS AGPREASRSV
PDMDLPGASS FYASDWEPLG AGSGGELEPL CTPVVTCTPC
PSTYTSTFVF TYPEADAFPS CAAAHRKGSS
SNEPSSDSLS FOS_RAT Proto-oncogene protein
c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM
GSPVNTQDFC ADLSVSSANF IPTVTAISTS PDLQWLVQPT
LVSSVAPSQT RAPHPYGLPT PSTGAYARAG VVKTMSGGRA
QSIGRRGKVE QLSPEEEEKR RIRRERNKMA AAKCRNRRRE
LTDTLQAETD QLEDEKSALQ TEIANLLKEK EKLEFILAAH
RPACKIPNDL GFPEEMSVTS LDLTGGLPEA TTPESEEAFT
LPLLNDPEPK PSLEPVKNIS NMELKAEPFD DFLFPASSRP
SGSETARSVP DVDLSGSFYA ADWEPLHSSS LGMGPMVTEL
EPLCTPVVTC TPSCTTYTSS FVFTYPEADS FPSCAAAHRK
GSSSNEPSSD SLSSPTLLAL FOS_MOUSE Proto-oncogene
protein c-fos MMFSGFNADY EASSSRCSSA SPAGDSLSYY
HSPADSFSSM GSPVNTQDFC ADLSVSSANF IPTVTAISTS
PDLQWLVQPT LVSSVAPSQT RAPHPYGLPT QSAGAYARAG
MVKTVSGGRA QSIGRRGKVE QLSPEEEEKR RIRRERNKMA
AAKCRNRRRE LTDTLQAETD QLEDEKSALQ TEIANLLKEK
EKLEFILAAH RPACKIPDDL GFPEEMSVAS LDLTGGLPEA
STPESEEAFT LPLLNDPEPK PSLEPVKSIS NVELKAEPFD
DFLFPASSRP SGSETSRSVP DVDLSGSFYA ADWEPLHSNS
LGMGPMVTEL EPLCTPVVTC TPGCTTYTSS FVFTYPEADS
FPSCAAAHRK GSSSNEPSSD SLSSPTLLAL
Sequence data for two related genes fosB from
mouse and human c-fos from chicken, mouse, and
rat.
19
  • Significant differences between FosB and C-Fos.
  • Rat and mouse C-Fos sequences differ from chicken
    C-Fos.
  • Long conserved region between130 and 225.
  • Symbols
  • Identity across all sequences
  • Conservation of amino acid characteristics
  • . Semi-conserved substitutions

20
  • To better visualize conservation, colors can be
    used.
  • Color code
  • AVFPMILW RED, Small (small hydrophobic
    (incl.aromatic -Y))
  • DE BLUE, Acidic
  • RHK MAGENTA, Basic
  • STYHCNGQ GREEN, Hydroxyl Amine Basic Q
  • Others Grey
  • Differences between the genes and species are
    more apparent.

21
Neighbor Joining tree constructed with a
web-ClustalW applet (Jalview) FosB c-Fos can be
distinguished Rat mouse cluster apart from
chicken, with respect to c-Fos
22
A Highly conserved region
B Rather dissimilar region
23
Threonyl-tRNA synthetase (thrS2) gene w/
consensus sequence
24
Threonyl-tRNA synthetase (thrS2) gene in 6 species
25
MALIGN
  • construction of pairwise MOTIFS (conserved
    regions of similarity without gaps)
  • construction of MULTIPLE MOTIFS (of thickness
    exceeding 2)
  • forming of SUPERMOTIFS (groupings of motifs that
    near each other) from MULTIPLE MOTIFS
  • construction of MULTIPLE ALIGNMENTS from
    previously obtained MOTIFS and SUPERMOTIFS and
    consequent selection of the best alignment.
  • http//www.genebee.msu.su/services/malign_full.htm
    l

26
(No Transcript)
27
Each motif supermotif has a score and involves
a pair of sequences
28
Malign
Possibility of receiving a sum of mismatch
weights along an alignment for random sequences
of the same length includes parameters for gaps
mismatches
29
Definition
  • A multiple alignment of strings S1, Sk is a
    series of strings with spaces S1, , Sk such
    that
  • S1 Sk
  • Sj is an extension of Sj by insertion of spaces
  • Goal Find an optimal multiple alignment.

30
Scoring Alignments
  • In order to find an optimal alignment, we need to
    be able to measure how good an alignment is
  • Sum of pairs (SP) method in a column, score each
    pair of letters and total the scores. Pairs of
    gaps score 0.
  • Total up scores for each column

31
SP Method Example
  • Using BLOSUM62 matrix, gap penalty -8
  • In column 1, we have pairs
  • -,S
  • -,S
  • S,S
  • k(k-1)/2 pairs per column

-8 - 8 4 -12
32
Dynamic Programming
  • The dynamic programming approach can be adapted
    to MSA
  • For simplicity, assume k sequences of length n
  • The dynamic programming array F is k-dimensional
    of length n1 (including initial gaps)
  • The entry F(i1, , ik) represents score of
    optimal alignment for s11..i1, sk1..ik

33
Dynamic Programming
  • Letting i represent the vector (i1,,ik) and b
    represent a nonzero binary vector of length k, we
    fill in the array with the formula
  • where (selecting a column to score)

34
Example
s1 MPE s2 MKE s3 MSKE s4 SKE
  • Let i(1,1,1,1), b(1,0,0,0)
  • Checking F(0,1,1,1) (i-b)
  • Column(s,i,b) is
  • SP-score is -24 (assuming gap penalty of -8)

M - - -
35
Analysis
  • O(nk) entries to fill
  • Each entry combines O(2k) other entries
  • Costs O(k2) to calculate each SP score
  • Overall cost is O(k2 2k nk), or exponential in
    the number of sequences!
  • MSA with SP-score shown NP-complete

36
Star Alignments
  • Heuristic method for multiple sequence alignments
  • Select a sequence sc as the center of the star
  • For each sequence s1, , sk such that index i ?
    c, perform a Needleman-Wunsch global alignment
  • Aggregate alignments with the principle once a
    gap, always a gap.

37
Star Alignments Example
MPE MKE
MSKE - MKE
s1 MPE s2 MKE s3 MSKE s4 SKE
s3
s1
s2
SKE MKE
-MPE -MKE MSKE -SKE
-MPE -MKE MSKE
MPE MKE
s4
38
Choosing a center
  • Try them all and pick the one with the best score
  • Calculate all O(k2) alignments, and pick the
    sequence sc that maximizes

39
Analysis
  • Assuming all sequences have length n
  • O(n2) to calculate global alignment
  • O(k) global alignments to calculate
  • Using a reasonable data structure for joining
    alignments, no worse than O(kl), where l is upper
    bound on alignment lengths
  • O(kn2k2l) overall cost

40
Progressive Approaches
  • CLUSTALW
  • Perform pairwise alignments
  • Construct a tree, joining most similar sequences
    first (guide tree)
  • Align sequences sequentially, using the
    phylogenetic tree
  • PILEUP
  • Similar to CLUSTALW
  • Uses UPGMA to produce tree (chapter 6)

41
Progressive Approaches
42
Problems with Progressive Alignments
  • MSA depends on pairwise alignments
  • If sequences are very distantly related, much
    higher likelihood of errors
  • Care must be made in choosing scoring matrices
    and penalties
  • Other approaches using Bayesian methods such as
    hidden Markov models

43
Localized Analysis
  • Profile Analysis
  • Blocks

44
Profile Analysis
  • A profile is a scoring matrix defined from a
    multiple sequence alignment of related sequences
  • Profiles are used to score unknown sequences to
    estimate whether or not the unknown sequence is
    related

45
Profile Examplehttp//www.sdsc.edu/projects/profi
le/profile_desc.html
  • ATP Binding RNA helicase (DEAD Box)

rhle_ecoli GVDVLVA TPG dbp2_schpo GVEICIA
TPG dbp2_yeast GSEIVIA TPG dbpa_ecoli APHIIVA
TPG
46
Using Profiles
  • Basically, perform a pairwise sequence alignment
    using the profile as a scoring matrix
  • Several sequences can be aligned to the same
    profile, yielding a multiple sequence alignment
  • Generating profiles is computationally intensive
  • Comparing a sequence to many profiles is
    computationally intensive
  • Which profile is my unknown sequence most like?

47
Blocks
  • Blocks are multiply aligned ungapped segments
    corresponding to the most highly conserved
    regions of proteins. from BLOCKS WWW Server
    (http//www.blocks.fhcrc.org/)

48
Statistical Methods
  • Expectation Maximization
  • Gibbs Sampling
  • Hidden Markov Models

49
Position Specific Scoring
  • We saw these with profile analysis
  • Essentially, derive a scoring matrix from similar
    sequences
  • Use the scoring matrix to score residues based on
    their alignment position
  • Scores can also be derived from a log
    transformation of the frequency of each amino
    acid
  • Log likelihood of seeing the amino acid in the
    position vs. just at random

50
Visualization and Editing
  • Visualization should present researchers with
    information
  • Point out key information
  • See patterns that might otherwise be missed
  • Editing
  • Sequence alignments might still violate
    biological principles
  • Our models are not perfect
  • Editors allow researchers to modify alignments,
    based on relevant principles
  • Principles are not built into most software

51
Visualization
  • Sequence alignment results

52
Visualization
  • Sequence logos for consensus sequences

Consensus read off the top
53
Visualization
54
Visualization
55
Editing
  • CINEMA
  • http//bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
  • Applet
  • GDE (Genetic Data Environment)
  • http//www.tigr.org/jeisen/GDE/GDE.html
  • Unix based, requires GCG
  • GeneDoc
  • Windows based
  • MACAW
  • Mac and PC

56
Editing
  • In addition to moving the motifs, etc. editors
    are used to format alignments
  • Colors, highlighting, etc.
  • Typically have to reformat the data
  • Can use the SEQIO program
  • Or use READSEQ on the web

57
Summary
  • There is a strong relationship between multiple
    sequence alignment and phylogenetics
  • Many different approaches to MSA
  • Dynamic programming
  • Progressive
  • Star, ClustalW, PILEUP
  • Iterative methods (not covered)
  • Hidden Markov Models, genetic algorithms
  • Localized alignments
  • Profiles, blocks

58
Summary
  • Basic probabilistic approaches
  • Expectation maximization
  • Gibbs sampling
  • Hidden Markov Models
  • Visualizations and Editing
  • Sequence logos
Write a Comment
User Comments (0)
About PowerShow.com