Title: Multiple Sequence Alignment
1Multiple Sequence Alignment
2Definition
- Homology related by descent
- Homologous sequence positions
? ATTGCGC
ATTGCGC
ATTGCGC
?
AT-CCGC
ATTGCGC
? ATCCGC
C
3Reasons for aligning sets of sequences
- Organise data to reflect sequence homology
- Estimate evolutionary distance
- Infer phylogenetic trees from homologous sites
- Highlight conserved sites/regions
- Highlight variable sites/regions
- Uncover changes in gene structure
- Look for evidence of selection
- Summarise information
4Alignments help to
Organise
Visualise
Analyze
Sequence Data
5- The process of aligning sequences is a game
involving playing off gaps and mismatches
6Ways of aligning multiple sequences
- By hand
- Automated
- Combination
7Definition
- Optimality criteria some kind rule or scoring
scheme to help you to decide what you consider to
be the best alignment
8Pairwise vs Multiple Sequences
- Pairs of sequences typically aligned using
exhaustive algorithms (dynamic programming) - complexity of exhaustive methods is O(2n mn) n
number of sequences m sequence length - Multiple sequence alignment uses heuristic methods
9The Correct Alignment
? ATTGCGC
ATTGCGC
ATTGCGC
? ATCCGC
C
10The Correct Alignment
11- Sequence alignment is easy with sufficiently
closely related sequences - Below a certain level of identity sequence
alignment may become meaningless - twilight zone for aa sequences 30
- In the twilight zone it is good to make use of
additional information if possible (e.g.
structure)
12Consensus Sequences
- Simplest FormA single sequence which represents
the most common amino acid/base in that position - Y D D G A V - E A L
- Y D G G - - - E A L
- F E G G I L V E A L
- F D - G I L V Q A V
- Y E G G A V V Q A L
- Y D G G A/I V/L V E A L
13Multiple Alignment Formats
- e.g. Clustal, Phylip, MSF, MEGA etc. etc.
14Clustal Format
- CLUSTAL X (1.81) multiple sequence alignment
- CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ------
--EVLNEN- - CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP------
--EVLNEN- - CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSR
E-------- - CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESS
E-------- - CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESS
EQEILKERK - CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ----
--QQHSSSE - CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ----
--------- - .. .
-
15Phylip Format (Interleaved)
- 7 100
- SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA
MSLSGLFANA VLRAQHLHQL - SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA
MSLSGLFANA VLRAQHLHQL - SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA
MPLSSLFANA VLRAQHLHQL - SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA
MPLSSLFSNA VLRAQHLHQL - SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA
MPLSSLFANA VLRAQHLHQL - SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA
MPLSSLFANA VLRAQHLHQL - SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT
IPLSRLFDNA MLRAHRLHQL - AADTFKEFER TYIPEGQRYS -IQNTQVAFC
FSETIPAPTG KNEAQQKSDL - AADTFKEFER TYIPEGQRYS -IQNTQVAFC
FSETIPAPTG KNEAQQKSDL - AADTYKEFER AYIPEGQRYS -IQNAQAAFC
FSETIPAPTG KEEAQQRTDM - AADTYKEFER AYIPEGQRYS -IQNAQAAFC
FSETIPAPTG KEEAQQRTDM - AADTYKEFER AYIPEGQRYS -IQNAQAAFC
FSETIPAPTG KDEAQQRSDM - AADTYKEFER AYIPEGQRYS -IQNAQAAFC
FSETIPAPTG KDEAQQRSDV - AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC
FSESIPTPSN REETQQKSNL
16Phylip Format (Sequential)
- 3 100
- Rat
- ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
- TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
- Mouse
- ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
- TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
- Rabbit
- ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
- TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG
17Mega Format
- mega
- TITLE No title
- Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
- Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
- Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
- Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
- Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
- Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
- Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT
18Progressive Multiple Alignment
- Heuristic
- Perform pairwise alignments
- Align sequences to alignments or alignments to
existing alignments (profile alignments - Do the alignments in some sensible order
19Iterative methods
- Several progressive alignment methods can be
iterated - e.g. Barton-Sternberg, ClustalX
20ClustalX Algorithm
- Perform pairwise alignments and calculate
distances for all pairs of sequences - Construct guide tree (dendrogram) joining the
most similar sequences using Neighbour Joining - Align sequences, starting at the leaves of the
guide tree. This involves the pair-wise
comparisons as well as comparison of single
sequence with a group of seqs (Profile)
21- ClustalX is not optimal
- There are known areas in which ClustalX performs
badly e.g. - errors introduced early cannot be corrected by
subsequent information - alignments of sequences of differing lengths
cause strange guide trees and unpredictable
effects - edges ClustalX does not penalise gaps at edges
- There are alternatives to ClustalX available
22Using ClustalX
- Start with sequences in FASTA format (or an
existing alignment in Clustal format - Do Alignment on the alignment menu
23(No Transcript)
24ClustalX Parameters
- Scoring Matrix
- Gap opening penalty
- Gap extension penalty
- Protein gap parameters
- Additional algorithm parameters
- Secondary structure penalties
25Score Matrices
- Pairwise matrices and multiple alignment matrix
series - PAM (Dayhoff), BLOSUM (Hennikof), GONNET
(default), user defined - Transition (Alt-gtG)/Transversion (Clt-T) ratio
low for distantly related sequences
26Gap Penalties
- Linear gap penalties Affine gap penalties
- p (o l.e)
- Gap opening
- Gap extension
- Protein specific penalties (on by default)
- Increase the probability of gaps associated with
certain residues - Increase the chances of gaps in loop regions (gt 5
hydrophilic residues)
27Algorithm parameters
- Slow-accurate pair-wise alignment
- Do alignment from guide tree
- Reset gaps before aligning (iteration)
- Delay Divergent sequences ()
28Additional displays
- Column Scores
- Low quality regions
- Exceptional residues
29Multiple Alignment Strategies
- Align pairs of sequences using an optimal method
- Choose representative sequences to align
carefully - Choose sequences of comparable lengths
- Progressive alignment programs such as ClustalX
for multiple alignment - Progressive alignment programs may be combined
- Review alignment by eye and edit
30Alignment of coding regions
- Nucleotide sequences much harder to align
accurately than proteins - Protein coding sequences can be aligned using the
protein sequences - e.g. BioEdit toggle translation to amino acid,
call clustalw to align, edit alignment by hand,
toggle back to nucleotide - In-frame nucleotide alignments can be used, e.g.
to determine non-synonymous and synonymous
distances separately
31Multiple Alignments and Phylogenetic Trees
- You can make a more accurate multiple sequence
alignment if you know the tree already - A good multiple sequence alignment is an
important starting point for drawing a tree - The process of constructing a multiple alignment
(unlike pair-wise) needs to take account of
phylogenetic relationships
32Editing a multiple sequence alignment
- It is NOT fraud to edit a multiple sequence
alignment - Incorporate additional knowledge if possible
- Alignment edititors help to keep the data
organised and help to prevent unwanted mistakes
33Alignment Editors
- e.g. GDE, Bioedit, Seaview, Jalview etc.
- Alignment editors can function as an
organisational tool (analyses tools on BioEdit) - Construct sub-sequences (GDE, Seaview)
- Annotate sequences (Seaview)
34Aligning weakly similar sequences
35Sequence contains conserved regions
- e.g. DIALIGN (Morgenstern, Dress, Werner)
- re-aligns regions between conserved blocks
- http//bibiserv.techfak.uni-bielefeld.de/
- useful if sequences contains consistent conserved
blocks - Block Maker searches for conserved words that
may be inconsistent http//blocks.fhcrc.org/
36Profile Alignment
- Gribskov et al. 1987
- Position specific scores
- Allows alignment of alignments
- Gaps introduced as whole columns in the separate
alignments - Optimal alignment in time O(a2l2)
- a alphabet size, l sequence length
- Information about the degree of conservation of
sequence positions is included
37Good reasons to use profile alignments
- Adding a new sequence to an existing multiple
alignment that you want to keep the same(align
sequence to profile) - Searching a database for new members of your
protein family(pfsearch) - Searching a database of profiles to find out
which one your sequence belongs to(pfscan) - Combining two multiple sequence
alignments(profile to profile)
38Profile Alignment Using ClustalX
- Profile Alignment Mode
- Align sequence to profile
- Align profile 1 to profile 2
- Secondary structure parameters
39(No Transcript)
40Profile searching using PSI-BLAST
- Position Specific Iterative
- Perform search construct profile perform
search - Convergence (hopefully)
- Increased sensitivity for distantly related
sequences - Available on-line (NCBI)
41Databases of Aligned Sequences
- Hovergen http//pbil.univ-lyon1.fr/databases/hover
gen.html (vertebrate alignments) - Pfam http//www.sanger.ac.uk/Software/Pfam/
(protein domain alignments and profile HMMs) - BLOCKS http//blocks.fhcrc.org/
- Ribosomal Database Project http//rdp.cme.msu.edu/
html/ alignments and trees derived from rRNA
sequences - Interpro combines information from other
sources - Many more
42Probabilistic Models of Sequence Alignment
- Hidden Markov Models
- sequence of states and associated symbol
probabilities - Produces a probabilistic model of a sequence
alignment - Align a sequence to a Profile Hidden Markov Model
- Algorithms exist to find the most efficient
pathway through the model
43- Markov Chain A chain of things. The
probability of the next thing depends only on the
current thing - Hidden Markov Model A sequence of states which
form a Markov Chain. The states are not
observable. The observable characters have
emission probabilities which depend on the
current state.