Title: Bioinformatics
1Lecture 9
Bioinformatics
- Multiple sequence alignments
- Scoring multiple sequence alignments
- Progressive methods
- ClustalW
- Other methods
- Hidden Markov Models
2Multiple sequence alignments (MSA)
- Comparing a pair of sequences is not sufficient
for many research purposes, mainly for
evolutionary reconstructions and study functional
similarities. - It is obvious that MSA is much more demanding in
computational sense. For two protein sequences
each 300 aa in length and excluding gaps, the
number of comparisons to be made using dynamic
programming approach is equal to 3002 9 x 104.
For 3 sequences of the same length this number is
3003 2.7 x 107. For 10 sequences it becomes
staggering. - Fortunately in late 1980 and mid 1990 methods,
which dramatically reduce a number of
comparisons, were invented. - The MSA alignment is usually done in three
consecutive steps. 1. Finding alignments between
each pair of sequences 2. A trial MSA is then
produced by predicting a phylogenetic tree for
the sequences (for instance neighbor-joining
method) 3. The sequences are then multiply
aligned in the order of their relationship on the
tree.
3Scoring Multiple sequence alignments
4Progressive methods of MSA
- The most closely related sequences are first
aligned by dynamic programming to build a MSA
starting from the most related sequences - The tree is based on pairwise comparisons of the
sequences using one of the phylogenetic methods - Unfortunately uncertainty is growing in the
lower levels of the tree, as deletions or
insertions not easy to recognise - The challenge is to utilize an appropriate
combination of sequence weighting, scoring matrix
and gap penalties, which prevents optimal MSA
5ClustalW
- This is one of the advanced version of the
popular and powerful program, where W stand for
weighting. ClustalW provides more realistic
alignments that should reflect evolutionary
changes and more appropriate distribution of gaps
between conserved domains - ClustalW performs a global-multiple sequence
alignment by a different method than MSA,
although the initial global-multiple sequence
alignment is calculated similarly - The steps involved are 1. Pairwise alignment of
all sequences 2. Use the alignment scores for
building a phylogenetic tree 3. Progressive
alignment guided by the phylogenetic
relationships indicated by the tree - The most closely related (similar) sequences are
aligned first, and then additional sequences are
added - The initial alignments used to produce the guide
tree may be obtained by a fast k-tuple approach
(similar to FASTA) or a slower dynamic
programming method - For building a tree genetic distances between
sequences are calculated as the numbers of
mismatched positions in an alignment divided by
the total number of matched positions
6ClustalW
Sequence A (weight a)
..K Sequence C (weight c)
..L Sequence B (weight b)
..I. Sequence C (weight c)
..L The same procedure applies to other
columns in all pairwise alignments Scores for
matching these two columns in an MSA a x c x
score (K,L ) b x c x score (I,L)
/ncolumnsmpairwisecomparisons
7An output from ClustalW sequences have
significant similarity
CLUSTAL W (1.82) multiple sequence
alignment gi42542791gbAAH66228.1
MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKMVAAGICRS-
49 gi825623embCAA39813.1
MGTKGKVIKCKAAIAWEAGKPLCIEEVEVAPPKAHEVRIQIIATSLCHT-
49 gi42738724gbAAS42652.1
--MQNFVFRNPTKLIFGKGQ---LEQLKTEIPQFGKKVLLVYGGGSIKRN
45 .
. . .
. Â gi42542791gbAAH66228.1
---DEHVVSGNLV-TPLPVILGHEAAGIVESVGEGVTTVKPG--DKVIPL
93 gi825623embCAA39813.1
---DASVIDSKFEGLAFPVIVGHEAAGIVESIGPGVTNVKPG--DKVIPL
94 gi42738724gbAAS42652.1
GIYDNVISILKDINAEVFELTGVEPNPRVSTVKKGIQICKDNGVEFILAV
95 .
. . . . ..
. Â gi42542791gbAAH66228.1
FTPQCGKCRICKNPESNYCLKN-DLGNPRG-------------------T
123 gi825623embCAA39813.1
YAPLCRKCKFCLSPLTNLCGKISNLKSPASDQ----------------QL
128 gi42738724gbAAS42652.1
GGGSVIDCTKAIAAGSKYDGDVWDIVTKKAFASEALPFGTVLTLAATGSE
145 . .
. . . . . . . ..
 gi42542791gbAAH66228.1
LQDGTRRFTCSGKPIHHFVGVSTFSQYTVVDENAVAKIDAASPLEKVCLI
173 gi825623embCAA39813.1
MEDKTSRFTCKGKPVYHFFGTSTFSQYTVVSDINLAKIDDDANLERVCLL
178 gi42738724gbAAS42652.1
MNAGSVITNWETNEKYGWGSPVTFPQFSILDPVHTASVPRDQTIYGMVDI
195 . .
. .. . alcohol
dehydrogenase, iron-containing Bacillus cereus
Class I alcohol dehydrogenase, gamma subunit
Homo sapiens Different form of alcohol
dehydrogenase Homo sapiens
8Localised alignments in sequences
- MSA programs discussed so far are based on
global alignments, including all available parts
of sequences - However many sequences may have blocks of
similarity, which are separated by low
similarity regions - Three approaches were used to develop methods
more oriented toward this structural feature 1.
Profile analysis 2. Block analysis 3. Pattern
searching - Profiles are found by performing the global MSA
of a group of sequences and them choosing the
more highly conserved regions. A score matrix for
such MSA, called profile, is then made. Once
produced, the profile is used to search a target
sequence for possible matches to the profile
using scores in the table to evaluate the
likelihood at each position.
9Profile analysis pattern identification
CONS A B C D.V W Y Z
Gap Len I 8 3 -2
5.21 -18 -6 4 100 100 T
13 19 -5 24.3 -28 14 15
100 100 L 5 5 -5
3.10 -1 5 2 22 22 S
17 14 17 13.1 -8 -15 4
100 100 T 15 3 22
0.9 -22 6 -4 100 100 T
8 -1 12 -219 -15 4 -3
100 100 C 17 0 24
-1.9 -5 14 -7 100 100 V
11 0 18 -1.31 -19 5 -5
100 100 C 10 -8 15
-1115 22 14 -11 100 100 V
7 7 -3 8.26 -24 -6 8
100 100
The profile represents the specific motif pattern
found for the chosen location for a set of hsp70
proteins. It is used to search a target sequence
for matches to the profile. The values are log
odds score of giving the probability of finding
the amino acid in the target sequence at that
position in the profile divided by the
probability of aligning the two aa by random
chance. There are 23 columns, representing 20 aa
1 unknown aa (Z) gap opening and extension
penalties. Gaps are costly unless the profile
itself include gaps, as in the row 3.
10Profile analysis pattern identification
11Block Analysis
- This method is very similar to the profile
search. The major difference is that insertions
and deletions are not considered. As a result the
patterns found contain regions of high similarity
separated by loosely similar or dissimilar
sequences - These ungapped patterns may be extracted from
these aligned regions and used to produce blocks.
Profile matrices the same as in the previous
method are built.
Seq1 GVDVLVATPG RLLDLEHQNA ..VKLDQV
EILVLDEADR Seq2 GPDALVSTPG RYLTLEHRNV ..LKPDIV
TIRVLDEADR Seq3 AVEVIVSTPG RLWDLHHQNA
..VQLSQD ELLDLDEADK
Seqn GCDKLNATPG
RLMDLKHQGA ..VKLLFV SILVMDEADR
12Hidden Markov Models
13Hidden Markov Model for sequence alignment
14Hidden Markov Models calculation of transition
probabilities
15GeneDoc a multiple sequence alignment editor