Bioinformatics - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Bioinformatics

Description:

The major difference is that insertions and deletions are not considered. ... Loop transition to accommodate multi-residue insertions ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 16
Provided by: anatolyr
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics


1
Lecture 9
Bioinformatics
  • Multiple sequence alignments
  • Scoring multiple sequence alignments
  • Progressive methods
  • ClustalW
  • Other methods
  • Hidden Markov Models

2
Multiple sequence alignments (MSA)
  • Comparing a pair of sequences is not sufficient
    for many research purposes, mainly for
    evolutionary reconstructions and study functional
    similarities.
  • It is obvious that MSA is much more demanding in
    computational sense. For two protein sequences
    each 300 aa in length and excluding gaps, the
    number of comparisons to be made using dynamic
    programming approach is equal to 3002 9 x 104.
    For 3 sequences of the same length this number is
    3003 2.7 x 107. For 10 sequences it becomes
    staggering.
  • Fortunately in late 1980 and mid 1990 methods,
    which dramatically reduce a number of
    comparisons, were invented.
  • The MSA alignment is usually done in three
    consecutive steps. 1. Finding alignments between
    each pair of sequences 2. A trial MSA is then
    produced by predicting a phylogenetic tree for
    the sequences (for instance neighbor-joining
    method) 3. The sequences are then multiply
    aligned in the order of their relationship on the
    tree.

3
Scoring Multiple sequence alignments
4
Progressive methods of MSA
  • The most closely related sequences are first
    aligned by dynamic programming to build a MSA
    starting from the most related sequences
  • The tree is based on pairwise comparisons of the
    sequences using one of the phylogenetic methods
  • Unfortunately uncertainty is growing in the
    lower levels of the tree, as deletions or
    insertions not easy to recognise
  • The challenge is to utilize an appropriate
    combination of sequence weighting, scoring matrix
    and gap penalties, which prevents optimal MSA

5
ClustalW
  • This is one of the advanced version of the
    popular and powerful program, where W stand for
    weighting. ClustalW provides more realistic
    alignments that should reflect evolutionary
    changes and more appropriate distribution of gaps
    between conserved domains
  • ClustalW performs a global-multiple sequence
    alignment by a different method than MSA,
    although the initial global-multiple sequence
    alignment is calculated similarly
  • The steps involved are 1. Pairwise alignment of
    all sequences 2. Use the alignment scores for
    building a phylogenetic tree 3. Progressive
    alignment guided by the phylogenetic
    relationships indicated by the tree
  • The most closely related (similar) sequences are
    aligned first, and then additional sequences are
    added
  • The initial alignments used to produce the guide
    tree may be obtained by a fast k-tuple approach
    (similar to FASTA) or a slower dynamic
    programming method
  • For building a tree genetic distances between
    sequences are calculated as the numbers of
    mismatched positions in an alignment divided by
    the total number of matched positions

6
ClustalW
Sequence A (weight a)
..K Sequence C (weight c)
..L Sequence B (weight b)
..I. Sequence C (weight c)
..L The same procedure applies to other
columns in all pairwise alignments Scores for
matching these two columns in an MSA a x c x
score (K,L ) b x c x score (I,L)
/ncolumnsmpairwisecomparisons
7
An output from ClustalW sequences have
significant similarity
CLUSTAL W (1.82) multiple sequence
alignment gi42542791gbAAH66228.1
MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKMVAAGICRS-
49 gi825623embCAA39813.1
MGTKGKVIKCKAAIAWEAGKPLCIEEVEVAPPKAHEVRIQIIATSLCHT-
49 gi42738724gbAAS42652.1
--MQNFVFRNPTKLIFGKGQ---LEQLKTEIPQFGKKVLLVYGGGSIKRN
45 .
. . .
.   gi42542791gbAAH66228.1
---DEHVVSGNLV-TPLPVILGHEAAGIVESVGEGVTTVKPG--DKVIPL
93 gi825623embCAA39813.1
---DASVIDSKFEGLAFPVIVGHEAAGIVESIGPGVTNVKPG--DKVIPL
94 gi42738724gbAAS42652.1
GIYDNVISILKDINAEVFELTGVEPNPRVSTVKKGIQICKDNGVEFILAV
95 .
. . . . ..
.   gi42542791gbAAH66228.1
FTPQCGKCRICKNPESNYCLKN-DLGNPRG-------------------T
123 gi825623embCAA39813.1
YAPLCRKCKFCLSPLTNLCGKISNLKSPASDQ----------------QL
128 gi42738724gbAAS42652.1
GGGSVIDCTKAIAAGSKYDGDVWDIVTKKAFASEALPFGTVLTLAATGSE
145 . .
. . . . . . . ..
  gi42542791gbAAH66228.1
LQDGTRRFTCSGKPIHHFVGVSTFSQYTVVDENAVAKIDAASPLEKVCLI
173 gi825623embCAA39813.1
MEDKTSRFTCKGKPVYHFFGTSTFSQYTVVSDINLAKIDDDANLERVCLL
178 gi42738724gbAAS42652.1
MNAGSVITNWETNEKYGWGSPVTFPQFSILDPVHTASVPRDQTIYGMVDI
195 . .
. .. . alcohol
dehydrogenase, iron-containing Bacillus cereus
Class I alcohol dehydrogenase, gamma subunit
Homo sapiens Different form of alcohol
dehydrogenase Homo sapiens
8
Localised alignments in sequences
  • MSA programs discussed so far are based on
    global alignments, including all available parts
    of sequences
  • However many sequences may have blocks of
    similarity, which are separated by low
    similarity regions
  • Three approaches were used to develop methods
    more oriented toward this structural feature 1.
    Profile analysis 2. Block analysis 3. Pattern
    searching
  • Profiles are found by performing the global MSA
    of a group of sequences and them choosing the
    more highly conserved regions. A score matrix for
    such MSA, called profile, is then made. Once
    produced, the profile is used to search a target
    sequence for possible matches to the profile
    using scores in the table to evaluate the
    likelihood at each position.

9
Profile analysis pattern identification
CONS A B C D.V W Y Z
Gap Len I 8 3 -2
5.21 -18 -6 4 100 100 T
13 19 -5 24.3 -28 14 15
100 100 L 5 5 -5
3.10 -1 5 2 22 22 S
17 14 17 13.1 -8 -15 4
100 100 T 15 3 22
0.9 -22 6 -4 100 100 T
8 -1 12 -219 -15 4 -3
100 100 C 17 0 24
-1.9 -5 14 -7 100 100 V
11 0 18 -1.31 -19 5 -5
100 100 C 10 -8 15
-1115 22 14 -11 100 100 V
7 7 -3 8.26 -24 -6 8
100 100
The profile represents the specific motif pattern
found for the chosen location for a set of hsp70
proteins. It is used to search a target sequence
for matches to the profile. The values are log
odds score of giving the probability of finding
the amino acid in the target sequence at that
position in the profile divided by the
probability of aligning the two aa by random
chance. There are 23 columns, representing 20 aa
1 unknown aa (Z) gap opening and extension
penalties. Gaps are costly unless the profile
itself include gaps, as in the row 3.
10
Profile analysis pattern identification
11
Block Analysis
  • This method is very similar to the profile
    search. The major difference is that insertions
    and deletions are not considered. As a result the
    patterns found contain regions of high similarity
    separated by loosely similar or dissimilar
    sequences
  • These ungapped patterns may be extracted from
    these aligned regions and used to produce blocks.
    Profile matrices the same as in the previous
    method are built.

Seq1 GVDVLVATPG RLLDLEHQNA ..VKLDQV
EILVLDEADR Seq2 GPDALVSTPG RYLTLEHRNV ..LKPDIV
TIRVLDEADR Seq3 AVEVIVSTPG RLWDLHHQNA
..VQLSQD ELLDLDEADK
Seqn GCDKLNATPG
RLMDLKHQGA ..VKLLFV SILVMDEADR
12
Hidden Markov Models
13
Hidden Markov Model for sequence alignment
14
Hidden Markov Models calculation of transition
probabilities
15
GeneDoc a multiple sequence alignment editor
Write a Comment
User Comments (0)
About PowerShow.com