Bioinformatic PhD. course - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatic PhD. course

Description:

which is an affine-gap function. How is the best alignment found?. C T A C T A C T A C G T ... 2.2 Affine-gap model score. In both cases we know which cell contributes ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 36

Provided by: lcl2

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatic PhD. course

1
Bioinformatic PhD. course
Bioinformatics Xavier Messeguer Peypoch
(http//www.lsi.upc.es/alggen) LSI Dep. de
Llenguatges i Sistemes Informàtics BSC Barcelona
Supercomputing Center Universitat Politècnica de
Catalunya
2
Contents
1. Biological introduction
2. Comparison of short sequences ( up to
10.000bps)
Dot Matrix Pairwise align.
Multiple align. Hash alg.
3. Comparison of large sequences ( more that
10.000bps)
Data structures Suffix trees MUMs
4. String matching
Exact Extended Approximate
5. Sequence assembly
6. Projects PROMO, MREPATT,
3
2. Comparison of short sequences (lt10.000 bps)
Summary (more or less)

2.1 Dot matrix
2.2 Pairwise alignment.
2.3 Hash algorithms.
2.4 Multiple alignment.

4
2.2 Pairwise alignment
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) from the alphabet a,c,t,g we
say that A and B from a,c,t,g,- are aligned
iff

A and B become A and B if gaps ( ) are
removed.
AB
For all i, it is not possible that ai bi -

MALIG (an example)
How many alignments of two sequences exist?
Which is the best alignment?
5
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
(a1,b1)
6
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
7
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
? ?
8
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
5 7
5 7
?
9
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) then
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with ( an ,
-) (a1a2...an ,b1b2...bm-1) those
that end with ( - , bm) (a1a2...an-1
,b1b2...bm-1) those that end with ( an , bm)
1
1
1
1
1 1 1
3
5 7

25
25 63

5 7
But, what is the assymptotic value?
10
2.2 Assymptotic value
As
(a1a2...an ,b1b2...bn)
and
n! nn e-n (Stirling approximation)
then
(a1a2...an ,b1b2...bn) gt 22n
11
2.2 Best alignment
How can an alignment be scored?
catcactactgacgactatcgtagcgcggctatacatctacgccaa-
ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt-
attgcggctacacactacgcacaactactgtatgtcgc-cgg----

Then we assign a score for each case, for example
1,-1,-2.
How can the best alignment be found?
12
2.2 Edit distance and alignment of strings
The best alignment of two strings
is related with the edit distance, first
discussed in 1966...
The most efficient algorithm was proposed in
1968 and in 1970
using the technique called Dynamic programming
13
2.2 Best alignment
C T A C T A C T A C G T A C T G A
14
2.2 Best alignment
C T A C T A C T A C G T A C T G A
15
2.2 Best alignment
C T A C T A C T A C G T A C T G A
The cell contains the score of the best
alignment of AC and
CTACT.
16
2.2 Best alignment
C T A C T A C T A C G T 0 A C T
G A
?
17
2.2 Best alignment
C T A C T A C T A C G T 0 -2 A C T
G A
?
- C
18
2.2 Best alignment
C T A C T A C T A C G T 0 -2 -4 A C
T G A
?
- - CT
19
2.2 Best alignment
C T A C T A C T A C G T 0 -2-4-6 -8
A C T G A
- - - - - - CTACTA
20
2.2 Best alignment
C T A C T A C T A C G T 0 -2-4-6 -8
A ? C ? T ? G A
21
2.2 Best alignment
C T A C T A C T A C G T 0 -2-4-6 -8
A-2 C-4 T -6 G A
ACT - - -
22
2.2 Best alignment
C T A C T A C T A C G T

A
C

T G A
C T A C T A C T A C G T 0 -2 -4-6 -8
A-2 C-4 T -6 G A
s(AC,CTA)-2
s(A,CTA)1
BA(AC,CTAC) best
s(AC,CTAC)max
s(A,CTAC)-2
23
Best alignment
Given the maximum score, how can the best
alignment be found?

Quadratic cost in space and time

Up to 10,000 bps sequences in length

Download alggen tool
24
2.2 Some slides revisited

We have developed the theory according to the
following principles
1) Both sequences have a similar length
(global).
2) The model of gaps is linear

If there are k consecutive gaps the penalty
scores k(-2).
25
2.2 Semiglobal pairwise alignment

Assume that we have sequences with different
length
S1
S2

It is meaningless to introduce gaps until both
sequences have similar length .
The most probable alignment should be
Final gaps
Initial gaps
How can these alignments be found?
26
2.2 Semiglobal pairwise alignment

Note that

Initial gaps
Final gaps
C T A C T A C T A C G T A C T
27
2.2 Semiglobal pairwise alignment
Given a cell
C T A C T A C T A C G T A C T
0 0 0 0 0 0 0 0 0
0 0 0
0
The cell contains the score of the best
alignment of CTA with the empty sequence.
28
2.2 Semiglobal pairwise alignment
C T A C T A C T A C G T 0 0 0 0 0 0
0 A C T
The contribution of the initial gaps is
disregarded, then
but, what happens with the final gaps?
29
2.2 Semiglobal pairwise alignment
C T A C T A C T A C G T 0 0 0 0 0 0
0 A 1 C 2 T
3
How does the algorithm search for the best
alignment?
Practice with the alggen tool.
30
2.2 Affine-gap model score
Given the following alignments
that have the same score
a g t a c c c c g t a g a g t - c c - - g t a -
a g t a c c c c g t a g a g t - c - c - g t a -
a g t a c c c c g t a g a g t - c - - c g t a -
a g t a c c c c g t a g a g t - - c c - g t a -
a g t a c c c c g t a g a g t - - - c c g t a -
a g t a c c c c g t a g a g t - - c - c g t a -
Which is the most reliable case from a
biological point of view?
31
2.2 Affine-gap model score
Then, how can we distinguish between consecutive
gaps and separated gaps?
a g t a c c c c g t a g a g t - - - c c g t a -
a g t a c c c c g t a g a g t - - c - c g t a -
Then, the penalty of k consecutive gaps becomes
OG (k-1) EG which is an
affine-gap function.
How is the best alignment found?.
32
2.2 Affine-gap model score
C T A C T A C T A C G T A C T G A
Smallest arrows refer to the introduction of an
opening gap. Largest arrows refer to the
introduction of an extension gap.
But from which cell do the largest arrows
originate?
33
2.2 Affine-gap model score
C T A C T A C T A C G T A C T G A
Acces to clustalW http//www.ebi.ac.uk/clustalw
34
2.2 Local alignment
Given two sequences, we can consider the
alignments of all their substrings
how can the
best of them be found?
Two questions arise - how can the alignments
be compared? - how can the best one be selected?
35
2.2 Local alignment
Given a path
Imagine the graph of the scores can the best
subalignments be detected?

It suffices to compare the value of each cell
with zero!

Write a Comment

User Comments (0)