Title: Algorithms for Generalized Comparison of Minisatellites
1Algorithms for Generalized Comparison of
Minisatellites
- Behshad Behzadi Jean-Marc Steyaert
- LIX, Ecole Polytechnique
- France
2Outline
- Biology
- Evolutionary Model
- Problem description
- Previous works
- Algorithms
- Results
3Biology
- Minisatellites consist of tandem arrays of short
repeat units found in genome of most higher
eukaryotes. - High degree of polymorphism at minisatellites has
applications from forensic studies to
investigation of origin of modern humans.
4Biology
- These repeats are called variants.
- MVR-PCR is designed to find the variants.
- As an example, MSY1 is the minisatellite
- on the human Y-chromosomes. There are five
different repeats (variants) in MSY1. -
5Different Repeat Types (Variants) of MSY1
6Graphical representations of Minisatellite Maps
of 13 males
7Summary
- Biologists are able to compute the minisatellite
maps A sequence in which each of the repeats is
replaced by its symbol. - Study of evolution of minisatellites is an
important problem, in human genetics studies.
8 Computer Science Model
- Each variant is a symbol of an alphabet.
- A minisatellite is a string on this alphabet.
- We need to compare these strings.
9Evolutionary Operations
- Insertion
- Deletion
- Mutation
- Amplification (p-plication)
- Contraction (p-contraction)
10Examples of operations(1)
- Insertion of d
- abbc gt
abbdc - Deletion of c
- abbcb gt
abbb - Mutation of c into d
- caab gt daab
11Examples of operations(2)
- 4-plication of c
- abcb gt abccccb
- 2-contraction of b
- abbc gt abc
- No subword replication or contraction
12Cost Functions
- I(x) insertion of letter x
- D(x) deletion of letter x
- M(x,y) mutation of x to y
- Ap(x) p-plication of letter x
- Cp(x) p-contraction of letter x
13Hypotheses
- All the costs are positive.
- The cost of duplications (and contractions) is
less than all other operations. - Distance M(x,x)0
- Triangle Inequality holds
- M(x,y)M(y,z) M(x,z)
14Transformation of s into t
- Applying a sequence of operations on s
transforming it into t. - An example xyy gtgt xbcbxzc
- xyy gt xy gt xxy gt xxz gt xbxz gt xbbxz gt
xbbbxz gt xbcbxz gt xbcbxzc - The cost of a transformation is the sum of costs
of its operations.
15Transformation distance between s and t
- TD Minimum cost for a possible transformation
of s into t. - The transformation which gives this minimum is
called optimal transformation.
16Previous Works
- Jobling al.
- Bérard Rivals (RECOMB02)
- B.B. J.M.S. (CPM2003, WABI2004) this work
17Optimal Transformation between s and t
- For any transformation of s into t there are 2
different types for the symbols of s. - Generative vs Vanishing letters of s
- create a substring in t (generation) or
- disappear (reduction)
18Basic lemmas
- The optimal generation of a non-empty string s
from a symbol x can be achieved by a
non-decreasing generation. - In an optimal transformation of a string s to a
string t any contraction operation can be done
before any generation.
19The schema of the proof
- Sequence u is eliminated sometime during the
process - The right-hand side transformation is equivalent
and less expensive w.r.t. evolution.
20Optimal Transformation
- generative and vanishing symbols can be
transformed in two distinct optimized phases.
21The Algorithm
- Preprocessing ( Substring generation costs) by
Dynamic programming - Main part (Transformation distance) by Dynamic
programming
22Substring generation costs
- Gi, j, x minimum generation cost of ti..j
from symbol x among all generations which do not
start by a mutation. - Ti,j,x is the minimum generation cost of
substring ti..j from symbol x. - mci,j,p,x is the minimum generation cost for
generating ti..j from symbol x among all
possible generations starting with a
p-plication.
23mci, j, p, x
24Substring generation costs
25Substring Reduction cost
- Si,j is the minimum cost of reduction of the
substring si..j into si. - Si,j is determined in the same way.
26Complexity
- The time complexity is
- The space complexity is
- The maximum possible p for a p-plication is noted
by
27Transformation Distance
- TDi,j is the transformation distance between
s1..i and t1..j.
28Complexity
- The main algorithm complexity is O(n³) in time
and O(n²) in space. - The total time complexity is
- The total space complexity is
29Further improvements
- Improving the complexity using the
- Run Length Encoded string representation.
- The RLE of aaaabbbbcccabbbbcc
is a4b4c3a1b4c2
also written a4b4c3ab4c2 - The lengths of the encoded strings with original
lengths m and n are denoted by m' and n'.
30Generation of Runs
- There exists an optimal generation of a non-empty
string t from a single symbol x in which for
every run of size k gt 1 in t the k-1 right
symbols of the run are generated by duplications
of the leftmost symbol of the run.
31New configurations in the transformation
Generations could split
runs into several parts... Similarly for
reductions... See on examples different
configurations
32(No Transcript)
33(No Transcript)
34(No Transcript)
35PreProcessing Generation Costs
- Compute the generation cost of all substrings of
the target string t from any symbol x of the
alphabet. - Fill a table Gt x,i,j by recurrence.
36 37(No Transcript)
38Core Algorithm
- The Transformation Distances TD between s1..i
and t1..j are computed by recurrence according
to lemmas derived to the situations - Generalized dynamic programming is used again
- Complexity O(n'3m'3mn'2nm'2mn)
39BS algorithms vs. BR algorithms
- Complexity improvement O(n4) to O(n3) and more
with RLE (O(n2) experimentally) - Generalization 1 amplifications and contractions
of order gt 2 - Generalization 2 symbol-dependent cost functions
- The triangle hypotheses on cost functions are not
restrictive and can be released by some
preprocessing.
40Dataset
- Provided by Prof. M. Jobling
- Minisatellite maps of 690 Y chromosomes from
worldwide population. - The length of the sequences is between 48 and
118. - Distances were computed for 690x690 pairs
41(No Transcript)
42Conclusion
- More efficient algorithm to compute faster the
distances and thus the phylogenetic trees. - A more general framework which can be used for
modelling more complicated biological evolutions.
43Thank you