Algorithms for Generalized Comparison of Minisatellites - PowerPoint PPT Presentation

About This Presentation
Title:

Algorithms for Generalized Comparison of Minisatellites

Description:

Minisatellites consist of tandem arrays of short repeat units found in genome of ... 4-plication of c. abcb abccccb. 2-contraction of b. abbc abc ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 44
Provided by: mathI9
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Generalized Comparison of Minisatellites


1
Algorithms for Generalized Comparison of
Minisatellites
  • Behshad Behzadi Jean-Marc Steyaert
  • LIX, Ecole Polytechnique
  • France

2
Outline
  • Biology
  • Evolutionary Model
  • Problem description
  • Previous works
  • Algorithms
  • Results

3
Biology
  • Minisatellites consist of tandem arrays of short
    repeat units found in genome of most higher
    eukaryotes.
  • High degree of polymorphism at minisatellites has
    applications from forensic studies to
    investigation of origin of modern humans.

4
Biology
  • These repeats are called variants.
  • MVR-PCR is designed to find the variants.
  • As an example, MSY1 is the minisatellite
  • on the human Y-chromosomes. There are five
    different repeats (variants) in MSY1.

5
Different Repeat Types (Variants) of MSY1
6
Graphical representations of Minisatellite Maps
of 13 males
7
Summary
  • Biologists are able to compute the minisatellite
    maps A sequence in which each of the repeats is
    replaced by its symbol.
  • Study of evolution of minisatellites is an
    important problem, in human genetics studies.

8
Computer Science Model
  • Each variant is a symbol of an alphabet.
  • A minisatellite is a string on this alphabet.
  • We need to compare these strings.

9
Evolutionary Operations
  • Insertion
  • Deletion
  • Mutation
  • Amplification (p-plication)
  • Contraction (p-contraction)

10
Examples of operations(1)
  • Insertion of d
  • abbc gt
    abbdc
  • Deletion of c
  • abbcb gt
    abbb
  • Mutation of c into d
  • caab gt daab

11
Examples of operations(2)
  • 4-plication of c
  • abcb gt abccccb
  • 2-contraction of b
  • abbc gt abc
  • No subword replication or contraction

12
Cost Functions
  • I(x) insertion of letter x
  • D(x) deletion of letter x
  • M(x,y) mutation of x to y
  • Ap(x) p-plication of letter x
  • Cp(x) p-contraction of letter x

13
Hypotheses
  • All the costs are positive.
  • The cost of duplications (and contractions) is
    less than all other operations.
  • Distance M(x,x)0
  • Triangle Inequality holds
  • M(x,y)M(y,z) M(x,z)

14
Transformation of s into t
  • Applying a sequence of operations on s
    transforming it into t.
  • An example xyy gtgt xbcbxzc
  • xyy gt xy gt xxy gt xxz gt xbxz gt xbbxz gt
    xbbbxz gt xbcbxz gt xbcbxzc
  • The cost of a transformation is the sum of costs
    of its operations.

15
Transformation distance between s and t
  • TD Minimum cost for a possible transformation
    of s into t.
  • The transformation which gives this minimum is
    called optimal transformation.

16
Previous Works
  • Jobling al.
  • Bérard Rivals (RECOMB02)
  • B.B. J.M.S. (CPM2003, WABI2004) this work

17
Optimal Transformation between s and t
  • For any transformation of s into t there are 2
    different types for the symbols of s.
  • Generative vs Vanishing letters of s
  • create a substring in t (generation) or
  • disappear (reduction)

18
Basic lemmas
  • The optimal generation of a non-empty string s
    from a symbol x can be achieved by a
    non-decreasing generation.
  • In an optimal transformation of a string s to a
    string t any contraction operation can be done
    before any generation.

19
The schema of the proof
  • Sequence u is eliminated sometime during the
    process
  • The right-hand side transformation is equivalent
    and less expensive w.r.t. evolution.

20
Optimal Transformation
  • generative and vanishing symbols can be
    transformed in two distinct optimized phases.

21
The Algorithm
  • Preprocessing ( Substring generation costs) by
    Dynamic programming
  • Main part (Transformation distance) by Dynamic
    programming

22
Substring generation costs
  • Gi, j, x minimum generation cost of ti..j
    from symbol x among all generations which do not
    start by a mutation.
  • Ti,j,x is the minimum generation cost of
    substring ti..j from symbol x.
  • mci,j,p,x is the minimum generation cost for
    generating ti..j from symbol x among all
    possible generations starting with a
    p-plication.

23
mci, j, p, x
24
Substring generation costs
25
Substring Reduction cost
  • Si,j is the minimum cost of reduction of the
    substring si..j into si.
  • Si,j is determined in the same way.

26
Complexity
  • The time complexity is
  • The space complexity is
  • The maximum possible p for a p-plication is noted
    by

27
Transformation Distance
  • TDi,j is the transformation distance between
    s1..i and t1..j.

28
Complexity
  • The main algorithm complexity is O(n³) in time
    and O(n²) in space.
  • The total time complexity is
  • The total space complexity is

29
Further improvements
  • Improving the complexity using the
  • Run Length Encoded string representation.
  • The RLE of aaaabbbbcccabbbbcc
    is a4b4c3a1b4c2
    also written a4b4c3ab4c2
  • The lengths of the encoded strings with original
    lengths m and n are denoted by m' and n'.

30
Generation of Runs
  • There exists an optimal generation of a non-empty
    string t from a single symbol x in which for
    every run of size k gt 1 in t the k-1 right
    symbols of the run are generated by duplications
    of the leftmost symbol of the run.

31
New configurations in the transformation
Generations could split
runs into several parts... Similarly for
reductions... See on examples different
configurations
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
PreProcessing Generation Costs
  • Compute the generation cost of all substrings of
    the target string t from any symbol x of the
    alphabet.
  • Fill a table Gt x,i,j by recurrence.

36



37
(No Transcript)
38
Core Algorithm
  • The Transformation Distances TD between s1..i
    and t1..j are computed by recurrence according
    to lemmas derived to the situations
  • Generalized dynamic programming is used again
  • Complexity O(n'3m'3mn'2nm'2mn)

39
BS algorithms vs. BR algorithms
  • Complexity improvement O(n4) to O(n3) and more
    with RLE (O(n2) experimentally)
  • Generalization 1 amplifications and contractions
    of order gt 2
  • Generalization 2 symbol-dependent cost functions
  • The triangle hypotheses on cost functions are not
    restrictive and can be released by some
    preprocessing.

40
Dataset
  • Provided by Prof. M. Jobling
  • Minisatellite maps of 690 Y chromosomes from
    worldwide population.
  • The length of the sequences is between 48 and
    118.
  • Distances were computed for 690x690 pairs

41
(No Transcript)
42
Conclusion
  • More efficient algorithm to compute faster the
    distances and thus the phylogenetic trees.
  • A more general framework which can be used for
    modelling more complicated biological evolutions.

43
Thank you
Write a Comment
User Comments (0)
About PowerShow.com