CSE 746 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE 746

Description:

Thus, the suffix tree for s will have at most |s| leaves, guaranteeing linear complexity in space. Suffix trees - Applications In the notes of (Lewis, ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 23
Provided by: DicleO
Category:
Tags: cse | suffix | tree

less

Transcript and Presenter's Notes

Title: CSE 746


1
  • CSE 746 Introduction to Bioinformatics
  • Research Project
  • Two methods of DNA Sequencing Comparing and
    Intertwining Suffix Trees and De Bruijn Graphs
    for Sequence Assembly
  • Dicle Öztürk
  • 540110004

2
Suffix trees - Definition
  • Definition (Gusfield)

3
Suffix trees uses and complexity
  • Useful in string search, text processing, tasks
  • More like bridge between exact and inexact
    pattern matching problems
  • Storing suffix trees requires more space than
    storing the string itself
  • It was Ukkonen who was the first to provide a
    linear time online construction of suffix trees
  • Can be used for the sorting stage of BWT

4
Suffix trees Naïve algorithm
  • Assuming a bounded alphabet, this algorithm runs
    in O(m2) time.
  • --
  • N1 root (initially a leaf)
  • Ni assumption
  • Ni1 inductive string constructed using Ni

5
Suffix Trees Naïve algorithm
  • find the longest path from root whose label
    matches a prefix of Si1...m
  • (Matching path is unique because no two edges out
    of a node can have labels that begin with the
    same char)
  • if no further match is possible
  • if in the middle of an edge (u,v)
  • split the edge into two
  • insert a new node w just after the last
    char on the edge that matched a char in
    Si1...m (before the char mismatched)
  • label the edges (u,w) and (w,v)
    accordingly
  • endif
  • create a new edge (w,i1), thus creating a
    new leaf (i1)
  • label (w,i1) with the unmatched part of the
    suffix Si1...m
  • endif

6
Suffix trees Ukkonens algorithm
  • Ukkonen moves Mccreights work further,
    decreasing space complexity to linear and giving
    comprehendible definitions.
  • The tree is constructed online and in a
    left-to-right fashion, as opposed to Weiners
    method.

7
Suffix trees Ukkonens algorithm
  • In Ukkonens algorithm, substrings are kept by
    their indices.
  • The trick is that the last index of suffixes are
    not defined, which are represented by leaves.
  • If w is a substring of the string s, w(i,j) is
    actually w si...sj.
  • Thus, the suffix tree for s will have at most s
    leaves, guaranteeing linear complexity in space.

8
Suffix trees - Applications
  • In the notes of (Lewis, usask.ca), some general
    applications of suffix trees in computational
    biology are mentioned,
  • Genome alignment
  • Signature selection
  • Finding a short sequence that is specific to
    individual genes
  • Searches for non-repeating segments
  • Finding an representing all tandem repeats

9
Suffix trees - Applications
  • (Riedl, 1994) gives a more detailed list of
    applications,
  • Suffix trees are useful on search, single
    sequence analysis and multiple sequence analysis
  • With the method they use, which is called Gestalt
    tree matching, homology-search applications are
    believed to outperform fastp and fasta

10
Suffix trees - Applications
  • Detection and occurrences of any number of short
    subsequences can be useful in enzyme cut-site
    determination
  • Generalised suffix tree of a set of sequences
    allows all of the sequences to be analysed
    simultaneously
  • Detection of common subsequences within a set of
    sequences can be applied to contig reassembly
    (Riedl, 1994)

11
Suffix trees - Applications
  • Finding the best match between the suffix of one
    read and the prefix of another can also be a
    fruitful task
  • Suffix-prefix overlaps can help for finding the
    shortest common superstrings of reads, especially
    in genome assembly
  • Suffix trees can be used to remove redundancies
    in string containment problems

12
De Bruijn Graphs
  • There exist strings which are called De Bruijn
    strings, which might have given some inspiration
    to the development of De Bruijn graphs and vice
    versa.
  • A De Bruijn string of order k is a non-empty
    string x which is defined over an alphabet A (
    x?A) such that if each string on A of length k
    occurs once and only once in x. Like x11001
    where A 0,1.

13
De Bruijn Graphs
  • De Bruijn graphs models those kinds of strings
    where the nodes hold the substrings of length k-1
    and edges have one character (leftover of
    k-length substring). If the two nodes are
    connected by an edge, the one being the source
    follows the other (it is a directed graph).
  • Building De Bruijn graphs is not a piece of cake
    but it has many applications in genome assembly

14
De Bruijn Graphs
  • De Bruijn graphs are useful in
  • Handling sequence variants like duplications,
    inversions and transpositions
  • Combining sequences if different length
  • Effective data compression even when the data has
    many redundant parts
  • Detecting and analysing structural variants from
    unassembled data

15
De Bruijn Graphs and Affix Trees
  • Conceptually, the De Bruijn graph of a sequence
    can be considered as a simplification of that
    sequence's affix tree.
  • Each non-empty substring of a given sequence is
    mapped onto a separate node
  • Each node is connected by an edge to its longest
    prefix and by a suffix link to its longest suffix
  • Nodes corresponding to sequences of length 1 are
    directly connected to the root node,
    corresponding to t.

16
De Bruijn Graphs and Affix Trees
  • Root represents empty string. The first children
    are the sequences of length 1.
  • The analogy is built upon the atomic tree
    representation of (Giegerich, 1997) and the idea
    is mostly from (Maaß, 2003).
  • It has been pointed out in (Zerbino, 2009) that
    traversing the De Bruijn graoh is equivalent to
    traversing the affix tree across its breadth

17
De Bruijn Graphs and Affix Trees
  • Furthermore it says,
  • If we rank the nodes by distance from the root,
    the k-mer nodes of the De Bruijn graph correspond
    to the nodes of rank k in the affix tree
  • It is easy to demonstrate that two k-mers are
    connected in the De Bruijn graph iff the
    corresponding nodes in the affix tree are
    connected by a path composed of an edge and a
    suffix link, going through a node of rank k1

18
De Bruijn Graphs and Affix Trees
  • (Giegreich, 1997) gives the definitions of
    suffix and prefix trees together with their
    special relationship.
  • Active suffixes and prefixes for the string t
  • The active suffix of t ? its longest nested
    suffix denoted as a(t)
  • The active prefix of t ? its longest nested
    prefix denoted as a-1(t)
  • Then, a(t-1) (a-1(t))-1

19
De Bruijn Graphs and Affix Trees
  • The tree is atomic of each of its edges is marked
    by a single char
  • So every node is explicit
  • The tree is actually a trie

20
De Bruijn Graphs and Affix Trees
  • (Maaß, 2003) furthers this analogy-based idea of
    affix trees and gives some more insight into the
    issue
  • It says,
  • A suffix link is an auxiliary edge from node n to
    node m where m is the node such that path(m) is
    the longest proper prefix of path(n) represented
    by a node in the tree.
  • Suffix links are used to move from one node to
    another so that the represented string is
    shortened at the front.
  • It mentions also that in essence, it was actually
    (Blummer, 1998) who observed the dual structure
    of suffix trees.

21
Redundancies in the sequences
  • And finally, we can say that the redundancy in
    the suffix tree of some string should be as less
    as possible so that an efficient build-up of De
    Bruijn graph out that tree can be obtained.
  • To reduce redundancy, some compression methods
    can be applied but no loss should take place and
    reversal should be possible. The algorithm of
    Lempel and Ziv is advised to be an efficient tool
    for this task, running in O(n) time with suffix
    trees.

22
References
  • 1 Algorithms on Strings, Trees, and
    Sequences Computer Science and Computational
    Biology, Dan Gusfield, Cambridge University
    Press, Jan 15, 1997.
  • 2 Ukkonen E., On-line Construction of
    Suffix-Trees, Algorithmica vol 14(3), 1995.
  • 3 - Algorithms on Strings, Maxime Crochemore
    and Christophe Hancart, Cambridge University
    Press, June 2007.
  • 4 Genome assembly and comparison using de
    Bruijn graphs, D.R. Zerbino, PhD Thesis, European
    Bioinformatics Institute, Darwin College,
    September, 2009.
  • 5 Giegerich R., and Kurtz S., From Ukkonen to
    McCreight and Weiner A unifying view of
    linear-time suf?x tree construction, Algorithmica
    19331353, 1997
  • 6 Maaß, M. G., Linear bidirectional on-line
    construction of af?x trees, Algorithmica vol.
    37(1), 2003.
  • 7 Bieganski, P., Riedl, J., Cartis, J.V.,
    Retzel, E.F., Generalized Suffix Trees for
    Biological Sequence Data Implementations and
    Applications, HICSS (5), 1994.
Write a Comment
User Comments (0)
About PowerShow.com