Title: CSE 746
1- CSE 746 Introduction to Bioinformatics
- Research Project
- Two methods of DNA Sequencing Comparing and
Intertwining Suffix Trees and De Bruijn Graphs
for Sequence Assembly - Dicle Öztürk
- 540110004
2Suffix trees - Definition
3Suffix trees uses and complexity
- Useful in string search, text processing, tasks
- More like bridge between exact and inexact
pattern matching problems - Storing suffix trees requires more space than
storing the string itself - It was Ukkonen who was the first to provide a
linear time online construction of suffix trees - Can be used for the sorting stage of BWT
4Suffix trees Naïve algorithm
- Assuming a bounded alphabet, this algorithm runs
in O(m2) time. - --
- N1 root (initially a leaf)
- Ni assumption
- Ni1 inductive string constructed using Ni
5Suffix Trees Naïve algorithm
- find the longest path from root whose label
matches a prefix of Si1...m - (Matching path is unique because no two edges out
of a node can have labels that begin with the
same char) - if no further match is possible
- if in the middle of an edge (u,v)
- split the edge into two
- insert a new node w just after the last
char on the edge that matched a char in
Si1...m (before the char mismatched) - label the edges (u,w) and (w,v)
accordingly - endif
- create a new edge (w,i1), thus creating a
new leaf (i1) - label (w,i1) with the unmatched part of the
suffix Si1...m - endif
6Suffix trees Ukkonens algorithm
- Ukkonen moves Mccreights work further,
decreasing space complexity to linear and giving
comprehendible definitions. - The tree is constructed online and in a
left-to-right fashion, as opposed to Weiners
method.
7Suffix trees Ukkonens algorithm
- In Ukkonens algorithm, substrings are kept by
their indices. - The trick is that the last index of suffixes are
not defined, which are represented by leaves. - If w is a substring of the string s, w(i,j) is
actually w si...sj. - Thus, the suffix tree for s will have at most s
leaves, guaranteeing linear complexity in space.
8Suffix trees - Applications
- In the notes of (Lewis, usask.ca), some general
applications of suffix trees in computational
biology are mentioned, - Genome alignment
- Signature selection
- Finding a short sequence that is specific to
individual genes - Searches for non-repeating segments
- Finding an representing all tandem repeats
9Suffix trees - Applications
- (Riedl, 1994) gives a more detailed list of
applications, - Suffix trees are useful on search, single
sequence analysis and multiple sequence analysis - With the method they use, which is called Gestalt
tree matching, homology-search applications are
believed to outperform fastp and fasta
10Suffix trees - Applications
- Detection and occurrences of any number of short
subsequences can be useful in enzyme cut-site
determination - Generalised suffix tree of a set of sequences
allows all of the sequences to be analysed
simultaneously - Detection of common subsequences within a set of
sequences can be applied to contig reassembly
(Riedl, 1994)
11Suffix trees - Applications
- Finding the best match between the suffix of one
read and the prefix of another can also be a
fruitful task - Suffix-prefix overlaps can help for finding the
shortest common superstrings of reads, especially
in genome assembly - Suffix trees can be used to remove redundancies
in string containment problems
12De Bruijn Graphs
- There exist strings which are called De Bruijn
strings, which might have given some inspiration
to the development of De Bruijn graphs and vice
versa. - A De Bruijn string of order k is a non-empty
string x which is defined over an alphabet A (
x?A) such that if each string on A of length k
occurs once and only once in x. Like x11001
where A 0,1.
13De Bruijn Graphs
- De Bruijn graphs models those kinds of strings
where the nodes hold the substrings of length k-1
and edges have one character (leftover of
k-length substring). If the two nodes are
connected by an edge, the one being the source
follows the other (it is a directed graph). - Building De Bruijn graphs is not a piece of cake
but it has many applications in genome assembly
14De Bruijn Graphs
- De Bruijn graphs are useful in
- Handling sequence variants like duplications,
inversions and transpositions - Combining sequences if different length
- Effective data compression even when the data has
many redundant parts - Detecting and analysing structural variants from
unassembled data
15De Bruijn Graphs and Affix Trees
- Conceptually, the De Bruijn graph of a sequence
can be considered as a simplification of that
sequence's affix tree. - Each non-empty substring of a given sequence is
mapped onto a separate node - Each node is connected by an edge to its longest
prefix and by a suffix link to its longest suffix - Nodes corresponding to sequences of length 1 are
directly connected to the root node,
corresponding to t.
16De Bruijn Graphs and Affix Trees
- Root represents empty string. The first children
are the sequences of length 1. - The analogy is built upon the atomic tree
representation of (Giegerich, 1997) and the idea
is mostly from (Maaß, 2003). - It has been pointed out in (Zerbino, 2009) that
traversing the De Bruijn graoh is equivalent to
traversing the affix tree across its breadth
17De Bruijn Graphs and Affix Trees
- Furthermore it says,
- If we rank the nodes by distance from the root,
the k-mer nodes of the De Bruijn graph correspond
to the nodes of rank k in the affix tree - It is easy to demonstrate that two k-mers are
connected in the De Bruijn graph iff the
corresponding nodes in the affix tree are
connected by a path composed of an edge and a
suffix link, going through a node of rank k1
18De Bruijn Graphs and Affix Trees
- (Giegreich, 1997) gives the definitions of
suffix and prefix trees together with their
special relationship. - Active suffixes and prefixes for the string t
- The active suffix of t ? its longest nested
suffix denoted as a(t) - The active prefix of t ? its longest nested
prefix denoted as a-1(t) - Then, a(t-1) (a-1(t))-1
19De Bruijn Graphs and Affix Trees
- The tree is atomic of each of its edges is marked
by a single char - So every node is explicit
- The tree is actually a trie
20De Bruijn Graphs and Affix Trees
- (Maaß, 2003) furthers this analogy-based idea of
affix trees and gives some more insight into the
issue - It says,
- A suffix link is an auxiliary edge from node n to
node m where m is the node such that path(m) is
the longest proper prefix of path(n) represented
by a node in the tree. - Suffix links are used to move from one node to
another so that the represented string is
shortened at the front. - It mentions also that in essence, it was actually
(Blummer, 1998) who observed the dual structure
of suffix trees.
21Redundancies in the sequences
- And finally, we can say that the redundancy in
the suffix tree of some string should be as less
as possible so that an efficient build-up of De
Bruijn graph out that tree can be obtained. - To reduce redundancy, some compression methods
can be applied but no loss should take place and
reversal should be possible. The algorithm of
Lempel and Ziv is advised to be an efficient tool
for this task, running in O(n) time with suffix
trees.
22References
- 1 Algorithms on Strings, Trees, and
Sequences Computer Science and Computational
Biology, Dan Gusfield, Cambridge University
Press, Jan 15, 1997. - 2 Ukkonen E., On-line Construction of
Suffix-Trees, Algorithmica vol 14(3), 1995. - 3 - Algorithms on Strings, Maxime Crochemore
and Christophe Hancart, Cambridge University
Press, June 2007. - 4 Genome assembly and comparison using de
Bruijn graphs, D.R. Zerbino, PhD Thesis, European
Bioinformatics Institute, Darwin College,
September, 2009. - 5 Giegerich R., and Kurtz S., From Ukkonen to
McCreight and Weiner A unifying view of
linear-time suf?x tree construction, Algorithmica
19331353, 1997 - 6 Maaß, M. G., Linear bidirectional on-line
construction of af?x trees, Algorithmica vol.
37(1), 2003. - 7 Bieganski, P., Riedl, J., Cartis, J.V.,
Retzel, E.F., Generalized Suffix Trees for
Biological Sequence Data Implementations and
Applications, HICSS (5), 1994.