CSE 746 - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

CSE 746

Description:

Thus, the suffix tree for s will have at most |s| leaves, guaranteeing linear complexity in space. Suffix trees - Applications In the notes of (Lewis, ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 23

Provided by: DicleO

Category:

more less

Transcript and Presenter's Notes

Title: CSE 746

1

CSE 746 Introduction to Bioinformatics
Research Project
Two methods of DNA Sequencing Comparing and
Intertwining Suffix Trees and De Bruijn Graphs
for Sequence Assembly
Dicle Öztürk
540110004

2
Suffix trees - Definition

Definition (Gusfield)

3
Suffix trees uses and complexity

Useful in string search, text processing, tasks
More like bridge between exact and inexact
pattern matching problems
Storing suffix trees requires more space than
storing the string itself
It was Ukkonen who was the first to provide a
linear time online construction of suffix trees
Can be used for the sorting stage of BWT

4
Suffix trees Naïve algorithm

Assuming a bounded alphabet, this algorithm runs
in O(m2) time.
--
N1 root (initially a leaf)
Ni assumption
Ni1 inductive string constructed using Ni

5
Suffix Trees Naïve algorithm

find the longest path from root whose label
matches a prefix of Si1...m
(Matching path is unique because no two edges out
of a node can have labels that begin with the
same char)
if no further match is possible
if in the middle of an edge (u,v)
split the edge into two
insert a new node w just after the last
char on the edge that matched a char in
Si1...m (before the char mismatched)
label the edges (u,w) and (w,v)
accordingly
endif
create a new edge (w,i1), thus creating a
new leaf (i1)
label (w,i1) with the unmatched part of the
suffix Si1...m
endif

6
Suffix trees Ukkonens algorithm

Ukkonen moves Mccreights work further,
decreasing space complexity to linear and giving
comprehendible definitions.
The tree is constructed online and in a
left-to-right fashion, as opposed to Weiners
method.

7
Suffix trees Ukkonens algorithm

In Ukkonens algorithm, substrings are kept by
their indices.
The trick is that the last index of suffixes are
not defined, which are represented by leaves.
If w is a substring of the string s, w(i,j) is
actually w si...sj.
Thus, the suffix tree for s will have at most s
leaves, guaranteeing linear complexity in space.

8
Suffix trees - Applications

In the notes of (Lewis, usask.ca), some general
applications of suffix trees in computational
biology are mentioned,
Genome alignment
Signature selection
Finding a short sequence that is specific to
individual genes
Searches for non-repeating segments
Finding an representing all tandem repeats

9
Suffix trees - Applications

(Riedl, 1994) gives a more detailed list of
applications,
Suffix trees are useful on search, single
sequence analysis and multiple sequence analysis
With the method they use, which is called Gestalt
tree matching, homology-search applications are
believed to outperform fastp and fasta

10
Suffix trees - Applications

Detection and occurrences of any number of short
subsequences can be useful in enzyme cut-site
determination
Generalised suffix tree of a set of sequences
allows all of the sequences to be analysed
simultaneously
Detection of common subsequences within a set of
sequences can be applied to contig reassembly
(Riedl, 1994)

11
Suffix trees - Applications

Finding the best match between the suffix of one
read and the prefix of another can also be a
fruitful task
Suffix-prefix overlaps can help for finding the
shortest common superstrings of reads, especially
in genome assembly
Suffix trees can be used to remove redundancies
in string containment problems

12
De Bruijn Graphs

There exist strings which are called De Bruijn
strings, which might have given some inspiration
to the development of De Bruijn graphs and vice
versa.
A De Bruijn string of order k is a non-empty
string x which is defined over an alphabet A (
x?A) such that if each string on A of length k
occurs once and only once in x. Like x11001
where A 0,1.

13
De Bruijn Graphs

De Bruijn graphs models those kinds of strings
where the nodes hold the substrings of length k-1
and edges have one character (leftover of
k-length substring). If the two nodes are
connected by an edge, the one being the source
follows the other (it is a directed graph).
Building De Bruijn graphs is not a piece of cake
but it has many applications in genome assembly

14
De Bruijn Graphs

De Bruijn graphs are useful in
Handling sequence variants like duplications,
inversions and transpositions
Combining sequences if different length
Effective data compression even when the data has
many redundant parts
Detecting and analysing structural variants from
unassembled data

15
De Bruijn Graphs and Affix Trees

Conceptually, the De Bruijn graph of a sequence
can be considered as a simplification of that
sequence's affix tree.
Each non-empty substring of a given sequence is
mapped onto a separate node
Each node is connected by an edge to its longest
prefix and by a suffix link to its longest suffix
Nodes corresponding to sequences of length 1 are
directly connected to the root node,
corresponding to t.

16
De Bruijn Graphs and Affix Trees

Root represents empty string. The first children
are the sequences of length 1.
The analogy is built upon the atomic tree
representation of (Giegerich, 1997) and the idea
is mostly from (Maaß, 2003).
It has been pointed out in (Zerbino, 2009) that
traversing the De Bruijn graoh is equivalent to
traversing the affix tree across its breadth

17
De Bruijn Graphs and Affix Trees

Furthermore it says,
If we rank the nodes by distance from the root,
the k-mer nodes of the De Bruijn graph correspond
to the nodes of rank k in the affix tree
It is easy to demonstrate that two k-mers are
connected in the De Bruijn graph iff the
corresponding nodes in the affix tree are
connected by a path composed of an edge and a
suffix link, going through a node of rank k1

18
De Bruijn Graphs and Affix Trees

(Giegreich, 1997) gives the definitions of
suffix and prefix trees together with their
special relationship.
Active suffixes and prefixes for the string t
The active suffix of t ? its longest nested
suffix denoted as a(t)
The active prefix of t ? its longest nested
prefix denoted as a-1(t)
Then, a(t-1) (a-1(t))-1

19
De Bruijn Graphs and Affix Trees

The tree is atomic of each of its edges is marked
by a single char
So every node is explicit
The tree is actually a trie

20
De Bruijn Graphs and Affix Trees

(Maaß, 2003) furthers this analogy-based idea of
affix trees and gives some more insight into the
issue
It says,
A suffix link is an auxiliary edge from node n to
node m where m is the node such that path(m) is
the longest proper prefix of path(n) represented
by a node in the tree.
Suffix links are used to move from one node to
another so that the represented string is
shortened at the front.
It mentions also that in essence, it was actually
(Blummer, 1998) who observed the dual structure
of suffix trees.

21
Redundancies in the sequences

And finally, we can say that the redundancy in
the suffix tree of some string should be as less
as possible so that an efficient build-up of De
Bruijn graph out that tree can be obtained.
To reduce redundancy, some compression methods
can be applied but no loss should take place and
reversal should be possible. The algorithm of
Lempel and Ziv is advised to be an efficient tool
for this task, running in O(n) time with suffix
trees.

22
References

1 Algorithms on Strings, Trees, and
Sequences Computer Science and Computational
Biology, Dan Gusfield, Cambridge University
Press, Jan 15, 1997.
2 Ukkonen E., On-line Construction of
Suffix-Trees, Algorithmica vol 14(3), 1995.
3 - Algorithms on Strings, Maxime Crochemore
and Christophe Hancart, Cambridge University
Press, June 2007.
4 Genome assembly and comparison using de
Bruijn graphs, D.R. Zerbino, PhD Thesis, European
Bioinformatics Institute, Darwin College,
September, 2009.
5 Giegerich R., and Kurtz S., From Ukkonen to
McCreight and Weiner A unifying view of
linear-time suf?x tree construction, Algorithmica
19331353, 1997
6 Maaß, M. G., Linear bidirectional on-line
construction of af?x trees, Algorithmica vol.
37(1), 2003.
7 Bieganski, P., Riedl, J., Cartis, J.V.,
Retzel, E.F., Generalized Suffix Trees for
Biological Sequence Data Implementations and
Applications, HICSS (5), 1994.