Title: Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching
1Introduction to Bioinformatics Lecture
IIIGenome Assembly and String Matching
- Jarek Meller
- Division of Biomedical Informatics,
- Childrens Hospital Research Foundation
- Department of Biomedical Engineering, UC
2Outline of the lecture
-
- Physical mapping problem and the resulting
computational challenges - Ordering clone libraries from the consecutive
ones to global optimization methods - Applications of exact string matching methods
- Towards the shortest superstring problem and the
shotgun assembly problem
3 Literature watch
Aloy et. al., Structure-Based Assembly of
Protein Complexes in Yeast, Science 303, As a
way of getting acquainted with protein pathways
and their intersection with structural studies.
4Assembling physical maps of a genome
Markers DNA
Physical mapping problem create and locate in
the genome of interest a set of markers (e.g.
stretches of DNA that hybridize to a given
probe). With sufficiently dense and ordered set
of markers any newly sequenced (and long enough
to cover at least one marker) DNA fragment can be
mapped to a rough location on the genome. One
of the early goals of the Human Genome Project
was to select and map a set of STS markers such
that there would be at least one STS in each
stretch of 100 kb of the genome.
5Physical mapping and the problem of ordering
clone libraries with STS markers
STS 1 2
3 4 5
DNA clone 1 clone 2 clone 3 clone
4
Definition A clone library consists of a set of
short DNA fragments, called clones that
originated in a stretch of the studied
DNA. Definition A sequence tagged site (STS) is
a DNA substring which occurs only once in the
DNA of interest. One may think of STSs as a set
of indices to which new DNA sequences can be
referenced. Problem What is the minimum length
of the STSs that could (at least in principle)
provide the requested coverage for the Human
genome?
6The problem of ordering clone libraries with STS
markers can be cast (and solved) as the
consecutive ones problem
The true location of the STSs and clones is not
known. However, for each clone the list of STSs
hybridizing to it is given.
STS 1 2
3 4 5
DNA clone 1 clone 2 clone 3 clone
4
Our task is to reconstruct the original order of
the STSs (and thus order the clone library) given
this data. Assuming that the STS probes are
unique and that there are no hybridization
errors the problem can be cast as the consecutive
ones problem and efficiently solved using CS
techniques (PQ-tree algorithm, Booth and Leuker,
1976).
7The consecutive ones problem and its solution
For a binary hybridization matrix find a
permutation of its columns such that in each row
all ones are located in a block of consecutive
entries.
STS 1 2
3 4 5
DNA clone 1 clone 2 clone 3 clone
4
STS Clone
8Fortunately errors make life more interesting
In the presence of experimental errors the
problem leads to global optimization problem (see
Pevzner, Chapter 3).
STS 1 2
3 4 5
DNA clone 1 clone 2 clone 3 clone
4
STS Clone
9Heuristic solutions may still provide good probe
ordering
The number of gaps (blocks of zeros in rows) in
the hybridization matrix may be used as a cost
function, since hybridization errors typically
split blocks of ones (false negatives) or split a
gap into two gaps (false positive). The problem
of finding a permutation that minimizes the
number of gaps can be cast as a Traveling
Salesman Problem (TSP), in which cities are the
columns of the hybridization matrix (plus an
additional column of zeros) and the distance
between two cities is the number of positions in
which the two columns differ (Hamming
dist.) Thus, an efficient algorithm is unlikely
in general case (unless PNP) and heuristic
solutions are being sought that provide good
probe ordering, at least for most cases (e.g.
Alizadeh et. al., 1995) Problem Is the correct
order of the STSs in the example from the
previous slide providing the shortest cycle for
the corresponding TSP?
10Map location of anonymous DNA as a string
matching problem
A sufficiently long string of anonymous yet
sequenced DNA can be placed on the physical map
by finding which STSs are contained in this
sequence. Due to the size of the problem,
efficiency is very important. Millions of STS
are available at present and their total length
is typically much larger than the length of the
DNA sequence to be mapped. Assuming no
sequencing errors, the problem can be cast as the
exact set matching and solved efficiently using
for example suffix trees. Generalized suffix
tree or inexact string matching methods need to
be used when some errors are allowed.
11Strings, sequences and string operations
12String exact matching problem
13Solving the exact matching problem conceptual
simplicity vs. computational complexity
14Computationally efficient and elegant solutions
15The idea of the suffix tree method
A string with m characters has m suffixes, which
can be represented as m leaves of a rooted
directed tree. Consider for example Tcabca
c
b
a
4
c
b
a
a
c
b
a
5
c
3
1
a
2
For simplicity one leaf, due to the terminal
character is not included. Problem What is the
reason for adding the terminal character?
16Why does it work?
A substring of a string is a prefix of a suffix
in that string. For example, a substring Pab is
a prefix of the suffix bca in Tcabca. Thus, if P
occurs in T there is a leaf in the suffix tree
that has a label starting with P.
c
b
a
4
c
b
a
a
c
b
a
5
c
3
1
a
2
As a related problem consider the motif search,
as implemented in PROSITE. Explain how finite
automata formalism is used for motif search.
17General idea ordered fingerprints and the notion
of closeness between DNA fragments
Hierarchical sequencing physical maps, clone
libraries and shotgun Definition The algorithmic
problem of shotgun sequence assembly is to deduce
the sequence of the DNA string from a set of
sequenced and partially overlapping short
substrings derived from that string. Analogy to
physical map assembly DNA sequence of a
substring may be viewed as a precise ordered
fingerprint (in analogy to STSs) and
the suffix-prefix match determines if two
substrings would be assembled together. In
general, the shortest superstring problem (find
the shortest string that contains each string
from a certain set of strings as its substring)
is NP-hard and heuristics are being developed to
address the problem.
18Get the relevant sequences to compare them
conservation and differences
Problem ? Algorithms ?
Programs Sequencing ? Fragment assembly problem
? The Shortest Superstring Problem ? Phrap
(Green, 1994) Gene finding ? Hidden Markov
Models, pattern recognition methods ? GenScan
(Burge Karlin, 1997) Sequence comparison ?
pairwise and multiple sequence alignments ?
dynamic algorithm, heuristic methods ? BLAST
(Altschul et. al., 1990)