Title: Peter Adams
1Bioinformatics in action.Using pure
mathematics, computer science and molecular
biology to sequence problematic regions of genomes
Peter Adams Department of Mathematics The
University of Queensland pa_at_maths.uq.edu.au
2Marketing versus science.
- The International Human Genome Sequencing
Consortium .. today announced the successful
completion of the Human Genome Project more than
two years ahead of schedule
.. with the only remaining gaps corresponding to
regions whose sequence cannot be reliably
resolved with current technology.
3DNA Sequencing
- Sequence analysis is the process of determining,
for a given organism, the correct order for the
four bases C, G, A, T. - This is a major exercise
- The human genome contains 3 billion base pairs
- Plant genomes can be up to 100 times larger
- Major focus of international scientific and
commercial effort - Pharmaceutical companies have great interest!
- Enormous progress has been made, but total
international sequencing effort is still
increasing
4Impediments to current sequencing
- Most techniques based on Sanger sequencing and
shotgun sequencing - Many patterns of bases create difficulties for
sequencing - For example
- Direct repeats cause problems with chemical
processes - Inverted repeats cause hairpin and other complex
structures - Repeated motifs create ambiguities in
reconstruction. - Problematic regions left till last because they
are difficult - Alternative labour intensive methods have to be
used - There still many gaps to be closed
5SBH An alternate technology
- Sequencing by hybridization was proposed as an
alternate technology
- Ideally, SBH involves
- Finding the SBH spectrum of the target fragment
(that is, all sub-strings of given length p which
occur in the target) and - Reconstructing the target from its SBH spectrum.
- SBH probes
- one probe for each possible sequence of p bases.
- Probe sites light-up, revealing all subsequences
in the target - p may be 10, 11 or more
- Enabled via pixels on microarrays called SBH
chips - Need 4p distinct regions on the chip
6An example with p4.
- Reveals all subsequences of length 4 in the
target fragment - May (probably wont) reveal repeated
occurrences - May (probably wont) know the total length of
the target - May (probably will) have false positives and
false negatives - Certainly wont reveal the order of the
subsequences - The target fragment is then reconstructed from
its SBH spectrum, by aligning overlapping
subsequences in the correct order.
7T C A C C G T C G C C A C T G T C
C T
T C A C C A C C A C C G
C C G T C G
T C G T C G
T C G C
C G C C
G C C A
C C A C
C A C T
A C T
G
C T G T
T G T C
G T C C
T C C T
Sequence reconstructionfrom an SBH spectrumwith
p4.
Try reconstructing this (without knowing the
order)
8 TCAC
CACC
ACCG
CACC
CCGT
TCAC
CCGT
ACCG
9Graphical representation of SBH
- Can represent the SBH spectrum as a combinatorial
graph - Vertices are subsequences of length (p-1) from
the spectrum - Draw a directed edge from vertex u to vertex v
whenever there is a subsequence in the SBH
spectrum containing the corresponding vertex
labels as its prefix and suffix - Can represent the SBH reconstruction problem
graphically - Find a path passing through every edge in SBH
graph exactly once - reconstruct the sequence of the target by
reading vertex labels in turn - A well-known problem in pure mathematics
eulerian trails!
10Previous example revisited
11An alternate reconstruction
12Evaluating the effectiveness of SBH
- unambiguous reconstruction of the target is not
always possible - two or more distinct DNA fragments can have
identical subsequence constituents (SBH spectra) - repeated subsequences of length (p?1) or more
may cause ambiguities
- Important questions are
- How effective is SBH?
- What is the likelihood that an unknown target
fragment of length L will have unambiguous
reconstruction if probes of length p are used? - Can investigate this by simulating SBH
reconstruction.
13Simulating SBH reconstruction
- Obtain a large database of previously sequenced
genomic DNA - For various values of p and L
- Select a DNA fragment of length L at random from
database - determine all subsequences of length p which are
present in the selected fragment - calculate the number of reconstructions of the
fragment - repeat this process sufficiently many times to
allow statistical predictions of the proportion
of fragments of length L that have unique
reconstruction from their subsequences of length
p - Ideally suited to grid computing!
14Simulating SBH for various probe lengths
15Failure of SBH
- Reconstruction ambiguities are a serious and
fundamental problem with SBH. - Only very short targets have high probability of
unambiguous reconstruction - SBH has never become a competitive sequencing
technology.
16Can Bioinformatics help?
Coming from the viewpoint of mathematics and
information technology
-
- We observed that the repeat structure of DNA
causes problems for SBH. - Can we reduce the impact of repeats in the
target fragment, by making the target more
random? - Does this idea make any sense at all?
- Using techniques from molecular biology,
mathematics and information technology, the
answer is yes!
17Sequence Analysis via Mutagensis (SAM)
- Deliberately introduce random pointwise
mutations into some number of copies of a target
DNA (Biochemistry) - Hence (in some copies) disrupt the sequence
structure which made standard technologies fail - Sequence (some) mutants using standard
technologies (Molecular Biology) - Infer the original target from the mutants
(Mathematics and
Information Technology)
18An overview of SAM
19Key points
- Mutation
- Achieved via use of certain mutagenic chemicals
- Achieve mutation rates of 1 - 30
- destroys problematic features (for example,
repeat structures) - Information is not lost, but instead is
distributed across multiple mutated variants - Algorithms
- May require 5-10 (or more) mutant copies
- reconstruct original sequence, even with high
mutation rates - resolve assembly ambiguities, using similar total
sequencing coverage to standard methods
20Example Stem-and-Loop structures
21Example Improving reads from a sequencer
A Genomic Poly-A sequence ambiguity. A poly-A
region in a human genome clone is described as
having an undefined number of A bases as a result
of poor sequencing reads (Genbank AC006367).
We tried using the latest cycle sequencing
chemistry (first figure). The result was poor
clones contained varying numbers of As and
downstream sequences were unreliable. The
experiment was repeated using SAM. The new
traces (second and third figures) clearly
demonstrate that the mutated fragments are more
readily sequenced.
22(No Transcript)
23Inferring the target
- Three approaches to inferring target
- Alignment approach
- Minimum distance approach
- Bayesian approach
24Previous example revisited
A section of the multiple alignment of six
mutant copies (M1 to M6) is shown. Inferred
sequence is shown as Inf. Published sequence is
shown as Pub. Output from a standard sequencing
experiment is shown as Seq.
M1 ACTCTGTCTCAAAAACAAAAAAAAAAAAA-----------------
GTGGACTTGGATGG M2 ACTCTGTCTCAAAAAAAAAAAAAACAAAAA-
---------------GTGGACTTGGATTG M3
ACGCTGTCTCAAAAACAAAAAAACAAAAAA----------------GTGG
ACGTGGATTG M4 ACTCTGTCTCAAAAAAAAAAAAACAAAAAA-----
-----------GTGGACTTGGATTG M5 ACTCTGTCTCAACAAAAAAA
ACAAAACAAAA---------------GTGGACTTGGATTG M6
ACTCTGTCTCAAAAAACAAAAAACACAAAC----------------GTGG
CCTTGGATTG Inf ACTCTGTCTCAAAAAAAAAAAAAAAAAAAA----
------------GTGGACTTGGATTG Pub
ACTCTGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTGG
ACTTGGATTG Seq ACTCTGTCTCAAAAAAAAAAAAAAAAAAANNANA-
-----------GTGGACTTGGATTG
25SAM and SBH.
Primary benefits of applying SAM with SBH
- allows information from multiple variants to be
combined in order to infer target - If an ambiguity is resolved in any mutant
variant, this can be used to resolve that
ambiguity in other variants and the target.
26An example.
Original fragment
Repeated subsequences
Mutant
Repeated subsequences
27(No Transcript)
28Performance
- Simulations of SAM with SBH and 10 mutant
variants, with reasonable mutation intensities. - Compare with standard SBH (results in parentheses)
29Thank you!
30(No Transcript)