Title: Sequence Assembly for Single Molecule Methods
1Sequence Assembly for Single Molecule Methods
- Steven Skiena, Alexey Smirnov
- Department of Computer Science
- SUNY at Stony Brook
- skiena, alexey_at_cs.sunysb.edu
2The State of Sequence Assembly
- The success of full genome sequencing implies
that shotgun sequence assembly with current
technologies is largely a solved problem - With conventional sequence technologies
- read length about 500 base pairs
- error rate under 2
- coverage about 10 times for bacteria, about 30
times for humans - But single molecule sequencing methods promise to
change these parameters significantly
3Single Molecule Sequencing Methods
- Single molecule sequencing methods, such as being
developed by U.S. Genomics, promise much longer
read lengths - read length hundreds of thousands of bases ?
- error rate ?
- "No free lunch hypothesis" - we anticipate that
the new technologies will (at least initially)
have significantly higher error rates than
current sequencing machines. - Our assumption long lousy reads.
4Our Problems
- What levels of coverage will be needed to get
accurate sequence informationfrom long noisy
reads? - How do we efficiently assemble such long noisy
reads?
5Sequencing from Subsequences
- Why subsequences?
- We anticipate that certain single molecule
sequencing technologies will be prone to having
many base deletion errors - Example in the U.S. Genomics technology,
sequence bases are replaced by tagged bases.
Untagged bases are invisible, generating
subsequences. - We study the effect of per base deletion
frequencies on our ability to accurately
reconstruct long sequences. Our study revolves
around this theoretical error model. But our
algorithm can be easily generalized.
6Notation
- n length of the original sequence
- p base deletion rate
- k number of reads
- Ri a read of the original sequence
7Quality of Reconstruction Metric
- where ED is the edit distance, s is the target
sequence of length n, s sequence reconstructed
from the reads. - An empty string has a score of 0
- The target string has a score of 1
8Lower Bounds on Reconstruction Quality
- k0 -gt report a random string of some length.
Computational experiments showed that reporting a
string of length 0.6n gives best results
(score0.37) - k1 -gt report this read score1-p (because
(1-p)n characters will be matched and the rest
will be inserted).
9Lower Bounds on Reconstruction Quality
10Information Theory Bounds
- What is the minimal number of reads that we need
to reconstruct the sequence? - First, we need to know the number of sequences of
length n in which a given read of length k occurs
Each of reads gives us at most this number of
bits of information
Therefore, we will need at least this many reads
11Bounds on the Number of Reads
Conclusion reconstruction becomes impossible for
error rates higher than 75, but possible for 50
12Sequence Assembly Algorithm
- We use a two phase procedure
- Insertion align a read Ri with consensus
sequence Ci-2 and build a new consensus Ci-1 - Refining and Cleanup delete/reorder characters
from current consensus to better reflect the
reads and delete unused characters
C3
refine cleanup
C2
R4
refine cleanup
C1
R3
refine cleanup
R1
R2
13Read Insertion
- How to choose the optimal alignment to insert a
new read into current consensus Ci? - Pairwise align all reads against Ci and for each
position of Ci, compute the number of times each
particular character was inserted into it at this
position. - Align the read being inserted against the
weighted consensus sequence using the insertion
weights generated before.
14Consensus Refining
- Pairwise alignment from reads is prone to two
types of errors inserting a pair of characters
in a wrong order and undersampling
ATA
R1
refine
ATCA
ACTAA
ACTAA
ACA
R2
Solution Try to make a swap and a character
doubling at each position and see if it improves
the alignment score for some reads.
15Clean up Procedure
- Pairwise align all reads against the target to
weight the positions of S by frequency of use. - Update weights after each alignment to bias
matches toward frequently used positions. - Delete all characters matched fewer than a
certain number of times.
16Complexity Analysis
- Each insertion step takes O(knn) time
- Each refining step takes O(knn) time
- Each cleanup step takes O(knn) time
- Total O(niterknn) where niter is the number
of iterations
17Results
- For base deletion rates as high as 40, we can
completely reconstruct sequences with high enough
coverage (50 times coverage) - For larger error rates, our algorithm finds
shorter supersequences, i.e. there are multiple
answers so exact reconstruction is impossible. - Here we ignored the possibility of
insertion/substitution errors, but it is clear
our methods can adapt to different error models
at lower error rates.
18Results
19Future Work
- We want to build your single molecule sequence
assembler! - Our Stroll shotgun sequence assembler (Chen and
Skiena) was used by Brookhaven National
Laboratory to sequence the bacterial Borrelia
burgdorferi. - We are particularly interested in identifying
better error models for sequencing technologies
under current development.
20The End