Title: DNA Sequence Compression using the BurrowsWheeler Transform
1DNA Sequence Compression using the
Burrows-Wheeler Transform
- Don Adjeroh, Yong Zhang
- Computer Science and Electrical Engineering
- West Virginia University, Morgantown,
- USA
- Amar Mukherjee
- Computer Science
- University of Central Florida, Orlando
- USA
- Matt Powell and Tim Bell
- Computer Science
- University of Canterbury, Christchurch
- New Zealand
- August 16, 2002
2Outline
- Introduction and The Problem
- Background
- Overview of Approach
- BWT and Repeat Analysis
- Parsing Strategies
- Results
- Conclusion
3 Introduction
- DNA as an information storage medium
- Draft sequence of the human genome now available
- Complete genomes available for many other
organisms - Implications and Possibilities
- Genome-wide analysis of entire genomes
- Cross-genome analysis with complete genomes
- Drug discovery and medical science
- Potential cure for gene-related diseases, such as
sickle-cell anemia, Huntingtons disease,
Fragile-X mental retardation syndrome, cancer
4The problem
5The problem of size...
- Some important genomes are in the order of
billions of base pairs - Exponential growth in the number of complete
genomes - Exponential growth in the size of available DNA
sequence data
Source Genbank website
- We need ...
- Efficient and effective algorithms for sequence
analysis and interpretation - Efficient techniques for management,
organization, and distribution of huge-volume
sequence data
6Nature of DNA Sequences
- Four types of nucleotide bases
- A - adenine, C - cytosine, G - guanine, T -
thymine - Various forms of repetitions
- Direct and tandem repeats
- Reverse repeats, complimented repeats,
palindromes - Combinations
- Approximate repeats
- Introns and Exons
- Coding areas - generally less repetitive
- Non-coding areas - generally more redundant
- Non-coding areas make up gt 50 of genomes
- DNA - only part of the whole story
7DNA Sequence Compression
Example
- General Data Compression
- Symbol-wise substitution (Huffman codes)
- Dictionary-based (LZ family)
- Context-based (BWT and PPM)
- DNA Sequence Compression
- 4 symbols (A,C,G,T) ? at most 2 bpc on average
- Generally dictionary-based
- Exploit the different forms of repeat
- Key Issues
- Speedy identification of the repeats
- Parsing with the repeats
- Representation of the parsed sequence
- Encoding the results
On-line dictionary
Off-line dictionary
8Overview of Approach
- Off-line dictionary based
- Repeat analysis via BWT
- Variable-length parsing
- Integer encoding using a simple Huffman code
- Compression using the BWT compression pipeline
Repeat Analysis
BWT
MTF
VLC
input
output
Repeat Analysis
BWT
MTF
VLC
input
output
9The Burrows-Wheeler Transform
- Forward Transform
- Let T be the input sequence, Tt1,t2,, tu.
- Form u permutations of T by cyclic rotations of
the characters in T. These form a u?u matrix M.
Each row represents one permutation of T - Sort the rows of M' lexicographically to form
another matrix M. - Record L, the last column of M, and id, the row
number for the row in M that corresponds to T.
The BWT output is the pair (L, 2).
10The BWT
- Inverse Transform
- Given only the pair (L,id)
- Sort L to produce F, the array of first
characters - Compute V, provides 1-1 mapping between the L and
F. FVjLj. - Generate original sequence T, the symbol Lj
cyclically precedes the symbol Fj in T, that
is, LV j cyclically precedes the symbol Lj
in T. - BWT Compression Pipeline
11BWT - Auxiliary Arrays
- From inverse BWT, Fj LV j. We generate
F??T mapping - Let Hrreverse(H). we have
- Define Hrs as the index vector to Hr
- Hr Hrs can be obtained in linear time from L, V
and F. - Example TACTAGA
12BWT, Suffix Trees and Repeats
- The BWT provides a lexicographic sorting of the
contexts - The BWT is closely related to the suffix tree
suffix array - The suffix array corresponds exactly to our
auxiliary array Hrs ! - Repetition analysis as is done on suffix trees
can now be done on the BWT output !
13Parsing Strategy I -- VPS1
- Off-line dictionary with pointers in the
dictionary - Repetition codes
- 1- direct repeat 2 - reverse 3 palindrome 4 -
compliment 5 - complimented palindrome 6-
reverse compliment - General Dictionary structure
- Example
Dictionary for example
14Analysis of VPS1
- Costs (in bits)
- Original sequence, S
- Parsed sequence
- Vocabulary
- Positions
- Dictionary
- Compression gain
- With , we underestimate the gain
,
G(S)
15Parsing Strategy II -- VPS2
- Off-line dictionary with pointers in the sequence
- Fixed-length versus variable-length parsing
- Parsing schema1 ltreference indexgt ltrIgt
- Parsing schema2 ltreference index, repeat type gt
ltrI, rTgt - Parsing schema3 ltindex, repeat type, rangegt
ltrI, rT, sP, nPgt - Example
- With fixed length parsing
- Parse(S) x1lt1,1gtx2 lt3,1gtx3lt2,1gtx4lt4,1gtx5lt2,2gtx6lt
1,5gtx7lt3,1gt - With variable-length parsing
- Parse(S) x1lt1gtx2lt3gtx3lt2gtx4lt4gtx5lt1,1,1,5gtx6lt1,5gtx7
lt3gt
Dictionary
16Analysis of VPS2
- Parsing
- ParsePart1 x1x2x3x4x5x6x7x8x9
- ParsePart1 lengths sk1, sk2, sk3 sk4, sk5, sk6
sk7, sk8, sk9 ski length of xi - ParsePart2 lt1gt0lt3gt0lt2gt0lt4gt0lt1,1,15gt0lt1,5gt0lt3gt0lt4
gt0 - Costs (per occurrence costs for each repeat)
- ParsePart1
- Vocabulary
- Total cost
- Per-replacement cost
- Compression gain
- Constraints on l(r),n(r), and rI(r)
ParsePart2
17Encoding the Integers
- 1st order Fibonacci codes (FK1)
- 2nd order Fibonacci codes (AF1)
- Simple Huffman codes (H1)
5 - 1011 6 - 1100 7 - 1101 8 - 1110 9 - 1111
F - 00 1 - 010 2 - 011 0 - 1000 3 - 1001 4 -
1010
18Results
- Compressed file size (bytes)
19Results
- Compressed file size (bits per character)
20Results
Size (bytes)
Method index 1 - gzip 2 - arith 3 - BWT 4 -
vps 5- vpsH1a 6 - vpsH1b 7 - vpsFK1a 8 -
vpsFK1b 9 - vpsAF1a 10 - vpsAF1b
Size (bpc)
21Results
- Computation Time (H1 versus AF1 and KF1)
- (Time in seconds)
22Conclusion
- BWT presents an effective mechanism for repeat
analysis - Not all repeats can lead to compression
- Off-line dictionaries are effective for DNA
sequence compression - Performance depends on the representation of the
parsed sequence, type, length, and number of
repeats. - What next ?
- Variable-length coding
- More rigorous theoretical analysis
- Further testing and comparative study
(approximate repeats, palindromes, speeding up
the process, etc.) - DNA sequence entropy estimation
- Towards compressed domain DNA sequence analysis
23