DNA Sequence Compression using the BurrowsWheeler Transform

About This Presentation

Title:

DNA Sequence Compression using the BurrowsWheeler Transform

Description:

Let Hr=reverse(H). we have. Define Hrs as the index vector to Hr: ... Off-line dictionary with pointers in the sequence. Fixed-length versus variable-length parsing ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:0.0/5.0

Slides: 24

Provided by: Adr110

more less

Transcript and Presenter's Notes

Title: DNA Sequence Compression using the BurrowsWheeler Transform

1
DNA Sequence Compression using the
Burrows-Wheeler Transform

Don Adjeroh, Yong Zhang
Computer Science and Electrical Engineering
West Virginia University, Morgantown,
USA
Amar Mukherjee
Computer Science
University of Central Florida, Orlando
USA
Matt Powell and Tim Bell
Computer Science
University of Canterbury, Christchurch
New Zealand
August 16, 2002

2
Outline

Introduction and The Problem
Background
Overview of Approach
BWT and Repeat Analysis
Parsing Strategies
Results
Conclusion

3
Introduction

DNA as an information storage medium
Draft sequence of the human genome now available
Complete genomes available for many other
organisms
Implications and Possibilities
Genome-wide analysis of entire genomes
Cross-genome analysis with complete genomes
Drug discovery and medical science
Potential cure for gene-related diseases, such as
sickle-cell anemia, Huntingtons disease,
Fragile-X mental retardation syndrome, cancer

4
The problem

Size !

5
The problem of size...

Some important genomes are in the order of
billions of base pairs
Exponential growth in the number of complete
genomes
Exponential growth in the size of available DNA
sequence data

Source Genbank website

We need ...
Efficient and effective algorithms for sequence
analysis and interpretation
Efficient techniques for management,
organization, and distribution of huge-volume
sequence data

6
Nature of DNA Sequences

Four types of nucleotide bases
A - adenine, C - cytosine, G - guanine, T -
thymine
Various forms of repetitions
Direct and tandem repeats
Reverse repeats, complimented repeats,
palindromes
Combinations
Approximate repeats
Introns and Exons
Coding areas - generally less repetitive
Non-coding areas - generally more redundant
Non-coding areas make up gt 50 of genomes
DNA - only part of the whole story

7
DNA Sequence Compression
Example

General Data Compression
Symbol-wise substitution (Huffman codes)
Dictionary-based (LZ family)
Context-based (BWT and PPM)
DNA Sequence Compression
4 symbols (A,C,G,T) ? at most 2 bpc on average
Generally dictionary-based
Exploit the different forms of repeat
Key Issues
Speedy identification of the repeats
Parsing with the repeats
Representation of the parsed sequence
Encoding the results

On-line dictionary
Off-line dictionary
8
Overview of Approach

Off-line dictionary based
Repeat analysis via BWT
Variable-length parsing
Integer encoding using a simple Huffman code
Compression using the BWT compression pipeline

Repeat Analysis
BWT
MTF
VLC
input
output
Repeat Analysis
BWT
MTF
VLC
input
output
9
The Burrows-Wheeler Transform

Forward Transform
Let T be the input sequence, Tt1,t2,, tu.
Form u permutations of T by cyclic rotations of
the characters in T. These form a u?u matrix M.
Each row represents one permutation of T
Sort the rows of M' lexicographically to form
another matrix M.
Record L, the last column of M, and id, the row
number for the row in M that corresponds to T.
The BWT output is the pair (L, 2).

10
The BWT

Inverse Transform
Given only the pair (L,id)
Sort L to produce F, the array of first
characters
Compute V, provides 1-1 mapping between the L and
F. FVjLj.
Generate original sequence T, the symbol Lj
cyclically precedes the symbol Fj in T, that
is, LV j cyclically precedes the symbol Lj
in T.
BWT Compression Pipeline

11
BWT - Auxiliary Arrays

From inverse BWT, Fj LV j. We generate
F??T mapping
Let Hrreverse(H). we have
Define Hrs as the index vector to Hr
Hr Hrs can be obtained in linear time from L, V
and F.
Example TACTAGA

12
BWT, Suffix Trees and Repeats

The BWT provides a lexicographic sorting of the
contexts
The BWT is closely related to the suffix tree
suffix array
The suffix array corresponds exactly to our
auxiliary array Hrs !
Repetition analysis as is done on suffix trees
can now be done on the BWT output !

13
Parsing Strategy I -- VPS1

Off-line dictionary with pointers in the
dictionary
Repetition codes
1- direct repeat 2 - reverse 3 palindrome 4 -
compliment 5 - complimented palindrome 6-
reverse compliment
General Dictionary structure
Example

Dictionary for example
14
Analysis of VPS1

Costs (in bits)
Original sequence, S
Parsed sequence
Vocabulary
Positions
Dictionary
Compression gain
With , we underestimate the gain

,

G(S)
15
Parsing Strategy II -- VPS2

Off-line dictionary with pointers in the sequence
Fixed-length versus variable-length parsing
Parsing schema1 ltreference indexgt ltrIgt
Parsing schema2 ltreference index, repeat type gt
ltrI, rTgt
Parsing schema3 ltindex, repeat type, rangegt
ltrI, rT, sP, nPgt
Example
With fixed length parsing
Parse(S) x1lt1,1gtx2 lt3,1gtx3lt2,1gtx4lt4,1gtx5lt2,2gtx6lt
1,5gtx7lt3,1gt
With variable-length parsing
Parse(S) x1lt1gtx2lt3gtx3lt2gtx4lt4gtx5lt1,1,1,5gtx6lt1,5gtx7
lt3gt

Dictionary
16
Analysis of VPS2

Parsing
ParsePart1 x1x2x3x4x5x6x7x8x9
ParsePart1 lengths sk1, sk2, sk3 sk4, sk5, sk6
sk7, sk8, sk9 ski length of xi
ParsePart2 lt1gt0lt3gt0lt2gt0lt4gt0lt1,1,15gt0lt1,5gt0lt3gt0lt4
gt0
Costs (per occurrence costs for each repeat)
ParsePart1
Vocabulary
Total cost
Per-replacement cost
Compression gain
Constraints on l(r),n(r), and rI(r)

ParsePart2

17
Encoding the Integers

1st order Fibonacci codes (FK1)
2nd order Fibonacci codes (AF1)
Simple Huffman codes (H1)

5 - 1011 6 - 1100 7 - 1101 8 - 1110 9 - 1111
F - 00 1 - 010 2 - 011 0 - 1000 3 - 1001 4 -
1010
18
Results

Compressed file size (bytes)

19
Results

Compressed file size (bits per character)

20
Results
Size (bytes)
Method index 1 - gzip 2 - arith 3 - BWT 4 -
vps 5- vpsH1a 6 - vpsH1b 7 - vpsFK1a 8 -
vpsFK1b 9 - vpsAF1a 10 - vpsAF1b
Size (bpc)
21
Results

Computation Time (H1 versus AF1 and KF1)
(Time in seconds)

22
Conclusion

BWT presents an effective mechanism for repeat
analysis
Not all repeats can lead to compression
Off-line dictionaries are effective for DNA
sequence compression
Performance depends on the representation of the
parsed sequence, type, length, and number of
repeats.
What next ?
Variable-length coding
More rigorous theoretical analysis
Further testing and comparative study
(approximate repeats, palindromes, speeding up
the process, etc.)
DNA sequence entropy estimation
Towards compressed domain DNA sequence analysis

The End
Thank You

Write a Comment

User Comments (0)