DNA Sequence Compression using the BurrowsWheeler Transform

1 / 23
About This Presentation
Title:

DNA Sequence Compression using the BurrowsWheeler Transform

Description:

Let Hr=reverse(H). we have. Define Hrs as the index vector to Hr: ... Off-line dictionary with pointers in the sequence. Fixed-length versus variable-length parsing ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:0.0/5.0
Slides: 24
Provided by: Adr110

less

Transcript and Presenter's Notes

Title: DNA Sequence Compression using the BurrowsWheeler Transform


1
DNA Sequence Compression using the
Burrows-Wheeler Transform
  • Don Adjeroh, Yong Zhang
  • Computer Science and Electrical Engineering
  • West Virginia University, Morgantown,
  • USA
  • Amar Mukherjee
  • Computer Science
  • University of Central Florida, Orlando
  • USA
  • Matt Powell and Tim Bell
  • Computer Science
  • University of Canterbury, Christchurch
  • New Zealand
  • August 16, 2002

2
Outline
  • Introduction and The Problem
  • Background
  • Overview of Approach
  • BWT and Repeat Analysis
  • Parsing Strategies
  • Results
  • Conclusion

3
Introduction
  • DNA as an information storage medium
  • Draft sequence of the human genome now available
  • Complete genomes available for many other
    organisms
  • Implications and Possibilities
  • Genome-wide analysis of entire genomes
  • Cross-genome analysis with complete genomes
  • Drug discovery and medical science
  • Potential cure for gene-related diseases, such as
    sickle-cell anemia, Huntingtons disease,
    Fragile-X mental retardation syndrome, cancer

4
The problem
  • Size !

5
The problem of size...
  • Some important genomes are in the order of
    billions of base pairs
  • Exponential growth in the number of complete
    genomes
  • Exponential growth in the size of available DNA
    sequence data

Source Genbank website
  • We need ...
  • Efficient and effective algorithms for sequence
    analysis and interpretation
  • Efficient techniques for management,
    organization, and distribution of huge-volume
    sequence data

6
Nature of DNA Sequences
  • Four types of nucleotide bases
  • A - adenine, C - cytosine, G - guanine, T -
    thymine
  • Various forms of repetitions
  • Direct and tandem repeats
  • Reverse repeats, complimented repeats,
    palindromes
  • Combinations
  • Approximate repeats
  • Introns and Exons
  • Coding areas - generally less repetitive
  • Non-coding areas - generally more redundant
  • Non-coding areas make up gt 50 of genomes
  • DNA - only part of the whole story

7
DNA Sequence Compression
Example
  • General Data Compression
  • Symbol-wise substitution (Huffman codes)
  • Dictionary-based (LZ family)
  • Context-based (BWT and PPM)
  • DNA Sequence Compression
  • 4 symbols (A,C,G,T) ? at most 2 bpc on average
  • Generally dictionary-based
  • Exploit the different forms of repeat
  • Key Issues
  • Speedy identification of the repeats
  • Parsing with the repeats
  • Representation of the parsed sequence
  • Encoding the results

On-line dictionary
Off-line dictionary
8
Overview of Approach
  • Off-line dictionary based
  • Repeat analysis via BWT
  • Variable-length parsing
  • Integer encoding using a simple Huffman code
  • Compression using the BWT compression pipeline

Repeat Analysis
BWT
MTF
VLC
input
output
Repeat Analysis
BWT
MTF
VLC
input
output
9
The Burrows-Wheeler Transform
  • Forward Transform
  • Let T be the input sequence, Tt1,t2,, tu.
  • Form u permutations of T by cyclic rotations of
    the characters in T. These form a u?u matrix M.
    Each row represents one permutation of T
  • Sort the rows of M' lexicographically to form
    another matrix M.
  • Record L, the last column of M, and id, the row
    number for the row in M that corresponds to T.
    The BWT output is the pair (L, 2).

10
The BWT
  • Inverse Transform
  • Given only the pair (L,id)
  • Sort L to produce F, the array of first
    characters
  • Compute V, provides 1-1 mapping between the L and
    F. FVjLj.
  • Generate original sequence T, the symbol Lj
    cyclically precedes the symbol Fj in T, that
    is, LV j cyclically precedes the symbol Lj
    in T.
  • BWT Compression Pipeline

11
BWT - Auxiliary Arrays
  • From inverse BWT, Fj LV j. We generate
    F??T mapping
  • Let Hrreverse(H). we have
  • Define Hrs as the index vector to Hr
  • Hr Hrs can be obtained in linear time from L, V
    and F.
  • Example TACTAGA

12
BWT, Suffix Trees and Repeats
  • The BWT provides a lexicographic sorting of the
    contexts
  • The BWT is closely related to the suffix tree
    suffix array
  • The suffix array corresponds exactly to our
    auxiliary array Hrs !
  • Repetition analysis as is done on suffix trees
    can now be done on the BWT output !

13
Parsing Strategy I -- VPS1
  • Off-line dictionary with pointers in the
    dictionary
  • Repetition codes
  • 1- direct repeat 2 - reverse 3 palindrome 4 -
    compliment 5 - complimented palindrome 6-
    reverse compliment
  • General Dictionary structure
  • Example

Dictionary for example
14
Analysis of VPS1
  • Costs (in bits)
  • Original sequence, S
  • Parsed sequence
  • Vocabulary
  • Positions
  • Dictionary
  • Compression gain
  • With , we underestimate the gain

,


G(S)
15
Parsing Strategy II -- VPS2
  • Off-line dictionary with pointers in the sequence
  • Fixed-length versus variable-length parsing
  • Parsing schema1 ltreference indexgt ltrIgt
  • Parsing schema2 ltreference index, repeat type gt
    ltrI, rTgt
  • Parsing schema3 ltindex, repeat type, rangegt
    ltrI, rT, sP, nPgt
  • Example
  • With fixed length parsing
  • Parse(S) x1lt1,1gtx2 lt3,1gtx3lt2,1gtx4lt4,1gtx5lt2,2gtx6lt
    1,5gtx7lt3,1gt
  • With variable-length parsing
  • Parse(S) x1lt1gtx2lt3gtx3lt2gtx4lt4gtx5lt1,1,1,5gtx6lt1,5gtx7
    lt3gt

Dictionary
16
Analysis of VPS2
  • Parsing
  • ParsePart1 x1x2x3x4x5x6x7x8x9
  • ParsePart1 lengths sk1, sk2, sk3 sk4, sk5, sk6
    sk7, sk8, sk9 ski length of xi
  • ParsePart2 lt1gt0lt3gt0lt2gt0lt4gt0lt1,1,15gt0lt1,5gt0lt3gt0lt4
    gt0
  • Costs (per occurrence costs for each repeat)
  • ParsePart1
  • Vocabulary
  • Total cost
  • Per-replacement cost
  • Compression gain
  • Constraints on l(r),n(r), and rI(r)

ParsePart2


17
Encoding the Integers
  • 1st order Fibonacci codes (FK1)
  • 2nd order Fibonacci codes (AF1)
  • Simple Huffman codes (H1)

5 - 1011 6 - 1100 7 - 1101 8 - 1110 9 - 1111
F - 00 1 - 010 2 - 011 0 - 1000 3 - 1001 4 -
1010
18
Results
  • Compressed file size (bytes)

19
Results
  • Compressed file size (bits per character)

20
Results
Size (bytes)
Method index 1 - gzip 2 - arith 3 - BWT 4 -
vps 5- vpsH1a 6 - vpsH1b 7 - vpsFK1a 8 -
vpsFK1b 9 - vpsAF1a 10 - vpsAF1b
Size (bpc)
21
Results
  • Computation Time (H1 versus AF1 and KF1)
  • (Time in seconds)

22
Conclusion
  • BWT presents an effective mechanism for repeat
    analysis
  • Not all repeats can lead to compression
  • Off-line dictionaries are effective for DNA
    sequence compression
  • Performance depends on the representation of the
    parsed sequence, type, length, and number of
    repeats.
  • What next ?
  • Variable-length coding
  • More rigorous theoretical analysis
  • Further testing and comparative study
    (approximate repeats, palindromes, speeding up
    the process, etc.)
  • DNA sequence entropy estimation
  • Towards compressed domain DNA sequence analysis

23
  • The End
  • Thank You
Write a Comment
User Comments (0)