CRAM: reference-based compression format

About This Presentation

Title:

CRAM: reference-based compression format

Description:

CRAM: reference-based compression format developed by Vadim Zalunin – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 25

Provided by: VadimZ9

Category:

more less

Transcript and Presenter's Notes

Title: CRAM: reference-based compression format

1
CRAM reference-based compression
format developed by Vadim Zalunin
2
Data horror
EMBL-EBI 10 petabytes SRA 1 petabytes Over 2
million DVDs or 2.5km Complete Genomics 0.5 TB
for a single file
3
The need for compression
Red alert
4
Compression, what is it?
BMP, 190 kb
PNG, 100 kb
JPG, 21 kb
JPG, 4 kb
LOSSLESS
LOSSY
5
Compression, when we know what to expect.
BMP, 145 kb
PNG, 2 kb
JPG, 6 kb
JPG, 3 kb
LOSSLESS
LOSSY
But the actual message is only 40 characters
(bytes) long!
6
Compression at its best
"Five little ducks went swimming one day"
compress
uncompress
IMAGE, 145 kb
TEXT, 40 b
IMAGE, 145 kb
3500 times more efficient
7
What are we talking about
bug
The bugs DNA is hidden somewhere
sample
sequencing machines
8
Looking closer at the data
It boils down to a long list of reads
read 1 read 2 read 3 .. read bizzilion
Each read represents a short nucleotide sequence
from the genome. Additional information may be
attached to it, for example error estimates.
9
What is a Read?
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
An excerpt from of a FASTQ file.
10
What is a Read?
read name
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
An excerpt from of a FASTQ file.
11
What is a Read?
read name
read bases
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
An excerpt from of a FASTQ file. Bases ACGTN
12
What is a Read?
read name
read bases
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
read quality scores
An excerpt from of a FASTQ file. Bases
ACGTN Quality scores from ! (ASCII 33) to
(ASCII 126)
13
What is quality score?

Then quality score is phred quality score encoded
as ASCII symbols 33-126.
Basically higher scores are better, so ! is
bad, I is good.
14
Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 T G A G C T C T T A G T A G C
read 2 G C T C T A A G T A G C C G C
read 3 C T C T A A G T A G C C G C G
read 4 G T A G C C G C G G A C T G T
read 5 C G G T C T G T C C G
Read start position
Read end position
15
Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 . . . . . . . . T . . . . . .
read 2 . . . . . . . . . . . . . . .
read 3 . . . . . . . . . . . . . . .
read 4 . . . . . . . . . . A . . . .
read 5 . . . . . . . . . . .
16
Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 . . . . . . . . T . . . . . .
read 2 . . . . . . . . . . . . . . .
read 3 . . . . . . . . . . . . . . .
read 4 . . . . . . . . . . A . . . .
read 5 . . . . . . . . . . .
Mismatching bases
17
Lossy quality scores
horizontal
Approach 1 Quality scores are usually values from
0 to 39. Lets shrink them, so that they are
from 0 to 7 now.
Approach 2 Lets treat quality scores using
alignment information. For example preserve
only quality scores for mismatching bases.
vertical
18
Comparison study1K Genomes exomes
compress
uncompress
BAM
CRAM
BAM
19
Comparison study1K Genomes exomes
compress
uncompress
BAM
CRAM
BAM
20
Comparison study1K Genomes exomes
compress
uncompress
BAM
CRAM
BAM
Original SNPs
Restored SNPs
21
Comparison study1K Genomes exomes
22
CRAM NGS data compression
CRAM lossy
CRAM very lossy
CRAM lossless
Untreated
Bits/base
(bad)
(good)
Do nothing
Lossless
Lossy
23
Progressive application of compression
Sample accessibility
Hard
Easy
High
Low
Sample value
24
References

More information
http//www.ebi.ac.uk/ena/about/cram_toolkit
Mailing list
http//listserver.ebi.ac.uk/mailman/listinfo/cram-
dev
Publications
Fritz, M.H. Leinonen, R., et al. (2011) Efficient
storage of high throughput DNA sequencing data
using reference-based compression. Genome Res. 21
(5), 734-40
Cochrane G., Cook C.E. and Birney E. (2012) The
future of DNA sequence archiving. Gigascience 1