Title: CRAM: reference-based compression format
1CRAM reference-based compression
format developed by Vadim Zalunin
2Data horror
EMBL-EBI 10 petabytes SRA 1 petabytes Over 2
million DVDs or 2.5km Complete Genomics 0.5 TB
for a single file
3The need for compression
Red alert
4Compression, what is it?
BMP, 190 kb
PNG, 100 kb
JPG, 21 kb
JPG, 4 kb
LOSSLESS
LOSSY
5Compression, when we know what to expect.
BMP, 145 kb
PNG, 2 kb
JPG, 6 kb
JPG, 3 kb
LOSSLESS
LOSSY
But the actual message is only 40 characters
(bytes) long!
6Compression at its best
"Five little ducks went swimming one day"
compress
uncompress
IMAGE, 145 kb
TEXT, 40 b
IMAGE, 145 kb
3500 times more efficient
7What are we talking about
bug
The bugs DNA is hidden somewhere
sample
sequencing machines
8Looking closer at the data
It boils down to a long list of reads
read 1 read 2 read 3 .. read bizzilion
Each read represents a short nucleotide sequence
from the genome. Additional information may be
attached to it, for example error estimates.
9What is a Read?
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
An excerpt from of a FASTQ file.
10What is a Read?
read name
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
An excerpt from of a FASTQ file.
11What is a Read?
read name
read bases
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
An excerpt from of a FASTQ file. Bases ACGTN
12What is a Read?
read name
read bases
_at_SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGG
AAGGAGAGAGTG IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIH
HIGIHHFI
read quality scores
An excerpt from of a FASTQ file. Bases
ACGTN Quality scores from ! (ASCII 33) to
(ASCII 126)
13What is quality score?
Then quality score is phred quality score encoded
as ASCII symbols 33-126.
Basically higher scores are better, so ! is
bad, I is good.
14Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 T G A G C T C T T A G T A G C
read 2 G C T C T A A G T A G C C G C
read 3 C T C T A A G T A G C C G C G
read 4 G T A G C C G C G G A C T G T
read 5 C G G T C T G T C C G
Read start position
Read end position
15Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 . . . . . . . . T . . . . . .
read 2 . . . . . . . . . . . . . . .
read 3 . . . . . . . . . . . . . . .
read 4 . . . . . . . . . . A . . . .
read 5 . . . . . . . . . . .
16Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 . . . . . . . . T . . . . . .
read 2 . . . . . . . . . . . . . . .
read 3 . . . . . . . . . . . . . . .
read 4 . . . . . . . . . . A . . . .
read 5 . . . . . . . . . . .
Mismatching bases
17Lossy quality scores
horizontal
Approach 1 Quality scores are usually values from
0 to 39. Lets shrink them, so that they are
from 0 to 7 now.
Approach 2 Lets treat quality scores using
alignment information. For example preserve
only quality scores for mismatching bases.
vertical
18Comparison study1K Genomes exomes
compress
uncompress
BAM
CRAM
BAM
19Comparison study1K Genomes exomes
compress
uncompress
BAM
CRAM
BAM
20Comparison study1K Genomes exomes
compress
uncompress
BAM
CRAM
BAM
Original SNPs
Restored SNPs
21Comparison study1K Genomes exomes
22CRAM NGS data compression
CRAM lossy
CRAM very lossy
CRAM lossless
Untreated
Bits/base
(bad)
(good)
Do nothing
Lossless
Lossy
23Progressive application of compression
Sample accessibility
Hard
Easy
High
Low
Sample value
24References
- More information
- http//www.ebi.ac.uk/ena/about/cram_toolkit
- Mailing list
- http//listserver.ebi.ac.uk/mailman/listinfo/cram-
dev - Publications
- Fritz, M.H. Leinonen, R., et al. (2011) Efficient
storage of high throughput DNA sequencing data
using reference-based compression. Genome Res. 21
(5), 734-40 - Cochrane G., Cook C.E. and Birney E. (2012) The
future of DNA sequence archiving. Gigascience 1