Title: Supporting ContentAddressable Caching with CZIP Compression
1Supporting Content-Addressable Caching with CZIP
Compression
- KyoungSoo Park, Sunghwan Ihm, Mic Bowman and
Vivek Pai - Princeton University
- Intel Research
2Content-Based Naming (CBN)
- Naming scheme based on its content
- Name one-way hash (content)
- Hashing function MD5, SHA-1, etc.
- Rabins fingerprint for chunk detection
- Redundancy elimination
- Network-traffic/storage systems
- Research/commercial systems
- Special-purpose systems
3Where Can CBN be Applied?
- Similar file distribution
- Linux distribution mirror
- DVD ISO contains all CD ISOs
- Virtual machine image migration
- Base OS takes up majority of content
- httpd VM vs. httpdmysqld VM
- Uncacheable Web content
- Some dynamic content doesnt change
4Contribution of This Work
- Generic CBN tool
- Easy to build new systems
- Easy to upgrade existing non-CBN systems
- CZIP compression CZIP-aware apps
- Can be used on existing platforms
- Provides benefit to non-CZIP apps
- Demonstrate sample systems
- Reduces FC6 mirror memory footprint by half
- Comparable compression speed to GZIPs
- 2x throughput for CZIP-aware Apache
- 4x origin server BW reduction for CZIP-aware CDN
5CZIP Compression
- Compression scheme like GZIP, BZIP2
- Export CBN information in the header
CZIP
UNCZIP
CZIP Header
6CZIP Header
- Header global attributes chunk info
- Global attributes
- One-way hash function (SHA-1/MD5)
- Chunk data compression (GZIP/BZIP2)
- Convergent encryption (on/off)
- Header CRC, File Hash, etc.
- Chunk information
- Content hash, start offset, chunk size
7Deployment Scenario
xyzlo5g
hdr
asdfghk
Client A
Chunk A
Server
Chunk B
file1.cz
CBN Cache
Client B
xyzlo5g
header
asdfghk
qoiertty
Chunk A
Chunk B
Chunk C
file2.cz
8Deployment Scenario
- CZIP-aware client-side proxy
xyzlo5g
hdr
asdfghk
file1.cz
Client A
Chunk A
Proxy
Server
Chunk B
file1.cz
CBN Cache
Client B
xyzlo5g
header
asdfghk
qoiertty
Chunk A
1. X-SHA-1 field helps CZIP-aware server 2.
Browser cache can support CBN too!
Chunk B
Chunk C
file2.cz
9Compressibility
- Fedora Core 6 ISOs/ All files/ Wikipedia DB
1
Data Compression Ratio
CZIPplain
0.9
CZIPgzip
0.8
CZIPbzip2
0.7
GZIP
0.6
BZIP2
0.5
0.4
0.3
0.2
0.1
0
FC6_i386_ISOs.tar
FC6_All_files.tar
Wikipedia_DB.tar
6.7 GB
49.7 GB
7.9 GB
10Compression speed
- On Pentium D 2.8GHz with 4GB memory
29,004 secs
3,151 secs
3,964 secs
11Virtual Machine Images
- Server consolidation/management
- Much redundancy among similar VMs
- Xen FC4 base image (X)
- X httpd (Y) / Y mysqld (Z)
- Investigating content overlap over
- Chunk size
- Chunking methods
- Rabins fingerprint vs. fixed-sized
- After extensive use
12Chunk Size / Chunking Methods
- Compare three VM images
- Base Xen FC4 image / Apache Base httpd
- Both Apache mysqld
Rabins fingerprint
Fixed-sized chunking
13Real VM Images
EC1 EC5 VMs based on Xen FC-4 standard tools
Daily used by five different engineers for three
weeks
14Dynamic Web Pages
- Observed the front page of these sites
- Google News
- CNN
- Slashdot
- Digg.com
- Fark.com
- New York Times
- All of them non-cacheable
- no-cache, no-store or private
15Average Content Overlap
Downloaded pages every 10 minutes for 18 days
16Potential Data Savings via CZIP
37
39
61
24
57
90
17Summary So far
- CZIP is comparable to GZIP in speed and
performance - CZIP is far better with files with much
redundancy - Redundancy decreases as chunk size increases
- Rabins fingerprint exposes a good deal of
redundancy regardless of chunk sizes - Optimal chunk size varies over workload
- Bigger chunk size is better for network transfer
- Dynamic content also exposes redundancy
- CZIP can save 24-90 of BW instead of GZIP
18Server Performance
- CZIP Apache Module
- Test scenario (FC mirror simulation)
- 1.5 GB from FC6 DVD
- 1.5 GB is split into three 0.5 GB images
- Each file is requested in round-robin fashion
- 100-300 clients simulated by six machines in LAN
- Server is 2.8GHz Pentium D w/ 2GB memory
- w/ 2GB physical memory with 2 Gbps-NICs
19CZIP Apache Module
90 2.56 times
Median 2.07 times
20CBN-Aware Content Distribution
- CoBlitz large-file CDN NSDI06
- Serving 1-2 TB every day on PlanetLab
- http//coblitz.codeen.org/URL
- University channel podcast/vodcast
- Fedora Core mirror, Citeseer etc.
- Chunk is basic caching unit
- Parallel chunk requests/responses
- Chunk request in HTTP byte-range query
21Making CoBlitz CZIP-Aware
- CoBlitzs chunk request
- GET /coblitz.codeen.org/www.cs.princeton.edu/
- bigfile.cz,start1000,end1999 HTTP/1.0
- Host coblitz.codeen.org
- CZIP-aware CoBlitz (C-CoBlitz) request
- GET /czip.codeen.org/Chunk_SHA-1_Hash HTTP/1.0
- Host czip.codeen.org
- X-URL www.cs.princeton.edu/bigfile.cz
- X-Range byte1000-1999
22CZIP-Aware CoBlitz Testing
- Two content-overlapping files
- Simultaneously fetch from 100 PlanetLab nodes
- Origin server is at Princeton
- Testing cases
- Regular Download original files by regular
CoBlitz - File-CZIP Download CZIPed files by regular
CoBlitz - CZIP-CDN Download CZIPed files by C-CoBlitz
23100 MB File Downloading
388 MB
Regular
File-CZIP
CZIP-CDN
2450 MB File Downloading
183 MB
Regular
File-CZIP
CZIP-CDN
25Conclusion
- CZIP is a generic compression tool providing CBN
benefits - CZIP is comparable to GZIP in compression
performance - CZIP helps greatly reduce memory footprint in
serving similar files - It is very easy to support CZIP and the benefit
is transparent
26Thank you!
- More information can be found at
http//codeen.cs.princeton.edu/czip/ - CZIP code will be released soon!
27200/300 Clients
90 2.27 times
90 2.11 times
80
65
Median 1.95 times
Median 1.84 times
200 clients
300 clients