Title: End-to-end data deduplication for the mobile Web
1End-to-end data deduplicationfor the mobile Web
Ricardo Filipe e João Barreto
Distributed Systems Group INESC-ID/Instituto
Superior Técnico
rfilipe_at_gsd.inesc-id.pt
2Motivation
3Motivation
4Motivation
5Motivation
6Problem?
7Problem?
8Solution Eliminate Redundant Data
9Solution Eliminate Redundant Data
10Basic Deduplication Techniques
- Classical Caching
- Detects only fully redundant resources
11Basic Deduplication Techniques
- Classical Caching
- Detects only fully redundant resources
- Gzip Compression
- Detects redundant chunks only within the resource
12Basic Deduplication Techniques
- Classical Caching
- Detects only fully redundant resources
- Gzip Compression
- Detects redundant chunks only within the resource
- Delta Encoding
- Detects redundant data only between pairs of
resources - Slow delta computation/offline
13Advanced Deduplication Techniques
- Value Based Web Cache Rhea, WWW03
14Advanced Deduplication Techniques
- Value Based Web Cache Rhea, WWW03
- An ISP proxy is not suitable for roaming clients
15Advanced Deduplication Techniques
- Value Based Web Cache Rhea, WWW03
- An ISP proxy is not suitable for roaming clients
- The ISP proxy is useless for encrypted data
(HTTPS)
16Advanced Deduplication Techniques
- Value Based Web Cache Rhea, WWW03
- An ISP proxy is not suitable for roaming clients
- The ISP proxy is useless for encrypted data
(HTTPS) - High resource usage on the client
17Advanced Deduplication Techniques
- Value Based Web Cache Rhea, WWW03
- An ISP proxy is not suitable for roaming clients
- The ISP proxy is useless for encrypted data
(HTTPS) - High resource usage on the client
- DedupHTTP solves all these limitations!
- And some more ?
18DedupHTTP in one slide
19DedupHTTP in one slide
20DedupHTTP in one slide
21DedupHTTP in one slide
22Server AlgorithmChunk division
- Karp-Rabin rolling hash
- Winnowing
23Server AlgorithmChunk division
- Karp-Rabin rolling hash
- Winnowing
- MurmurHash (or MD5, SHA1, etc.)
24Server AlgorithmMetadata Storage
- Chunk array
- Chunk Hash Table
25Server AlgorithmChunk Search
26Server AlgorithmEncoding
27Optimizations
- Metadata Coalescing
- Join contiguous chunk metadata blocks into one
response metadata block - Especially relevant for (almost) fully redundant
resources
28Optimizations
- Metadata Coalescing
- Join contiguous chunk metadata blocks into one
response metadata block - Especially relevant for (almost) fully redundant
resources - Old resource versions only need to store their
metadata on the server
29DedupHTTP Advantages
30DedupHTTP Advantages
- Online Deduplication
- Detects redundancy between different resources
and versions of resources
31DedupHTTP Advantages
- Online Deduplication
- Detects redundancy between different resources
and versions of resources - High detail in redundancy detection
32Evaluation
- Implemented in proxies on a Web browser machine
and a Web server machine, connected through
Internet, no LAN - Compared Systems
- DedupHTTP
- Gzip
- DedupHTTP Gzip (Hybrid)
- Delta-Encoding
- Classical Caching
Workload Name Number of resources Total Size
Cnn.com 337 42 MB
Engadget.com 335 36 MB
Huffingtonpost.com 401 62 MB
33EvaluationAverage Chunk Size
34EvaluationComparison of redundancy detected
35EvaluationTime To Display Over Internet
36Conclusions
- HTTP transfers can be greatly improved
37Conclusions
- HTTP transfers can be greatly improved
- DedupHTTP Solution that takes most of the good
points of previous solutions and discards the bad
ones
38Conclusions
- HTTP transfers can be greatly improved
- DedupHTTP Solution that takes most of the good
points of previous solutions and discards the bad
ones - DedupHTTP was evaluated against reference
solutions for Web deduplication in the access to
relevant Web sites - Traffic savings of up to 94.5 without
deterioration of Time To Display
39Future Work
- Reorganize and improve response metadata
- Create a resource storage heuristic that works
for sites that can be accessed through several
Web Servers
40Questions?
technologyfrom seed
http//www.gsd.inesc-id.pt/
rfilipe_at_gsd.inesc-id.pt