Fast, Inexpensive ContentAddressed Storage in Foundation - PowerPoint PPT Presentation

About This Presentation

Title:

Fast, Inexpensive ContentAddressed Storage in Foundation

Description:

Users increasingly store their most valuable data digitally. Wedding/baby photographs ... backup server that automatically coalesces duplicate media collections ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 40

Provided by: sean88

Category:

more less

Transcript and Presenter's Notes

Title: Fast, Inexpensive ContentAddressed Storage in Foundation

1
Fast, Inexpensive Content-Addressed Storage in
Foundation

Sean Rhea Russ Cox, Alex Pesterev
Meraki, Inc. MIT CSAIL
Work done while at Intel Research, Berkeley.

2
Digital Dark Ages?

Users increasingly store their most valuable data
digitally
Wedding/baby photographs
Letters (now called email)
Diaries, scrapbooks, tax returns
Yet digital information remains especially
vulnerable
Terry Kuny We are living in the midst of
digital Dark Ages
Hard drives crash
Removable media evolve (e.g., 5 ¼ floppies)
File formats become obsolete (e.g., WordStar,
Lotus 1-2-3)
What will the world remember of the late 20th
century?

As a community, were not bad at storing
important data over the long term.
Weve only just begun to think about how well
interpret that data 30 years from now.

4
For Example

Viewing an old PowerPoint presentation
Do we still have PowerPoint at all? And Windows?
Does the presentation use non-standard
fonts/codecs?
Has some newer application overwritten a shared
library with an incompatible version (DLL
Hell)?
Not just a Microsoft problem consider a web page
Even current IE/Safari/Firefox dont agree on
formatting
All kinds of plugins necessary sound, video,
Flash

5
The Foundation Idea

Make daily backups of entire software stack
Archives users applications, OS, and
configuration state
Dont worry about identifying dependencies
Just save it all Every byte, every night
To recover an obscure file, boot the relevant
stack in an emulator
View file with the application that created it

6
Foundation FAQ

Why preserve the entire disk?
Preserve software stack dependencies preserve
the data with the right application, libraries,
and operating system as a single unit
Works for all applications, not just ones
designed for preservation
Why daily images?
Want to preserve machine state as close as
possible to last write of users data (i.e.,
preserve image before something changes)
Also allows recovery from user errors
Why emulate hardware?
Much better track record than emulating software
Software example OpenOffice emulating Microsoft
Word (yikes)
Hardware emulators available today for Amiga,
PDP-11, Nintendo

I would love to give a talk about why Foundation
is a great solution to the digital preservation
problem.
Really, though, I think its just a pretty good
start.
Instead, Im going to talk about a fun problem we
had to solve to make it work.

8
Every Byte, Every Night?Indefinitely? Really?

Plan 9 did exactly that
Archive changed blocks every night to optical
jukebox
Found that storage capacity grew faster than
usage
Later with Content-Addressable Storage (Venti)
Automatically coalesces duplicate data to save
space
Required multiple, high-speed disks for
performance
Challenge for Foundation provide similar storage
efficiency on consumer hardware
Time Machine model one external USB drive

9
Talk Outline

Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making Cheap Content-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions

10
Venti Review

Plan 9 file system was two-level
Spinning storage, mostly a normal file system
Archival storage, optical write-once jukebox
Venti replaced optical jukebox
Still write-once
Chunks of data named by their SHA-1 hashes
Content-Addressable Storage (CAS)
Automatically coalesces duplicate writes

11
Venti Review
seen it before?
update index
Hash ? Offset
reads 2nd block
reads 4th block
reads 1st block
5h( )?1 6 7 8 9
0 1 2 3h( )?0 4
h( )?2
append hash to summary
Summary
h( )
,h( )
,h( )
,
h( )
append to log
no log write!
Data Log
Users Hard Drive
External USB Drive
RAM
12
Venti Review
map hash to log offset
Hash ? Offset
restore block
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Crash!
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
read block from log
Final step (not shown) archive summary in data
log as well
Data Log
Users Hard Drive
External USB Drive
RAM
13
Notes on Venti

The Good News
CAS stores each block with particular contents
only once
Changing any one block and re-archiving uses only
one more block in archive
Adding a duplicate file from a different source
uses no additional storage
The Bad News
Synchronous, random reads to on-disk index

14
Venti Review
seen it before?
Hash ? Offset
reads 4th block
0 1 2 3h( )?0 4
5h( )?1 6 7 8 9
h( )?2
Summary
h( )
,h( )
,h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
15
Venti Review
map hash to log offset
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
16
Notes on Venti

The Good News
CAS stores each block with particular contents
only once
Changing any one block and re-archiving uses only
one more block in archive
Adding a duplicate file from a different source
uses no additional storage
The Bad News
Synchronous, random reads to on-disk index
Best case, one-disk performance for 512-byte
blocks
one 5 ms seek per 512 bytes archived 100 kB/s
Thats 12 days to archive a 100 GB disk!
Larger blocks give better throughput, less sharing

17
Notes on Venti (cont.)

Ventis solution use 8 high-speed disks for
index
Untennable in consumer space
Wears disks out pretty quickly, too
The compare-by-hash controversy
Fear of hash collisions two different blocks
with same hash breaks Venti
May be very unlikely, but cost (data corruption)
is huge
Does CAS really require a cryptographically
strong hash?

18
Talk Outline

Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making Cheap Content-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions

19
Making Inexpensive CAS Fast

The problem disk seeks
Secure hash randomizes an otherwise sequential
disk-to-disk transfer
To reduce seeks, must reduce hash table lookups
When do hash table lookups occur?
When writing data, to determine if weve seen it
before
When writing data, to update the index
When reading data, to map hashes to disk locations

20
2. Updating the Index

After appending a block to the data log, must
update the index
Psuedorandom hash causes a seek

21
Updating the Index
update index
Hash ? Offset
reads 2nd block
0 1 2 3h( )?0 4
5h( )?1 6 7 8 9
Summary
h( )
Have to seek to the right bucket
append to log
Data Log
Users Hard Drive
External USB Drive
RAM
22
2. Updating the Index

After appending a block to the data log, must
update the index
Psuedorandom hash causes a seek
Easy to fix use a write-back index cache
Store index writes in memory
Flush to disk sequentially in large batches
On crash, reconstruct index from the data log

23
3. Mapping Hashes to Disk Locations During Reads

To restore disk
Start with the list of original blocks hashes
Lookup each block in index
Read block from data log and restore to disk

24
map hash to log offset
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
25
3. Mapping Hashes to Disk Locations During Reads

To restore disk
Start with the list of original blocks hashes
Lookup each block in index
Read block from data log and restore to disk
Observation data log is mostly ordered
Duplicate blocks often occur as part of duplicate
files

26
Ordering in Data Log
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Data Log
Users Hard Drive
External USB Drive
RAM
27
3. Mapping Hashes to Disk Locations During Reads

To restore disk
Start with the list of original blocks hashes
Lookup each block in index
Read block from data log and restore to disk
Observation data log is mostly ordered
Duplicate blocks often occur as part of duplicate
files
Idea add another index, ordered by log offset
Read-ahead in this index to eliminate future
lookups in original index

28
Index by Offset
map hash to log offset (seek!)
Hash ? Offset
restore block
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 2nd block
lookup hash of 1st block
Crash!
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
new index, sorted by offset
read block from log (seek!)
read block from log (no seek!)
prefetch hashes for next few offsets from
secondary index (seek!)
find log offset in secondary index no seek!
Hash ? Offset
Data Log
Users Hard Drive
External USB Drive
RAM
29
1. Is a Block New, or Duplicate?

Optimization for reads also helps duplicate
writes
Index misses on first duplicate block
Hits on subsequent blocks rewritten in same order
Doesnt help for new data
Every lookup in primary index fails
Still suffer a seek for every new block

30
1. Is a Block New, or Duplicate?

Idea use a Bloom filter to identify new blocks
Lossy representation of the primary index
Uses much less memory than index itself
For any given block, Bloom filter tells us
Its definitely new ? append to log, update index
It might be duplicate ? lookup in index
If it really is a duplicate, we get the prefetch
benefit
Otherwise, called a false positive
Using enough memory keeps false positives at 1

31
Results

Do these optimizations pay off?
Buffering index writes is an obvious win
Bloom filter is, too removes 99 of seeks when
writing new data
Both trade RAM for seeks
Benefit of secondary index less clear
If duplicate data comes in long sequences, it
reduces index seeks to two per sequence
If duplicate data comes in little fragments, it
doubles the number of index seeks
Need traces of real data to answer this question

32
Results (cont.)

Research group at MIT has been running Venti as
its backup server for two years
We looked at 400 nightly snapshots
Simulated archiving and restoring these in both
Venti and Foundation

33
Talk Outline

Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making Cheap Content-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions

34
Eliminating Compare by Hash

Some worried that same SHA-1 doesnt imply same
contents (i.e., hash collisions are possible)
Even if very rare, consequences (corruption) too
great
Stepping back a bit, CAS as a black box
Give it a data block, get back an opaque ID
Give it an opaque ID, get back the data block
Do we care that the ID is a SHA-1 hash?
What if the opaque ID was just the blocks
location in the data log?

35
Using Locations As IDs

Pros
Reads require no index lookups at all
System can still find potential duplicates
using hashing (with a weaker, faster hash
function)
Cons
Need another mechanism to check integrity
Since hash untrusted, must compare suspected
duplicates byte-by-byte
Others have claimed these byte-by-byte
comparisons are a non-starter

36
2nd Disk Arm to the Rescue

Once we eliminate most index reads (via our
previous optimizations), the backup disk is
otherwise idle while backing up duplicate data
Can instead put it to work doing byte-by-byte
comparisons of suspected duplicates

37
Talk Outline

Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making Cheap Content-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions

38
Related Work

Apple Time Machine
Duplicates coalesced at file level via hard links
Netapp WAFL, ZFS
Copy-on-write coalesces blocks at the FS level
Misses duplicates that come into system
separately
Data Domain Deduplication FS
Very similar to Foundation, in enterprise context
Depends on collision-freeness of hash function
Lots of other Content-Addressed Storage work
LBFS, SUNDR, Peabody

39
Conclusions

Consumer-grade CAS works now
A single, external USB drive is enough
Just have to be crafty about avoiding seeks
Lots of uses other than preservation
E.g., inexpensive household backup server that
automatically coalesces duplicate media
collections
Doesnt require a collision-free hash function

Write a Comment

User Comments (0)