Title: Fast, Inexpensive ContentAddressed Storage in Foundation
1Fast, Inexpensive Content-Addressed Storage in
Foundation
- Sean Rhea Russ Cox, Alex Pesterev
- Meraki, Inc. MIT CSAIL
- Work done while at Intel Research, Berkeley.
2Digital Dark Ages?
- Users increasingly store their most valuable data
digitally - Wedding/baby photographs
- Letters (now called email)
- Diaries, scrapbooks, tax returns
- Yet digital information remains especially
vulnerable - Terry Kuny We are living in the midst of
digital Dark Ages - Hard drives crash
- Removable media evolve (e.g., 5 ¼ floppies)
- File formats become obsolete (e.g., WordStar,
Lotus 1-2-3) - What will the world remember of the late 20th
century?
3- As a community, were not bad at storing
important data over the long term. - Weve only just begun to think about how well
interpret that data 30 years from now.
4For Example
- Viewing an old PowerPoint presentation
- Do we still have PowerPoint at all? And Windows?
- Does the presentation use non-standard
fonts/codecs? - Has some newer application overwritten a shared
library with an incompatible version (DLL
Hell)? - Not just a Microsoft problem consider a web page
- Even current IE/Safari/Firefox dont agree on
formatting - All kinds of plugins necessary sound, video,
Flash
5The Foundation Idea
- Make daily backups of entire software stack
- Archives users applications, OS, and
configuration state - Dont worry about identifying dependencies
- Just save it all Every byte, every night
- To recover an obscure file, boot the relevant
stack in an emulator - View file with the application that created it
6Foundation FAQ
- Why preserve the entire disk?
- Preserve software stack dependencies preserve
the data with the right application, libraries,
and operating system as a single unit - Works for all applications, not just ones
designed for preservation - Why daily images?
- Want to preserve machine state as close as
possible to last write of users data (i.e.,
preserve image before something changes) - Also allows recovery from user errors
- Why emulate hardware?
- Much better track record than emulating software
- Software example OpenOffice emulating Microsoft
Word (yikes) - Hardware emulators available today for Amiga,
PDP-11, Nintendo
7- I would love to give a talk about why Foundation
is a great solution to the digital preservation
problem. - Really, though, I think its just a pretty good
start. - Instead, Im going to talk about a fun problem we
had to solve to make it work.
8Every Byte, Every Night?Indefinitely? Really?
- Plan 9 did exactly that
- Archive changed blocks every night to optical
jukebox - Found that storage capacity grew faster than
usage - Later with Content-Addressable Storage (Venti)
- Automatically coalesces duplicate data to save
space - Required multiple, high-speed disks for
performance - Challenge for Foundation provide similar storage
efficiency on consumer hardware - Time Machine model one external USB drive
9Talk Outline
- Introduction
- What is Foundation?
- Review of Content-Addressed Storage (Venti)
- Contributions
- Making Cheap Content-Addressed Storage Fast
- Avoiding Concerns over Hash Collisions
- Related Work
- Conclusions
10Venti Review
- Plan 9 file system was two-level
- Spinning storage, mostly a normal file system
- Archival storage, optical write-once jukebox
- Venti replaced optical jukebox
- Still write-once
- Chunks of data named by their SHA-1 hashes
- Content-Addressable Storage (CAS)
- Automatically coalesces duplicate writes
11Venti Review
seen it before?
update index
Hash ? Offset
reads 2nd block
reads 4th block
reads 1st block
5h( )?1 6 7 8 9
0 1 2 3h( )?0 4
h( )?2
append hash to summary
Summary
h( )
,h( )
,h( )
,
h( )
append to log
no log write!
Data Log
Users Hard Drive
External USB Drive
RAM
12Venti Review
map hash to log offset
Hash ? Offset
restore block
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Crash!
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
read block from log
Final step (not shown) archive summary in data
log as well
Data Log
Users Hard Drive
External USB Drive
RAM
13Notes on Venti
- The Good News
- CAS stores each block with particular contents
only once - Changing any one block and re-archiving uses only
one more block in archive - Adding a duplicate file from a different source
uses no additional storage - The Bad News
- Synchronous, random reads to on-disk index
14Venti Review
seen it before?
Hash ? Offset
reads 4th block
0 1 2 3h( )?0 4
5h( )?1 6 7 8 9
h( )?2
Summary
h( )
,h( )
,h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
15Venti Review
map hash to log offset
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
16Notes on Venti
- The Good News
- CAS stores each block with particular contents
only once - Changing any one block and re-archiving uses only
one more block in archive - Adding a duplicate file from a different source
uses no additional storage - The Bad News
- Synchronous, random reads to on-disk index
- Best case, one-disk performance for 512-byte
blocks - one 5 ms seek per 512 bytes archived 100 kB/s
- Thats 12 days to archive a 100 GB disk!
- Larger blocks give better throughput, less sharing
17Notes on Venti (cont.)
- Ventis solution use 8 high-speed disks for
index - Untennable in consumer space
- Wears disks out pretty quickly, too
- The compare-by-hash controversy
- Fear of hash collisions two different blocks
with same hash breaks Venti - May be very unlikely, but cost (data corruption)
is huge - Does CAS really require a cryptographically
strong hash?
18Talk Outline
- Introduction
- What is Foundation?
- Review of Content-Addressed Storage (Venti)
- Contributions
- Making Cheap Content-Addressed Storage Fast
- Avoiding Concerns over Hash Collisions
- Related Work
- Conclusions
19Making Inexpensive CAS Fast
- The problem disk seeks
- Secure hash randomizes an otherwise sequential
disk-to-disk transfer - To reduce seeks, must reduce hash table lookups
- When do hash table lookups occur?
- When writing data, to determine if weve seen it
before - When writing data, to update the index
- When reading data, to map hashes to disk locations
202. Updating the Index
- After appending a block to the data log, must
update the index - Psuedorandom hash causes a seek
21Updating the Index
update index
Hash ? Offset
reads 2nd block
0 1 2 3h( )?0 4
5h( )?1 6 7 8 9
Summary
h( )
Have to seek to the right bucket
append to log
Data Log
Users Hard Drive
External USB Drive
RAM
222. Updating the Index
- After appending a block to the data log, must
update the index - Psuedorandom hash causes a seek
- Easy to fix use a write-back index cache
- Store index writes in memory
- Flush to disk sequentially in large batches
- On crash, reconstruct index from the data log
233. Mapping Hashes to Disk Locations During Reads
- To restore disk
- Start with the list of original blocks hashes
- Lookup each block in index
- Read block from data log and restore to disk
24map hash to log offset
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 1st block
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Have to seek to the right bucket
Data Log
Users Hard Drive
External USB Drive
RAM
253. Mapping Hashes to Disk Locations During Reads
- To restore disk
- Start with the list of original blocks hashes
- Lookup each block in index
- Read block from data log and restore to disk
- Observation data log is mostly ordered
- Duplicate blocks often occur as part of duplicate
files
26Ordering in Data Log
Hash ? Offset
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
Data Log
Users Hard Drive
External USB Drive
RAM
273. Mapping Hashes to Disk Locations During Reads
- To restore disk
- Start with the list of original blocks hashes
- Lookup each block in index
- Read block from data log and restore to disk
- Observation data log is mostly ordered
- Duplicate blocks often occur as part of duplicate
files - Idea add another index, ordered by log offset
- Read-ahead in this index to eliminate future
lookups in original index
28Index by Offset
map hash to log offset (seek!)
Hash ? Offset
restore block
0h( )?4 1 2h( )?3 3h( )?0 4h( )?7
5h( )?1 6h( )?6 7h( )?5 8h( )?2 9
lookup hash of 2nd block
lookup hash of 1st block
Crash!
Summary
h( ), h( ), h( ), h( ), h( ), h( ), h( ), h( ),
h( ), h( ), h( ), h( ), h( ), h( ), h( )
new index, sorted by offset
read block from log (seek!)
read block from log (no seek!)
prefetch hashes for next few offsets from
secondary index (seek!)
find log offset in secondary index no seek!
Hash ? Offset
Data Log
Users Hard Drive
External USB Drive
RAM
291. Is a Block New, or Duplicate?
- Optimization for reads also helps duplicate
writes - Index misses on first duplicate block
- Hits on subsequent blocks rewritten in same order
- Doesnt help for new data
- Every lookup in primary index fails
- Still suffer a seek for every new block
301. Is a Block New, or Duplicate?
- Idea use a Bloom filter to identify new blocks
- Lossy representation of the primary index
- Uses much less memory than index itself
- For any given block, Bloom filter tells us
- Its definitely new ? append to log, update index
- It might be duplicate ? lookup in index
- If it really is a duplicate, we get the prefetch
benefit - Otherwise, called a false positive
- Using enough memory keeps false positives at 1
31Results
- Do these optimizations pay off?
- Buffering index writes is an obvious win
- Bloom filter is, too removes 99 of seeks when
writing new data - Both trade RAM for seeks
- Benefit of secondary index less clear
- If duplicate data comes in long sequences, it
reduces index seeks to two per sequence - If duplicate data comes in little fragments, it
doubles the number of index seeks - Need traces of real data to answer this question
32Results (cont.)
- Research group at MIT has been running Venti as
its backup server for two years - We looked at 400 nightly snapshots
- Simulated archiving and restoring these in both
Venti and Foundation
33Talk Outline
- Introduction
- What is Foundation?
- Review of Content-Addressed Storage (Venti)
- Contributions
- Making Cheap Content-Addressed Storage Fast
- Avoiding Concerns over Hash Collisions
- Related Work
- Conclusions
34Eliminating Compare by Hash
- Some worried that same SHA-1 doesnt imply same
contents (i.e., hash collisions are possible) - Even if very rare, consequences (corruption) too
great - Stepping back a bit, CAS as a black box
- Give it a data block, get back an opaque ID
- Give it an opaque ID, get back the data block
- Do we care that the ID is a SHA-1 hash?
- What if the opaque ID was just the blocks
location in the data log?
35Using Locations As IDs
- Pros
- Reads require no index lookups at all
- System can still find potential duplicates
using hashing (with a weaker, faster hash
function) - Cons
- Need another mechanism to check integrity
- Since hash untrusted, must compare suspected
duplicates byte-by-byte - Others have claimed these byte-by-byte
comparisons are a non-starter
362nd Disk Arm to the Rescue
- Once we eliminate most index reads (via our
previous optimizations), the backup disk is
otherwise idle while backing up duplicate data - Can instead put it to work doing byte-by-byte
comparisons of suspected duplicates
37Talk Outline
- Introduction
- What is Foundation?
- Review of Content-Addressed Storage (Venti)
- Contributions
- Making Cheap Content-Addressed Storage Fast
- Avoiding Concerns over Hash Collisions
- Related Work
- Conclusions
38Related Work
- Apple Time Machine
- Duplicates coalesced at file level via hard links
- Netapp WAFL, ZFS
- Copy-on-write coalesces blocks at the FS level
- Misses duplicates that come into system
separately - Data Domain Deduplication FS
- Very similar to Foundation, in enterprise context
- Depends on collision-freeness of hash function
- Lots of other Content-Addressed Storage work
- LBFS, SUNDR, Peabody
39Conclusions
- Consumer-grade CAS works now
- A single, external USB drive is enough
- Just have to be crafty about avoiding seeks
- Lots of uses other than preservation
- E.g., inexpensive household backup server that
automatically coalesces duplicate media
collections - Doesnt require a collision-free hash function