Title: OceanStore Global-Scale Persistent Storage
1OceanStoreGlobal-Scale Persistent Storage
- John Kubiatowicz
- University of California at Berkeley
2Context Project EndeavourInterdisciplinary,
Technology-Centered Team
- Alex Aiken, PL
- Eric Brewer, OS
- John Canny, AI
- David Culler, OS/Arch
- Joseph Hellerstein, DB
- Michael Jordan, Learning
- Anthony Joseph, OS
- Randy Katz, Nets
- John Kubiatowicz, Arch
- James Landay, UI
- Jitendra Malik, Vision
- George Necula, PL
- Christos Papadimitriou, Theory
- David Patterson, Arch
- Kris Pister, Mems
- Larry Rowe, MM
- Alberto Sangiovanni-Vincentelli, CAD
- Doug Tygar, Security
- Robert Wilensky, DL/AI
3Endeavour Goals
- Enhancing human understanding
- Help people to interact with information,
devices, and people - exploit Moores law growth
in everything - Enable new approaches for problem solving
learning - Figure of merit how effectively we amplify and
leverage human intellect - Enabling and exploiting ubiquitous computing
- Small devices, sensors, smart materials, cars,
etc - New methods for design, construction, and
administration of ultra-scale systems - Planetary-scale Information Utilities
- Infrastructure is transparent and always active
- Extensive use of redundancy of hardware and data
- Devices that negotiate their interfaces
automatically - Elements that tune, repair, and maintain
themselves
4Endeavour Maxims
- Exploit Moores law growth for better behavior
- Use of excess capacity for better human
interface - Personal Information Mgmt is the Killer App
- Not corporate processing but management,
analysis, aggregation, dissemination, filtering
for the individual - Automated extraction and organization of daily
activities to assist people - Time to move beyond the Desktop
- Community computing infer relationships among
information, delegate control, establish
authority - Information Technology as a Utility
- Continuous service delivery, on a
planetary-scale, on top of a highly dynamic
information base
5Endeavour Approach
- Information Devices
- Beyond desktop computers to MEMS-sensors/actuators
with capture/display to yield enhanced activity
spaces - Information Utility
- Information Applications
- High Speed/Collaborative Decision Making and
Learning - Augmented Smart Spaces Rooms and Vehicles
- Design Methodology
- User-centric Design withHW/SW Co-design
- Formal methods for safe and trustworthy
decomposable and reusable components
- Fluid, Network-Centric System Software
- Partitioning and management of state between soft
and persistent state - Data processing placement and movement
- Component discovery and negotiation
- Flexible capture, self-organization, and re-use
of information
6OceanStore Context Ubiquitous Computing
- Computing everywhere
- Desktop, Laptop, Palmtop
- Cars, Cellphones
- Shoes? Clothing? Walls?
- Connectivity everywhere
- Rapid growth of bandwidth in the interior of the
net - Broadband to the home and office
- Wireless technologies such as CMDA, Satelite,
laser - Rise of the thin-client metaphor
- Services provided by interior of network
- Incredibly thin clients on the leaves
- MEMs devices -- sensorsCPUwireless net in 1mm3
- Mobile society people move and devices are
disposable
7Questions about information
- Where is persistent information stored?
- 20th-century tie between location and content
outdated (we all survived the Feb 29th bug --
lets move on!) - In world-scale system, locality is key
- How is it protected?
- Can disgruntled employee of ISP sell your
secrets? - Cant trust anyone (how paranoid are you?)
- Can we make it indestructible?
- Want our data to survive the big one!
- Highly resistant to hackers (denial of service)
- Wide-scale disaster recovery
- Is it hard to manage?
- Worst failures are human-related
- Want automatic (introspective) diagnose and
repair
8First ObservationWant Utility Infrastructure
- Mark Weiser from Xerox Transparent computing is
the ultimate goal - Computers should disappear into the background
- In storage context
- Dont want to worry about backup
- Dont want to worry about obsolescence
- Need lots of resources to make data secure and
highly available, BUT dont want to own them - Outsourcing of storage already becoming popular
- Pay monthly fee and your data is out there
- Simple payment interface ? one bill from one
company
9Second ObservationNeed wide-scale deployment
- Many components with geographic separation
- System not disabled by natural disasters
- Can adapt to changes in demand and regional
outages - Gain in stability through statistics
- Difference between thermodynamics and mechanics?
surprising stability of temperature and pressure
given 1030 molecules with highly variable
behavior! - Wide-scale use and sharing also requires
wide-scale deployment - Bandwidth increasing rapidly, but latency bounded
by speed of light - Handling many people with same system leads to
economies of scale
10OceanStoreEveryones data, One big Utility
- The data is just out there
- Separate information from location
- Locality is an only an optimization (an important
one!) - Wide-scale coding and replication for durability
- All information is globally identified
- Unique identifiers are hashes over names keys
- Single uniform lookup interface replaces DNS,
server location, data location - No centralized namespace required (such as SDSI)
11Basic StructureIrregular Mesh of Pools
12Amusing back of the envelope calculation(courtesy
Bill Bolotsky, Microsoft)
- How many files in the OceanStore?
- Assume 1010 people in world
- Say 10,000 files/person (very conservative?)
- So 1014 files in OceanStore!
- If 1 gig files (not likely), get 1 mole of files!
- Truly impressive number of elements
- but small relative to physical constants
13Utility-based Infrastructure
Canadian OceanStore
Sprint
ATT
IBM
Pac Bell
IBM
- Service provided by confederation of companies
- Monthly fee paid to one service provider
- Companies buy and sell capacity from each other
14Outline
- Motivation
- Properties of the OceanStore and Assumptions
- Specific Technologies and approaches
- Conflict resolution on encrypted data
- Replication and Deep archival storage
- Naming and Data Location
- Introspective computing for optimization and
repair - Economic models
- Conclusion
15Ubiquitous Devices ? Ubiquitous Storage
- Consumers of data move, change from one device to
another, work in cafes, cars, airplanes, the
office, etc. - Properties REQUIRED for OceanStore storage
substrate - Strong Security data encrypted in the
infrastructure resistance to monitoring and
denial of service attacks - Coherence too much data for naïve users to keep
coherent by hand - Automatic replica management and optimization
huge quantities of data cannot be managed
manually - Simple and automatic recovery from disasters
probability of failure increases with size of
system - Utility model world-scale system requires
cooperation across administrative boundaries
16State of the Art?
- Widely deployed systems NFS, AFS (/DFS)
- Single regions of failure, caching only at
endpoints - ClearText exposed at various levels of system
- Compromised server?? all data on server
compromised - Mobile computing community Coda, Ficus, Bayou
- Small scale, fixed coherence mechanism
- Not optimized to take advantage of high-bandwidth
connections between server components - ClearText also exposed at various levels of
system - Web caching community Inktomi, Akamai
- Specialized, incremental solutions
- Caching along client/server path, various
bottlenecks - Database Community
- Interfaces not usable by legacy applications
- ACID update semantics not always appropriate
17OceanStore Assumptions
- Untrusted Infrastructure
- The OceanStore is comprised of untrusted
components - Only cyphertext within the infrastructure
- Information must not be leaked over time
- Principle Party
- There is one organization that is financially
responsible for the integrity of your data - Mostly Well-Connected
- Data producers and consumers are connected to a
high-bandwidth network most of the time - Exploit multicast for quicker consistency when
possible - Promiscuous Caching
- Data may be cached anywhere, anytime
- Operations Interface with Conflict Resolution
- Applications employ an operations-oriented
interface, rather than a file-systems interface - Coherence is centered around conflict resolution
18OceanStore Technologies INaming and Data
Location
- Requirements
- System-level names should help to authenticate
data - Route to nearby data without global communication
- Dont inhibit rapid relocation of data
- OceanStore approach Two-level search with
embedded routing - Underlying namespace is flat and built from
secure cryptographic hashes (160-bit SHA-1) - Search process combines quick, probabilistic
search with slower guaranteed search - Long-distance data location and routing are
integrated - Every source/destination pair has multiple
routing paths - Continuous, on-line optimization adapts for hot
spots, denial of service, and inefficiencies in
routing
19Universal Location Facility
- Takes 160-bit unique identifier (GUID) and
Returns the nearest object that matches
20Some current results
- Have a working algorithm for local search
- Uses attenuated bloom filters
- Performs search by passing messages from node to
node. All state kept in messages! - Updates filters through semi-chaotic passing of
information between neighbors - Resembles compiler dataflow algorithm
- Can be shown to converge
- Have candidate for backing store index
- Randomized data structure with locality
properties - Every document has multiple roots in the
OceanStore - Searches close to copy tend to find copy
quickly - Redundant, insensitive to faults, and repairable
- Investigating algorithms to continually adapt
routing structure to adjust for faults and denial
of service
21OceanStore Technologies IIRapid Update in an
Untrusted Infrastructure
- Requirements
- Scalable coherence mechanism which can operate
directly on encrypted data without revealing
information - Handle Byzantine failures
- Rapid dissemination of committed information
- OceanStore Approach
- Operations-based interface using conflict
resolution - Modeled after Xerox Bayou ? updates packets
includePredicate/update pairs which operate on
encryped data - Use of oblivious function techniques to perform
this update - Use of incremental cryptographic techniques
- User signs Updates and principle party signs
commits - Committed data multicast to clients
22Tentative UpdatesEpidemic Disemination
23Committed UpdatesMulticast Dissemination
24Our State of the Art
- Have techniques for protecting metadata
- Uses encryption and signatures to provide
protection against substitution attacks - Provides secure pointer technology
- Have a working scheme that can do some forms of
conflict resolution directly on encryped data - Uses new technique for searching on encrypted
data. - Can be generalized to perform optimistic
concurrency, but at cost in performance and
possibly privacy - Byzantine assumptions for update commitment
- Signatures on update requests from clients
- Compromised servers are unable to produce valid
updates - Uncompromised second-tier servers can make
consistent ordering decision with respect to
tentative commits - Use of threshold cryptography in inner-tier of
servers - Signatures on update stream from inner-tier
- Use of chained MACs to reduce overhead
25OceanStore Technologies IIIHigh-Availability
and Disaster Recovery
- Requirements
- Handle diverse, unstable participants in
OceanStore - Mitigate denial of service attacks
- Eliminate backup as independent (and fallible)
technology - Flexible disaster recovery for everyone
- OceanStore Approach
- Use of erasure-codes to provide stable storage
for archival copies and snapshots of live data - Mobile replicas are self-contained centers for
logging and conflict resolution - Version-based update for painless recovery
- Continuous introspection repairs data structures
and degree of redundancy
26Floating Replicas and Deep Archival Coding
- Floating Replicas are per-object virtual servers
- Complete copy of data
- logging for updates/conflict resolution
- Interaction with other centers to keep data
consistent - May appear and disappear like bubbles
- Erasure coded fragments provide very stable store
- Multi-level codes spread over 1000s of nodes
- Could lose 1/2 of nodes and still recover data
- Archive old versions of data and checkpoints
- Inactive data may only be in erasure-coded form
27Floating Replica and Deep Archival Coding
28Structure of Archival Checkpoints
Checkpoint Reference (GUID)
. . . . .
NOTE Each Block needs a GUID
Blocks
- All blocks and fragments signed
- Copy on Write behavior
- Older metablocks fragmented also
Unit of Archival Storage
29Proactive Self-Maintenance
- Continuous testing and repair of information
- Slow sweep through all information to make sure
there are sufficient erasure-coded fragments - Continuously reevaluate of risk and redistribute
data - Slow sweep and repair of metadata/search trees
- Continuous online self-testing of HW and SW
- detects flaky, failing, or buggy components via
- fault injection triggering hardware and software
error handling paths to verify their
integrity/existence - stress testing pushing HW/SW components past
normal operating parameters - scrubbing periodic restoration of potentially
decaying hardware or software state - automates preventive maintenance
30OceanStore Technologies IVIntrospective
Optimization
- Requirements
- Reasonable job on global-scale optimization
problem - Take advantage of locality whenever possible
- Sensitivity to limited storage and bandwidth at
endpoints - Repair of data structures, increasing of
redundancy - Stability in chaotic environment ? Active
Feedback - OceanStore Approach
- Introspective Monitoring and analysis of
relationships to cluster information by
relatedness - Time series-analysis of user and data motion
- Rearrangement and replication in response to
monitoring - Clustered prefetching fetch related objects
- Proactive-prefetching get data there before
needed - Rearrangement in response to overload and attack
31 Example Client Introspection
- Client observer and optimizer components
- greedy agents working on the behalf of the client
- Watches client activity/combines with historical
info - Performs clustering and time-series analysis
- forwards results to infrastructure (privacy
issues!) - Monitoring of state of network to adapt behavior
- Typical Actions
- cluster related files together
- prefetch files that will be needed soon
- Create/destroy floating replicas
32OceanStore Technologies VThe oceanic data market
- Properties
- Utility providers have resources (storage and
bandwidth) - Clients use resources both directly and
indirectly - Use of data storage and bandwidth on demand
- Data movement on behalf of users
- Some customers are more important than others
- Techniques that we are exploring (very
preliminary) - Data market driven by principle party
- Tradeoff between performance (replication) and
cost - Secure signatures on data packets permit
- Accounting of bandwidth and CPU utilization
- Access control policies (Bays in OceanStore
nomenclature) - Use of challenge-response protocols (similar to
zero-knowledge proofs) to demonstrate possession
of data
33Two-Phase Implementation
- This term Read-Mostly Prototype
- Construction of data location facility
- Initial introspective gathering of tacit info and
adaptation - Initial archival techniques (use of erasure
codes) - Unix file-system interface under Linux (legacy
apps) - Later? Full Prototype
- Final conflict resolution and encryption
techniques - More sophisticated tacit info gathering and
rearrangement - Final object interface and integration with
Endeavour applications - Wide-scale deployment via NTON and Internet-2
34OceanStore Conclusion
- The Time is now for a Universal Data Utility
- Ubiquitous computing and connectivity is (almost)
here! - Confederation of utility providers is right model
- OceanStore holds all data, everywhere
- Local storage is a cache on global storage
- Provides security in an untrusted infrastructure
- Large scale system has good statistical
properties - Use of introspection for performance and
stability - Quality of individual servers enhances
reliability - Exploits economies of scale to
- Provide high-availability and extreme
survivability - Lower maintenance cost
- self-diagnosis and repair
- Insensitivity to technology changesJust unplug
one set of servers, plug in others