Title: OceanStore: In Search of GlobalScale, Persistent Storage
1OceanStoreIn Search of Global-Scale,
Persistent Storage
- John Kubiatowicz
- UC Berkeley
2OceanStore Context Ubiquitous Computing
- Computing everywhere
- Desktop, Laptop, Palmtop
- Cars, Cellphones
- Shoes? Clothing? Walls?
- Connectivity everywhere
- Rapid growth of bandwidth in the interior of the
net - Broadband to the home and office
- Wireless technologies such as CMDA, Satelite,
laser
3Utility-based Infrastructure?
- Data service provided by federation of companies
- Cross-administrative domain
- Metric MOLE OF BYTES (6?1023)
4OceanStore Assumptions
- Untrusted Infrastructure
- The OceanStore is comprised of untrusted
components - Only ciphertext within the infrastructure
- Responsible Party
- Some organization (i.e. service provider)
guarantees that your data is consistent and
durable - Not trusted with content of data, merely its
integrity - Mostly Well-Connected
- Data producers and consumers are connected to a
high-bandwidth network most of the time - Exploit multicast for quicker consistency when
possible - Promiscuous Caching
- Data may be cached anywhere, anytime
5Key ObservationWant Automatic Maintenance
- Cant possibly manage billions of servers by
hand! - System should automatically
- Adapt to failure
- Repair itself
- Incorporate new elements
- Introspective Computing/Autonomic Computing
- Can data be accessible for 1000 years?
- New servers added from time to time
- Old servers removed from time to time
- Everything just works
6Outline
- Motivation
- Assumptions of the OceanStore
- Specific Technologies and approaches
- Routing and Data Location
- Naming
- Conflict resolution on encrypted data
- Replication and Deep archival storage
- Introspection for optimization and repair
- Conclusion
7Basic StructureIrregular Mesh of Pools
8Bringing Order to this Chaos
- How do you find information?
- Must be scalable and provide maximum flexibility
- How do you name information?
- Must provide global uniqueness
- How do you ensure consistency?
- Must scale and handle intermittent connectivity
- Must prevent unauthorized update of information
- How do you protect information?
- Must preserve privacy
- Must provide deep archival storage (continuous
repair) - How do go tune performance?
- Locality very important
- Throughout all of this how do you maintain it???
9Location and Routing
10Locality, Locality, LocalityOne of the defining
principles
- The ability to exploit local resources over
remote ones whenever possible - -Centric approach
- Client-centric, server-centric, data
source-centric - Requirements
- Find data quickly, wherever it might reside
- Locate nearby object without global communication
- Permit rapid object migration
- Verifiable cant be sidetracked
- Locality yields Performance, Availability,
Reliability
11Enabling Technology DOLR(Decentralized Object
Location and Routing)
Tapestry
12Stability under Changes
- Unstable, unreliable, untrusted nodes are the
common case! - Network never fully stabilizes
- What is half-life of a routing node?
- Must provide stable routing in these
circumstances - Redundancy and adaptation fundamental
- Make use of alternative paths when possible
- Incrementally remove faulty nodes
- Route around network faults
- Continuously tune neighbor links
13The Tapestry DOLR
- Routing to Objects, not Locations!
- Replacement for IP?
- Very powerful abstraction
- Built as overlay network, but not fundamental
- Randomized prefix routing distributed object
location index - Routing nodes have links to nearby neighbors
- Additional state tracks objects
- Massive parallel insert (SPAA 2002)
- Construction of nearest-neighbor mesh links
- Log2 n message complexity for new node
- New nodes integrated, faulty ones removed
- Objects kept available during this process
14OceanStore Naming
15Model of Data
- Ubiquitous object access from anywhere
- Undifferentiated Bag of Bits
- Versioned Objects
- Every update generates a new version
- Can always go back in time (Time Travel)
- Each Version is Read-Only
- Can have permanent name (SHA-1 Hash)
- Much easier to repair
- An Object is a signed mapping between permanent
name and latest version - Write access control/integrity involves managing
these mappings
16Secure Hashing
- Read-only data GUID is hash over actual
information - Uniqueness and Unforgeability the data is what
it is! - Verification check hash over data
- Changeable data GUID is combined hash over a
human-readable name public key - Uniqueness GUID space selected by public key
- Unforgeability public key is indelibly bound to
GUID - Verification check signatures with public key
17Secure Naming
- Naming hierarchy
- Users map from names to GUIDs via hierarchy of
OceanStore objects (ala SDSI) - Requires set of root keys to be acquired by user
18The Write Path
19The Path of an OceanStore Update
20OceanStore Consistency viaConflict Resolution
- Consistency is form of optimistic concurrency
- An update packet contains a series of
predicate-action pairs which operate on encrypted
data - Each predicate tried in turn
- If none match, the update is aborted
- Otherwise, action of first true predicate is
applied - Inner Ring must securely
- Pick serial order of updates
- Apply them
- Sign result (threshold signature)
- Disseminate results to active users
21Automatic Maintenance
- Byzantine Commitment for inner ring
- Tolerates up to 1/3 malicious servers in inner
ring - Continuous refresh of set of inner-ring servers
- Proactive threshold signatures
- Use of Tapestry ?membership of inner ring unknown
to clients - Secondary tier self-organized into overlay
dissemination tree - Use of Tapestry routing to suggest placement of
replicas in the infrastructure - Automatic choice between update vs invalidate
22Self-Organizing Soft-State Replication
- Simple algorithms for placing replicas on nodes
in the interior - Intuition locality propertiesof Tapestry help
select positionsfor replicas - Tapestry helps associateparents and childrento
build multicast tree - Preliminary resultsshow that this is effective
23Deep Archival Storage
24TwoTypes of OceanStore Data
- Active Data Floating Replicas
- Per object virtual server
- Logging for updates/conflict resolution
- Interaction with other replicas for
consistentency - May appear and disappear like bubbles
- Archival Data OceanStores Stable Store
- m-of-n coding Like hologram
- Data coded into n fragments, any m of which are
sufficient to reconstruct (e.g m16, n64) - Coding overhead is proportional to n?m (e.g 4)
- Other parameter, rate, is 1/overhead
- Fragments are cryptographically self-verifying
- Most data in the OceanStore is archival!
25Archival Disseminationof Fragments
26Fraction of Blocks Lost per Year (FBLPY)
- Exploit law of large numbers for durability!
- 6 month repair, FBLPY
- Replication 0.03
- Fragmentation 10-35
27The Dissemination ProcessAchieving Failure
Independence
28Automatic Maintenance
- Continuous Entropy Suppression i.e. repair!
- Erasure coding give flexibility in timing repair
- Data continuously transferred from physical
medium to physical medium - No tapes decaying in basement
- Actual Repair
- Recombine fragments, then send out copies again
- DOLR permits efficient heartbeat mechanism
- Permits infrastructure to notice
- Servers going away for a while
- Or, going away forever!
- Continuous sweep through data
29Introspective Tuning
30On the use of Redundancy
- Question Can we use Moores law gains for
something other than just raw performance? - Growth in computational performance
- Growth in network bandwidth
- Growth in storage capacity
- Physical systems are unreliable and untrusted
- Can we use multiple faulty elements instead of
one? - Can we devote resources to monitoring and
analysis? - Can we devote resources to repairing systems?
- Complexity of systems growing rapidly
- Can no longer debug systems entirely
- How to handle this?
31The Biological Inspiration
- Biological Systems are built from (extremely)
faulty components, yet - They operate with a variety of component failures
? Redundancy of function and representation - They have stable behavior ? Negative feedback
- They are self-tuning ? Optimization of common
case - Introspective Computing
- Components for computing
- Components for monitoring andmodel building
- Components for continuous adaptation
32The Thermodynamic Analogy
- System such as OceanStore has a variety of latent
order - Connections between elements
- Mathematical structure (erasure coding, etc)
- Distributions peaked about some desired behavior
- Permits Stability through Statistics
- Exploit the behavior of aggregates
- Subject to Entropy
- Servers fail, attacks happen, system changes
- Requires continuous repair
- Apply energy (i.e. through servers) to reduce
entropy
33Introspective Optimization
- Adaptation of routing substrate
- Optimization of Tapestry Mesh
- Fault-tolerant routing mechanisms
- Adaptation of second-tier multicast tree
- Monitoring of access patterns
- Clustering algorithms to discover object
relationships - Time series-analysis of user and data motion
- Observations of system behavior
- Extracting of failure correllations
- Continuous testing and repair of information
- Slow sweep through all information to make sure
there are sufficient erasure-coded fragments - Continuously reevaluate risk and redistribute data
34PondStore Java
- Event-driven state-machine model
- Included Components
- Initial floating replica design
- Conflict resolution and Byzantine agreement
- Routing facility (Tapestry)
- Bloom Filter location algorithm
- Plaxton-based locate and route data structures
- Introspective gathering of tacit info and
adaptation - Language for introspective handler construction
- Clustering, prefetching, adaptation of network
routing - Initial archival facilities
- Interleaved Reed-Solomon codes for fragmentation
- Methods for signing and validating fragments
- Target Applications
- Unix file-system interface under Linux (legacy
apps) - Email application, proxy for web caches,
streaming multimedia applications
35We have Things Running!
- Latest it is up to 7MB/sec
- Still a ways to go, but working
36Update Latency
- Cryptography in critical path (not surprising!)
- New metric Avoid hashes (like avoid copies)
37OceanStore Goes Global!
- OceanStore components running globally
- Australia, Georgia, Washington, Texas, Boston
- Able to run the Andrew File-System benchmark with
inner ring spread throughout US - Interface NFS on OceanStore
- Word on the street it was easy to do
- The components were debugged locally
- Easily set up remotely
- I am currently talking with people in
- England, Maryland, Minnesota, .
- PlanetLab testbed will give us access to much more
38Reality Web Caching through OceanStore
39Other Apps
- Better file system support
- NFS (working reimplementation in progress)
- Windows Installable file system (soon)
- Email through OceanStore
- IMAP and POP proxies
- Let normal mail clients access mailboxes in OS
- Palm-pilot synchronization
- Palm data base as an OceanStore DB
40OceanStore Conclusions
- OceanStore everyones data, one big utility
- Global Utility model for persistent data storage
- OceanStore assumptions
- Untrusted infrastructure with a responsible party
- Mostly connected with conflict resolution
- Continuous on-line optimization
- OceanStore properties
- Provides security, privacy, and integrity
- Provides extreme durability
- Lower maintenance cost through redundancy,
continuous adaptation, self-diagnosis and repair - Large scale system has good statistical properties
41For more info
- OceanStore vision paper for ASPLOS 2000
- OceanStore An Architecture for Global-Scale
Persistent Storage - Tapestry algorithms paper (SPAA
2002) Distributed Object Location in a Dynamic
Network - Bloom Filters for Probabilistic Routing (INFOCOM
2002) - Probabilistic Location and Routing
- OceanStore web site http//oceanstore.org/