Fault-tolerant replication management in large-scale distributed storage systems - PowerPoint PPT Presentation

About This Presentation

Title:

Fault-tolerant replication management in large-scale distributed storage systems

Description:

on a storage device as long as there are no failures or layout ... In case a device fails to report and try to renew its lease, the manager considers it failed ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 16

Provided by: Wre6

Learn more at: http://www-net.cs.umass.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fault-tolerant replication management in large-scale distributed storage systems

1
Fault-tolerant replication management
inlarge-scale distributed storage systems

Richard Golding
Storage Systems Program, Hewlett Packard Labs
golding_at_hpl.hp.com

Elizabeth Borowsky Computer Science Dept., Boston
College borowsky_at_cs.bc.edu
2
Introduction

Palladio - solution for detecting, handling, and
recovering from both small- and large-scale
failures in a distributed storage system.
Palladio - provides virtualized data storage
services to applications via set of virtual
stores, which are structured as a logical array
of bytes into which applications can write and
read data. The stores layout maps each byte in
its address space to an address on one or more
devices.
Palladio - storage devices take an active role in
the recovery of the stores they are part of.
Managers keep track of the virtual stores in the
system, coordinating changes to their layout and
handling recovery from failure.

Provide robust read and write access to data in
virtual stores.
Atomic and serialized read and write access.
Detect and recover from failure.
Accommodate layout changes.

Entities Hosts Stores Managers Management policies
Protocols Layout Retrieval protocol Data Access
protocol Reconciliation protocol Layout Control
protocol
4
Protocols
Access protocol allows hosts to read and write
data on a storage device as long as there are no
failures or layout changes for the virtual store.
It must provide serialized, atomic writes that
can span multiple devices. Layout retrieval
protocol allows hosts to obtain the current
layout of a virtual store the mapping from the
virtual stores address space onto the devices
that store parts of it. Reconciliation protocol
runs between pairs of devices to bring them back
to consistency after a failure. Layout control
protocol runs between managers and devices
maintains consensus about the layout and failure
status of the devices, and in doing so
coordinates the other three protocols.
5
Layout Control Protocol

The layout control protocol tries to maintain
agreement
between a stores manager and the storage devices
that hold the store.
The layout of data onto storage devices
The identity of the stores active manager.

The notion of epochs
The layout and manager are fixed during each
epoch
Epochs are numbered
Epoch transitions
Device leases acquisition and renewal
Device leases used to detect possible failure.

6
Operation during an epoch

The manager has quorum and coverage of devices.
Periodic lease renewal
In case a device fails to report and try to renew
its lease, the manager considers it failed
In case the manager fails to renew the lease, the
device considers the manager failed and starts a
manager recovery sequence
When the manager looses quorum or coverage the
epoch ends and a state of epoch transition is
entered.

7
Epoch transition