ISTORE: A Platform for Scalable, Available, Maintainable StorageIntensive Applications

About This Presentation

Title:

ISTORE: A Platform for Scalable, Available, Maintainable StorageIntensive Applications

Description:

... that provides online self-testing of its hardware and software ... scrubbing: periodic restoration of potentially 'decaying' hardware or software state ... –

Number of Views:91

Avg rating:3.0/5.0

Slides: 31

Provided by: aaronbrown6

Learn more at: http://istore.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: ISTORE: A Platform for Scalable, Available, Maintainable StorageIntensive Applications

1
ISTORE A Platform for Scalable, Available,
Maintainable Storage-Intensive Applications

Aaron Brown, David Oppenheimer, Jim Beck,
Rich Martin, Randi Thomas, David Patterson,
and Kathy Yelick
Computer Science Division
University of California, Berkeley
http//iram.cs.berkeley.edu/istore/

2
ISTORE Philosophy SAM

The ISTORE project is researching techniques for
bringing scalability, availability, and
maintainability (SAM) to large server systems
ISTORE vision a self-testing HW/SW platform that
automatically reacts to situations requiring an
administrative response
brings self-maintenance to applications and
storage
ISTORE target high-end servers for
data-intensive infrastructure services
single-purpose systems managing large amounts of
data for large numbers of active network users
e.g. TB of data, 10,000s requests/sec, millions
of users

3
Motivation Service Demands

Emergence of a true information infrastructure
today e-commerce, online database services,
online backup, search engines, and web servers
tomorrow more of above (with ever-growing
datasets), plus thin-client/PDA infrastructure
support
these services have different needs than
traditionally fault-tolerant services (ATMs,
telephone switch, ...)
rapid software evolution
unpredictable, wildly fluctuating demand and user
base
often must incorporate low-cost, off-the-shelf HW
and SW components

4
Service Demands (2)

Infrastructure users expect always-onservice
and constant quality of service
infrastructure must provide scalable
fault-toleranceand performance-tolerance
to a rapidly growing and evolving application
base
failures and slowdowns have major business impact
e.g., recent EBay, ETrade, Schwab outages

5
The Need for 24x7 Availability

Todays widely deployed systems cant provide
24x7 fault- and performance-tolerance
they rely on manual administration
static data and application partitioning
human detection of and response to most anomalous
behaviors and changes in system environment
human administrators are too expensive, too slow,
too prone to mistakes
Jim Gray reports 42 of Tandem failures due to
administrator error (in 1985)
Tomorrows ever-growing infrastructure systems
need to be self-maintaining
self-maintaining systems anticipate problems and
handle them as they arise, automatically

6
Self-Maintaining Systems

Self-maintaining systems require
a robust platform that provides online
self-testing of its hardware and software
easy incremental scalability when existing
resources stop providing desired quality of
service
rapid detection of anomalous behavior and changes
in system environment
failures, load spikes, changing access patterns,
...
fast and flexible reaction to detected conditions
flexible specification of conditions that trigger
adaptation
Systems deployed on the ISTORE platform will be
self-maintaining

7
Target Application Model

Scalable applications for data storage and access
e.g., bottom (data) tier of three-tier systems
Desired properties
ability to manage replicated/distributed state
including distribution of workload across
replicas
ability to create and destroy replicas on the fly
persistence model that can tolerate node failure
without loss of data
logging of writes, soft-state, etc.
ability to migrate service between nodes
e.g., checkpoint and restore, or kill and restart
built-in application self-testing

8
Target Application Model (2)

What existing application architectures come
close to fitting this model?
parallel shared-nothing DBMSs
IBM DB2, Teradata, Tandem SQL/MX
distributed server applications
Lotus Notes/Domino
traditional distributed filesystems/fileservers
cluster-aware applications (with small mods?)
LARD cluster web server (Rice)
Microsoft Cluster Server Phase 2 (?)
What doesnt fit?
simple 2-node hot standby failover clusters
Microsoft Cluster Server Phase 1

9
The ISTORE Approach

Divides self-maintenance into two components
1) reactive self-maintenance dynamic reaction
to exceptional system events
self-diagnosing, self-monitoring hardware
software monitoring and problem detection
automatic reaction to detected problems
2) proactive self-maintenance continuous online
self- testing and self-analysis
automatic characterization of system components
in situ fault injection, self-testing, and
scrubbing to detect flaky hardware components and
to exercise rarely-taken application code paths
before theyre used

10
Reactive Self-Maintenance

ISTORE defines a layered system model for
monitoring and reaction

ISTORE API defines interface between runtime
system and app. reaction mechanisms

Policies define systems monitoring, detection,
and reaction behavior

Hardware architecture plug-and-play intelligent
devices with integrated self-monitoring,
diagnostics, and fault injection hardware
intelligence used to collect and filter
monitoring data
diagnostics and fault injection enhance
robustness
networked to create a scalable shared-nothing
cluster

12
ISTORE-II Hardware Vision

System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
Target for 5-7 years

1999 IBM MicroDrive
1.7 x 1.4 x 0.2 (43 mm x 36 mm x 5 mm)
340 MB, 5400 RPM, 5 MB/s, 15 ms seek
2006 MicroDrive?
9 GB, 50 MB/s (1.6X/yr capacity, 1.4X/yr BW)

13
2006 ISTORE

ISTORE node
Add 20 pad to MicroDrive size for packaging,
connectors
Then double thickness to add IRAM
2.0 x 1.7 x 0.5 (51 mm x 43 mm x 13 mm)
Crossbar switches growing by Moores Law
2x/1.5 yrs ? 4X transistors/3yrs
Crossbars grow by N2 ? 2X switch/3yrs
16 x 16 in 1999 ? 64 x 64 in 2005
ISTORE rack (19 x 33 x 84) (480 mm x 840 mm
x 2130 mm)
1 tray (3 high) ? 16 x 32 ? 512 ISTORE nodes
20 traysswitchesUPS ? 10,240 ISTORE nodes(!)

Each node includes extra diagnostic support
diagnostic processor independent hardware
running monitoring and control software
monitors hardware and environmental state not
normally visible to system software
control
reboot/power-cycle main CPU
inject simulated faults power, bus transients,
memory errors, network interface failure, ...
separate diagnostic network connects the
diagnostic processors of each brick
provides independent network path to diagnostic
CPU
works when brick CPU is powered off or has failed

Software collects and filters monitoring data
hardware monitors device health, environmental
conditions, and indicators that software is
working
some information processed locally to provide
fail-fast behavior when higher-level software
deemed potentially untrustworthy
most information passed on to software monitoring
software monitoring layer also collects
higher-level performance data, access patterns,
app. heartbeats

The data is collected in a virtual database
desired monitoring data is selected and
aggregated by specifying views over the
database
database schema views hide differences in
monitoring implementation on heterogeneous HW and
SW
Running example
If ambient temperature of a shelf is rising
significantly faster than that of other shelves,
reduce power consumption on those nodes, then
if necessary, migrate non-redundant data replicas
off some nodes on that shelf and shut them down
view for each shelf, average temperature across
all temperature sensors on that shelf

Conditions requiring administrative response are
detected by observing values and/or patterns in
the monitoring data
triggers specify these patterns and invoke
appropriate adaptation algorithms
input to a trigger is a view of the monitoring
data
views and triggers can be specified separately to
allow
easy selection of desired reaction algorithm
easy redefinition of conditions that invoke a
particular reaction
Running example
trigger change in temperature of one shelf gt 0
and more than twice the change in temperature of
any other shelf, averaged over a one-minute period

Adaptation algorithms coordinate
application-level reaction mechanisms
adaptation algorithms define a sequence of
operations that address the anomaly detected by
the associated trigger
adaptation algorithms call application-implemented
mechanisms via a standard API
but are independent of application mechanism
details
Running example coordination of reaction
1) identify nodes with non-redundant data
2) invoke application mechanism to migrate that
data off n of those nodes
3) reduce power consumption by those n nodes
4) install trigger to monitor temperature change
and shut down nodes if power reduction is
ineffective

ISTORE expects reaction mechanisms to be
implemented by the application
these reaction mechanisms are application-specific
e.g., moving data requires knowledge of data
semantics, consistency policies, ...
a research goal of ISTORE is to provide a
standard API to these mechanisms
initially, try to leverage and extend existing
mechanisms to avoid wholesale rewriting of
applications
many data-intensive applications already support
functionality similar to the needed mechanisms
eventually, generalize and extend API to
encompass mechanisms and needs of future
applications

20
Policies

Programmer or administrator specifies policies to
control the systems adaptive behavior
the policy compiler turns a high-level
declarative specification of desired behavior
into the appropriate
adaptation algorithms (that invoke application
mechanisms through the ISTORE API)
triggers (to invoke the adaptation algorithms
when the appropriate conditions are detected)
views (that enable monitoring needed by the
triggers)
Running example
policy if ambient temperature of a shelf is
rising significantly faster than that of other
shelves, reduce power and prepare to shut down
nodes

21
Summary Layered System Model

Layered system model for monitoring and reaction
provides reactive self-maintenance

Self-maintenance in ISTORE also consists of
proactive, continuous self-testing and analysis

22
The ISTORE Approach

Divides self-maintenance into two components
1) reactive self-maintenance dynamic reaction
to exceptional system events
self-diagnosing, self-monitoring hardware
software monitoring and problem detection
automatic reaction to detected problems
2) proactive self-maintenance continuous online
self- testing and self-analysis
in situ fault injection, self-testing, and
scrubbing to detect flaky hardware components and
to exercise rarely-taken application code paths
before theyre used
automatic characterization of system components

23
Continuous Online Self-Testing

Self-maintaining systems should automatically
carry out preventative maintenance
need aggressive in situ component testing via
fault injection triggering hardware and software
error handling paths to verify their
integrity/existence
stress testing pushing HW/SW components past
normal operating parameters
scrubbing periodic restoration of potentially
decaying hardware or software state
ISTORE periodically isolates nodes from the
system and performs extensive self-tests
nodes can be easily isolated due to ISTOREs
built-in redundancy
even in a deployed, running system

24
Self-Testing Hardware

Goals of hardware self-testing is to detect flaky
components and preserve data integrity
Examples
fault injection power cycle disk to check for
stiction
stress testing run disk controller at 100
utilization to test behavior under load
scrubbing read all disk sectors and rewrite any
that suffer soft errors fire disk if too many
errors

25
Self-Testing Software

Software self-testing proactively identifies
weaknesses in software before they cause a
visible failure
helps prevent failure due to bugs that only
appear in certain hardware/software
configurations
helps identify bugs that occur when software is
driven into an untested state only reachable in a
live system
e.g., long uptimes, heavy load, unexpected
requests
Examples
fault injection (includes HW- and SW-induced
faults that the SW is expected to handle) SCSI
parity error, invalid return codes from operating
system
stress testing heavy load, pathological requests
scrubbing restart/reboot long-running software

26
Online Self-Analysis

Self-maintaining systems require knowledge of
their components dynamic runtime behavior
current plug-and-play hardware approaches are
not sufficient
need more than just discovery of new devices
functional capabilities and supported APIs
also need dynamic component characterization

27
Characterizing HW/SW Behavior

An ISTORE may contain black-box components
heterogeneous hardware devices
application-supplied reaction mechanisms whose
implementations are hidden
To select and tune adaptation algorithms, the
ISTORE system needs to understand the behavior of
these components
in the context of a complex, live system
examples
characterize performance of disks in system, use
that data to select destination disks for replica
creation
isolate two nodes, invoke replication from one to
the other, monitor actions taken by application
(e.g., how long it takes, how much data is moved)

28
Support for Application Self-tuning

ISTOREs characterization mechanisms can also
help applications tune themselves
current systems require manual tuning to meet
scalability and performance goals
especially true for shared-nothing systems in
which computational and storage resources arent
pooled
possible research direction is to expose
characterization information to application via
an extension of the ISTORE API
this would allow aware applications to
automatically adapt their behavior based on
system conditions

29
ISTORE API

The ISTORE API defines interfaces for
adaptation algorithms to invoke application
reaction mechanisms
e.g., migrate data, replicate data, checkpoint,
shutdown, ...
applications to provide hints to the runtime
system so it can optimize adaptation algorithms
data storage
e.g., application tags data whose unavailability
can be temporarily tolerated
runtime system to invoke application self-testing
and fault injection, and for application to
report results
runtime system to inform application about
current state of system, hardware capabilities,
...

30
Summary

ISTORE focuses on Scalability, Availability, and
Maintainability for emerging data-intensive
network applications
ISTORE provides a platform for deploying
self-maintaining systems that are up 24x7
ISTORE will achieve self-maintenance via
hardware platform with integrated diagnostic
support
reactive self-maintenance a layered,
policy-driven runtime system that provides a
framework for monitoring and reaction
proactive self-maintenance support for
continuous on-line self-testing and component
characterization
and a standard API for interfacing applications
to the runtime system