ISTORE: A Platform for Scalable, Available, Maintainable StorageIntensive Applications

About This Presentation
Title:

ISTORE: A Platform for Scalable, Available, Maintainable StorageIntensive Applications

Description:

... that provides online self-testing of its hardware and software ... scrubbing: periodic restoration of potentially 'decaying' hardware or software state ... –

Number of Views:91
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: ISTORE: A Platform for Scalable, Available, Maintainable StorageIntensive Applications


1
ISTORE A Platform for Scalable, Available,
Maintainable Storage-Intensive Applications
  • Aaron Brown, David Oppenheimer, Jim Beck,
  • Rich Martin, Randi Thomas, David Patterson,
  • and Kathy Yelick
  • Computer Science Division
  • University of California, Berkeley
  • http//iram.cs.berkeley.edu/istore/

2
ISTORE Philosophy SAM
  • The ISTORE project is researching techniques for
    bringing scalability, availability, and
    maintainability (SAM) to large server systems
  • ISTORE vision a self-testing HW/SW platform that
    automatically reacts to situations requiring an
    administrative response
  • brings self-maintenance to applications and
    storage
  • ISTORE target high-end servers for
    data-intensive infrastructure services
  • single-purpose systems managing large amounts of
    data for large numbers of active network users
  • e.g. TB of data, 10,000s requests/sec, millions
    of users

3
Motivation Service Demands
  • Emergence of a true information infrastructure
  • today e-commerce, online database services,
    online backup, search engines, and web servers
  • tomorrow more of above (with ever-growing
    datasets), plus thin-client/PDA infrastructure
    support
  • these services have different needs than
    traditionally fault-tolerant services (ATMs,
    telephone switch, ...)
  • rapid software evolution
  • unpredictable, wildly fluctuating demand and user
    base
  • often must incorporate low-cost, off-the-shelf HW
    and SW components

4
Service Demands (2)
  • Infrastructure users expect always-onservice
    and constant quality of service
  • infrastructure must provide scalable
    fault-toleranceand performance-tolerance
  • to a rapidly growing and evolving application
    base
  • failures and slowdowns have major business impact
  • e.g., recent EBay, ETrade, Schwab outages

5
The Need for 24x7 Availability
  • Todays widely deployed systems cant provide
    24x7 fault- and performance-tolerance
  • they rely on manual administration
  • static data and application partitioning
  • human detection of and response to most anomalous
    behaviors and changes in system environment
  • human administrators are too expensive, too slow,
    too prone to mistakes
  • Jim Gray reports 42 of Tandem failures due to
    administrator error (in 1985)
  • Tomorrows ever-growing infrastructure systems
    need to be self-maintaining
  • self-maintaining systems anticipate problems and
    handle them as they arise, automatically

6
Self-Maintaining Systems
  • Self-maintaining systems require
  • a robust platform that provides online
    self-testing of its hardware and software
  • easy incremental scalability when existing
    resources stop providing desired quality of
    service
  • rapid detection of anomalous behavior and changes
    in system environment
  • failures, load spikes, changing access patterns,
    ...
  • fast and flexible reaction to detected conditions
  • flexible specification of conditions that trigger
    adaptation
  • Systems deployed on the ISTORE platform will be
    self-maintaining

7
Target Application Model
  • Scalable applications for data storage and access
  • e.g., bottom (data) tier of three-tier systems
  • Desired properties
  • ability to manage replicated/distributed state
  • including distribution of workload across
    replicas
  • ability to create and destroy replicas on the fly
  • persistence model that can tolerate node failure
    without loss of data
  • logging of writes, soft-state, etc.
  • ability to migrate service between nodes
  • e.g., checkpoint and restore, or kill and restart
  • built-in application self-testing

8
Target Application Model (2)
  • What existing application architectures come
    close to fitting this model?
  • parallel shared-nothing DBMSs
  • IBM DB2, Teradata, Tandem SQL/MX
  • distributed server applications
  • Lotus Notes/Domino
  • traditional distributed filesystems/fileservers
  • cluster-aware applications (with small mods?)
  • LARD cluster web server (Rice)
  • Microsoft Cluster Server Phase 2 (?)
  • What doesnt fit?
  • simple 2-node hot standby failover clusters
  • Microsoft Cluster Server Phase 1

9
The ISTORE Approach
  • Divides self-maintenance into two components
  • 1) reactive self-maintenance dynamic reaction
    to exceptional system events
  • self-diagnosing, self-monitoring hardware
  • software monitoring and problem detection
  • automatic reaction to detected problems
  • 2) proactive self-maintenance continuous online
    self- testing and self-analysis
  • automatic characterization of system components
  • in situ fault injection, self-testing, and
    scrubbing to detect flaky hardware components and
    to exercise rarely-taken application code paths
    before theyre used

10
Reactive Self-Maintenance
  • ISTORE defines a layered system model for
    monitoring and reaction
  • ISTORE API defines interface between runtime
    system and app. reaction mechanisms
  • Policies define systems monitoring, detection,
    and reaction behavior

11
  • Hardware architecture plug-and-play intelligent
    devices with integrated self-monitoring,
    diagnostics, and fault injection hardware
  • intelligence used to collect and filter
    monitoring data
  • diagnostics and fault injection enhance
    robustness
  • networked to create a scalable shared-nothing
    cluster

12
ISTORE-II Hardware Vision
  • System-on-a-chip enables computer, memory,
    redundant network interfaces without
    significantly increasing size of disk
  • Target for 5-7 years
  • 1999 IBM MicroDrive
  • 1.7 x 1.4 x 0.2 (43 mm x 36 mm x 5 mm)
  • 340 MB, 5400 RPM, 5 MB/s, 15 ms seek
  • 2006 MicroDrive?
  • 9 GB, 50 MB/s (1.6X/yr capacity, 1.4X/yr BW)

13
2006 ISTORE
  • ISTORE node
  • Add 20 pad to MicroDrive size for packaging,
    connectors
  • Then double thickness to add IRAM
  • 2.0 x 1.7 x 0.5 (51 mm x 43 mm x 13 mm)
  • Crossbar switches growing by Moores Law
  • 2x/1.5 yrs ? 4X transistors/3yrs
  • Crossbars grow by N2 ? 2X switch/3yrs
  • 16 x 16 in 1999 ? 64 x 64 in 2005
  • ISTORE rack (19 x 33 x 84) (480 mm x 840 mm
    x 2130 mm)
  • 1 tray (3 high) ? 16 x 32 ? 512 ISTORE nodes
  • 20 traysswitchesUPS ? 10,240 ISTORE nodes(!)

14
  • Each node includes extra diagnostic support
  • diagnostic processor independent hardware
    running monitoring and control software
  • monitors hardware and environmental state not
    normally visible to system software
  • control
  • reboot/power-cycle main CPU
  • inject simulated faults power, bus transients,
    memory errors, network interface failure, ...
  • separate diagnostic network connects the
    diagnostic processors of each brick
  • provides independent network path to diagnostic
    CPU
  • works when brick CPU is powered off or has failed

15
  • Software collects and filters monitoring data
  • hardware monitors device health, environmental
    conditions, and indicators that software is
    working
  • some information processed locally to provide
    fail-fast behavior when higher-level software
    deemed potentially untrustworthy
  • most information passed on to software monitoring
  • software monitoring layer also collects
    higher-level performance data, access patterns,
    app. heartbeats

16
  • The data is collected in a virtual database
  • desired monitoring data is selected and
    aggregated by specifying views over the
    database
  • database schema views hide differences in
    monitoring implementation on heterogeneous HW and
    SW
  • Running example
  • If ambient temperature of a shelf is rising
    significantly faster than that of other shelves,
  • reduce power consumption on those nodes, then
  • if necessary, migrate non-redundant data replicas
    off some nodes on that shelf and shut them down
  • view for each shelf, average temperature across
    all temperature sensors on that shelf

17
  • Conditions requiring administrative response are
    detected by observing values and/or patterns in
    the monitoring data
  • triggers specify these patterns and invoke
    appropriate adaptation algorithms
  • input to a trigger is a view of the monitoring
    data
  • views and triggers can be specified separately to
    allow
  • easy selection of desired reaction algorithm
  • easy redefinition of conditions that invoke a
    particular reaction
  • Running example
  • trigger change in temperature of one shelf gt 0
    and more than twice the change in temperature of
    any other shelf, averaged over a one-minute period

18
  • Adaptation algorithms coordinate
    application-level reaction mechanisms
  • adaptation algorithms define a sequence of
    operations that address the anomaly detected by
    the associated trigger
  • adaptation algorithms call application-implemented
    mechanisms via a standard API
  • but are independent of application mechanism
    details
  • Running example coordination of reaction
  • 1) identify nodes with non-redundant data
  • 2) invoke application mechanism to migrate that
    data off n of those nodes
  • 3) reduce power consumption by those n nodes
  • 4) install trigger to monitor temperature change
    and shut down nodes if power reduction is
    ineffective

19
  • ISTORE expects reaction mechanisms to be
    implemented by the application
  • these reaction mechanisms are application-specific
  • e.g., moving data requires knowledge of data
    semantics, consistency policies, ...
  • a research goal of ISTORE is to provide a
    standard API to these mechanisms
  • initially, try to leverage and extend existing
    mechanisms to avoid wholesale rewriting of
    applications
  • many data-intensive applications already support
    functionality similar to the needed mechanisms
  • eventually, generalize and extend API to
    encompass mechanisms and needs of future
    applications

20
Policies
  • Programmer or administrator specifies policies to
    control the systems adaptive behavior
  • the policy compiler turns a high-level
    declarative specification of desired behavior
    into the appropriate
  • adaptation algorithms (that invoke application
    mechanisms through the ISTORE API)
  • triggers (to invoke the adaptation algorithms
    when the appropriate conditions are detected)
  • views (that enable monitoring needed by the
    triggers)
  • Running example
  • policy if ambient temperature of a shelf is
    rising significantly faster than that of other
    shelves, reduce power and prepare to shut down
    nodes

21
Summary Layered System Model
  • Layered system model for monitoring and reaction
    provides reactive self-maintenance
  • Self-maintenance in ISTORE also consists of
    proactive, continuous self-testing and analysis

22
The ISTORE Approach
  • Divides self-maintenance into two components
  • 1) reactive self-maintenance dynamic reaction
    to exceptional system events
  • self-diagnosing, self-monitoring hardware
  • software monitoring and problem detection
  • automatic reaction to detected problems
  • 2) proactive self-maintenance continuous online
    self- testing and self-analysis
  • in situ fault injection, self-testing, and
    scrubbing to detect flaky hardware components and
    to exercise rarely-taken application code paths
    before theyre used
  • automatic characterization of system components

23
Continuous Online Self-Testing
  • Self-maintaining systems should automatically
    carry out preventative maintenance
  • need aggressive in situ component testing via
  • fault injection triggering hardware and software
    error handling paths to verify their
    integrity/existence
  • stress testing pushing HW/SW components past
    normal operating parameters
  • scrubbing periodic restoration of potentially
    decaying hardware or software state
  • ISTORE periodically isolates nodes from the
    system and performs extensive self-tests
  • nodes can be easily isolated due to ISTOREs
    built-in redundancy
  • even in a deployed, running system

24
Self-Testing Hardware
  • Goals of hardware self-testing is to detect flaky
    components and preserve data integrity
  • Examples
  • fault injection power cycle disk to check for
    stiction
  • stress testing run disk controller at 100
    utilization to test behavior under load
  • scrubbing read all disk sectors and rewrite any
    that suffer soft errors fire disk if too many
    errors

25
Self-Testing Software
  • Software self-testing proactively identifies
    weaknesses in software before they cause a
    visible failure
  • helps prevent failure due to bugs that only
    appear in certain hardware/software
    configurations
  • helps identify bugs that occur when software is
    driven into an untested state only reachable in a
    live system
  • e.g., long uptimes, heavy load, unexpected
    requests
  • Examples
  • fault injection (includes HW- and SW-induced
    faults that the SW is expected to handle) SCSI
    parity error, invalid return codes from operating
    system
  • stress testing heavy load, pathological requests
  • scrubbing restart/reboot long-running software

26
Online Self-Analysis
  • Self-maintaining systems require knowledge of
    their components dynamic runtime behavior
  • current plug-and-play hardware approaches are
    not sufficient
  • need more than just discovery of new devices
    functional capabilities and supported APIs
  • also need dynamic component characterization

27
Characterizing HW/SW Behavior
  • An ISTORE may contain black-box components
  • heterogeneous hardware devices
  • application-supplied reaction mechanisms whose
    implementations are hidden
  • To select and tune adaptation algorithms, the
    ISTORE system needs to understand the behavior of
    these components
  • in the context of a complex, live system
  • examples
  • characterize performance of disks in system, use
    that data to select destination disks for replica
    creation
  • isolate two nodes, invoke replication from one to
    the other, monitor actions taken by application
    (e.g., how long it takes, how much data is moved)

28
Support for Application Self-tuning
  • ISTOREs characterization mechanisms can also
    help applications tune themselves
  • current systems require manual tuning to meet
    scalability and performance goals
  • especially true for shared-nothing systems in
    which computational and storage resources arent
    pooled
  • possible research direction is to expose
    characterization information to application via
    an extension of the ISTORE API
  • this would allow aware applications to
    automatically adapt their behavior based on
    system conditions

29
ISTORE API
  • The ISTORE API defines interfaces for
  • adaptation algorithms to invoke application
    reaction mechanisms
  • e.g., migrate data, replicate data, checkpoint,
    shutdown, ...
  • applications to provide hints to the runtime
    system so it can optimize adaptation algorithms
    data storage
  • e.g., application tags data whose unavailability
    can be temporarily tolerated
  • runtime system to invoke application self-testing
    and fault injection, and for application to
    report results
  • runtime system to inform application about
    current state of system, hardware capabilities,
    ...

30
Summary
  • ISTORE focuses on Scalability, Availability, and
    Maintainability for emerging data-intensive
    network applications
  • ISTORE provides a platform for deploying
    self-maintaining systems that are up 24x7
  • ISTORE will achieve self-maintenance via
  • hardware platform with integrated diagnostic
    support
  • reactive self-maintenance a layered,
    policy-driven runtime system that provides a
    framework for monitoring and reaction
  • proactive self-maintenance support for
    continuous on-line self-testing and component
    characterization
  • and a standard API for interfacing applications
    to the runtime system
Write a Comment
User Comments (0)
About PowerShow.com