Deconstructing Commodity Storage Clusters - PowerPoint PPT Presentation

About This Presentation
Title:

Deconstructing Commodity Storage Clusters

Description:

Title: Slide 1 Author: Haryadi Gunawi Last modified by: Haryadi Gunawi Created Date: 1/20/2005 2:53:55 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 30
Provided by: Haryadi2
Category:

less

Transcript and Presenter's Notes

Title: Deconstructing Commodity Storage Clusters


1
Deconstructing Commodity Storage Clusters
  • Haryadi S. Gunawi, Nitin Agrawal,
  • Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Univ. of Wisconsin - Madison
Jiri Schindler
Corporation
2
Storage system
  • Storage system
  • Important components of large-scale systems
  • Multi-billion dollar industry
  • Often comprised of high-end storage servers
  • A big box with lots of disks inside
  • The simple question
  • How does storage server work?
  • Simple but hard closed storage subsystem design

3
Why need to know?
  • Better modeling
  • How system behaves under different workload
  • Example in storage industry capacity model for
    capacity planning
  • Model is limited if the information is limited
  • Product validation
  • Validate what product specs say
  • Performance numbers cannot confirm
  • Critical evaluation of design and implementation
    choices
  • Control what is occurring inside

4
Traditionally black box
  • Highly customized and proprietary hardware and OS
  • Hitachi Lightning, NetApp Filers, EMC Symmetrix
  • EMC Symmetrix disk/cache manager, proprietary OS
  • Internal information is hidden behind standard
    interfaces

?
Client
Acks
5
Modern graybox storage system
  • Cluster of commodity PCs running commodity OS
  • Google FS cluster, HP FAB, EMC Centera
  • Advantages of commodity storage clusters
  • Direct internal observation visible probe
    points
  • Leverage existing standardized tools

Storage System
Update DB
Update DB
PC
Commodity
PC
PC
Client
Switch
PC
Switch
PC
PC
6
Intra-box Techniques
  • Two Intra-box techniques
  • Observation
  • System perturbation
  • Two components of analysis
  • Deduce structure of main communication protocol
  • Object Read and Write protocol
  • Internal policy decisions
  • Caching, prefetching, write buffering, load
    balancing, etc.

7
Goal and EMC Author
  • Objectives
  • Feasibility of deconstructing commodity storage
    clusters, no source code
  • Results achieved without EMC assistance
  • EMC Author
  • Evaluate correctness of our findings
  • Give insights behind their design decisions

8
Outline
  • Introduction
  • EMC Centera Overview
  • Intra-box tools
  • Deducing Protocol
  • Observation and Delay Pertubation
  • Inferring Policies
  • System Perturbation
  • Conclusion

9
Centera Topology
Storage Nodes
Access Nodes
Client
SN 1
LAN
WAN
SN 2
AN 1
SN 3
Client
AN 2
SN 4
SN 5
SN 6
10
Commodity OS
Storage Node
Access Node
Client
Centera Software
Centera Software
Linux
Linux
Client SDK
Reiserfs
TCP/UDP
TCP
Reiserfs
TCP/UDP
IDE driver
IDE driver
WAN
LAN
11
Probe Points Observation
Storage Node
Access Node
Client
Centera Software
Centera SW.
Client SDK
Reiserfs
TCP/UDP
TCP/UDP
TCP
tcpdump
tcpdump
tcpdump
Pseudo Dev. Driver
IDE drives
  • Internal probe points
  • Trace traffic using standardized tools
  • tcpdump trace network traffic
  • Pseudo Device Driver trace disk traffic

12
Probe Points Perturbation
Storage Node
User-level Process
Access Node
Client
Centera Software
Add CPU Load while(1) ..
Add Disk Load cp fX fY
Centera SW
Client SDK
Reiserfs
TCP/UDP
TCP/UDP
TCP
Pseudo Dev.
Mod. NistNet
Mod. NistNet
Mod. NistNet
Delay
tcpdump
tcpdump
tcpdump
IDE drives
  • Perturbing system at probe points
  • Modified NistNet delay particular messages
  • Pseudo Dev. Driver delay disk I/O traffic
  • Additional Load
  • CPU Load High priority while loop
  • Disk Load File copy

13
Outline
  • Introduction
  • EMC Centera Overview
  • Deducing Protocol
  • Observation and Delay Perturbation
  • Inferring Policies
  • System Perturbation
  • Conclusion

14
Understanding the protocol
  • Understanding Read/Write protocol
  • Read and Write implementations in big distributed
    storage systems are not simple
  • Deconstruct the protocol structure
  • Which pieces are involved?
  • Where data is sent to?
  • Data reliably stored, mirrored, striped?

15
Observing Write Protocol
  • Deconstruct protocol using passive observation
  • Run a series of write workload
  • Observe network and disk traffic
  • Correlation tools convert traces into protocol
    structure

EMC Centera
Client
an1
sn1
sn2
sn3
an2
sn4
sn5
Access Nodes
sn6
Storage Nodes
16
Observation Results
Access Node
Primary SN
Secondary SN
  • Object Write Protocol findings
  • Phase 1 Write request establishment
  • Phase 2 Data transfer
  • Phase 3 Disk write, notify other SNs, commit
  • Phase 4 Series of acknowledgement
  • Determine general properties
  • Primary SN handles generation of 2nd copy
  • Two new TCP connections / object write

Client
R
Write Req.
TCP Setup
R
Write Req
TCP Setup
R
Request Ack.
Request Ack.
Request Ack.
Data Transfer
Transfer Ack.
SNx
SNy
SNv
SNw
Write-Commit
Write-Commit
Write Complete
time
17
Resolving Dependencies
  • Cannot conclude dependencies from observation
    only
  • B after A ! B depends on A
  • Must delay A, and see if B is delayed

Primary SN
Secondary SN
AN
From observation only Primary commit depends on
secondary commit and sync. disk write
Primary commit (pc)
  • Conclude causality by delaying
  • disk write traffic and
  • secondary commit

18
Delaying a Particular Message
  • Need to delay a particular message
  • Leverage packet sizes
  • Modify NistNet
  • Delay specific message, not link
  • Ex delay sc (90 bytes)

Access Node
Primary SN
Secondary SN
Client
299 bytes
509
509
161
161
161
289
375
321
321
sc
90 bytes
prim. commit
539
4
4
4
4
19
Delaying secondary-commit
Primary SN
Secondary SN
  • Resolving first dependency
  • Delay secondary commit ? primary commit also gets
    delayed
  • Primary commit depends on the receipt of
    secondary commit

AN
delay
20
Delaying disk I/O traffic
Primary SN
  • Delay disk writes at primary storage node

Secondary-commit
Delay Disk Write
From observation and delay Primary commit
depends on secondary commit message and sync.
disk write
21
Ability to analyze internal designs
  • Intra-box techniques Observation and
    perturbation by delay
  • Able to deduce Object Write protocol
  • Give ability to analyze internal design decisions
  • Serial vs. Parallel
  • Primary SN handles the generation of 2nd copy
    (Serial)
  • vs.
  • AN handles both 1st and 2nd (Parallel)
  • EMC Centera write throughput is more important
  • Decrease load on access nodes increase write
    throughput
  • New TCP connections (internally) / object write
  • vs. using persistent connection to remove TCP
    setup cost
  • Prefer simplicity no need to manage persistent
    conn. for all requests

22
Outline
  • Introduction
  • EMC Centera Overview
  • Deducing Protocol
  • Inferring Policies
  • Various system perturbation
  • Conclusion

23
Inferring internal policies
  • Write policies
  • Level of replication, Load balancing,
    Caching/buffering
  • Read policies
  • Caching, Prefetching, Load balancing
  • Try to infer
  • Is particular policy implemented?
  • At which level it is being implemented?
  • Ex Read Caching at Client, Access Node, Storage
    Node?

24
System Pertubation
  • Perturb the system
  • Delay and extra load
  • 4 common load-balancing factors
  • CPU load
  • High priority while loop
  • Disk load
  • Background file copy
  • Active TCP connection
  • Network delay

net delay
25
Write Load Balancing
  • What factors determined which storage nodes are
    selected?
  • Experiment
  • Observe which primary storage nodes selected
  • Without load writes are balanced
  • With load writes skew toward unloaded nodes

?
sn1 Unloaded
AN
?
sn2 Unloaded
sn2 Loaded
26
Write Load Balancing Results
Normal No Perturb
Additional CPU Load
Disk Load
Network Load
Incoming Net. Delay
sn1
sn1
sn1
sn1
sn1
sn2
CPU
Disk
TCP
Delay
27
Summary of findings
Write Policies Write Policies
Replication Two copies in two nodes attached to different power (reliability)
Load balancing CPU usage (locally observable status) Network status is not incorporated
Write buffering Storage nodes write synchronously
EMC Centera Simplicity and Reliability
Read Policies Read Policies
Caching Storage node only (commodity filesystem) Access node and client does not cache.
Prefetching Storage node only (commodity filesystem) Access node and client does not prefetch
Load Balancing Not implemented in earlier version Still reads from busy nodes
28
Conclusion
  • Intra-box
  • Observe and perturb
  • Deconstruct protocol and infer policies
  • No access to source code
  • Power of probe points
  • More observation places
  • Ability to control the system
  • Systems built with more externally visible probe
    points
  • Systems more readily understood, analyzed, and
    debugged
  • Higher-performing, more robust and reliable
    computer systems

29
Questions?
Write a Comment
User Comments (0)
About PowerShow.com