Title: Explicit Control in a Batch-aware Distributed File System
1Explicit Control in a Batch-aware Distributed
File System
- John Bent
- Douglas Thain
- Andrea Arpaci-Dusseau
- Remzi Arpaci-Dusseau
- Miron Livny
- University of Wisconsin, Madison
2Grid computing
Physicists invent distributed computing!
Astronomers develop virtual supercomputers!
3Grid computing
Internet
Home storage
If it looks like a duck . . .
4Are existing distributed file systems adequate
for batch computing workloads?
- NO. Internal decisions inappropriate
- Caching, consistency, replication
- A solution Batch-Aware Distributed File System
(BAD-FS) - Combines knowledge with external storage control
- Detail information about workload is known
- Storage layer allows external control
- External scheduler makes informed storage
decisions - Combining information and control results in
- Improved performance
- More robust failure handling
- Simplified implementation
5Outline
- Introduction
- Batch computing
- Systems
- Workloads
- Environment
- Why not DFS?
- Our answer BAD-FS
- Design
- Experimental evaluation
- Conclusion
6Batch computing
- Not interactive computing
- Job description languages
- Users submit
- System itself executes
- Many different batch systems
- Condor
- LSF
- PBS
- Sun Grid Engine
7Batch computing
Internet
Home storage
Scheduler
1
2
3
4
8Batch workloads
Pipeline and Batch Sharing in Grid Workloads,
Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.
- General properties
- Large number of processes
- Process and data dependencies
- I/O intensive
- Different types of I/O
- Endpoint
- Batch
- Pipeline
- Our focus Scientific workloads
- More generally applicable
- Many others use batch computing
- video production, data mining, electronic design,
financial services, graphic rendering
9Batch workloads
Endpoint
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
10Cluster-to-cluster (c2c)
- Not quite p2p
- More organized
- Less hostile
- More homogeneity
- Correlated failures
- Each cluster is autonomous
- Run and managed by different entities
- An obvious bottleneck is wide-area
Internet
Home store
How to manage flow of data into, within and out
of these clusters?
11Why not DFS?
Internet
Home store
- Distributed file system would be ideal
- Easy to use
- Uniform name space
- Designed for wide-area networks
- But . . .
- Not practical
- Embedded decisions are wrong
12DFSs make bad decisions
- Caching
- Must guess what and how to cache
- Consistency
- Output Must guess when to commit
- Input Needs mechanism to invalidate cache
- Replication
- Must guess what to replicate
13BAD-FS makes good decisions
- Removes the guesswork
- Scheduler has detailed workload knowledge
- Storage layer allows external control
- Scheduler makes informed storage decisions
- Retains simplicity and elegance of DFS
- Practical and deployable
14Outline
- Introduction
- Batch computing
- Systems
- Workloads
- Environment
- Why not DFS?
- Our answer BAD-FS
- Design
- Experimental evaluation
- Conclusion
15Practical and deployable
- User-level requires no privilege
- Packaged as a modified batch system
- A new batch system which includes BAD-FS
- General will work on all batch systems
- Tested thus far on multiple batch systems
SGE
SGE
SGE
SGE
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
SGE
SGE
SGE
SGE
Internet
Home store
16Contributions of BAD-FS
Compute node
Compute node
Compute node
Compute node
CPU Manager
CPU Manager
CPU Manager
CPU Manager
BAD-FS
BAD-FS
BAD-FS
1) Storage managers
2) Batch-Aware Distributed File System
Job queue
3) Expanded job
description language
4) BAD-FS scheduler
Scheduler
Home storage
BAD-FS Scheduler
17BAD-FS knowledge
- Remote cluster knowledge
- Storage availability
- Failure rates
- Workload knowledge
- Data type (batch, pipeline, or endpoint)
- Data quantity
- Job dependencies
18Control through volumes
- Guaranteed storage allocations
- Containers for job I/O
- Scheduler
- Creates volumes to cache input data
- Subsequent jobs can reuse this data
- Creates volumes to buffer output data
- Destroys pipeline, copies endpoint
- Configures workload to access containers
19Knowledge plus control
- Enhanced performance
- I/O scoping
- Capacity-aware scheduling
- Improved failure handling
- Cost-benefit replication
- Simplified implementation
- No cache consistency protocol
20I/O scoping
- Technique to minimize wide-area traffic
- Allocate storage to cache batch data
- Allocate storage for pipeline and endpoint
- Extract endpoint
Compute node
Compute node
AMANDA 200 MB pipeline 500 MB batch 5 MB
endpoint
Internet
Steady-state Only 5 of 705 MB traverse
wide-area.
BAD-FS Scheduler
21Capacity-aware scheduling
- Technique to avoid over-allocations
- Scheduler runs only as many jobs as fit
22Capacity-aware scheduling
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
23Capacity-aware scheduling
- 64 batch-intensive synthetic pipelines
- Vary size of batch data
- 16 compute nodes
24Improved failure handling
- Scheduler understands data semantics
- Data is not just a collection of bytes
- Losing data is not catastrophic
- Output can be regenerated by rerunning jobs
- Cost-benefit replication
- Replicates only data whose replication cost is
cheaper than cost to rerun the job - Results in paper
25Simplified implementation
- Data dependencies known
- Scheduler ensures proper ordering
- No need for cache consistency protocol in
cooperative cache
26Real workloads
- AMANDA
- Astrophysics study of cosmic events such as
gamma-ray bursts - BLAST
- Biology search for proteins within a genome
- CMS
- Physics simulation of large particle colliders
- HF
- Chemistry study of non-relativistic interactions
between atomic nuclei and electors - IBIS
- Ecology global-scale simulation of earths
climate used to study effects of human activity
(e.g. global warming)
27Real workload experience
- Setup
- 16 jobs
- 16 compute nodes
- Emulated wide-area
- Configuration
- Remote I/O
- AFS-like with /tmp
- BAD-FS
- Result is order of magnitude improvement
28BAD Conclusions
- Existing DFSs insufficient
- Schedulers have workload knowledge
- Schedulers need storage control
- Caching
- Consistency
- Replication
- Combining this control with knowledge
- Enhanced performance
- Improved failure handling
- Simplified implementation
29For more information
- http//www.cs.wisc.edu/adsl
- http//www.cs.wisc.edu/condor
- Questions?
30Why not BAD-scheduler and traditional DFS?
- Cooperative caching
- Data sharing
- Traditional DFS
- assume sharing is exception
- provision for arbitrary, unplanned sharing
- Batch workloads, sharing is rule
- Sharing behavior is completely known
- Data committal
- Traditional DFS must guess when to commit
- AFS uses close, NFS uses 30 seconds
- Batch workloads precisely define when
31Is cap aware imp in real world?
- Heterogeneity of remote resources
- Shared disk
- Workloads changing, some are very, very large.
32User burden
- Additional info needed in declarative lang.
- User probably already knows this info
- Or can readily obtain
- Typically, this info already exists
- Scattered across collection of scripts,
Makefiles, etc. - BAD-FS improves current situation by collecting
this info into one central location
33Enhanced performance
- I/O scoping
- Scheduler knows I/O types
- Creates storage volumes accordingly
- Only endpoint I/O traverses wide-area
- Capacity-aware scheduling
- Scheduler knows I/O quantities
- Throttles workloads, avoids over-allocations
34Improved failure handling
- Scheduler understands data semantics
- Lost data is not catastrophic
- Pipe data can be regenerated
- Batch data can be refetched
- Enables cost-benefit replication
- Measure
- replication cost
- data generation cost
- failure rate
- Replicate only data whose replication cost is
cheaper than expected cost to reproduce - Improves workload throughput
35Capacity-aware scheduling
- Goal
- Avoid overallocations
- Cache thrashing
- Write failures
- Method
- Breadth-first
- Depth-first
- Idleness
36Capacity-aware scheduling evaluation
- Workload
- 64 synthetic pipelines
- Varied pipe size
- Environment
- 16 compute nodes
- Configuration
- Breadth-first
- Depth-first
- BAD-FS
Failures directly correlate to workload
throughput.
37Workload example AMANDA
- Astrophysics study of cosmic events such as
gamma-ray bursts - Four stage pipeline
- 200 MB pipeline I/O
- 500 MB batch I/O
- 5 MB endpoint I/O
- Focus
- Scientific workloads
- Many others use batch computing
- video production, data mining, electronic design,
financial services, graphic rendering
38BAD-FS and scheduler
- BAD-FS
- Allows external decisions via volumes
- A guaranteed storage allocation
- Size, lifetime, and a type
- Cache volumes
- Read-only view of an external server
- Can be bound together into cooperative cache
- Scratch volumes
- Private read-write name space
- Batch-aware scheduler
- Rendezvous of control and information
- Understands storage needs and availability
- Controls storage decisions
39Scheduler controls storage decisions
- What and how to cache?
- Answer batch data and cooperatively
- Technique I/O scoping and capacity-aware
scheduling - What and when to commit?
- Answer endpoint data when ready
- Technique I/O scoping and capacity-aware
scheduling - What and when to replicate?
- Answer data whose cost to regenerate is high
- Technique cost-benefit replication
40I/O scoping
- Goal
- Minimize wide-area traffic
- Means
- Information about data type
- Storage volumes
- Method
- Create coop cache volumes for batch data
- Create scratch volumes to contain pipe
- Result
- Only endpoint data traverses wide-area
- Improved workload throughput
41I/O scoping evaluation
- Workload
- 64 synthetic pipelines
- 100 MB of I/O each
- Varied data mix
- Environment
- 32 compute nodes
- Emulated wide-area
- Configuration
- Remote I/O
- Cache volumes
- Scratch volumes
- BAD-FS
Wide-area traffic directly correlates to workload
throughput.
42Capacity-aware scheduling
- Goal
- Avoid over-allocations of storage
- Means
- Information about data quantities
- Information about storage availability
- Storage volumes
- Method
- Use depth-first scheduling to free pipe volumes
- User breadth-first scheduling to free batch
- Result
- No thrashing due to over-allocations of batch
- No failures due to over-allocations of pipe
- Improved throughput
43Capacity-aware scheduling evaluation
- Workload
- 64 synthetic pipelines
- Pipe-intensive
- Environment
- 16 compute nodes
- Configuration
- Breadth-first
- Depth-first
- BAD-FS
44Capacity-aware scheduling evaluation
- Workload
- 64 synthetic pipelines
- Pipe-intensive
- Environment
- 16 compute nodes
- Configuration
- Breadth-first
- Depth-first
- BAD-FS
Failures directly correlate to workload
throughput.
45Cost-benefit replication
- Goal
- Avoid wasted replication overhead
- Means
- Knowledge of data semantics
- Data loss is not catastrophic
- Can be regenerated or refetched
- Method
- Measure
- Failure rate, f, within each cluster
- Cost, p, to reproduce data
- Time to rerun jobs to regenerate pipe data
- Time to refetch batch data from home
- Cost, r, to replicate data
- Replicate only when pf gt r
- Result
- Data is replicated only when it should be
- Can improve throughput
46Cost-benefit replication evaluation
- Workload
- Synthetic pipelines of depth 3
- Runtime 60 seconds
- Environment
- Artificially injected failures
- Configuration
- Always-copy
- Never-copy
- BAD-FS
Trade-off overhead in environment without failure
to gain throughput in environment with failure.
47Real workloads
- Workload
- Real workloads
- 64 pipelines
- Environment
- 16 compute nodes
- Emulated wide-area
- Cold and warm
- First 16 are cold
- Subsequent 48 warm
- Configuration
- Remote I/O
- AFS-like
- BAD-FS
48Experimental results not shown here
- I/O scoping
- Capacity planning
- Cost-benefit replication
- Other real workload results
- Large in the wild demonstration
- Works in c2c
- Works across multiple batch systems
49Existing approaches
- Remote I/O
- Interpose and redirect all I/O home
- CON Quickly saturates wide-area connection
- Pre-staging
- Manually push all input endpoint and batch
- Manually pull all endpoint output
- Manually configure workload to find pre-staged
data - CON Repetitive, error-prone, laborious
- Traditional distributed file systems
- Locate remote compute nodes within same name
space as home (e.g. AFS) - Not an existing approach impractical to deploy
50Declarative language
- Existing languages express process
- specification
- requirements
- dependencies
- Add primitives to describe I/O behavior
- Modified language can express data
- dependencies
- type (i.e. endpoint, batch, pipe)
- quantities
51Example AMANDA on AFS
?
- Caching
- Batch data redundantly fetched
- Callback overhead
- Consistency
- Pipeline data committed on close
- Replication
- No idea which data is important
200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
AMANDA 200 MB pipeline I/O 500 MB batch I/O 5 MB
endpoint I/O
This is slide in which Im most interested in
feedback.
52Overview
53I/O Scoping
54Capacity-aware scheduling, batch-intense
55Capacity-aware scheduling evaluation
- Workload
- 64 synthetic pipelines
- Pipe-intensive
- Environment
- 16 compute nodes
56Failure handling
57Workload experience
58In the wild
59Example workflow language Condor DAGman
- Keyword job names file w/ execute instrs
- Keywords parent, child express relations
- no declaration of data
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D
60Adding data primitives to a workflow language
- New keywords for container operations
- volume create a container
- scratch specify container type
- mount how the app addresses the container
- extract the desired endpoint output
- User must provide complete, exact I/O information
to the scheduler - Specify which procs use which data
- Specify size of data read and written
61Extended workflow language
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D volume B1
ftp//home/data 1GB volume P1 scratch 500
MB volume P2 scratch 500 MB A mount B1 /data C
mount B1 /data A mount P1 /tmp B mount P1 /tmp C
mount P2 /tmp D mount P2 /tmp extract P1/out
ftp//home/out.1 extract P2/out ftp//home/out.2
62Terminology
- Application
- Process
- Workload
- Pipeline I/O
- Batch I/O
- Endpoint I/O
- Pipe-depth
- Batch-width
- Scheduler
- Home storage
- Catalogue
63Remote resources
64Example scenario
- Workload
- Width 100, depth 2
- 1 GB batch
- 1 GB pipe
- 1 KB endpoint
- Environment
- Batch data archived at home
- Remote compute cluster available
1 KB
1 KB
1 GB
1 GB
1 GB
Home store
1 KB
1 KB
65Ideal utilization of remote storage
- Minimize wide-area traffic by scoping I/O
- Transfer batch data once and cache
- Contain pipe data within compute cluster
- Only endpoint data should traverse wide-area
- Improve throughput through space mgmt
- Avoid thrashing due to excessive batch
- Avoid failure due to excessive pipe
- Cost-benefit checkpointing and replication
- Track data generation and replication costs
- Measure failure rates
- Use cost-benefit checkpointing algorithm
- Apply independent policy for each pipeline
66Remote I/O
- Simplest conceptually
- Requires least amount of remote privilege
- But . . .
- Batch data fetched redundantly
- Pipe I/O unnecessarily crosses wide-area
- Wide-area bottleneck quickly saturates
67Pre-staging
- Requires large user burden
- Needs access to local file sys for each cluster
- Manually pushes batch data
- May configures workload to use /tmp
- Must manually pulls endpoint outputs
- Good performance through I/O scoping but
- Tedious, repetitive, mistake-prone
- Availability of /tmp cant be guaranteed
- Scheduler lacks knowledge to checkpoint