Explicit Control in a Batch-aware Distributed File System

About This Presentation

Title:

Explicit Control in a Batch-aware Distributed File System

Description:

Data committal. Traditional DFS must guess when to commit. AFS uses close, NFS uses 30 seconds ... Private read-write name space. Batch-aware scheduler ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 34

Provided by: con92

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Explicit Control in a Batch-aware Distributed File System

1
Explicit Control in a Batch-aware Distributed
File System

John Bent
Douglas Thain
Andrea Arpaci-Dusseau
Remzi Arpaci-Dusseau
Miron Livny
University of Wisconsin, Madison

2
Grid computing
Physicists invent distributed computing!
Astronomers develop virtual supercomputers!
3
Grid computing
Internet
Home storage
If it looks like a duck . . .
4
Are existing distributed file systems adequate
for batch computing workloads?

NO. Internal decisions inappropriate
Caching, consistency, replication
A solution Batch-Aware Distributed File System
(BAD-FS)
Combines knowledge with external storage control
Detail information about workload is known
Storage layer allows external control
External scheduler makes informed storage
decisions
Combining information and control results in
Improved performance
More robust failure handling
Simplified implementation

5
Outline

Introduction
Batch computing
Systems
Workloads
Environment
Why not DFS?
Our answer BAD-FS
Design
Experimental evaluation
Conclusion

6
Batch computing

Not interactive computing
Job description languages
Users submit
System itself executes
Many different batch systems
Condor
LSF
PBS
Sun Grid Engine

7
Batch computing
Internet
Home storage
Scheduler
1
2
3
4
8
Batch workloads
Pipeline and Batch Sharing in Grid Workloads,
Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.

General properties
Large number of processes
Process and data dependencies
I/O intensive
Different types of I/O
Endpoint
Batch
Pipeline
Our focus Scientific workloads
More generally applicable
Many others use batch computing
video production, data mining, electronic design,
financial services, graphic rendering

9
Batch workloads
Endpoint
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
10
Cluster-to-cluster (c2c)

Not quite p2p
More organized
Less hostile
More homogeneity
Correlated failures
Each cluster is autonomous
Run and managed by different entities
An obvious bottleneck is wide-area

Internet
Home store
How to manage flow of data into, within and out
of these clusters?
11
Why not DFS?
Internet
Home store

Distributed file system would be ideal
Easy to use
Uniform name space
Designed for wide-area networks
But . . .
Not practical
Embedded decisions are wrong

12
DFSs make bad decisions

Caching
Must guess what and how to cache
Consistency
Output Must guess when to commit
Input Needs mechanism to invalidate cache
Replication
Must guess what to replicate

13
BAD-FS makes good decisions

Removes the guesswork
Scheduler has detailed workload knowledge
Storage layer allows external control
Scheduler makes informed storage decisions
Retains simplicity and elegance of DFS
Practical and deployable

14
Outline

Introduction
Batch computing
Systems
Workloads
Environment
Why not DFS?
Our answer BAD-FS
Design
Experimental evaluation
Conclusion

15
Practical and deployable

User-level requires no privilege
Packaged as a modified batch system
A new batch system which includes BAD-FS
General will work on all batch systems
Tested thus far on multiple batch systems

SGE
SGE
SGE
SGE
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
SGE
SGE
SGE
SGE
Internet
Home store
16
Contributions of BAD-FS
Compute node
Compute node
Compute node
Compute node
CPU Manager
CPU Manager
CPU Manager
CPU Manager
BAD-FS
BAD-FS
BAD-FS
1) Storage managers
2) Batch-Aware Distributed File System
Job queue
3) Expanded job
description language
4) BAD-FS scheduler
Scheduler
Home storage
BAD-FS Scheduler
17
BAD-FS knowledge

Remote cluster knowledge
Storage availability
Failure rates
Workload knowledge
Data type (batch, pipeline, or endpoint)
Data quantity
Job dependencies

18
Control through volumes

Guaranteed storage allocations
Containers for job I/O
Scheduler
Creates volumes to cache input data
Subsequent jobs can reuse this data
Creates volumes to buffer output data
Destroys pipeline, copies endpoint
Configures workload to access containers

19
Knowledge plus control

Enhanced performance
I/O scoping
Capacity-aware scheduling
Improved failure handling
Cost-benefit replication
Simplified implementation
No cache consistency protocol

20
I/O scoping

Technique to minimize wide-area traffic
Allocate storage to cache batch data
Allocate storage for pipeline and endpoint
Extract endpoint

Compute node
Compute node
AMANDA 200 MB pipeline 500 MB batch 5 MB
endpoint
Internet
Steady-state Only 5 of 705 MB traverse
wide-area.
BAD-FS Scheduler
21
Capacity-aware scheduling

Technique to avoid over-allocations
Scheduler runs only as many jobs as fit

22
Capacity-aware scheduling
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
23
Capacity-aware scheduling

64 batch-intensive synthetic pipelines
Vary size of batch data
16 compute nodes

24
Improved failure handling

Scheduler understands data semantics
Data is not just a collection of bytes
Losing data is not catastrophic
Output can be regenerated by rerunning jobs
Cost-benefit replication
Replicates only data whose replication cost is
cheaper than cost to rerun the job
Results in paper

25
Simplified implementation

Data dependencies known
Scheduler ensures proper ordering
No need for cache consistency protocol in
cooperative cache

26
Real workloads

AMANDA
Astrophysics study of cosmic events such as
gamma-ray bursts
BLAST
Biology search for proteins within a genome
CMS
Physics simulation of large particle colliders
HF
Chemistry study of non-relativistic interactions
between atomic nuclei and electors
IBIS
Ecology global-scale simulation of earths
climate used to study effects of human activity
(e.g. global warming)

27
Real workload experience

Setup
16 jobs
16 compute nodes
Emulated wide-area
Configuration
Remote I/O
AFS-like with /tmp
BAD-FS
Result is order of magnitude improvement

28
BAD Conclusions

Existing DFSs insufficient
Schedulers have workload knowledge
Schedulers need storage control
Caching
Consistency
Replication
Combining this control with knowledge
Enhanced performance
Improved failure handling
Simplified implementation

29
For more information

http//www.cs.wisc.edu/adsl
http//www.cs.wisc.edu/condor
Questions?

30
Why not BAD-scheduler and traditional DFS?

Cooperative caching
Data sharing
Traditional DFS
assume sharing is exception
provision for arbitrary, unplanned sharing
Batch workloads, sharing is rule
Sharing behavior is completely known
Data committal
Traditional DFS must guess when to commit
AFS uses close, NFS uses 30 seconds
Batch workloads precisely define when

31
Is cap aware imp in real world?

Heterogeneity of remote resources
Shared disk
Workloads changing, some are very, very large.

32
User burden

Additional info needed in declarative lang.
User probably already knows this info
Or can readily obtain
Typically, this info already exists
Scattered across collection of scripts,
Makefiles, etc.
BAD-FS improves current situation by collecting
this info into one central location

33
Enhanced performance

I/O scoping
Scheduler knows I/O types
Creates storage volumes accordingly
Only endpoint I/O traverses wide-area
Capacity-aware scheduling
Scheduler knows I/O quantities
Throttles workloads, avoids over-allocations

34
Improved failure handling

Scheduler understands data semantics
Lost data is not catastrophic
Pipe data can be regenerated
Batch data can be refetched
Enables cost-benefit replication
Measure
replication cost
data generation cost
failure rate
Replicate only data whose replication cost is
cheaper than expected cost to reproduce
Improves workload throughput

35
Capacity-aware scheduling

Goal
Avoid overallocations
Cache thrashing
Write failures
Method
Breadth-first
Depth-first
Idleness

36
Capacity-aware scheduling evaluation

Workload
64 synthetic pipelines
Varied pipe size
Environment
16 compute nodes
Configuration
Breadth-first
Depth-first
BAD-FS

Failures directly correlate to workload
throughput.
37
Workload example AMANDA

Astrophysics study of cosmic events such as
gamma-ray bursts
Four stage pipeline
200 MB pipeline I/O
500 MB batch I/O
5 MB endpoint I/O
Focus
Scientific workloads
Many others use batch computing
video production, data mining, electronic design,
financial services, graphic rendering

38
BAD-FS and scheduler

BAD-FS
Allows external decisions via volumes
A guaranteed storage allocation
Size, lifetime, and a type
Cache volumes
Read-only view of an external server
Can be bound together into cooperative cache
Scratch volumes
Private read-write name space
Batch-aware scheduler
Rendezvous of control and information
Understands storage needs and availability
Controls storage decisions

39
Scheduler controls storage decisions

What and how to cache?
Answer batch data and cooperatively
Technique I/O scoping and capacity-aware
scheduling
What and when to commit?
Answer endpoint data when ready
Technique I/O scoping and capacity-aware
scheduling
What and when to replicate?
Answer data whose cost to regenerate is high
Technique cost-benefit replication

40
I/O scoping

Goal
Minimize wide-area traffic
Means
Information about data type
Storage volumes
Method
Create coop cache volumes for batch data
Create scratch volumes to contain pipe
Result
Only endpoint data traverses wide-area
Improved workload throughput

41
I/O scoping evaluation

Workload
64 synthetic pipelines
100 MB of I/O each
Varied data mix
Environment
32 compute nodes
Emulated wide-area
Configuration
Remote I/O
Cache volumes
Scratch volumes
BAD-FS

Wide-area traffic directly correlates to workload
throughput.
42
Capacity-aware scheduling

Goal
Avoid over-allocations of storage
Means
Information about data quantities
Information about storage availability
Storage volumes
Method
Use depth-first scheduling to free pipe volumes
User breadth-first scheduling to free batch
Result
No thrashing due to over-allocations of batch
No failures due to over-allocations of pipe
Improved throughput

43
Capacity-aware scheduling evaluation

Workload
64 synthetic pipelines
Pipe-intensive
Environment
16 compute nodes
Configuration
Breadth-first
Depth-first
BAD-FS

44
Capacity-aware scheduling evaluation

Workload
64 synthetic pipelines
Pipe-intensive
Environment
16 compute nodes
Configuration
Breadth-first
Depth-first
BAD-FS

Failures directly correlate to workload
throughput.
45
Cost-benefit replication

Goal
Avoid wasted replication overhead
Means
Knowledge of data semantics
Data loss is not catastrophic
Can be regenerated or refetched
Method
Measure
Failure rate, f, within each cluster
Cost, p, to reproduce data
Time to rerun jobs to regenerate pipe data
Time to refetch batch data from home
Cost, r, to replicate data
Replicate only when pf gt r
Result
Data is replicated only when it should be
Can improve throughput

46
Cost-benefit replication evaluation

Workload
Synthetic pipelines of depth 3
Runtime 60 seconds
Environment
Artificially injected failures
Configuration
Always-copy
Never-copy
BAD-FS

Trade-off overhead in environment without failure
to gain throughput in environment with failure.
47
Real workloads

Workload
Real workloads
64 pipelines
Environment
16 compute nodes
Emulated wide-area
Cold and warm
First 16 are cold
Subsequent 48 warm
Configuration
Remote I/O
AFS-like
BAD-FS

48
Experimental results not shown here

I/O scoping
Capacity planning
Cost-benefit replication
Other real workload results
Large in the wild demonstration
Works in c2c
Works across multiple batch systems

49
Existing approaches

Remote I/O
Interpose and redirect all I/O home
CON Quickly saturates wide-area connection
Pre-staging
Manually push all input endpoint and batch
Manually pull all endpoint output
Manually configure workload to find pre-staged
data
CON Repetitive, error-prone, laborious
Traditional distributed file systems
Locate remote compute nodes within same name
space as home (e.g. AFS)
Not an existing approach impractical to deploy

50
Declarative language

Existing languages express process
specification
requirements
dependencies
Add primitives to describe I/O behavior
Modified language can express data
dependencies
type (i.e. endpoint, batch, pipe)
quantities

51
Example AMANDA on AFS
?

Caching
Batch data redundantly fetched
Callback overhead
Consistency
Pipeline data committed on close
Replication
No idea which data is important

200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
AMANDA 200 MB pipeline I/O 500 MB batch I/O 5 MB
endpoint I/O
This is slide in which Im most interested in
feedback.
52
Overview
53
I/O Scoping
54
Capacity-aware scheduling, batch-intense
55
Capacity-aware scheduling evaluation

Workload
64 synthetic pipelines
Pipe-intensive
Environment
16 compute nodes

56
Failure handling
57
Workload experience
58
In the wild
59
Example workflow language Condor DAGman

Keyword job names file w/ execute instrs
Keywords parent, child express relations
no declaration of data

job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D
60
Adding data primitives to a workflow language

New keywords for container operations
volume create a container
scratch specify container type
mount how the app addresses the container
extract the desired endpoint output
User must provide complete, exact I/O information
to the scheduler
Specify which procs use which data
Specify size of data read and written

61
Extended workflow language
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D volume B1
ftp//home/data 1GB volume P1 scratch 500
MB volume P2 scratch 500 MB A mount B1 /data C
mount B1 /data A mount P1 /tmp B mount P1 /tmp C
mount P2 /tmp D mount P2 /tmp extract P1/out
ftp//home/out.1 extract P2/out ftp//home/out.2
62
Terminology