Pipeline and Batch Sharing in Grid Workloads - PowerPoint PPT Presentation

About This Presentation
Title:

Pipeline and Batch Sharing in Grid Workloads

Description:

IBIS - ecology. CMS - physics. Hartree-Fock - chemistry. Nautilus - molecular ... IBIS is a global-scale simulation of earth's climate used to study effects of ... – PowerPoint PPT presentation

Number of Views:270
Avg rating:3.0/5.0
Slides: 27
Provided by: Miron1
Category:

less

Transcript and Presenter's Notes

Title: Pipeline and Batch Sharing in Grid Workloads


1
Pipeline and Batch Sharingin Grid Workloads
2
Goals
  • Study diverse range of scientific apps
  • Measure CPU, memory and I/O demands
  • Understand relationships btwn apps
  • Focus is on I/O sharing

3
Batch-Pipelined workloads
  • Behavior of single applications has been well
    studied
  • sequential and parallel
  • But many apps are not run in isolation
  • End result is product of a group of apps
  • Commonly found in batch systems
  • Run 100s or 1000s of times
  • Key is sharing behavior btwn apps

4
Batch-Pipelined Sharing
5
3 types of I/O
  • Endpoint unique input and output
  • Pipeline ephemeral data
  • Batch shared input data

6
Outline
  • Goals and intro
  • Applications
  • Methodology
  • Results
  • Implications

7
Six (plus one) target scientific applications
  • BLAST - biology
  • IBIS - ecology
  • CMS - physics
  • Hartree-Fock - chemistry
  • Nautilus - molecular dynamics
  • AMANDA -astrophysics
  • SETI_at_home - astronomy

8
Common characteristics
  • Diamond-shaped storage profile
  • Multi-level working sets
  • logical collection may be greater than that used
    by app
  • Significant data sharing
  • Commonly submitted in large batches

9
BLAST
search string
genomic database
blastp
BLAST searches for matching proteins and
nucleotides in a genomic database. Has only a
single executable and thus no pipeline sharing.
matches
10
IBIS
inputs
climate data
analyze
IBIS is a global-scale simulation of earths
climate used to study effects of human activity
(e.g. global warming). Only one app thus no
pipeline sharing.
forecast
11
CMS
configuration
CMS is a two stage pipeline in which the first
stage models accelerated particles and the second
simulates the response of a detector. This is
actually just the first half of a bigger pipeline.
cmkin
raw events
cmsim
geometry
configuration
triggered events
12
Hartree-Fock
problem
setup
initial state
HF is a three stage simulation of the
non-relativistic interactions between atomic
nuclei and electrons. Aside from the executable
files, HF has no batch sharing.
argos
integral
scf
solutions
13
Nautilus
initial state
physics
nautilus
intermediate
Nautilus is a three stage pipeline which solves
Newtons equation for each molecular particle in
a three-dimensional space. The physics which
govern molecular interactions is expressed in a
shared dataset. The first stage is often
repeated multiple times.
bin2coord
coordinates
rasmol
visualization
14
AMANDA
inputs
physics
corsika
AMANDA is a four stage astrophysics pipeline
designed to observe cosmic events such as
gamma-ray bursts. The first stage simulates
neutrino production and the creation of muon
showers. The second transforms into a standard
format and the third and fourth stages follow the
muons paths through earth and ice.
raw events
corama
standard events
ice tables
mmc
noisy events
geometry
mmc
triggered events
15
SETI_at_home
SETI_at_home is a single stage pipeline which
downloads a work unit of radio telescope noise
and analyzes it for any possible signs that would
indicate extraterrestrial intelligent life. Has
no batch data but does have pipeline data as it
performs its own checkpointing.
work unit
setiathome
analysis
16
Methodology
  • CPU behavior tracked with HW counters
  • Memory tracked with usage statistics
  • I/O behavior tracked with interposition
  • mmap was a little tricky
  • Data collection was easy.
  • Running the apps was challenge.

17
Resources Consumed
  • Relatively modest. Max BW is 7 MB/s for HF.

18
I/O Mix
  • Only IBIS has significant ratio of endpoint I/O.

19
Observations about individual applications
  • Modest buffer cache sizes sufficient
  • Max is AMANDA, needs 500 MB
  • Large proportion of random access
  • IBIS, CMS close to 100, HF 80
  • Amdahl and Gray balances skewed
  • Drastically overprovisioned in terms of I/O
    bandwidth and memory capacity

20
Observations about workloads
  • These apps are NOT run in isolation
  • Submitted in batches of 100s to 1000s
  • Large degree of I/O sharing
  • Significant scalability implications

21
Scalability of batch width
Storage center (1500 MB/s)
Commodity disk (15 MB/s)
22
Batch elimination
Storage center (1500 MB/s)
Commodity disk (15 MB/s)
23
Pipeline elimination
Storage center (1500 MB/s)
Commodity disk (15 MB/s)
24
Endpoint only
Storage center (1500 MB/s)
Commodity disk (15 MB/s)
25
Conclusions
  • Grid applications do not run in isolation
  • Relationships btwn apps must be understood
  • Scalability depends on semantic information
  • Relationships between apps
  • Understanding different types of I/O

26
Questions?
  • For more information
  • Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
    Remzi Arpaci-Dusseau and Miron Livny, Pipeline
    and Batch Sharing in Grid Workloads, in
    Proceedings of High Performance Distributed
    Computing (HPDC-12).
  • http//www.cs.wisc.edu/condor/doc/profiling.pdf
  • http//www.cs.wisc.edu/condor/doc/profiling.ps
Write a Comment
User Comments (0)
About PowerShow.com