VO and Application Centric Approaches - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

VO and Application Centric Approaches

Description:

Policy managers want to decide about their resources. Start with human-in ... preemption: suspend lower priority job. migration: suspend and migrate elsewhere ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 39

Provided by: gridCy

Category:

more less

Transcript and Presenter's Notes

Title: VO and Application Centric Approaches

1

VO- and Application- Centric Approaches
to Service Level Agreement
Marian Bubak, Jakub Moscicki,
Marcin Radecki, and Tomasz Szepieniec
Cyfronet AGH, Krakow, PL
CERN / IT

2
Contents

VO-centric approach to SLA
Motivation
Basic requirements
SLA metrics
SLA execution
Bazaar tool
QoS from a user perspective
User-level vs system-level techniques
Tools Ganga/DIANE
Examples of QoS metrics
Case-study Lattice QCD 2008
Summary

VO-centric Approach to SLA

4
Motivation

Large number of VOs/users and resources
Dynamic management is a must
Remote interactions
Limitation of automation
Policy managers want to decide about their
resources
Start with human-in-the-loop SLA process

5
Aim SLA based Resource Allocation
6
What is needed

Definition of meaningful and measurable SLA
metrics
Communication patterns
(Re-)negotiation
Configuration validation
Tracking demands/policy changes
Complexity management and process traceability
SLA execution monitoring (including feedback from
users)
So, we should
define the SLA process
and build a collaboration tool

7
EGEE Grid and Bazaar

Starting point
No standard QoS metrics
No procedures to express requirements
Resources become available in the infrastructure
even if not agreed with VO
Resource Allocation in Central Europe ROC (Bazaar
Project)?
A procedure of tracking requests and responses to
them
Registration and monitoring of SLAs between VOs
and Resources Providers
Collaboration tool for tracking the process

8
Central European Region in EGEE

8 countries,
25 sites,
8000 cores,
850 TB storage
30 VOs

9
SLA Metrics

Common language for users and providers
Users I need to use x CPUs
Providers prefer to speak about aggregated
wall-clock time in specific period, without
guarantee that resources will be available in
(any) defined time
Expressive enough to satisfy users important
requests
Aggregated time, parallel use, waiting time
(queues), condition of environment
Configurable
providers need to have technical possibility to
configure the resources according to the SLA
(fabric layer need to support those requirements)
Measurable in execution time

10
Examples of SLA Metrics

Computational Resources
Guaranteed number of job slots in Local Batch
System
CPUs or cores?
Total wall-clock time to be used in specified
time period (in hours)
weekly, monthly
Access period (range of dates)?
Maximum wall-clock- and CPU-time of a single job
(hours)
Maximum waiting time from job submission to make
it running (in minutes)?
Average power of a single core (benchmark results
like SpecInt)
Capacity available for temporal use by a job (GB)
Memory available per core/CPU (GB)
Maximum latency between nodes in the cluster (ms)

11
Examples of SLA Metrics?

Grid Storage Resources
Storage quota guaranteed (GB)
Maximum latency in accessing files (optional, in
ms)
Minimum bandwidth in accessing files (optional,
Gb/s)
Storage quota for temporal use (optional, GB)
Time limit for temporal use of storage (optional,
hours)
Period of using storage (dates from-to)?
General Resource QoS
Minimum resource availability (optional, in )
Minimum resource reliability (optional, in )
Maximum time to acknowledge trouble ticket (days)
Maximum time to resolve trouble ticket (days)

12
SLA Execution Stages in Bazaar
The process is initialized by a VO by a call for
resources Next, a resource providers define
their proposal for SLA
13
States Transition Details

Each state transition must be confirmed by
both sides

Proper configuration is controlled by separated
set of states
14
Bazaar Functionality

Call management - the user can perform call
creation, edition and management.
SLA management including negotiation - site
managers can create a contract as a response to a
call. Both partners can negotiate contract
conditions and track contract changes.
Notification management - system notifies a user
via e-mail and user interface about actions like
resource reconfiguration etc.
Feedback - VO managers can assess site's
configuration and both partners can provide a
general assessment of the collaboration when the
contract has been completed.
Accounting and statistics - users can generate
reports with resources usage statistics. In the
next prototype, a tool shall enable obtaining
data from EGEE accounting tools.

15
Bazaar in operation

Bazaar a tool supporting resource allocation
including SLA negotiation
Integrated with EGEE Operation Portal (CIC
Portal)?
No cost of entry data obtained from GOCDB and
CIC-Portal VO-cards
Introduced into operations in Central European
Region
Main features of Bazaar
Clear view on VOs demands for resources
Management of calls and SLAs between VOs and RCs
SLA negotiation support
E-mail notifications
Tracking of SLA changes

16
SLA in PL-Grid

PL-Grid Project
Grid operations center in Poland
3 different infrastructures EGI compliant
(currently gLite-based), DEISA, cloud-like
research grid
SLA Management in PL-Grid
We take ideas from Bazaar Project as a starting
point
Develop SLA-centric model including
Impact on resources available at the technical
level
Notifications on missing resources
Improvement on SLA monitoring and accounting
Integration with computational grants system

17
PL-Grid Operation Tools Architecture
18
Conclusions

Human in a negotiation loop seems to be
unavoidable
SLAs should support VO and resource managers
Complexity management should be supported by Web
2.0 tools (collaboration tools with traceable
processes)?

QoS on the Grid
with User-Level Scheduling

20
Some Grid applications

Data Analysis
extraction of (statistical) parameters from data
using event loop
ATLAS experiment at LHC
Monte Carlo simulation
creation of statistical objects (e.g. histograms)
or building images by generating large number of
independent events
Geant4 simulations for radiotherapy in medical
physics
Parameter sweep
running a large number of independent jobs in
various configurations
Geant 4 regression tests
High-throughput activities
autonomous computing over long periods of time
Avian Flu Drug Search (bio-informatics)?
Lattice QCD (theoretical physics)?
High-performance, short-deadline activities
short-deadline performance peak
ITU frequency analysis for RRC06

21
QoS for scientific applications

In the Grid the basic interaction of a user is
sending jobs
efficient job/workload management plays central
role
efficient scheduling often requires
application-specific knowledge
which may be difficult at the system level
The system provides an appropriate QoS if it
responds in an acceptable way to the user and is
capable of automatically maintaining the
processing goals defined by the user (measured by
metrics)
Some QoS metrics (measure of user-defined goals)?
turnaround time
typically minimize the total execution time of
the job
reliability / failure rate
response latency time to obtain initial results
feedback from the execution
filling histograms with events -gt significance of
individual partial results decreases with time
prioritization/scheduling of the tasks
predictability/stability of the execution

22
Mechanisms for better QoS

In general QoS in NOT implemented on the Grid
Techniques for performance related metrics
dedication of resources (wasteful)
advanced reservations
difficult for some users who do not plan ahead
interactive work
better scheduling fast/slow queues (site
configuration)
preemption suspend lower priority job
migration suspend and migrate elsewhere
better brokering forecasting using monitoring
systems (e.g. NWS)
Techniques for failure related metrics
metascheduling (JDL retry count, Condor)
Techniques for application-specific metrics
metascheduling (not generally implemented, e.g.
out of scope of DAGs)

23
QoS Implementation Choices

QoS implementation
site service modifications
faster queues, scheduler modifications e.g.
virtualization schemes with MAUI
middleware modification
checkpointing/migration, special services (e.g.
GARA), Virtual Machines
system level modifications (unix kernel modules,
special I/O)
user-level overlay schedulers (plot jobs,
agents,...)?
Boundary conditions in a large Grid (e.g. EGEE)?
acceptance/deployment of middleware changes very
slow due organizational constraints
resource providers' constraints (site changes)
many sites cannot freely change their software
(serving also non-grid users)
sysadmins do not like sudo-like programs
interfacing legacy applications

24
User-level overlay

Overlays are the only option if we talk about
using existing Grid infrastructure at the large
scale

LCG and EGEE Grid
the largest Grid infrastructure to date
over 250 sites
over 80K WNs
over 15 PB of storage

25
User-level tools

DIANE helps smaller scientific communities using
distributed (Grid) resources more efficiently
reduce the application execution time
reduce the manual work overhead by providing
fully automatic execution and failure management,
efficiently integrate local and Grid resources
part of EGEE Respect suite
http//cern.ch/diane
Ganga Job Management Interface
Submission gateway to many distributed systems
Easy job management and application configuration
http//cern.ch/ganga

26
User-level Overlay

User-level overlay
each user uses a (temporary) overlay which is
created for the duration the computations

(drawing courtesy of ThIS collaboration
27
Master/Worker backbone

Master/Worker processing of tasks
RunMaster executes on a local host
WorkerAgents execute as Grid jobs
TaskScheduler is a software component (python
module) which may be arbitrarily customized or
replaced
application plugins
ApplicationWorker
ApplicationManager

28
Flexible architecture

3 functional parts
Submitter selection and acquisition of the
resources
M/W scheduling and execution control
Directory Service late binding of resources
System is easily customized by plugins

29
Examples of QoS Metrics

Selected examples of QoS metrics for different
applications

30
QoS Metric predictability of execution

Comparison of G4 Production on LCG DIANE and
direct submission
6 sites / 173 CPUs / 100 VO-shared, 70
VO-dedicated
207 tasks, direct 1 task 1 job, DIANE workers

31
QoS metric reliability

Summary of ITU RRC06 runs
200K jobs in less than 6 hours
worst case reliability 0.0003 jobs lost

run jobs task turnaround CPUh WN
comment 1 243K 26K 6.40h 425h
190 lost lt10 tasks (3e-04)? 2 237K 23K
6.30h 332h 125 lost 1 task (4e-05)?
3 224K 40K 3.05h 192h 210 OK
4 218K 39K 1.05h 151h 320 OK

ITU RRC-06 (15 May16 June 2006)?
120 countries (1200 delegates) negotiated
thenew digital frequency plan
a part of a new international agreement
introduction of digital broadcasting
UHF (470-862 Mhz)?
VHF (174-230 Mhz)?
preceded by RRC-04 and other international
meetings

32
QoS Metric low latency on the Grid

RRC06 ITU job
116 LCG workers
3470 tasks
130 CPU h
large span of task length
not a priori known!

33
QoS Metric stability of execution
34

Case study high-throughput Lattice QCD
simulation
application-aware scheduler prioritize tasks
based on the simulation parameters
active resource selection via Submitter
(WorkerFactory)?
dynamically select resources based on their
fitness for the application

35
Lattice QCD 2008 _at_ Grid

Study the behaviour of the critical point of
quark-gluon plasma
The scientific results obtained by the LQCD
project were published in a paper P. de Forcrand
et al. "The chiral critical point of Nf 3 QCD
at finite density to the order (µ/T)4" and are
available at http//arxiv.org/pdf/0808.1096
Monte-Carlo simulation of discrete space-time
lattice
need a lot of CPU
relatively small data (Gbs)?

36
LQCD execution history

ongoing since May 2008
several phases (application and system upgrades,
power-cuts, etc...)
routinely production since September 2008
runs unattended for months
operated by a single, not-a-Grid-expert user
large-scale
1000 running jobs at any time
700 CPU-years since the May 2008
18 TB of data

37
Routinely LQCD production

700 CPU years since May 2008
18 TB of data transferred
800 simultaneous workers

38
Summary

User-level overlay is a technique enhancing the
QoS parameters for scientific applications in the
EGEE Grid
Pros cons
Existing infrastructure may be used as is
Application-specific optimizations (impossible at
the system level)?
Hard QoS not possible (infrastructure
unreliable)?
Faire-share implemented by the underlying
infrastructure and respected by the overlay (if
used appropriately)?
Used successful for diverse applications
Overlays are a complementary approach to SLAs
More on tools