Title: Transparent Cross-Border Migration of Parallel Multi Node Applications
1Transparent Cross-Border Migration of Parallel
Multi Node Applications
- Dominic Battré, Matthias Hovestadt, Odej Kao,
Axel Keller, Kerstin Voss - Cracow Grid Workshop 2007
2Outline
- Motivation
- The Software Stack
- Cross-Border Migration
- Summary
3The Gap between Grid and RMS
- User asks for SLA
- Grid Middleware realizes job by means of local
RMS - BUT RMS offer Best Effort
- Need SLA-aware RMS
4HPC4U Highly Predictable Clusters for
Internet-Grids
- Objective
- Software-only solution for an SLA-aware, fault
tolerant infrastructure, offering reliability and
QoS, and acting as active Grid component - Key Features
- System level checkpointing
- Job migration
- Job types sequential and MPI-parallel
- Planning based scheduling
5HPC4U Planning Based Scheduling
queuing systems planning systems
planned time frame present present and future
new job requests insert in queues re-planning
assignment of planned start time no all requests
runtime estimation not necessary mandatory
backfilling optional yes, implicit
advance reservations not possible yes, trivial
queues
new jobs
Machine
new jobs
time
6HPC4U Software Stack
User- / Broker- Interface
CLI
Negotiation
RMS
Scheduler
SSC
Process
Network
Storage
Cluster
7HPC4U Checkpointing Cycle
7. Job runningagain
RMS
5. Link to Snapshot
4. Snap-shot !
Network
Storage
Process
2. In- Transit Packets
8Cross Border Migration Intra Domain
User- / Broker- Interface
User- / Broker- Interface
CLI
CLI
CRM
CRM
Negotiation
PP
Negotiation
PP
RMS
RMS
Scheduler
Scheduler
SSC
SSC
Process
Network
Storage
Process
Network
Storage
Cluster
Cluster
9Cross Border Migration Target Retrieval
User- / Broker- Interface
User- / Broker- Interface
CLI
CLI
CRM
CRM
Negotiation
PP
Negotiation
PP
RMS
RMS
Scheduler
Scheduler
SSC
SSC
Process
Network
Storage
Process
Network
Storage
Cluster
Cluster
10Cross Border Migration Checkpoint Migration
User- / Broker- Interface
User- / Broker- Interface
CLI
CLI
CRM
CRM
Negotiation
PP
Negotiation
PP
RMS
RMS
Scheduler
Scheduler
SSC
SSC
Process
Network
Storage
Process
Network
Storage
Cluster
Cluster
11Cross Border Migration Remote Execution
User- / Broker- Interface
User- / Broker- Interface
CLI
CLI
CRM
CRM
Negotiation
PP
Negotiation
PP
RMS
RMS
Scheduler
Scheduler
SSC
SSC
Process
Network
Storage
Process
Network
Storage
Cluster
Cluster
12Cross Border Migration Result Migration
User- / Broker- Interface
User- / Broker- Interface
CLI
CLI
CRM
CRM
Negotiation
PP
Negotiation
PP
RMS
RMS
Scheduler
Scheduler
SSC
SSC
Process
Network
Storage
Process
Network
Storage
Cluster
Cluster
13Cross-Border Migration Using Globus
User- / Broker- Interface
CLI
WS-AG
CRM
Broker
- WS-AG implementation based on GT4
- Developed in EU project AssessGrid
- Source specifies SLA / file staging parameters
- Subset of JSDL (POSIX Jobs)
- Resource determination via broker
- Source directly contacts destination
- Destination pulls migration data via Grid-FTP
- Destination pushes result data back to source
- Source uses WSRF event notification
Negotiation
PP
RMS
Scheduler
SSC
Process
Network
Storage
Cluster
14Ongoing Work Introducing Risk Management
User- / Broker- Interface
CLI
CRM
Broker
WS-AG
- Topic of EU project AssessGrid
- Encorporated in SLA
- Provider
- Estimates risk for agreeing an SLA
- Considers propability of failure in schedule
- Assessment based on historical data
Risk Assessor
Negotiation
PP
RMS
Scheduler
Consultant Service
SSC
Monitoring
Process
Network
Storage
Cluster
15Summary Best Effort is not Enough
Cross border migration and Risk assessment
provide new means to increase the reliability
of Grid Computing.
16More information
- Read the paper
- AssessGrid www.assessgrid.eu
- HPC4U www.hpc4u.eu
- OpenCCS www.openccs.eu
Thanks for your attention!
17Contents
18Scheduling Aspects
- Execution Time
- Exact start time
- Earliest start time, latest finish time
- User provides stage-in files by time X
- Provider keeps stage-out files until time Y
- Provisional Reservations
- Job Priorities
- Job Suspension
19HPC4U Planning Based Scheduler
- Run-time estimation ? Start time assignment
32
l3h
Reser-vation 2
l2h
CPUs
16
Reservation for Grid Job according SLA (l6h)
Reser-vation 2
Time
4
8
10
12
2
6
14
16
20 21Motivation Fault Tolerance
- Commercial Grid users need SLAs
- Providers cautious on adoption
- Reason Business case risk
- Missed deadlines due to system failures
- ? Penalties to be paid
- Solution Prevention with Fault Tolerance
- Fault tolerance mechanisms available, but
- Application modification mandatory
- Overall solution (System software, process,
storage, file system, network) required - Combination with Grid migration missing
22HPC4U Objective
- Software-only solution for a SLA-aware, fault
tolerant infrastructure, offering reliability and
QoS, acting as active Grid component - Key features
- Definition and implementation of SLAs
- Resource reservation for guaranteed QoS
- Application-transparent fault tolerance
23HPC4U Concept
- SLA negotiation as an explicit statement of
expectations and obligations in a business
relationship between provider and customer - Reservation of CPU, storage and network for
desired time interval - Job start in checkpointing environment
- In case of system failure
- ? Job migration / restart with respect to SLA
24HPC4U Project Outcomes
25Phases of Operation
- Negotiation of SLA
- Pre-Runtime Configuration of Resources
- e.g. network, storage, compute nodes
- Runtime Stage-In, Computation, Stage-Out
- Post-Runtime Re-configuration
26PhasePre-Runtime
- Task of Pre-Runtime Phase
- Configuration of all allocated resources
- Goal Fulfill requirements of SLA
- Reconfiguration affects all HPC4U elements
- Resource Management System
- e.g. configuration of assigned compute nodes
- Storage Subsystem
- e.g. initialization of a new data partition
- Network Subsystem
- e.g. configuration of network infrastructure
27Phase Runtime
- Runtime Phase lifetime of job in system
- adherence with SLA has to be assured
- FT mechanisms have to be utilized
- Phase consists of three distinct steps
- Stage-In
- transmission of required input data from Grid
customer to compute resource - Computation
- execution of application
- Stage-Out
- transmission of generated output data
fromcompute resource back to Grid customer
28Phase Post-Runtime
- Task of Post-Runtime Phase
- Re-Configuration of all resources
- e.g. re-configuration of network
- e.g. deletion of checkpoint datasets
- e.g. deletion of temporary data
- Counterpart to Pre-Runtime Phase
- Allocation of resources ends
- Update of schedules in RMS and storage
- Resources are available for new jobs
29Motivation Cross Border Migration
Customer
HPC4U
29
30 31Subsystems
- Process Subsystem
- checkpointing of network
- cooperative checkpointing protocol (CCP)
- Network Subsystem
- checkpoint network state
- Storage Subsystem
- provision of storage
- provision of snapshot
32Metacluster Checkpointing Subsystem
- Virtualization of Resources
- Capture of full application context
- resources, states, process hierarchy
- Non-intrusive
- ? Virtual Bubble
33 34Storage subsystem
Virtual Storage Manager
- Functionalities
- Negotiates the storage part of the SLA
- Provides storage capacity at a given QoS level
- Provides FT mechanisms
- Requirement manage multiple jobs running on the
same SR
35Data Container concept
- Idea
- create storage environment for applications at a
desired QoS level with abstraction of physical
devices - Components
File I/O (read, write, open,)
Data Container
Block I/O (read, write, ioctl)
Logical space
Block I/O
Storage Resource
36Data container properties
- Storage part of the SLA
- Data container section
- Size
- File system type
- Number of nodes that need to access the data
container (private/shared) - Performance section
- Application I/O profile ? Benchmark
- Bandwidth (in MB/s or IO/s)
- Or Default configuration
- Dependability section
- Data redundancy type (within a cluster)
- Snapshot needed or not
- Data replication or not (between clusters)
- Job specific section
- Jobs time to schedule and time to finish
37Fault Tolerance Mechanisms
- RAID
- Tolerate the failure of one or more disks
- RAIN
- Tolerate the failure of one or more nodes
- Implementation
- Hardware
- Software
- Storage FT mechanisms rely on special data
layouts
Software
38Data container snapshot
- Provide instantaneous copy of data containers
- Technique used Copy-On-Write (COW)
- create multiple copies of data without
duplicating all the data blocks - With checkpoint, it allows application restart
from a previous running stage - Impact on SR performance
- Taken into account at negotiation time
39Snapshot single node job restart after node
failure
- Characteristics
- The job is running on a single node
- The data container is private to that node
- Data container snapshot resides on the same
storage resource
40Interfaces with other components
RMS
Interface VSM - RMS
VSM
Interface VSM SR
Storage Resource (SR)
Storage Subsystem
Network (socket , RDMA, )
41 42Motivation AssessGrid
Checkpoint
Accept this job?
Node crashes
Restart
43Grid Fabric Layer with Risk Assessor
- NegotiationManager
- Agr./Agr.Fact. WS
- checks whether offer complies to template
- initiation of file transfers
- Scheduler
- creates tentative schedules for offers
- Risk Assessor
- Consultant Service
- records data
- Monitoring
- runtime behavior
44Motivation AssessGrid
- Aim of AssessGrid
- Introduce risk awareness in Grid technology
- Risk awareness incorporated across three layers
- End-user
- Broker
- Service Provider
45AssessGrid - Architectural Overview
- End-user
- Portal
- Broker
- Risk Assessor
- Confidence Service
- Workflow Assessor
- Provider
- Negotiator
- Scheduler
- Risk Assessor
- Consultant Service
46Precautionary Fault-Tolerance
- Use of planning based scheduler
- How many
- spare
- resources are
- available at
- execution time?
47Estimating Risk for a Job Execution
- Use of planning based scheduler
- How much slack time is available for fault
tolerance? - How much effort do I undertake for fault
tolerance? - What is the considered risk of resource failure?
Execution Time
Slack Time
Latest Finish Time
Earliest Start Time
48Risk Assessment
- Estimate risk for agreeing an SLA
- consider risk of resource failure
- estimate risk for a job execution
- initiate precautionary FT mechanisms
low risk middle risk high risk
49Risk Management at Job Execution
Events
Risk Management
Decisions Actions
Risk Assessment Business Model (price,
penalty) Weekend/Holiday/Workday Schedule (SLAs,
best effort) Redundancy Measures
50Detection of Bottlenecks
- Consultant Service
- Analysis of SLA violation
- Estimated risk for the job
- Planned FT mechanisms
- Monitoring Information
- Job
- Resources
- Data Mining
- Find connections between SLA violations
- Detect weak points in the providers
infrastructure
51 52Components
53Implementation with Globus Toolkit 4
- Why Globus?
- Utility Authentication, Authorization,
Delegation, RFT, MDS, WS-Notification - Impact
- Problem 1 GRAM (Grid Resource Allocation and
Management) - State machine, incl. File-Staging, Delegation of
Credentials, RSL - Cannot use it written for batch schedulers, nor
for planning schedulers - Problem 2 Deviations from WS-AG spec.
- Different Namespaces WS-A, WS-RF
54Implementation with Globus Toolkit 4
- Technical Challenges
- xsanyType
- Wrote custom serializers/deserializers
- Subtitution groups
- Used in ItemConstraint (Creation Constraints)
- Cannot be mapped to Java by Axis
- Replaced by xsanyType use as DOM tree
- CreationConstraints
- Namespace prefixes in XPaths meaningless
- Need for WSDL and interpretation for xsall,
xschoice, and friends
55Context
- ltwsagContextgt
-
- ltwsagAgreementInitiatorgt
- ltAGDistinguishedNamegt
- /CDE/O
- lt/AGDistinguishedNamegt
- lt/wsagAgreementInitiatorgt
- ltwsagAgreementRespondergtEPRlt/gt
- ltAGServiceUsersgt
- ltAGServiceUsergtDNlt/gt
- lt/AGServiceUsersgt
-
- lt/wsagContextgt
Context
Terms
Creation Constraints
56Terms, SDTs
- Conjunction of terms
- Common structure of templates
- WS-AG too powerful/difficult to fully support
- Service Description Term (one)
- assessgridServiceDescription (extension of
abstract ServiceTermType) - jsdlPOSIXExecutable (executable, arguments,
environment) - jsdlApplication (mis-)used for libraries
- jsdlResources
- jsdlDataStaging
- assessgridPoF (upper bound)
Context
Terms
Creation Constraints
57Terms, GuaranteeTerms
- No hierarchy but two meta guarantees
- ProviderFulfillsAllObligations
- e.g. Reward 1000 EUR, Penalty 1000 EUR
- ConsumerFulfillsAllObligations
- e.g. Reward 0 EUR, Penalty 1000 EUR
- First violation is responsible for failure
- No hardware problem, then User fault
- Other Guarantees
- Execution Time
- Any start time (best effort)
- Exact start time
- Earliest start time, latest finish time
- User provides StageIn files by time X
- Provider keeps StageOut files until time Y
No timely execution
No stage-out
58Terms
- SLA does not contain requirements of fault
tolerance mechanisms - Covered by asserted PoF, penalty and loss of
reputation - Compulsory Assessment Intervals not really useful
for us - How often do you assess that job was allocated
for asserted time? - Preferences too complicated
Context
Terms
Creation Constraints
59CreationConstraints
- Difficult to support Namespaces
- //wsag/assessgrid - prefixes are just strings
- Very difficult to support structural information
- xsgroup, xsall, xschoice, xssequence
- Possible but difficult to support xsrestriction
- xssimple
- Check for enumeration (xsrestriction of
xsstring) - Check for valid dates (xsrestriction of xsdate)
- Everything else close to impossible
- min,maxIn,Exclusive
- totalDigits, fractionDigits, length, probably
useless
Context
Terms
Creation Constraints