Title: Virtualization in MetaSystems
1Virtualization in MetaSystems
- Vaidy Sunderam
- Emory University, Atlanta, USA
- vss_at_emory.edu
2Credits and Acknowledgements
- Distributed Computing Laboratory, Emory
University - Dawid Kurzyniec, Piotr Wendykier, David DeWolfs,
Dirk Gorissen, Maciej Malawski, Vaidy Sunderam - Collaborators
- Oak Ridge Labs (A. Geist, C. Engelmann, J. Kohl)
- Univ. Tennessee (J. Dongarra, G. Fagg, E.
Gabriel) - Sponsors
- U. S. Department of Energy
- National Science Foundation
- Emory University
3Virtualization
- Fundamental and universal concept in CS, but
receiving renewed, explicit recognition - Machine level
- Single OS image Virtuozo, Vservers, Zones
- Full virtualization VMware, VirtualPC, QEMU
- Para-virtualization UML, Xen (Ian Pratt et. al,
cl.cam.uk) - Consolidate under-utilized resources, avoid
downtime, load-balancing, enforce security
policy - Parallel distributed computing
- Software systems PVM, MPICH, grid toolkits and
systems - Consolidate under-utilized resources, avoid
downtime, load-balancing, enforce security policy
aggregate resources
4Virtualization in PVM
- Historical perspective PVM 1.0, 1989
5Key PVM Abstractions
- Programming model
- Timeshared, multiprogrammed virtual machine
- Two-level process space
- Functional name ordinal number
- Flat, open, reliable messaging substrate
- Heterogeneous messages and data representation
- Multiprocessor emulation
- Processor/process decoupling
- Dynamic addition/deletion of processors
- Raw nodes projected
- Transparently
- Or with exposure of heterogeneous attributes
6Parallel Distributed Computing
- Multiprocessor systems
- Parallel distributed memory computing
- Stable and mainstream SPMD, MPI
- Issues relatively clear performance
- Platforms
- Applications
- Correspondingly tightly coupled
7Parallel Distributed Computing
- Metacomputing and grids
- Platforms
- Parallelism
- Possibly within components, but mostly loose
concurrency or pipelining between components
(PVM 2-level model) - Grids resource virtualization across multiple
admin domain - Moved to explicit focus on service orientation
- Wrap applications as services, compose
applications into workflows deploy on service
oriented infrastructure - Motivation service/resource coupling
- Provider provides resource and service
virtualized access
8Virtualization in PDC
- What can/should be virtualized?
- Raw resource
- CPU process/task instantiation gt staging,
security etc - Storage e.g. network file system over GMail
- Data value added or processed
- Service
- Define interface and input-output behavior
- Service provider must operate the service
- Communication
- Interaction paradigm with strong/adequate
semantics
- Key capability
- Configurable/reconfigurable resource, service,
and communication
9The Harness II Project
- Theme
- Virtualized abstractions for critical aspects of
parallel distributed computing implemented as
pluggable modules, (including programming
systems) - Major project components
- Fault-tolerant MPI specification, libraries
- Container/component infrastructure C-kernel, H2O
- Communication framework RMIX
- Programming systems
- FT-MPI H2O, MOCCA (CCA H2O), PVM
10Harness II
- Aggregation for Concurrent High Performance
Computing - Hosting layer
- Collection of H2O kernels
- Flexible/lightweight middleware
- Equivalent to Distributed Virtual Machine
- But only on client side
- DVM pluglets responsible for
- (Co) allocation/brokering
- Naming/discovery
- Failures/migration/persistence
- Programming environments FT- MPI, CCA, paradigm
frameworks, distributed numerical libraries
11H2O Middleware Abstraction
- Providers own resources
- Independently make them available over the
network - Clients discover, locate, andutilize resources
- Resource sharing occurs between single provider
and single client - Relationships may betailored as appropriate
- Including identity formats, resource allocation,
compensation agreements - Clients can themselves be providers
- Cascading pairwise relationships maybe formed
12H2O Framework
- Resources provided as services
- Service active software component exposing
functionality of the resource - May represent added value
- Run within a providers container (execution
context) - May be deployed by any authorized party
provider, client, or third-party reseller - Provider specifies policies
- Authentication/authorization
- Actors ? kernel/pluglet
- Decoupling
- Providers/providers/clients
13Example usage scenarios
- Resource computational service
- Reseller deploys software component into
providers container - Reseller notifies the client about the offered
computational service - Client utilizes the service
- Resource raw CPU power
- Client gathers application components
- Client deploys components into providers
containers - Client executes distributed application utilizing
providers CPU power
- Resource legacy application
- Provider deploys the service
- Provider stores the information about the service
in a registry - Client discovers the service
- Client accesses legacy application through the
service
14Model and Implementation
Interface StockQuote double
getStockQuote()
- H2O nomenclature
- container kernel
- component pluglet
- Object-oriented model, Java
- and C-based implementations
- Pluglet remotely accessible object
- Must implement Pluglet interface, may implement
Suspendible interface - Used by kernel to signal/trigger pluglet state
changes - Model
- Implement (or wrap) service as a pluglet to be
deployed on kernel(s)
Clients
Functionalinterfaces
(e.g. StockQuote)
Pluglet
Suspendible
Interface Pluglet void init(ExecutionContext
cxt) void start() void stop()
void destroy()
Interface Suspendible void suspend()
void resume()
15Accessing Virtualized Services
- Request-response ideally suited, but
- Stateful service access must be supported
- Efficiency issues, concurrent access
- Asynchronous access for compute intensive service
- Semantics of cancellation and error handling
- Many approaches focus on performance alone and
ignore semantic issues - Solution
- Enhanced procedure call/method invocation
- Well understood paradigm, extend to be more
appropriate to access metacomputing services
16The RMIX layer
- H2O built on top of RMIX communication substrate
- Provides flexible p2p communication layer for H2O
applications - Enable various message layer protocols within a
single, provider-based framework library - Adopting common RMI semantics
- Enable high performance and interoperability
- Easy porting between protocols, dynamic protocol
negotiation - Offer flexible communication model, but retain
RMI simplicity - Extended with asynchronous and one-way calls
- Issues Consistency, Ordering, Exceptions,
Cancellation
RPC clients
Web Services
Java
H2O kernel
SOAP clients
...
RMIX
RMIX
Networking
Networking
RPC, IIOP, JRMP, SOAP,
17RMIX Overview
- Extensible RMI framework
- Client and provider APIs
- uniform access to communication capabilities
- supplied by pluggable provider implementations
- Multiple protocols supported
- JRMPX, ONC-RPC, SOAP
- Configurable and flexible
- Protocol switching
- Asynchronous invocation
18RMIX Abstractions
- Uniform interface and API
- Protocol switching
- Protocol negotiation
- Various protocol stacks for different situations
- SOAP interoperability
- SSL security
- ARPC, custom (Myrinet, Quadrics) efficiency
-
- Asynchronous access to virtualized remote
resources
19Asynchronous RMIX
- Parameter marshalling
- Data consistency
- Also in PVM, MPI etc
- Exceptions/cancellation
- Critical for stateful servers
- Conservative vs. best effort
- Other issues
- Execution order
- Security
- Virtualizing communications
- Performance/familiarity vs. semantic issues
20Programming Models CCA and H2O
- Common Component Architecture
- Component standard for HPC
- Uses and provides ports described in SIDL
- Support for scientific data types
- Existing tightly coupled (CCAFFEINE) and loosely
coupled, distributed (XCAT) frameworks - H2O
- Well matched to CCA model
21MOCCA implementation in H2O
- Each component running in separate pluglet
- Thanks to H2O kernel security mechanisms,
multiple components may run without interfering - Two-level builder hierarchy
- ComponentID pluglet URI
- MOCCA_Light pure Java implementation (no SIDL)
22Performance Small Data Packets
- Factors
- SOAP header overhead in XCAT
- Connection pools in RMIX
23Large Data Packets
- Encoding (binary vs. base64)
- CPU saturation on Gigabit LAN (serialization)
- Variance caused by Java garbage collection
24Use Case 2 H2O FT-MPI
- Overall scheme
- H2O framework installed on computational nodes,
or cluster front-ends - Pluglet for startup, event notification, node
discovery - FT-MPI native communication (also MPICH)
- Major value added
- FT-MPI need not be installed anywhere on
computing nodes - To be staged just-in-time before program
execution - Likewise, application binaries and data need not
be present on computing nodes - The system must be able to stage them in a secure
manner
25Staging FT-MPI runtime with H2O
- FT-MPI runtime library and daemons
- Staged from a repository (e.g. Web server) to the
computational node upon users request - Automatic platform type detection appropriate
binary files are downloaded from the repository
as needed - Allows users to run fault tolerant MPI programs
on machines where FT-MPI is not pre-installed - Not needing login account to do so using H2O
credentials instead
26Launching FT-MPI applications with H2O
- Staging applications from a network repository
- Uses URL code base to refer to a remotely stored
application - Platform-specific binary transparently uploaded
to a computational node upon client request - Separation of roles
- Application developer bundles the application and
puts it into a repository - The end-user launches the application, unaware of
heterogeneity
27Interconnecting heterogeneous clusters
- Private, non-routable networks
- Communication proxies on cluster front-ends route
data streams - Local (intra-cluster) channels not affected
- Nodes use virtual addresses at the IP level
resolved by the proxy
28Initial experimental results
- Proxied connection versus direct connection
- Standard FT-MPI throughput benchmark was used
- within a Gig-Ethernet cluster proxies retain 65
of throughput
29Summary
- Virtualization in PDC
- Devising appropriate abstractions
- Balance pragmatics and performance vs. model
cleanness - The Harness II Project
- H2O kernel
- Reconfigurability, by clients/tprs very valuable
- RMIX communications framework
- High level abstractions for control comms (native
data comms) - Multiple programming model overlays
- CCA, FT-MPI, PVM
- Concurrent computing environments on demand