Javier Jaen Martinez - PowerPoint PPT Presentation

About This Presentation

Title:

Javier Jaen Martinez

Description:

Javier Jaen Martinez CERN IT/PDP – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 47

Provided by: Frederi124

Learn more at: https://www.jlab.org

Category:

more less

Transcript and Presenter's Notes

Title: Javier Jaen Martinez

1
Farm Computing Issues and Examples

Javier Jaen Martinez
CERN IT/PDP

2
Table of Contents

Motivation Goals
Types of Farms
Core Issues
Examples
JMX A Management Technology
Summary

3
Study Goals

How are Farms evolving in non HEP environments?
Have Generic PC Farms and Filter Farms shared
requirements for system/application monitoring,
control and management?
Will we benefit from future developments in other
domains?
Which are the emerging technologies for farm
computing?

4
Introduction

According to Pfister there are three ways to
improve performance
In terms of computing technologies
work harder using faster hardware
work smarter using more efficient algorithms
and techniques
getting help depending on how processors,
memory and interconnect are laid out MPP, SMP,
Distributed Systems and Farms

Work harder
Work smarter
Get Help
5
Motivation

IT/PDP is already using commodity farms
All 4 LHC experiments will use Event Filter Farms
Commodity Farms are also becoming very popular
for non HEP applications

6
Motivation
1000s tasks and 1000s of nodes to be controlled
monitored and managed (system and application
management challenge).
7
Types of Farms

In our domain
Event Filter Farms
To filter data acquired in previous levels of a
DAQ
Reduce aggregated throughput by rejecting
uninteresting events or by compressing them

8
Types of Farms

Batch Data Processing
Job reads data from tape process information and
writes back data
Each job runs on a separate node
Job management performed by a batch scheduler
Nodes with good CPU performance and large disks
Good connectivity to mass storage
Inter-node communication not critical
(independent jobs)
Interactive Data Analysis
Analysis and data mining
Traverse large databases as fast as possible
Programs may run in parallel
Nodes with great CPU performance and large disks
High performance inter-process communication

9
Types of farms

Montecarlo Simulation
Used to simulate detectors
Simulation jobs run independently on each node
Similar to a batch data processing system (maybe
with less disk requirements)
Others
Workgroup Services
Central Data Recording Farms
Disk server Farms,
...

10
Types of farms

In non HEP environments
High Performance Farms (Parallel)
a collection of interconnected stand-alone
computers cooperatively working together as a
single, integrated computing resource
Farm seen as a computer architecture for parallel
computation
High Availability Farms
Mission Critical Applications
Hot Standby
Failover and Failback

11
Key Issues in Farm Computing

Size Scalability (physical application)
Enhanced Availability (failure management)
Single System Image (look-and-feel of one
system)
Fast Communication (networks protocols)
Load Balancing (CPU, Net, Memory, Disk)
Security and Encryption (farm of farms)
Distributed Environment (Social issues)
Manageability (admin. and control)
Programmability (offered API)
Applicability (farm-aware and non-aware app.)

12
Core Issues (Maturity)
M o n i t o r i n g
Load Balancing
Mature Development
Failure Management
SSI
Fast Communication
Manageability
Future Challenge
13
Monitoring why?

Performance Tuning
Environment changes dynamically due to the
variable load on the system and the network.
improving or maintaining the quality of the
services according to those changes
Exists a reactive control monitoring that acts on
farm parameters to obtain desired performance
Fault Recovery
to know the source of any failure in order to
improve robustness and reliability.
automatic fault recovery service needed in farms
with hundreds of nodes (migration, )
Security
to detect and report security violation events

14
Monitoring Why?

Performance Evaluation
to evaluate applications/system performance at
run-time.
Evaluation is performed off-line with data
monitored on-line
Testing
to check correctness of new applications running
in a farm by
detecting erroneous or incorrect operations
obtaining activity reports of certain functions
of the farm
obtaining a complete history of the farm in a
given period of time

15
Monitoring Types
Generation
Processing
Dissemin.
Presentat.
Instrumentation Collection Traces generation
Traces merging database updating correlation filte
ring
Users Managers Control Systems
Pull/Push Distrib/Central. Time/Event Collection
Format
Online/Offline On Demand/Autom Storage Format
Dissem. Format Access Type Access
Control Demand/Auto
Present. Format
How Many Monitoring tools are available
16
Monitoring Tools
Maple.
SAS.
Cheops
NextPoint
NetLogger
Ganymede
MTR
ResponseNetworks
MeasureNet
Network health
http//www.slac.stanford.edu/cottrell/tcom/nmtf.h
tml
No Integrated tools for services, applications,
devices, network monitoring
17
Monitoring Strategies?

Define common strategies
What to be monitored?
Collection strategies
Processing alternatives
Displaying techniques
Obtain Modular implementations
Good example ATLAS Back End Software
IT Division has started a monitoring project
Integrated monitoring
Service Oriented

18
Fast Communication
Killer Platform

ns
ms
µs
Killer Switch

Fast processors and fast networks
The time is spent in crossing between them

19
Fast Communication

Remove the kernel from critical path
Offer to user applications a fully protected,
virtual, direct (zero copy send messages),
user-level access to the network interface
This idea has been specified in VIA (Virtual
Interface Architecture)

Application
High Level Comm. Lib (MPI, ShM Put/Get, PVM)
Send/Recv/RDMA
Buff Manag./Synchro
VI Kernel Agent
VI Network Adapter
20
Fast Communication

VIAs predecesors
Active Messages (Berkeley Now project, Fast
Sockets)
Fast Messages (UCSD MPI, Shmem Put/Get, Global
Arrays)
Applications using sockets, MPI, ShMem, can
benefit from these fast communication layers
Several Farms (HPVM (FM), NERSC PC cluster
(M-VIA), ) already benefit from this technology

21
Fast Communication (Fast Mess)
10,000
100
77.1 MB/s
1,000
Bandwidth (MB/s)
Latency (µs)
100
10
10
11.1µs
FM packet size
1
1
4
16
64
256
1K
4K
16K
64K
Message size (bytes)
22
Fast Communication
HPVM
Pwr. Chal.
SP-2
T3E
Origin 2K
Beowulf
23
Single System Image

A single system image is the illusion, created by
software or hardware, that presents a collection
of resources as one, more powerful resource.
Strong SSI results in farms appearing like a
single machine to the user, to applications, and
to the network.
The SSI level is a good measure of the coupling
degree of the nodes in a farm
Every farm has a certain degree of SSI (A farm
with no SSI at all is not a farm).

24
Benefits of Single System Image

Usage of system resources transparently
Transparent process migration and load balancing
across nodes.
Improved reliability and higher availability
Improved system response time and performance
Simplified system management
Reduction in the risk of operator errors
User need not be aware of the underlying system
architecture to use these machines effectively
(C) from Jain

25
SSI Services

Single Entry Point
Single File Hierarchy xFS, AFS, ...
Single Control Point Management from single GUI
Single memory space
Single Job Management Glunix, Codine, LSF
Single User Interface Like workstation/PC
windowing environment
Single I/O Space (SIO)
any node can access any peripheral or disk
devices without the knowledge of physical
location.

26
SSI Services

Single Process Space (SPS)
Any process on any node create process with
cluster wide process wide and they communicate
through signal, pipes, etc, as if they are one a
single node.

Every SSI has a boundary
Single system support can exist at different
levels
OS Level MOSIX
MiddlewareCodine,PVM
Application Level Monitoring App, Back-End SW

27
Scheduling Software

Goal enables the scheduling of system activities
and execution of applications while offering high
availability services transparently
Usually works completely outside the kernel and
on top of machines existing operating system
Advantages
Load Balancing
Use spare CPU cycles
Provide Fault tolerance
In practice, increased and reliable throughput of
user applications

28
SS Generalities

The workings of a typical SS
Create a job description file job name,
resources, desired platform,
Job description file is sent by the client
software to a master scheduler
The master scheduler has an overall view queues
that have been configured plus the computational
load of the nodes in the farm
The master ensures that the resources being used
are load balanced and ensures that jobs complete
sucessfully

29
SS Main features

Application Support
are batch, interactive and parallel jobs
supported?
multiple configurable queues?
Job Scheduling and allocation
Allocation Policy taking into account system
load, CPU type, computational load, memory, disk
space,
Checkpointingsave state at regular intervals
during job execution. Job an be restarted from
last checkpoint
Migration move job to another node in the farm
to achieve dynamic load balancing or perform a
sequence of activities on different specialized
nodes
Monitoring/ Suspension/Resumption

30
SS Main features

Dynamics of resources
Resources, queues, and nodes reconfigured
dynamically
Existence of Single points of failure
Fault tolerance re-run a job if system crashes
and check for needed resources

31
SSPackages
Research CCS Condor Dynamic Network Queueing
System Distributed Queueing System Generic
NQS Portable Batch System Prospero Resource
Manager MOSIX Far Dynamite
Commercial Codine (Genias) LoadBalancer
(Tivoli) LSF (Platform) Network Queueing
Environment (SGI) TaskBroker (HP)
Condor
DNQS
Utopia
NQS
DQS
NQE
PBS
Codine
LSF
32
SS Some examples

CODINE LSF
to be used in large heterogeneous networked env.
Dynamic and static load balancing
Batch, interactive, parallel jobs
Checkpointing Migration
Offers API for new distributed applications
No single Point of failure
Job accounting data and analysis tools
Modification of resource reservation for started
jobs and specification of releasable shared
resources (LSF)
MPI (LSF) vs MPI, PVM, Express, Linda (Codine)
Reporting tools (LSF)
C API (LSF), ?? (Codine)
No Checkpointing of forked jobs or signaled jobs

33
Failure Management

Traditionally associated to Scheduling Sw and
oriented to long running processes (CPU
intensive)
If a CPU intensive process crashes --gt wasted CPU
Solution
Save the state of the process periodically
In case of failure process restarted from last
checkpoint
Strategies
store checkpoints in files using a distributed
file system (slows down computation, NFS is poor,
AFS caching of Checkpoints may flush other useful
data)
checkpoint servers (dedicated node with disk
storage and management functions for
checkpointing)

34
Failure Management

Levels
Transparent checkpointing checkpointing library
linked against an executable binary. The library
checkpoints transparently the process (condor,
libckpt, Hector)
User directed Checkpointing (directives included
in the applications code to perform specific
checkpoints of particular memory segments)
Future challenges
Decoupling Failure management and scheduling
Define strategies for System failure recovery (at
kernel level?)
Define strategies for task failure recovery

35
Examples MOSIX Farms

MOSIX Multicomputer OS for UNIX
An OS module (layer) that provides the
applications with the illusion of working on a
single system
Remote operations are performed like local
operations
Strong SSI at kernel level

36
Example MOSIX Farms
Preemptive process migration that can
migrate---gtany process, anywhere, anytime

Supervised by distributed algorithms that
respond on-line to global resource availability
- transparently
Load-balancing - migrate process from over-loaded
to under-loaded nodes
Memory ushering - migrate processes from a node
that has exhausted its memory, to prevent
paging/swapping

37
Example MOSIX Farms

A scalable cluster configuration
50 Pentium-II 300 MHz
38 Pentium-Pro 200 MHz (some are SMPs)
16 Pentium-II 400 MHz (some are SMPs)
Over 12 GB cluster-wide RAM
Connected by the Myrinet 2.56 G.b/s LANRuns
Red-Hat 6.0, based on Kernel 2.2.7
Download MOSIX
http//www.mosix.cs.huji.ac.il/

38
Example HPVM Farms

GOAL Obtain Supercomputing performance from a
pile of PCs
Scalability 256 processors demonstrated
Networking over Myrinet interconnect
OS LINUX and NT (going NT)

CORBA
Winsock 2
HPF

Available now
Under development

Global Arrays
SHMEM
MPI
Illinois Fast Messages (FM)
39
Example HPVM Farms

SSI at middleware level
MPI, and LSF
Fast CommunicationFast Messages
Monitoring none yet
Manageability (still poor)
HPVM front-end (Java applet LSF features)
Symera (under development at NCSA)
DCOM based management tool (only for NT)
Add/remove node from cluster
logical cluster definition
distributed processes control monitoring
Other NERSC PC Cluster and Beowulf

40
Example Disk server Farms

To transfer data sets between disk and
applications.
IT/PDP
RFIO package (optimize large sequential data
transfers)
each disk server system runs one master RFIO
daemon in the background and a new requests lead
to the spawning of further RFIO daemons.
Memory space is used for caching
SSI Weak
Load balancing of rfio daemons in different nodes
of the farm
Single memory space I/O space could be useful
in a disk server farm with heterogeneous machines

41
Example Disk server Farms

Monitoring
RFIO daemons status, load of farm nodes, memory
usage, caching hit rates,...
Fast Messaging rfio techniques using TCP sockets
Manageability storage, daemons, caching
management
Linux based disk servers performance is now
comparable to UNIX disk servers (benchmarking
study by Bernd Panzer IT/PDP)!!!!
DPSS (Distributed Parallel Storage Server)
collection of disk servers which operate in
parallel over a wide area network to provide
logical block level access to large data sets
SSI
applications are not aware of declustered data.
Load balancing if replicated data
Monitoring Java Agents Monitoring and Management
Fast Messaging Dynamic TCP buffer size adjustment

42
JMX A Management Technology

JMX Java Management Extensions (Basics)
defines a management architecture, APIs, and
management services all under a single
specification
resources can be made manageable without regards
as to how its manager is implemented (SNMP,
Corba, Java Manager)
Based on Dynamic Agents
Platform and Protocol independent
JDMK 3.2

Management Applic
Manager Level (JMX Manager)
Agent Level (JMX Agent)
Instrumentation Level (JMX Resource)
Managed Resource
43
JMX Components
44
JMX Applications

Implement distributed SNMP monitoring
infrastructures
Heterogeneus farms (NTLinux) management
Environments where Management Intelligence or
requirements change over time
Environments where Management Clients maybe
implemented using different technologies.

45
Summary

Farms scale and intended use will grow in the
next years
We presented a set of factors to compare
different farm computing approaches
Developments from non HEP domains can be used in
HEP farms
Fast Networking
Monitoring
System Management
However Application and tasks Management is very
dependant on particular domains

46
Summary

EFF community should
Share common experiences (specific subfields in
future meetings)
Define common monitoring requirements and
mechanisms, SSI requirements, management
procedures (filtering, reconstruction,
compression, )
Follow on developments in management of High
Performance computing farms (same challenge of
management of thousands of processes/threads)
Obtain if possible modular implementations of
these requirements that constitute EFF Management
Approach