Current and Emerging Trends in Cluster Computing

About This Presentation

Title:

Current and Emerging Trends in Cluster Computing

Description:

University of Portsmouth, UK. Daresbury Laboratory, Seminar. 9th January 2001 ... Build a bootable diskette to initially boot a new cluster node prior to installation. ... – PowerPoint PPT presentation

Number of Views:127

Avg rating:3.0/5.0

Slides: 79

Provided by: acetR

Category:

more less

Transcript and Presenter's Notes

Title: Current and Emerging Trends in Cluster Computing

1
Current and Emerging Trends in Cluster Computing

Mark Baker

University of Portsmouth, UK Daresbury
Laboratory, Seminar 9th January 2001
http//www.dcs.port.ac.uk/mab/Talks/
2
Talk Content

Background and Overview
Cluster Architectures
Cluster Networking
SSI
Cluster Tools
Conclusions

3
Commodity Cluster Systems

Bringing high-end computing to broader problem
domain - new markets
Order of magnitude price/performance advantage.
Commodity enabled no long development lead
times.
Low vulnerability to vendor-specific decisions
companies are ephemeral Clusters are forever !!!

4
Commodity Cluster Systems

Rapid response to technology tracking.
User-driven configuration-potential for
application specific ones.
Industry-wide, non-proprietary software
environment.

5
Cluster Architecture
6
Beowulf-class Systems

Cluster of PCs
Intel x86
DEC Alpha
Mac Power PC.
Pure Mass-Market COTS.
Unix-like O/S with source
Linux, BSD, Solaris.
Message passing programming model
MPI, PVM, BSP, homebrews
Single user environments.
Large science and engineering applications.

7
Decline of Heavy Metal

No market for high-end computers
minimal growth in last five years.
Extinction
KSR, TMC, Intel, Meiko, Cray!?, Maspar, BBN,
Convex
Must use COTS
Fabrication costs skyrocketing
Development lead times too short
US Federal Agencies Fleeing
NSF, DARPA, DOE, NIST
Currently no new good IDEAS.

8
HPC Architectures Top 500
140 (20)
9
Clusters on the Top500
10
A Definition !?

A cluster is a type of parallel or distributed
system that consists of a collection of
interconnected whole computers used as a single,
unified computing resource.
Where "Whole computer" is meant to indicate a
normal, whole computer system that can be used on
its own processor(s), memory, I/O, OS, software
sub-systems, applications.

11
But

There is a lot of discussion, still, about what a
cluster is.
Clusters and Constellations!
Is it based on commodity or proprietary
hardware/software?
Seems that there is the commodity and commercial
camps.
Both are potentially correct!!

12
Taxonomy
Cluster Computing
13
Cluster Technology Drivers

Reduced recurring costs approx 10 of MPPs.
Rapid response to technology advances.
Just-in-place configuration and reconfigurable.
High reliability if system designed properly.
Easily maintained through low cost replacement.
Consistent portable programming model
Unix, C, Fortran, message passing.
Applicable to wide range of problems and
algorithms.

14
Operating Systems

Little work on OSs specifically for clusters.
Turnkey clusters are provided with versions of a
companies mainline products.
Typically there may be some form of SSI
integrated into a conventional OS.
Two variants encountered for
System administration/job-scheduling purposes -
middleware that enables each node to deliver the
required services.
Kernel-level e.g., transparent remote device
usage or to use a distributed storage facility
that is seen by users as a single standard file
system.

15
Linux

Most popular cluster OS for clusters is Linux.
It is free
It is an open source - anyone is free to
customize the kernel to suit ones needs
It is easy - large user community users and
developers have created an abundant number of
tools, web sites, and documentation, so that
Linux installation and administration is
straightforward enough for a typical cluster user.

16
Examples Solaris MC

A multi-computer version of their Solaris OS
called Solaris MC.
Incorporates some advances made by Sun, including
an object-oriented methodology and the use of
CORBA IDL in the kernel.
Consists of a small set of kernel extensions and
a middleware library - provides SSI to the level
of the device
Processes running on one node can access remote
devices as if they were local, also provides a
global file system and process space.

17
Examples micro-kernels

Another approach is a minimalist approach by
using micro-kernels - Exokernel is such system.
With this approach, only the minimal amount of
system functionality is built into the kernel -
allowing services that are needed to be loaded.
It maximizes the available physical memory by
removing undesirable functionality
The user can alter the characteristics of the
service, e.g., a scheduler specific to a cluster
application may be loaded that helps it run more
efficiently.

18
How Much of the OS is Needed?

This brings up the issue of OS configuration - in
particular why provide a node OS with the ability
to provide more services to applications than
they are ever likely to use?
e.g. a user may want to alter the personality of
the local OS, e.g. "strip down" to a minimalist
kernel to maximise the available physical memory
and remove undesired functionality.
Mechanisms to achieve this range from
Use of a new kernel
Dynamically linking service modules into the
kernel.

19
Networking - Introduction

One of the key enabling technologies that has
established clusters as a dominant force has been
networking technologies.
High performance parallel applications need
low-latency, high-bandwidth and reliable
interconnects.
Existing LAN/WAN technologies/protocols
(10/100Mbps Ethernet, ATM) are not well suited to
support Clusters.
Hence the birth of SANs.

20
Comparison

AA1I put the references with the text that
describes these.
21
Why Buy a SAN

Well, it depends on your application
For scientific HPC, Myrinet seems to offer a good
MBytes/, lots of software and proven
scalability
Synfinity with its best-in-class 1.6 GBytes/s
could be a valuable alternative for
small/medium-sized clusters, untried at the
moment.
Windows-based users should give Giganet a try
QsNet and ServerNet II are likely the most
expensive solutions, but an Alpha-based cluster
from Compaq with one of these should be a good
number cruncher.

22
Some Emerging Technologies

VIA
Infiniband

23
Communications Concerns

New Physical Networking Technologies are Fast
Gigabit Ethernet, ServerNet, Myrinet
Legacy Network Protocol Implementations are Slow
System Calls
Multiple Data Copies.
Communications Gap
Systems use a fraction of the available
performance.

24
Communications Solutions

User-level (no kernel) networking.
Several Existing Efforts
Active Messages (UCB)
Fast Messages (UIUC)
U-Net (Cornell)
BIP (Univ. Lyon, France)
Standardization VIA
Industry Involvement
Killer Clusters.

User Process
OS
NIC
25
VIA

VIA is a standard that combines many of the best
features of various academic projects, and will
strongly influence the evolution of cluster
computing.
Although VIA can be used directly for application
programming, it is considered by many systems
designers to be at too low a level for
application programming.
With VIA, the application must be responsible for
allocating some portion of physical memory and
using it effectively.

26
VIA

It is expected that most OS and middleware
vendors will provide an interface to VIA that is
suitable for application programming.
Generally, the interface to VIA that is suitable
for application programming comes in the form of
a message-passing interface for scientific or
parallel programming.

27
What is VIA?

Use the kernel for set-ppand get It out of the
way for send/receive!
The Virtual Interface (VI)
Protected application-application channel
Memory directly accessible by user process.
Target Environment
LANs and SANs at Gigabit speeds
No reliability of underlying media assumed
(unlike MP fabrics)
Errors/Drops assumed to be rare generally fatal
to VI.

28
InfiniBand - Introduction

System bus technologies are beginning to reach
their limits in terms of speed.
Common PCI buses can only support up to 133 Mbps
across all PCI slots, and even with the 64-bit,
66 MHz buses available in high-end PC servers,
566 Mbps of shared bandwidth is the most a user
can hope for.

29
InfiniBand - Introduction

To counter this, a new standard based on switched
serial links to device groups and devices is
currently in development.
Called InfiniBand, the standard is actually a
merged proposal of two earlier groups Next
Generation I/O (NGIO) led by Intel, Microsoft,
and Sun and Future I/O, supported by Compaq,
IBM, and Hewlett-Packard.

30
Infiniband Hardware
31
Infiniband - Performance

A single InfiniBand link operates at 2.5 Gbps,
point-to-point in a single direction.
Bi-directional links offer twice the throughput
and can be aggregated together into larger pipes
of 1 GBytes/ (four co-joined links), or 3
GBytes/s (12 links).
Higher aggregations of links will be possible in
the future.

32
Recap

Whistle stop tour looked at some of the existing
and merging network technologies that are being
used with current clusters.
Hardware is more advanced than software that
comes with it.
Software is starting to catch up VIA and
Infiniband are providing performance and
functionality that todays sophisticated
applications require.

33
Single System Image (SSI)

Illusion, created by software or hardware, that
presents a collection of computing resources as
one whole unified resource.
Cluster appears like a single machine to users,
applications, and to the network.
Use of system resources transparent.
Transparent process migration and load balancing.
Potentially improved reliability and higher
availability, system response time and
performance.
Simplified system management.
No need to be aware of the underlying system
architecture to use these machines effectively.

34
Desired SSI Services

Single Entry Point
telnet cluster.my_institute.edu
telnet node1.cluster. institute.edu
Single File Hierarchy /Proc, NFS, xFS, AFS, etc.
Single Control Point Management GUI
Single memory space - Network RAM/DSM
Single Job Management Codine LSF
Single GUI Like workstation/PC windowing
environment it may be Web technology

35
Cluster Tools

Introduction
Management Tools
Application Tools
MPI/OpenMP
Debuggers
ATLAS
PAPI

36
Cluster Tools Introduction

For cluster computing to be effective within the
community it is essential that there are
numerical libraries and programming tools
available to application developers and system
maintainers.
Cluster systems present very different software
environments on which to build tools, libraries
and applications.

37
Introduction

Basically two categories of cluster tools
Management
Application
Management tools are used to configure, install,
manage, and maintain clusters.
Application tools are used to help design,
develop, implement, debug and profile user
applications.

38
Management Tools

Clusters will only be effectively taken up if
there is not only computer scientists providing
programming environments and tools, but also
developers producing useful and successful
application, AND also systems engineers to
configure and manage these machines.
System management is often neglected when the
price/performance of clusters are compared
against other more traditional machines.
System management is vital for the successful
deployment of all types of clusters.

39
System Tools

The lack of good administration and user-level
application management tools represents a hidden
operation cost that is often overlooked.
While there are numerous tools and techniques
available for the administration of clusters, few
of these tools ever see the outside of their
developers cluster basically they are
developed for specific in-house uses.
This results in a great deal of duplicated effort
among cluster administrators and software
developers.

40
System Tools Criteria

One of the main criteria of the tools is that,
they provide the look and feel of commands issued
to a single machine.
This is accomplished through using lists, or
configuration files, to represent the group of
machines on which a command will operate.

41
System Tools Security

Generally security inside a cluster, between
cluster nodes, is somewhat relaxed for a number
of practical reasons.
Some of these include
Improve performance
Ease programming
All nodes are generally compromised if one
cluster nodes security is compromised.
Thus, security from outside the cluster into the
cluster is of utmost concern.

42
System Tools Scalability

A user may tolerate an inefficient tool that
takes minutes to perform an operation across a
cluster of 8 machines as it is faster than
performing the operation manually 8 times.
However, that user will most likely find it
intolerable to wait over an hour for the same
operation to take effect across 128 cluster
nodes.
A further complication is federated clusters
-extending even further to wide area
administration.

43
System Tools Some Areas

Move disk images from image server to clients.
Copy/Move/remove client files.
Build a bootable diskette to initially boot a new
cluster node prior to installation.
Secure shell ssh.
Cluster-wide ps manipulate cluster-wide
processes.
DHCP, used to allocate IP addresses to machines
on a given network - lease IP to node.
Shutdown/reboot individual nodes.

44
Cluster Tools

There have been many advances and developments in
the creation of parallel code and tools for
distributed memory machines and likewise for
SMP-based parallelism.
In most cases the parallel, MPI-based libraries
and tools will operate on cluster systems, but
they may not achieve an acceptable level of
efficiency or effectiveness on clusters that
comprise SMP nodes.

45
MPI

The underlying technology on which the
distributed memory machines are programmed is
that of MPI.
MPI provides the communication layer of the
library or package, which may, or may not be
revealed, to the user.
The large number of implementations of MPI
ensures portability of codes across platforms and
in general, the use of MPI-based software on
clusters.

46
HPF and OpenMP

Two standards for parallel programming without
explicit message passing.
HPF targets data parallel applications for
distributed memory MIMD systems RIP.
OpenMP targets (scalable) shared memory
multiprocessors.

47
OpenMP

The emerging standard of OpenMP is providing a
portable base for the development of libraries
for shared memory machines.
Although most cluster environments do not support
this paradigm globally across the cluster, it is
still an essential tool for clusters that may
have SMP nodes.

48
OpenMP

Portable, shared memory multiProcessing API.
Fortran 77 Fortran 90 and C C.
Multi-vendor support, for both UNIX and NT.
Standardizes fine grained (loop) parallelism.
Also supports coarse grained algorithms.
Based on compiler directives and runtime library
calls.

49
OpenMP

Specifies
Work-sharing constructs
Data environment constructs
Synchronization constructs
Library Routines and Environment.
Variables.
PARALLEL
DO
SECTION
SINGLE

50
Some Debuggers

TotalView (http//www.etnus.com/)
Cray - TotalView
SGI - dbx, cvd
IBM - pdbx, pedb
Sun Prism (ex TMC debugger)

51
Where Does the Performance Go? orWhy Should I
Cares About the Memory Hierarchy?
Performance
Time
52
Computation and Memory Use

Computational optimizations
Theoretical peak( fpus)(flops/cycle) MHz
PIII (1 fpu)(1 flop/cycle)(650 MHz) 650
Mflop/s
Athlon (2 fpu)(1flop/cycle)(600 MHz) 1200
Mflop/s
Power3 (2 fpu)(2 flops/cycle)(375 MHz) 1500
Mflop/s
Memory optimization
Theoretical peak (bus width) (bus speed)
PIII (32 bits)(133 Mhz) 532 MB/s 66.5 MW/s
Athlon (64 bits)(200 Mhz) 1600 MB/s 200
MW/s
Power3 (128 bits)(100 Mhz) 1600 MB/s 200
MW/s
Memory about an order of magnitude slower.

53
Memory Hierarchy

By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.

Processor
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Control
Main Memory (DRAM)
Level 2 and 3 Cache (SRAM)
Remote Cluster Memory
Distributed Memory
On-Chip Cache
Datapath
Registers
54
How To Get Performance From Commodity
Processors?

Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning.
H/w and S/w have a large design space with many
parameters
Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules.
Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors.
Until recently, no tuned BLAS for Pentium for
Linux.
Need for quick/dynamic deployment of optimized
routines.
ATLAS - Automatic Tuned Linear Algebra Software
PhiPac from Berkeley, FFTW from MIT
(http//www.fftw.org)

55
ATLAS

An adaptive software architecture
High-performance
Portability
Elegance.
ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

56
ATLAS Across Various Architectures (DGEMM n500)

ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

57
Code Generation Strategy

On-chip multiply optimizes for
TLB access
L1 cache reuse
FP unit usage
Memory fetch
Register reuse
Loop overhead minimization.
Takes 30 minutes to a hour to run.
New model of high performance programming where
critical code is machine generated using
parameter optimisation.

Code is iteratively generated timed until
optimal case is found. We try
Differing sized blocks
Breaking false dependencies
M, N and K loop unrolling
Designed for RISC arch
Super Scalar
Need reasonable C compiler.

58
Plans for ATLAS

Software Release, available today
Level 1, 2, and 3 BLAS implementations
See http//www.netlib.org/atlas/
Next Version
Multi-treading and a Java generator.
Futures
Optimize message passing system
Runtime adaptation
Sparsity analysis
Iterative code improvement
Specialization for user applications.
Adaptive libraries.

59
Tools for Performance Evaluation

Timing and performance evaluation has been an
art
Resolution of the clock
Issues about cache effects
Different systems.
Situation about to change
Todays processors have internal counters.

60
Performance Counters

Almost all high performance processors include
hardware performance counters.
Some are easy to access, others not available to
users.
On most platforms the APIs, if they exist, are
not appropriate for a common user, functional or
well documented.
Existing performance counter APIs
Cray T3E, SGI MIPS R10000, IBM Power series,
DEC Alpha pfm pseudo-device interface,
Windows 95, NT and Linux.

61
Performance Data That May Be Available

Pipeline stalls due to memory subsystem.
Pipeline stalls due to resource conflicts.
I/D cache misses for different levels.
Cache invalidations.
TLB misses.
TLB invalidations.

Cycle count.
Floating point instruction count.
Integer instruction count.
Instruction count.
Load/store count.
Branch taken/not taken count.
Branch mispredictions.

62
PAPI Implementation

Performance Application Programming
Interface
The purpose of PAPI is to design, standardize and
implement a portable and efficient API to access
the hardware performance monitor counters found
on most modern microprocessors.

63
Graphical ToolsPerfometer Usage

Application is instrumented with PAPI
call perfometer()
Will be layered over the best existing
vendor-specific APIs for these platforms
Sections of code that are of interest are
designated with specific colors
Using a call to set_perfometer(color).
Application is started, at the call to
performeter a task is spawned to collect and send
the information to a Java applet containing the
graphical view.

64
Perfometer
Call Perfometer(red)
65
PAPI 1.0 Release

Mailing list
send subscribe ptools-perfapi to
majordomo_at_ptools.org
ptools-perfapi_at_ptools.org is the reflector

Platforms
Linux/x86
Solaris/Ultra
AIX/Power
Tru64/Alpha
IRIX/MIPS.
May require patch to kernel.
C and Fortran bindings.
To download software see
http//icl.cs.utk.edu/projects/papi/

66
Conclusions

One of the principal challenges facing the
future of HPC software development is the
encapsulation of architectural complexity

67
Where are We

Distinction between PCs and workstations
hardware/software has evaporated.
Beowulf-class systems and other PC clusters
firmly established as a mainstream
compute-performance resource strategy.
Linux and NT established as dominant O/S
platforms.
Integrating COTS network technology capable of
supporting many application/algorithms.
Both business/commerce and science/engineering
exploiting Beowulfs for price-performance and
flexibility

68
Where are We (2)

Thousand processor Beowulf.
Gflops/s processors.
MPI and PVM standards.
Extreme Linux effort providing robust and
scalable resource management.
SMP support (on a node).
First generation middleware components for
distributed resource, multi-user environments.
Books on Linux, Beowulfs, and general clustering
available.
Vendor acceptance in to market strategy.

69
Million Tflops/s

Today, 3M peak Tflops/s.
Before year 2002 1M peak Tflops/s.
Performance efficiency is serious challenge
System integration
does vendor support of massive parallelism have
to mean massive markup?
System administration, boring but necessary.
Maintenance without vendors how?
New kind of vendors for support!
Heterogeneity will become major aspect.

70
Summary of Immediate Challenges

There are more costs than capital costs.
Higher level of expertise is required in house.
Software environments are behind vendor
offerings.
Tightly coupled systems easier to exploit in some
cases.
Linux model of development scares people.
Not yet for everyone.
PC-clusters have not achieved maturity.

71
Future Technology Enablers

SOCs system-on-a-chip
GHz processor clock rate.
VLIW.
64-bit processors
scientific/engineering application
address spaces.
Gbit DRAMs
Micro-disks on a board
Optical fiber and wave division multiplexing
communications (also free space?)

72
Future Technology Enablers (2)

Very high bandwidth backplanes/switches.
SMP on a chip
multiple processors with multi-layered caches.
Processor in Memory (PIM).
Standardized dense packaging.
Lower cost per node.

73
Software Stumbling Blocks

Linux cruftiness
Heterogeneity.
Scheduling and protection in time and space
Task migration.
Checkpointing and restarting.
Effective, scalable parallel file system.
Parallel debugging and performance optimization.
System software development frameworks and
conventions.

74
Accomplishments

Many Beowulf-class systems installed.
Experience gained in the implementation and
application.
Many applications, some large, routinely executed
on Beowulfs.
Basic software sophisticated and robust.
Supports dominant programming/execution paradigm.
Single most rapidly growing area in HPC.
Ever larger systems in development.
Recognised as mainstream.

75
Towards the Futurewhat can we expect?

2 Gflops/s peak processors.
1000 per processor.
1 Gbps at lt 250 per port.
new backplane performance e.g. PCI
Light-weight communications, lt10?s latency.
Optimized math libraries.
1 GByte main memory per node.
24 GByte disk storage per node.
De facto standardised middleware.

76
The Future

Common standards and Open Source software.
Better
Tools, utilities and libraries
Design with minimal risk to accepted standards.
Higher degree of portability (standards).
Wider range and scope of HPC applications.
Wider acceptance of HPC technologies and
techniques in commerce and industry.
Emerging Grid-based environments.

77
Ending

Like to thank
John Dongarra, Thomas Sterling and for use of
some the materials used.
Recommend you monitor TFCC activities
http//ww.ieeetfcc.org
Join TFCCs mailing list.
Send me a reference to your projects.
Join in TFCCs efforts (sponsorship, organise
meetings, contribute to publications).
Cluster white paper preprint on Web.