Title: Current and Emerging Trends in Cluster Computing
1Current and Emerging Trends in Cluster Computing
University of Portsmouth, UK Daresbury
Laboratory, Seminar 9th January 2001
http//www.dcs.port.ac.uk/mab/Talks/
2Talk Content
- Background and Overview
- Cluster Architectures
- Cluster Networking
- SSI
- Cluster Tools
- Conclusions
3Commodity Cluster Systems
- Bringing high-end computing to broader problem
domain - new markets - Order of magnitude price/performance advantage.
- Commodity enabled no long development lead
times. - Low vulnerability to vendor-specific decisions
companies are ephemeral Clusters are forever !!!
4Commodity Cluster Systems
- Rapid response to technology tracking.
- User-driven configuration-potential for
application specific ones. - Industry-wide, non-proprietary software
environment.
5Cluster Architecture
6Beowulf-class Systems
- Cluster of PCs
- Intel x86
- DEC Alpha
- Mac Power PC.
- Pure Mass-Market COTS.
- Unix-like O/S with source
- Linux, BSD, Solaris.
- Message passing programming model
- MPI, PVM, BSP, homebrews
- Single user environments.
- Large science and engineering applications.
7Decline of Heavy Metal
- No market for high-end computers
- minimal growth in last five years.
- Extinction
- KSR, TMC, Intel, Meiko, Cray!?, Maspar, BBN,
Convex - Must use COTS
- Fabrication costs skyrocketing
- Development lead times too short
- US Federal Agencies Fleeing
- NSF, DARPA, DOE, NIST
- Currently no new good IDEAS.
8HPC Architectures Top 500
140 (20)
9Clusters on the Top500
10A Definition !?
- A cluster is a type of parallel or distributed
system that consists of a collection of
interconnected whole computers used as a single,
unified computing resource. - Where "Whole computer" is meant to indicate a
normal, whole computer system that can be used on
its own processor(s), memory, I/O, OS, software
sub-systems, applications.
11But
- There is a lot of discussion, still, about what a
cluster is. - Clusters and Constellations!
- Is it based on commodity or proprietary
hardware/software? - Seems that there is the commodity and commercial
camps. - Both are potentially correct!!
12Taxonomy
Cluster Computing
13Cluster Technology Drivers
- Reduced recurring costs approx 10 of MPPs.
- Rapid response to technology advances.
- Just-in-place configuration and reconfigurable.
- High reliability if system designed properly.
- Easily maintained through low cost replacement.
- Consistent portable programming model
- Unix, C, Fortran, message passing.
- Applicable to wide range of problems and
algorithms.
14Operating Systems
- Little work on OSs specifically for clusters.
- Turnkey clusters are provided with versions of a
companies mainline products. - Typically there may be some form of SSI
integrated into a conventional OS. - Two variants encountered for
- System administration/job-scheduling purposes -
middleware that enables each node to deliver the
required services. - Kernel-level e.g., transparent remote device
usage or to use a distributed storage facility
that is seen by users as a single standard file
system.
15Linux
- Most popular cluster OS for clusters is Linux.
- It is free
- It is an open source - anyone is free to
customize the kernel to suit ones needs - It is easy - large user community users and
developers have created an abundant number of
tools, web sites, and documentation, so that
Linux installation and administration is
straightforward enough for a typical cluster user.
16Examples Solaris MC
- A multi-computer version of their Solaris OS
called Solaris MC. - Incorporates some advances made by Sun, including
an object-oriented methodology and the use of
CORBA IDL in the kernel. - Consists of a small set of kernel extensions and
a middleware library - provides SSI to the level
of the device - Processes running on one node can access remote
devices as if they were local, also provides a
global file system and process space.
17Examples micro-kernels
- Another approach is a minimalist approach by
using micro-kernels - Exokernel is such system. - With this approach, only the minimal amount of
system functionality is built into the kernel -
allowing services that are needed to be loaded. - It maximizes the available physical memory by
removing undesirable functionality - The user can alter the characteristics of the
service, e.g., a scheduler specific to a cluster
application may be loaded that helps it run more
efficiently.
18How Much of the OS is Needed?
- This brings up the issue of OS configuration - in
particular why provide a node OS with the ability
to provide more services to applications than
they are ever likely to use? - e.g. a user may want to alter the personality of
the local OS, e.g. "strip down" to a minimalist
kernel to maximise the available physical memory
and remove undesired functionality. - Mechanisms to achieve this range from
- Use of a new kernel
- Dynamically linking service modules into the
kernel.
19Networking - Introduction
- One of the key enabling technologies that has
established clusters as a dominant force has been
networking technologies. - High performance parallel applications need
low-latency, high-bandwidth and reliable
interconnects. - Existing LAN/WAN technologies/protocols
(10/100Mbps Ethernet, ATM) are not well suited to
support Clusters. - Hence the birth of SANs.
20Comparison
Â
Â
 AA1I put the references with the text that
describes these.
21Why Buy a SAN
- Well, it depends on your application
- For scientific HPC, Myrinet seems to offer a good
MBytes/, lots of software and proven
scalability - Synfinity with its best-in-class 1.6 GBytes/s
could be a valuable alternative for
small/medium-sized clusters, untried at the
moment. - Windows-based users should give Giganet a try
- QsNet and ServerNet II are likely the most
expensive solutions, but an Alpha-based cluster
from Compaq with one of these should be a good
number cruncher.
22Some Emerging Technologies
23Communications Concerns
- New Physical Networking Technologies are Fast
- Gigabit Ethernet, ServerNet, Myrinet
- Legacy Network Protocol Implementations are Slow
- System Calls
- Multiple Data Copies.
- Communications Gap
- Systems use a fraction of the available
performance.
24Communications Solutions
- User-level (no kernel) networking.
- Several Existing Efforts
- Active Messages (UCB)
- Fast Messages (UIUC)
- U-Net (Cornell)
- BIP (Univ. Lyon, France)
- Standardization VIA
- Industry Involvement
- Killer Clusters.
User Process
OS
NIC
25VIA
- VIA is a standard that combines many of the best
features of various academic projects, and will
strongly influence the evolution of cluster
computing. - Although VIA can be used directly for application
programming, it is considered by many systems
designers to be at too low a level for
application programming. - With VIA, the application must be responsible for
allocating some portion of physical memory and
using it effectively.
26VIA
- It is expected that most OS and middleware
vendors will provide an interface to VIA that is
suitable for application programming. - Generally, the interface to VIA that is suitable
for application programming comes in the form of
a message-passing interface for scientific or
parallel programming.
27What is VIA?
- Use the kernel for set-ppand get It out of the
way for send/receive! - The Virtual Interface (VI)
- Protected application-application channel
- Memory directly accessible by user process.
- Target Environment
- LANs and SANs at Gigabit speeds
- No reliability of underlying media assumed
(unlike MP fabrics) - Errors/Drops assumed to be rare generally fatal
to VI.
28InfiniBand - Introduction
- System bus technologies are beginning to reach
their limits in terms of speed. - Common PCI buses can only support up to 133 Mbps
across all PCI slots, and even with the 64-bit,
66 MHz buses available in high-end PC servers,
566 Mbps of shared bandwidth is the most a user
can hope for.
29InfiniBand - Introduction
- To counter this, a new standard based on switched
serial links to device groups and devices is
currently in development. - Called InfiniBand, the standard is actually a
merged proposal of two earlier groups Next
Generation I/O (NGIO) led by Intel, Microsoft,
and Sun and Future I/O, supported by Compaq,
IBM, and Hewlett-Packard.
30Infiniband Hardware
31Infiniband - Performance
- A single InfiniBand link operates at 2.5 Gbps,
point-to-point in a single direction. - Bi-directional links offer twice the throughput
and can be aggregated together into larger pipes
of 1 GBytes/ (four co-joined links), or 3
GBytes/s (12 links). - Higher aggregations of links will be possible in
the future.
32Recap
- Whistle stop tour looked at some of the existing
and merging network technologies that are being
used with current clusters. - Hardware is more advanced than software that
comes with it. - Software is starting to catch up VIA and
Infiniband are providing performance and
functionality that todays sophisticated
applications require.
33Single System Image (SSI)
- Illusion, created by software or hardware, that
presents a collection of computing resources as
one whole unified resource. - Cluster appears like a single machine to users,
applications, and to the network. - Use of system resources transparent.
- Transparent process migration and load balancing.
- Potentially improved reliability and higher
availability, system response time and
performance. - Simplified system management.
- No need to be aware of the underlying system
architecture to use these machines effectively.
34Desired SSI Services
- Single Entry Point
- telnet cluster.my_institute.edu
- telnet node1.cluster. institute.edu
- Single File Hierarchy /Proc, NFS, xFS, AFS, etc.
- Single Control Point Management GUI
- Single memory space - Network RAM/DSM
- Single Job Management Codine LSF
- Single GUI Like workstation/PC windowing
environment it may be Web technology
35Cluster Tools
- Introduction
- Management Tools
- Application Tools
- MPI/OpenMP
- Debuggers
- ATLAS
- PAPI
36Cluster Tools Introduction
- For cluster computing to be effective within the
community it is essential that there are
numerical libraries and programming tools
available to application developers and system
maintainers. - Cluster systems present very different software
environments on which to build tools, libraries
and applications.
37Introduction
- Basically two categories of cluster tools
- Management
- Application
- Management tools are used to configure, install,
manage, and maintain clusters. - Application tools are used to help design,
develop, implement, debug and profile user
applications.
38Management Tools
- Clusters will only be effectively taken up if
there is not only computer scientists providing
programming environments and tools, but also
developers producing useful and successful
application, AND also systems engineers to
configure and manage these machines. - System management is often neglected when the
price/performance of clusters are compared
against other more traditional machines. - System management is vital for the successful
deployment of all types of clusters.
39System Tools
- The lack of good administration and user-level
application management tools represents a hidden
operation cost that is often overlooked. - While there are numerous tools and techniques
available for the administration of clusters, few
of these tools ever see the outside of their
developers cluster basically they are
developed for specific in-house uses. - This results in a great deal of duplicated effort
among cluster administrators and software
developers.
40System Tools Criteria
- One of the main criteria of the tools is that,
they provide the look and feel of commands issued
to a single machine. - This is accomplished through using lists, or
configuration files, to represent the group of
machines on which a command will operate.
41System Tools Security
- Generally security inside a cluster, between
cluster nodes, is somewhat relaxed for a number
of practical reasons. - Some of these include
- Improve performance
- Ease programming
- All nodes are generally compromised if one
cluster nodes security is compromised. - Thus, security from outside the cluster into the
cluster is of utmost concern.
42System Tools Scalability
- A user may tolerate an inefficient tool that
takes minutes to perform an operation across a
cluster of 8 machines as it is faster than
performing the operation manually 8 times. - However, that user will most likely find it
intolerable to wait over an hour for the same
operation to take effect across 128 cluster
nodes. - A further complication is federated clusters
-extending even further to wide area
administration.
43System Tools Some Areas
- Move disk images from image server to clients.
- Copy/Move/remove client files.
- Build a bootable diskette to initially boot a new
cluster node prior to installation. - Secure shell ssh.
- Cluster-wide ps manipulate cluster-wide
processes. - DHCP, used to allocate IP addresses to machines
on a given network - lease IP to node. - Shutdown/reboot individual nodes.
44Cluster Tools
- There have been many advances and developments in
the creation of parallel code and tools for
distributed memory machines and likewise for
SMP-based parallelism. - In most cases the parallel, MPI-based libraries
and tools will operate on cluster systems, but
they may not achieve an acceptable level of
efficiency or effectiveness on clusters that
comprise SMP nodes.
45MPI
- The underlying technology on which the
distributed memory machines are programmed is
that of MPI. - MPI provides the communication layer of the
library or package, which may, or may not be
revealed, to the user. - The large number of implementations of MPI
ensures portability of codes across platforms and
in general, the use of MPI-based software on
clusters.
46HPF and OpenMP
- Two standards for parallel programming without
explicit message passing. - HPF targets data parallel applications for
distributed memory MIMD systems RIP. - OpenMP targets (scalable) shared memory
multiprocessors.
47OpenMP
- The emerging standard of OpenMP is providing a
portable base for the development of libraries
for shared memory machines. - Although most cluster environments do not support
this paradigm globally across the cluster, it is
still an essential tool for clusters that may
have SMP nodes.
48OpenMP
- Portable, shared memory multiProcessing API.
- Fortran 77 Fortran 90 and C C.
- Multi-vendor support, for both UNIX and NT.
- Standardizes fine grained (loop) parallelism.
- Also supports coarse grained algorithms.
- Based on compiler directives and runtime library
calls.
49OpenMP
- Specifies
- Work-sharing constructs
- Data environment constructs
- Synchronization constructs
- Library Routines and Environment.
- Variables.
- PARALLEL
- DO
- SECTION
- SINGLE
50Some Debuggers
- TotalView (http//www.etnus.com/)
- Cray - TotalView
- SGI - dbx, cvd
- IBM - pdbx, pedb
- Sun Prism (ex TMC debugger)
51Where Does the Performance Go? orWhy Should I
Cares About the Memory Hierarchy?
Performance
Time
52Computation and Memory Use
- Computational optimizations
- Theoretical peak( fpus)(flops/cycle) MHz
- PIII (1 fpu)(1 flop/cycle)(650 MHz) 650
Mflop/s - Athlon (2 fpu)(1flop/cycle)(600 MHz) 1200
Mflop/s - Power3 (2 fpu)(2 flops/cycle)(375 MHz) 1500
Mflop/s - Memory optimization
- Theoretical peak (bus width) (bus speed)
- PIII (32 bits)(133 Mhz) 532 MB/s 66.5 MW/s
- Athlon (64 bits)(200 Mhz) 1600 MB/s 200
MW/s - Power3 (128 bits)(100 Mhz) 1600 MB/s 200
MW/s - Memory about an order of magnitude slower.
53 Memory Hierarchy
- By taking advantage of the principle of locality
- Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology.
Processor
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Control
Main Memory (DRAM)
Level 2 and 3 Cache (SRAM)
Remote Cluster Memory
Distributed Memory
On-Chip Cache
Datapath
Registers
54How To Get Performance From Commodity
Processors?
- Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning. - H/w and S/w have a large design space with many
parameters - Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules. - Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors. - Until recently, no tuned BLAS for Pentium for
Linux. - Need for quick/dynamic deployment of optimized
routines. - ATLAS - Automatic Tuned Linear Algebra Software
- PhiPac from Berkeley, FFTW from MIT
(http//www.fftw.org)
55ATLAS
- An adaptive software architecture
- High-performance
- Portability
- Elegance.
- ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
56ATLAS Across Various Architectures (DGEMM n500)
- ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
57Code Generation Strategy
- On-chip multiply optimizes for
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization.
- Takes 30 minutes to a hour to run.
- New model of high performance programming where
critical code is machine generated using
parameter optimisation.
- Code is iteratively generated timed until
optimal case is found. We try - Differing sized blocks
- Breaking false dependencies
- M, N and K loop unrolling
- Designed for RISC arch
- Super Scalar
- Need reasonable C compiler.
58Plans for ATLAS
- Software Release, available today
- Level 1, 2, and 3 BLAS implementations
- See http//www.netlib.org/atlas/
- Next Version
- Multi-treading and a Java generator.
- Futures
- Optimize message passing system
- Runtime adaptation
- Sparsity analysis
- Iterative code improvement
- Specialization for user applications.
- Adaptive libraries.
59Tools for Performance Evaluation
- Timing and performance evaluation has been an
art - Resolution of the clock
- Issues about cache effects
- Different systems.
- Situation about to change
- Todays processors have internal counters.
60Performance Counters
- Almost all high performance processors include
hardware performance counters. - Some are easy to access, others not available to
users. - On most platforms the APIs, if they exist, are
not appropriate for a common user, functional or
well documented. - Existing performance counter APIs
- Cray T3E, SGI MIPS R10000, IBM Power series,
- DEC Alpha pfm pseudo-device interface,
- Windows 95, NT and Linux.
61Performance Data That May Be Available
- Pipeline stalls due to memory subsystem.
- Pipeline stalls due to resource conflicts.
- I/D cache misses for different levels.
- Cache invalidations.
- TLB misses.
- TLB invalidations.
- Cycle count.
- Floating point instruction count.
- Integer instruction count.
- Instruction count.
- Load/store count.
- Branch taken/not taken count.
- Branch mispredictions.
62PAPI Implementation
- Performance Application Programming
Interface - The purpose of PAPI is to design, standardize and
implement a portable and efficient API to access
the hardware performance monitor counters found
on most modern microprocessors.
63Graphical ToolsPerfometer Usage
- Application is instrumented with PAPI
- call perfometer()
- Will be layered over the best existing
vendor-specific APIs for these platforms - Sections of code that are of interest are
designated with specific colors - Using a call to set_perfometer(color).
- Application is started, at the call to
performeter a task is spawned to collect and send
the information to a Java applet containing the
graphical view.
64Perfometer
Call Perfometer(red)
65PAPI 1.0 Release
- Mailing list
- send subscribe ptools-perfapi to
majordomo_at_ptools.org - ptools-perfapi_at_ptools.org is the reflector
- Platforms
- Linux/x86
- Solaris/Ultra
- AIX/Power
- Tru64/Alpha
- IRIX/MIPS.
- May require patch to kernel.
- C and Fortran bindings.
- To download software see
- http//icl.cs.utk.edu/projects/papi/
66Conclusions
- One of the principal challenges facing the
future of HPC software development is the
encapsulation of architectural complexity
67Where are We
- Distinction between PCs and workstations
hardware/software has evaporated. - Beowulf-class systems and other PC clusters
firmly established as a mainstream
compute-performance resource strategy. - Linux and NT established as dominant O/S
platforms. - Integrating COTS network technology capable of
supporting many application/algorithms. - Both business/commerce and science/engineering
exploiting Beowulfs for price-performance and
flexibility
68Where are We (2)
- Thousand processor Beowulf.
- Gflops/s processors.
- MPI and PVM standards.
- Extreme Linux effort providing robust and
scalable resource management. - SMP support (on a node).
- First generation middleware components for
distributed resource, multi-user environments. - Books on Linux, Beowulfs, and general clustering
available. - Vendor acceptance in to market strategy.
69Million Tflops/s
- Today, 3M peak Tflops/s.
- Before year 2002 1M peak Tflops/s.
- Performance efficiency is serious challenge
- System integration
- does vendor support of massive parallelism have
to mean massive markup? - System administration, boring but necessary.
- Maintenance without vendors how?
- New kind of vendors for support!
- Heterogeneity will become major aspect.
70Summary of Immediate Challenges
- There are more costs than capital costs.
- Higher level of expertise is required in house.
- Software environments are behind vendor
offerings. - Tightly coupled systems easier to exploit in some
cases. - Linux model of development scares people.
- Not yet for everyone.
- PC-clusters have not achieved maturity.
71Future Technology Enablers
- SOCs system-on-a-chip
- GHz processor clock rate.
- VLIW.
- 64-bit processors
- scientific/engineering application
- address spaces.
- Gbit DRAMs
- Micro-disks on a board
- Optical fiber and wave division multiplexing
communications (also free space?)
72Future Technology Enablers (2)
- Very high bandwidth backplanes/switches.
- SMP on a chip
- multiple processors with multi-layered caches.
- Processor in Memory (PIM).
- Standardized dense packaging.
- Lower cost per node.
73Software Stumbling Blocks
- Linux cruftiness
- Heterogeneity.
- Scheduling and protection in time and space
- Task migration.
- Checkpointing and restarting.
- Effective, scalable parallel file system.
- Parallel debugging and performance optimization.
- System software development frameworks and
conventions.
74Accomplishments
- Many Beowulf-class systems installed.
- Experience gained in the implementation and
application. - Many applications, some large, routinely executed
on Beowulfs. - Basic software sophisticated and robust.
- Supports dominant programming/execution paradigm.
- Single most rapidly growing area in HPC.
- Ever larger systems in development.
- Recognised as mainstream.
75Towards the Futurewhat can we expect?
- 2 Gflops/s peak processors.
- 1000 per processor.
- 1 Gbps at lt 250 per port.
- new backplane performance e.g. PCI
- Light-weight communications, lt10?s latency.
- Optimized math libraries.
- 1 GByte main memory per node.
- 24 GByte disk storage per node.
- De facto standardised middleware.
76The Future
- Common standards and Open Source software.
- Better
- Tools, utilities and libraries
- Design with minimal risk to accepted standards.
- Higher degree of portability (standards).
- Wider range and scope of HPC applications.
- Wider acceptance of HPC technologies and
techniques in commerce and industry. - Emerging Grid-based environments.
77Ending
- Like to thank
- John Dongarra, Thomas Sterling and for use of
some the materials used. - Recommend you monitor TFCC activities
- http//ww.ieeetfcc.org
- Join TFCCs mailing list.
- Send me a reference to your projects.
- Join in TFCCs efforts (sponsorship, organise
meetings, contribute to publications). - Cluster white paper preprint on Web.
78IEEE Computer Society
- Task Force on Cluster Computing
- (TFCC)
- http//www.ieeetfcc.org