Current and Emerging Trends in Cluster Computing - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Current and Emerging Trends in Cluster Computing

Description:

University of Portsmouth, UK. Daresbury Laboratory, Seminar. 9th January 2001 ... Build a bootable diskette to initially boot a new cluster node prior to installation. ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 79
Provided by: acetR
Category:

less

Transcript and Presenter's Notes

Title: Current and Emerging Trends in Cluster Computing


1
Current and Emerging Trends in Cluster Computing
  • Mark Baker

University of Portsmouth, UK Daresbury
Laboratory, Seminar 9th January 2001
http//www.dcs.port.ac.uk/mab/Talks/
2
Talk Content
  • Background and Overview
  • Cluster Architectures
  • Cluster Networking
  • SSI
  • Cluster Tools
  • Conclusions

3
Commodity Cluster Systems
  • Bringing high-end computing to broader problem
    domain - new markets
  • Order of magnitude price/performance advantage.
  • Commodity enabled no long development lead
    times.
  • Low vulnerability to vendor-specific decisions
    companies are ephemeral Clusters are forever !!!

4
Commodity Cluster Systems
  • Rapid response to technology tracking.
  • User-driven configuration-potential for
    application specific ones.
  • Industry-wide, non-proprietary software
    environment.

5
Cluster Architecture
6
Beowulf-class Systems
  • Cluster of PCs
  • Intel x86
  • DEC Alpha
  • Mac Power PC.
  • Pure Mass-Market COTS.
  • Unix-like O/S with source
  • Linux, BSD, Solaris.
  • Message passing programming model
  • MPI, PVM, BSP, homebrews
  • Single user environments.
  • Large science and engineering applications.

7
Decline of Heavy Metal
  • No market for high-end computers
  • minimal growth in last five years.
  • Extinction
  • KSR, TMC, Intel, Meiko, Cray!?, Maspar, BBN,
    Convex
  • Must use COTS
  • Fabrication costs skyrocketing
  • Development lead times too short
  • US Federal Agencies Fleeing
  • NSF, DARPA, DOE, NIST
  • Currently no new good IDEAS.

8
HPC Architectures Top 500
140 (20)
9
Clusters on the Top500
10
A Definition !?
  • A cluster is a type of parallel or distributed
    system that consists of a collection of
    interconnected whole computers used as a single,
    unified computing resource.
  • Where "Whole computer" is meant to indicate a
    normal, whole computer system that can be used on
    its own processor(s), memory, I/O, OS, software
    sub-systems, applications.

11
But
  • There is a lot of discussion, still, about what a
    cluster is.
  • Clusters and Constellations!
  • Is it based on commodity or proprietary
    hardware/software?
  • Seems that there is the commodity and commercial
    camps.
  • Both are potentially correct!!

12
Taxonomy
Cluster Computing
13
Cluster Technology Drivers
  • Reduced recurring costs approx 10 of MPPs.
  • Rapid response to technology advances.
  • Just-in-place configuration and reconfigurable.
  • High reliability if system designed properly.
  • Easily maintained through low cost replacement.
  • Consistent portable programming model
  • Unix, C, Fortran, message passing.
  • Applicable to wide range of problems and
    algorithms.

14
Operating Systems
  • Little work on OSs specifically for clusters.
  • Turnkey clusters are provided with versions of a
    companies mainline products.
  • Typically there may be some form of SSI
    integrated into a conventional OS.
  • Two variants encountered for
  • System administration/job-scheduling purposes -
    middleware that enables each node to deliver the
    required services.
  • Kernel-level e.g., transparent remote device
    usage or to use a distributed storage facility
    that is seen by users as a single standard file
    system.

15
Linux
  • Most popular cluster OS for clusters is Linux.
  • It is free
  • It is an open source - anyone is free to
    customize the kernel to suit ones needs
  • It is easy - large user community users and
    developers have created an abundant number of
    tools, web sites, and documentation, so that
    Linux installation and administration is
    straightforward enough for a typical cluster user.

16
Examples Solaris MC
  • A multi-computer version of their Solaris OS
    called Solaris MC.
  • Incorporates some advances made by Sun, including
    an object-oriented methodology and the use of
    CORBA IDL in the kernel.
  • Consists of a small set of kernel extensions and
    a middleware library - provides SSI to the level
    of the device
  • Processes running on one node can access remote
    devices as if they were local, also provides a
    global file system and process space.

17
Examples micro-kernels
  • Another approach is a minimalist approach by
    using micro-kernels - Exokernel is such system.
  • With this approach, only the minimal amount of
    system functionality is built into the kernel -
    allowing services that are needed to be loaded.
  • It maximizes the available physical memory by
    removing undesirable functionality
  • The user can alter the characteristics of the
    service, e.g., a scheduler specific to a cluster
    application may be loaded that helps it run more
    efficiently.

18
How Much of the OS is Needed?
  • This brings up the issue of OS configuration - in
    particular why provide a node OS with the ability
    to provide more services to applications than
    they are ever likely to use?
  • e.g. a user may want to alter the personality of
    the local OS, e.g. "strip down" to a minimalist
    kernel to maximise the available physical memory
    and remove undesired functionality.
  • Mechanisms to achieve this range from
  • Use of a new kernel
  • Dynamically linking service modules into the
    kernel.

19
Networking - Introduction
  • One of the key enabling technologies that has
    established clusters as a dominant force has been
    networking technologies.
  • High performance parallel applications need
    low-latency, high-bandwidth and reliable
    interconnects.
  • Existing LAN/WAN technologies/protocols
    (10/100Mbps Ethernet, ATM) are not well suited to
    support Clusters.
  • Hence the birth of SANs.

20
Comparison
 
 
 AA1I put the references with the text that
describes these.
21
Why Buy a SAN
  • Well, it depends on your application
  • For scientific HPC, Myrinet seems to offer a good
    MBytes/, lots of software and proven
    scalability
  • Synfinity with its best-in-class 1.6 GBytes/s
    could be a valuable alternative for
    small/medium-sized clusters, untried at the
    moment.
  • Windows-based users should give Giganet a try
  • QsNet and ServerNet II are likely the most
    expensive solutions, but an Alpha-based cluster
    from Compaq with one of these should be a good
    number cruncher.

22
Some Emerging Technologies
  • VIA
  • Infiniband

23
Communications Concerns
  • New Physical Networking Technologies are Fast
  • Gigabit Ethernet, ServerNet, Myrinet
  • Legacy Network Protocol Implementations are Slow
  • System Calls
  • Multiple Data Copies.
  • Communications Gap
  • Systems use a fraction of the available
    performance.

24
Communications Solutions
  • User-level (no kernel) networking.
  • Several Existing Efforts
  • Active Messages (UCB)
  • Fast Messages (UIUC)
  • U-Net (Cornell)
  • BIP (Univ. Lyon, France)
  • Standardization VIA
  • Industry Involvement
  • Killer Clusters.

User Process
OS
NIC
25
VIA
  • VIA is a standard that combines many of the best
    features of various academic projects, and will
    strongly influence the evolution of cluster
    computing.
  • Although VIA can be used directly for application
    programming, it is considered by many systems
    designers to be at too low a level for
    application programming.
  • With VIA, the application must be responsible for
    allocating some portion of physical memory and
    using it effectively.

26
VIA
  • It is expected that most OS and middleware
    vendors will provide an interface to VIA that is
    suitable for application programming.
  • Generally, the interface to VIA that is suitable
    for application programming comes in the form of
    a message-passing interface for scientific or
    parallel programming.

27
What is VIA?
  • Use the kernel for set-ppand get It out of the
    way for send/receive!
  • The Virtual Interface (VI)
  • Protected application-application channel
  • Memory directly accessible by user process.
  • Target Environment
  • LANs and SANs at Gigabit speeds
  • No reliability of underlying media assumed
    (unlike MP fabrics)
  • Errors/Drops assumed to be rare generally fatal
    to VI.

28
InfiniBand - Introduction
  • System bus technologies are beginning to reach
    their limits in terms of speed.
  • Common PCI buses can only support up to 133 Mbps
    across all PCI slots, and even with the 64-bit,
    66 MHz buses available in high-end PC servers,
    566 Mbps of shared bandwidth is the most a user
    can hope for.

29
InfiniBand - Introduction
  • To counter this, a new standard based on switched
    serial links to device groups and devices is
    currently in development.
  • Called InfiniBand, the standard is actually a
    merged proposal of two earlier groups Next
    Generation I/O (NGIO) led by Intel, Microsoft,
    and Sun and Future I/O, supported by Compaq,
    IBM, and Hewlett-Packard.

30
Infiniband Hardware
31
Infiniband - Performance
  • A single InfiniBand link operates at 2.5 Gbps,
    point-to-point in a single direction.
  • Bi-directional links offer twice the throughput
    and can be aggregated together into larger pipes
    of 1 GBytes/ (four co-joined links), or 3
    GBytes/s (12 links).
  • Higher aggregations of links will be possible in
    the future.

32
Recap
  • Whistle stop tour looked at some of the existing
    and merging network technologies that are being
    used with current clusters.
  • Hardware is more advanced than software that
    comes with it.
  • Software is starting to catch up VIA and
    Infiniband are providing performance and
    functionality that todays sophisticated
    applications require.

33
Single System Image (SSI)
  • Illusion, created by software or hardware, that
    presents a collection of computing resources as
    one whole unified resource.
  • Cluster appears like a single machine to users,
    applications, and to the network.
  • Use of system resources transparent.
  • Transparent process migration and load balancing.
  • Potentially improved reliability and higher
    availability, system response time and
    performance.
  • Simplified system management.
  • No need to be aware of the underlying system
    architecture to use these machines effectively.

34
Desired SSI Services
  • Single Entry Point
  • telnet cluster.my_institute.edu
  • telnet node1.cluster. institute.edu
  • Single File Hierarchy /Proc, NFS, xFS, AFS, etc.
  • Single Control Point Management GUI
  • Single memory space - Network RAM/DSM
  • Single Job Management Codine LSF
  • Single GUI Like workstation/PC windowing
    environment it may be Web technology

35
Cluster Tools
  • Introduction
  • Management Tools
  • Application Tools
  • MPI/OpenMP
  • Debuggers
  • ATLAS
  • PAPI

36
Cluster Tools Introduction
  • For cluster computing to be effective within the
    community it is essential that there are
    numerical libraries and programming tools
    available to application developers and system
    maintainers.
  • Cluster systems present very different software
    environments on which to build tools, libraries
    and applications.

37
Introduction
  • Basically two categories of cluster tools
  • Management
  • Application
  • Management tools are used to configure, install,
    manage, and maintain clusters.
  • Application tools are used to help design,
    develop, implement, debug and profile user
    applications.

38
Management Tools
  • Clusters will only be effectively taken up if
    there is not only computer scientists providing
    programming environments and tools, but also
    developers producing useful and successful
    application, AND also systems engineers to
    configure and manage these machines.
  • System management is often neglected when the
    price/performance of clusters are compared
    against other more traditional machines.
  • System management is vital for the successful
    deployment of all types of clusters.

39
System Tools
  • The lack of good administration and user-level
    application management tools represents a hidden
    operation cost that is often overlooked.
  • While there are numerous tools and techniques
    available for the administration of clusters, few
    of these tools ever see the outside of their
    developers cluster basically they are
    developed for specific in-house uses.
  • This results in a great deal of duplicated effort
    among cluster administrators and software
    developers.

40
System Tools Criteria
  • One of the main criteria of the tools is that,
    they provide the look and feel of commands issued
    to a single machine.
  • This is accomplished through using lists, or
    configuration files, to represent the group of
    machines on which a command will operate.

41
System Tools Security
  • Generally security inside a cluster, between
    cluster nodes, is somewhat relaxed for a number
    of practical reasons.
  • Some of these include
  • Improve performance
  • Ease programming
  • All nodes are generally compromised if one
    cluster nodes security is compromised.
  • Thus, security from outside the cluster into the
    cluster is of utmost concern.

42
System Tools Scalability
  • A user may tolerate an inefficient tool that
    takes minutes to perform an operation across a
    cluster of 8 machines as it is faster than
    performing the operation manually 8 times.
  • However, that user will most likely find it
    intolerable to wait over an hour for the same
    operation to take effect across 128 cluster
    nodes.
  • A further complication is federated clusters
    -extending even further to wide area
    administration.

43
System Tools Some Areas
  • Move disk images from image server to clients.
  • Copy/Move/remove client files.
  • Build a bootable diskette to initially boot a new
    cluster node prior to installation.
  • Secure shell ssh.
  • Cluster-wide ps manipulate cluster-wide
    processes.
  • DHCP, used to allocate IP addresses to machines
    on a given network - lease IP to node.
  • Shutdown/reboot individual nodes.

44
Cluster Tools
  • There have been many advances and developments in
    the creation of parallel code and tools for
    distributed memory machines and likewise for
    SMP-based parallelism.
  • In most cases the parallel, MPI-based libraries
    and tools will operate on cluster systems, but
    they may not achieve an acceptable level of
    efficiency or effectiveness on clusters that
    comprise SMP nodes.

45
MPI
  • The underlying technology on which the
    distributed memory machines are programmed is
    that of MPI.
  • MPI provides the communication layer of the
    library or package, which may, or may not be
    revealed, to the user.
  • The large number of implementations of MPI
    ensures portability of codes across platforms and
    in general, the use of MPI-based software on
    clusters.

46
HPF and OpenMP
  • Two standards for parallel programming without
    explicit message passing.
  • HPF targets data parallel applications for
    distributed memory MIMD systems RIP.
  • OpenMP targets (scalable) shared memory
    multiprocessors.

47
OpenMP
  • The emerging standard of OpenMP is providing a
    portable base for the development of libraries
    for shared memory machines.
  • Although most cluster environments do not support
    this paradigm globally across the cluster, it is
    still an essential tool for clusters that may
    have SMP nodes.

48
OpenMP
  • Portable, shared memory multiProcessing API.
  • Fortran 77 Fortran 90 and C C.
  • Multi-vendor support, for both UNIX and NT.
  • Standardizes fine grained (loop) parallelism.
  • Also supports coarse grained algorithms.
  • Based on compiler directives and runtime library
    calls.

49
OpenMP
  • Specifies
  • Work-sharing constructs
  • Data environment constructs
  • Synchronization constructs
  • Library Routines and Environment.
  • Variables.
  • PARALLEL
  • DO
  • SECTION
  • SINGLE

50
Some Debuggers
  • TotalView (http//www.etnus.com/)
  • Cray - TotalView
  • SGI - dbx, cvd
  • IBM - pdbx, pedb
  • Sun Prism (ex TMC debugger)

51
Where Does the Performance Go? orWhy Should I
Cares About the Memory Hierarchy?
Performance
Time
52
Computation and Memory Use
  • Computational optimizations
  • Theoretical peak( fpus)(flops/cycle) MHz
  • PIII (1 fpu)(1 flop/cycle)(650 MHz) 650
    Mflop/s
  • Athlon (2 fpu)(1flop/cycle)(600 MHz) 1200
    Mflop/s
  • Power3 (2 fpu)(2 flops/cycle)(375 MHz) 1500
    Mflop/s
  • Memory optimization
  • Theoretical peak (bus width) (bus speed)
  • PIII (32 bits)(133 Mhz) 532 MB/s 66.5 MW/s
  • Athlon (64 bits)(200 Mhz) 1600 MB/s 200
    MW/s
  • Power3 (128 bits)(100 Mhz) 1600 MB/s 200
    MW/s
  • Memory about an order of magnitude slower.

53
Memory Hierarchy
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.

Processor
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Control
Main Memory (DRAM)
Level 2 and 3 Cache (SRAM)
Remote Cluster Memory
Distributed Memory
On-Chip Cache
Datapath
Registers
54
How To Get Performance From Commodity
Processors?
  • Todays processors can achieve high-performance,
    but this requires extensive machine-specific hand
    tuning.
  • H/w and S/w have a large design space with many
    parameters
  • Blocking sizes, loop nesting permutations, loop
    unrolling depths, software pipelining strategies,
    register allocations, and instruction schedules.
  • Complicated interactions with the increasingly
    sophisticated micro-architectures of new
    microprocessors.
  • Until recently, no tuned BLAS for Pentium for
    Linux.
  • Need for quick/dynamic deployment of optimized
    routines.
  • ATLAS - Automatic Tuned Linear Algebra Software
  • PhiPac from Berkeley, FFTW from MIT
    (http//www.fftw.org)

55
ATLAS
  • An adaptive software architecture
  • High-performance
  • Portability
  • Elegance.
  • ATLAS is faster than all other portable BLAS
    implementations and it is comparable with
    machine-specific libraries provided by the vendor.

56
ATLAS Across Various Architectures (DGEMM n500)
  • ATLAS is faster than all other portable BLAS
    implementations and it is comparable with
    machine-specific libraries provided by the vendor.

57
Code Generation Strategy
  • On-chip multiply optimizes for
  • TLB access
  • L1 cache reuse
  • FP unit usage
  • Memory fetch
  • Register reuse
  • Loop overhead minimization.
  • Takes 30 minutes to a hour to run.
  • New model of high performance programming where
    critical code is machine generated using
    parameter optimisation.
  • Code is iteratively generated timed until
    optimal case is found. We try
  • Differing sized blocks
  • Breaking false dependencies
  • M, N and K loop unrolling
  • Designed for RISC arch
  • Super Scalar
  • Need reasonable C compiler.

58
Plans for ATLAS
  • Software Release, available today
  • Level 1, 2, and 3 BLAS implementations
  • See http//www.netlib.org/atlas/
  • Next Version
  • Multi-treading and a Java generator.
  • Futures
  • Optimize message passing system
  • Runtime adaptation
  • Sparsity analysis
  • Iterative code improvement
  • Specialization for user applications.
  • Adaptive libraries.

59
Tools for Performance Evaluation
  • Timing and performance evaluation has been an
    art
  • Resolution of the clock
  • Issues about cache effects
  • Different systems.
  • Situation about to change
  • Todays processors have internal counters.

60
Performance Counters
  • Almost all high performance processors include
    hardware performance counters.
  • Some are easy to access, others not available to
    users.
  • On most platforms the APIs, if they exist, are
    not appropriate for a common user, functional or
    well documented.
  • Existing performance counter APIs
  • Cray T3E, SGI MIPS R10000, IBM Power series,
  • DEC Alpha pfm pseudo-device interface,
  • Windows 95, NT and Linux.

61
Performance Data That May Be Available
  • Pipeline stalls due to memory subsystem.
  • Pipeline stalls due to resource conflicts.
  • I/D cache misses for different levels.
  • Cache invalidations.
  • TLB misses.
  • TLB invalidations.
  • Cycle count.
  • Floating point instruction count.
  • Integer instruction count.
  • Instruction count.
  • Load/store count.
  • Branch taken/not taken count.
  • Branch mispredictions.

62
PAPI Implementation
  • Performance Application Programming
    Interface
  • The purpose of PAPI is to design, standardize and
    implement a portable and efficient API to access
    the hardware performance monitor counters found
    on most modern microprocessors.

63
Graphical ToolsPerfometer Usage
  • Application is instrumented with PAPI
  • call perfometer()
  • Will be layered over the best existing
    vendor-specific APIs for these platforms
  • Sections of code that are of interest are
    designated with specific colors
  • Using a call to set_perfometer(color).
  • Application is started, at the call to
    performeter a task is spawned to collect and send
    the information to a Java applet containing the
    graphical view.

64
Perfometer
Call Perfometer(red)
65
PAPI 1.0 Release
  • Mailing list
  • send subscribe ptools-perfapi to
    majordomo_at_ptools.org
  • ptools-perfapi_at_ptools.org is the reflector
  • Platforms
  • Linux/x86
  • Solaris/Ultra
  • AIX/Power
  • Tru64/Alpha
  • IRIX/MIPS.
  • May require patch to kernel.
  • C and Fortran bindings.
  • To download software see
  • http//icl.cs.utk.edu/projects/papi/

66
Conclusions
  • One of the principal challenges facing the
    future of HPC software development is the
    encapsulation of architectural complexity

67
Where are We
  • Distinction between PCs and workstations
    hardware/software has evaporated.
  • Beowulf-class systems and other PC clusters
    firmly established as a mainstream
    compute-performance resource strategy.
  • Linux and NT established as dominant O/S
    platforms.
  • Integrating COTS network technology capable of
    supporting many application/algorithms.
  • Both business/commerce and science/engineering
    exploiting Beowulfs for price-performance and
    flexibility

68
Where are We (2)
  • Thousand processor Beowulf.
  • Gflops/s processors.
  • MPI and PVM standards.
  • Extreme Linux effort providing robust and
    scalable resource management.
  • SMP support (on a node).
  • First generation middleware components for
    distributed resource, multi-user environments.
  • Books on Linux, Beowulfs, and general clustering
    available.
  • Vendor acceptance in to market strategy.

69
Million Tflops/s
  • Today, 3M peak Tflops/s.
  • Before year 2002 1M peak Tflops/s.
  • Performance efficiency is serious challenge
  • System integration
  • does vendor support of massive parallelism have
    to mean massive markup?
  • System administration, boring but necessary.
  • Maintenance without vendors how?
  • New kind of vendors for support!
  • Heterogeneity will become major aspect.

70
Summary of Immediate Challenges
  • There are more costs than capital costs.
  • Higher level of expertise is required in house.
  • Software environments are behind vendor
    offerings.
  • Tightly coupled systems easier to exploit in some
    cases.
  • Linux model of development scares people.
  • Not yet for everyone.
  • PC-clusters have not achieved maturity.

71
Future Technology Enablers
  • SOCs system-on-a-chip
  • GHz processor clock rate.
  • VLIW.
  • 64-bit processors
  • scientific/engineering application
  • address spaces.
  • Gbit DRAMs
  • Micro-disks on a board
  • Optical fiber and wave division multiplexing
    communications (also free space?)

72
Future Technology Enablers (2)
  • Very high bandwidth backplanes/switches.
  • SMP on a chip
  • multiple processors with multi-layered caches.
  • Processor in Memory (PIM).
  • Standardized dense packaging.
  • Lower cost per node.

73
Software Stumbling Blocks
  • Linux cruftiness
  • Heterogeneity.
  • Scheduling and protection in time and space
  • Task migration.
  • Checkpointing and restarting.
  • Effective, scalable parallel file system.
  • Parallel debugging and performance optimization.
  • System software development frameworks and
    conventions.

74
Accomplishments
  • Many Beowulf-class systems installed.
  • Experience gained in the implementation and
    application.
  • Many applications, some large, routinely executed
    on Beowulfs.
  • Basic software sophisticated and robust.
  • Supports dominant programming/execution paradigm.
  • Single most rapidly growing area in HPC.
  • Ever larger systems in development.
  • Recognised as mainstream.

75
Towards the Futurewhat can we expect?
  • 2 Gflops/s peak processors.
  • 1000 per processor.
  • 1 Gbps at lt 250 per port.
  • new backplane performance e.g. PCI
  • Light-weight communications, lt10?s latency.
  • Optimized math libraries.
  • 1 GByte main memory per node.
  • 24 GByte disk storage per node.
  • De facto standardised middleware.

76
The Future
  • Common standards and Open Source software.
  • Better
  • Tools, utilities and libraries
  • Design with minimal risk to accepted standards.
  • Higher degree of portability (standards).
  • Wider range and scope of HPC applications.
  • Wider acceptance of HPC technologies and
    techniques in commerce and industry.
  • Emerging Grid-based environments.

77
Ending
  • Like to thank
  • John Dongarra, Thomas Sterling and for use of
    some the materials used.
  • Recommend you monitor TFCC activities
  • http//ww.ieeetfcc.org
  • Join TFCCs mailing list.
  • Send me a reference to your projects.
  • Join in TFCCs efforts (sponsorship, organise
    meetings, contribute to publications).
  • Cluster white paper preprint on Web.

78
IEEE Computer Society
  • Task Force on Cluster Computing
  • (TFCC)
  • http//www.ieeetfcc.org
Write a Comment
User Comments (0)
About PowerShow.com