A Tutorial - PowerPoint PPT Presentation

About This Presentation
Title:

A Tutorial

Description:

A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore INDIA December 16, 2002 By Dheeraj Bhardwaj – PowerPoint PPT presentation

Number of Views:436
Avg rating:3.0/5.0
Slides: 284
Provided by: Dr898
Category:

less

Transcript and Presenter's Notes

Title: A Tutorial


1
A Tutorial
  • Designing Cluster Computers
  • and
  • High Performance Storage Architectures
  • At
  • HPC ASIA 2002, Bangalore INDIA
  • December 16, 2002
  • By

Dheeraj Bhardwaj Department of Computer Science
Engineering Indian Institute of Technology,
Delhi INDIA e-mail dheerajb_at_cse.iits.ac.in http
//www.cse.iitd.ac.in/dheerajb
N. Seetharama Krishna Centre for Development of
Advanced Computing Pune University Campus, Pune
INDIA e-mail krishna_at_cdacindia.com http//www.cd
acindia.com
2
Acknowledgments
  • All the contributors of LINUX
  • All the contributors of Cluster Technology
  • All the contributors in the art and science of
    parallel computing
  • Department of Computer Science Engineering, IIT
    Delhi
  • Centre for Development of Advanced Computing,
    (C-DAC) and collaborators

3
Disclaimer
  • The information and examples provided are based
    on the Red Hat Linux 7.2 installation on the
    Intel PCs platforms ( our specific hardware
    specifications)
  • Much of it should be applicable to other
    versions of Linux
  • There is no warranty that the materials are error
    free
  • Authors will not be held responsible for any
    direct, indirect, special, incidental or
    consequential damages related to any use of these
    materials

4
Part I
  • Designing Cluster Computers

5
Outline
  • Introduction
  • Classification of Parallel Computers
  • Introduction to Clusters
  • Classification of Clusters
  • Cluster Components Issues
  • Hardware
  • Interconnection Network
  • System Software
  • Design and Build a Cluster Computers
  • Principles of Cluster Design
  • Cluster Building Blocks
  • Networking Under Linux
  • PBS
  • PVFS
  • Single System Image

6
Outline
  • Tools for Installation and Management
  • Issues related to Installation, configuration,
    Monitoring and Management
  • NPACI Rocks
  • OSCAR
  • EU DataGrid WP4 Project
  • Other Tools Sycld Beowulf, OpenMosix, Cplant,
    SCore
  • HPC Applications and Parallel Programming
  • HPC Applications
  • Issues related to parallel Programming
  • Parallel Algorithms
  • Parallel Programming Paradigms
  • Parallel Programming Models
  • Message Passing
  • Applications I/O and Parallel File System
  • Performance metrics of Parallel Systems
  • Conclusion

7
Introduction
8
What do We Want to Achieve ?
  • Develop High Performance Computing (HPC)
    Infrastructure which is
  • Scalable (Parallel MPP Grid)
  • User Friendly
  • Based on Open Source
  • Efficient in Problem Solving
  • Able to Achieve High Performance
  • Able to Handle Large Data Volumes
  • Cost Effective
  • Develop HPC Applications which are
  • Portable ( Desktop Supercomputers
    Grid)
  • Future Proof
  • Grid Ready

9
Who Uses HPC ?
  • Scientific Engineering Applications
  • Simulation of physical phenomena
  • Virtual Prototyping (Modeling)
  • Data analysis
  • Business/ Industry Applications
  • Data warehousing for financial sectors
  • E-governance
  • Medical Imaging
  • Web servers, Search Engines, Digital libraries
  • etc ..
  • All face similar problems
  • Not enough computational resources
  • Remote facilities Network becomes the
    bottleneck
  • Heterogeneous and fast changing systems

10
HPC Applications
  • Three Types
  • High-Capacity Grand Challenge Applications
  • Throughput Running hundreds/thousands of job,
    doing parameter studies,
    statistical analysis etc
  • Data Genome analysis, Particle Physics,
    Astronomical observations, Seismic data
    processing etc
  • We are seeing a Fundamental Change in HPC
    Applications
  • They have become multidisciplinary
  • Require incredible mix of varies technologies and
    expertise

11
Why Parallel Computing ?
  • If your Application requires more computing power
    than a sequential computer can provide ? !!!!!
  • You might suggest to improve the operating speed
    of processor and other components
  • We do not disagree with your suggestion BUT how
    long you can go ?
  • We always have desire and prospects for greater
    performance
  • Parallel Computing is the right answer

12
Serial and Parallel Computing
A parallel computer is a Collection of
processing elements that communicate and
co-operate to solve large problems fast.
  • PARALLEL COMPUTING
  • Fetch/Store
  • Compute/communicate
  • Cooperative game
  • SERIAL COMPUTING
  • Fetch/Store
  • Compute

13
Classification of Parallel Computers
14
Classification of Parallel Computers
Flynn Classification Number of Instructions
Data Streams
Conventional
Data Parallel, Vector Computing
Systolic Arrays
very general, multiple approaches
15
MIMD Architecture Classification
Current focus is on MIMD model, using general
purpose processors or multicomputers.
16
MIMD Shared Memory Architecture
  • Source PE writes data to Global Memory
    destination retrieves it
  • Easy to build
  • Limitation reliability expandability. A
    memory component or any processor failure affects
    the whole system.
  • Increase of processors leads to memory
    contention.
  • Ex. Silicon graphics supercomputers....

17
MIMD Distributed Memory Architecture
High Speed Interconnection Network
Processor 1
Processor 2
Processor 3
Memory Bus
Memory Bus
Memory Bus
Memory 3
Memory 1
Memory 2
  • Inter Process Communication using High Speed
    Network.
  • Network can be configured to various topologies
    e.g. Tree, Mesh, Cube..
  • Unlike Shared MIMD
  • easily/ readily expandable
  • Highly reliable (any CPU failure does not affect
    the whole system)

18
MIMD Features
  • MIMD architecture is more general purpose
  • MIMD needs clever use of synchronization that
    comes from message passing to prevent the race
    condition
  • Designing efficient message passing algorithm is
    hard because the data must be distributed in a
    way that minimizes communication traffic
  • Cost of message passing is very high

19
Shared Memory (Address-Space) Architecture
  • Non-Uniform memory access (NUMA) shared address
    space computer with local and global memories
  • Time to access a remote memory bank is longer
    than the time to access a local word
  • Shared address space computers have a local cache
    at each processor to increase their effective
    processor-bandwidth.
  • The cache can also be used to provide fast access
    to remotely located shared data
  • Mechanisms developed for handling cache coherence
    problem

20
Shared Memory (Address-Space) Architecture
Interconnection Network
M
M
M
Non-uniform memory access (NUMA)
shared-address-space computer with local and
global memories
21
Shared Memory (Address-Space) Architecture
Interconnection Network
Non-uniform-memory-access (NUMA)
shared-address-space computer with local memory
only
22
Shared Memory (Address-Space) Architecture
  • Provides hardware support for read and write
    access by all processors to a shared address
    space.
  • Processors interact by modifying data objects
    stored in a shared address space.
  • MIMD shared -address space computers referred as
    multiprocessors
  • Uniform memory access (UMA) shared address space
    computer with local and global memories
  • Time taken by processor to access any memory word
    in the system is identical

23
Shared Memory (Address-Space) Architecture
P
P
P
Interconnection Network
M
M
M
Uniform Memory Access (UMA) shared-address-space
computer
24
Uniform Memory Access (UMA)
  • Parallel Vector Processors (PVPs)
  • Symmetric Multiple Processors (SMPs)

25
Parallel Vector Processor
VP Vector Processor SM Shared memory
26
Parallel Vector Processor
  • Works good only for vector codes
  • Scalar codes mat not perform perform well
  • Need to completely rethink and re-express
    algorithms so that vector instructions were
    performed almost exclusively
  • Special purpose hardware is necessary
  • Fastest systems are no longer vector
    uniprocessors.

27
Parallel Vector Processor
  • Small number of powerful custom-designed vector
    processors used
  • Each processor is capable of at least 1 Giga
    flop/s performance
  • A custom-designed, high bandwidth crossbar switch
    networks these vector processors.
  • Most machines do not use caches, rather they use
    a large number of vector registers and an
    instruction buffer
  • Examples Cray C-90, Cray T-90, Cray T-3D

28
Symmetric Multiprocessors (SMPs)
P/C Microprocessor and cache SM Shared memory
29
Symmetric Multiprocessors (SMPs) characteristics
Symmetric Multiprocessors (SMPs)
  • Uses commodity microprocessors with on-chip and
    off-chip caches.
  • Processors are connected to a shared memory
    through a high-speed snoopy bus
  • On Some SMPs, a crossbar switch is used in
    addition to the bus.
  • Scalable upto
  • 4-8 processors (non-back planed based)
  • few tens of processors (back plane based)

30
Symmetric Multiprocessors (SMPs)
Symmetric Multiprocessors (SMPs) characteristics
  • All processors see same image of all system
    resources
  • Equal priority for all processors (except for
    master or boot CPU)
  • Memory coherency maintained by HW
  • Multiple I/O Buses for greater Input / Output

31
Symmetric Multiprocessors (SMPs)
Processor L1 cache
Processor L1 cache
Processor L1 cache
Processor L1 cache
DIR Controller
I/O Bridge
I/O Bus
Memory
32
Symmetric Multiprocessors (SMPs)
  • Issues
  • Bus based architecture
  • Inadequate beyond 8-16 processors
  • Crossbar based architecture
  • multistage approach considering I/Os required in
    hardware
  • Clock distribution and HF design issues for
    backplanes
  • Limitation is mainly caused by using a
    centralized shared memory and a bus or cross bar
    interconnect which are both difficult to scale
    once built.

33
Commercial Symmetric Multiprocessors (SMPs)
  • Sun Ultra Enterprise 10000 (high end, expandable
    upto 64 processors), Sun Fire
  • DEC Alpha server 8400
  • HP 9000
  • SGI Origin
  • IBM RS 6000
  • IBM P690, P630
  • Intel Xeon, Itanium, IA-64(McKinley)

34
Symmetric Multiprocessors (SMPs)
  • Heavily used in commercial applications (data
    bases, on-line transaction systems)
  • System is symmetric (every processor has equal
    equal access to the shared memory, the I/O
    devices, and the operating systems.
  • Being symmetric, a higher degree of parallelism
    can be achieved.

35
Massively Parallel Processors (MPPs)
P/C Microprocessor and cache LM Local
memory NIC Network interface circuitry MB
Memory bus
36
Massively Parallel Processors (MPPs)
  • Commodity microprocessors in processing nodes
  • Physically distributed memory over processing
    nodes
  • High communication bandwidth and low latency as
    an interconnect. (High-speed, proprietary
    communication network)
  • Tightly coupled network interface which is
    connected to the memory bus of a processing node

37
Massively Parallel Processors (MPPs)
  • Provide proprietary communication software to
    realize the high performance
  • Processors Interconnected by a high-speed memory
    bus to a local memory through and a network
    interface circuitry (NIC)
  • Scaled up to hundred or even thousands of
    processors
  • Each processes has its private address space and
    Processes interact by passing messages

38
Massively Parallel Processors (MPPs)
  • MPPs support asynchronous MIMD modes
  • MPPs support single system image at different
    levels
  • Microkernel operating system on compute nodes
  • Provide high-speed I/O system
  • Example Cray T3D, T3E, Intel Paragon, IBM
    SP2

39
Cluster ?
  • cluster n.
  • A group of the same or similar elements gathered
    or occurring closely together a bunch She held
    out her hand, a small tight cluster of fingers
    (Anne Tyler).
  • Linguistics. Two or more successive consonants in
    a word, as cl and st in the word cluster.

A Cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnected stand alone/complete computers
cooperatively working together as a single,
integrated computing resource.
40
Cluster System Architecture
Programming Environment Web Windows
Other Subsystems (Java, C, Fortran, MPI, PVM)
User Interface (Database, OLTP)
Single System Image Infrastructure
Availability Infrastructure
OS Node
OS Node
OS Node

Interconnect
41
Clusters ?
  • A set of
  • Nodes physically connected over commodity/
    proprietary network
  • Gluing Software
  • Other than this definition no Official Standard
    exists
  • Depends on the user requirements
  • Commercial
  • Academic
  • Good way to sell old wine in a new bottle
  • Budget
  • Etc ..
  • Designing Clusters is not obvious but Critical
    issue.

42
Why Clusters NOW?
  • Clusters gained momentum when three technologies
    converged
  • Very high performance microprocessors
  • workstation performance yesterday
    supercomputers
  • High speed communication
  • Standard tools for parallel/ distributed
    computing their growing popularity
  • Time to market gt performance
  • Internet services huge demands for scalable,
    available, dedicated internet servers
  • big I/O, big computing power

43
How should we Design them ?
  • Components
  • Should they be off-the-shelf and low cost?
  • Should they be specially built?
  • Is a mixture a possibility?
  • Structure
  • Should each node be in a different box
    (workstation)?
  • Should everything be in a box?
  • Should everything be in a chip?
  • Kind of nodes
  • Should it be homogeneous?
  • Can it be heterogeneous?

44
What Should it offer ?
  • Identity
  • Should each node maintains its identity (and
    owner)?
  • Should it be a pool of nodes?
  • Availability
  • How far should it go?
  • Single-system Image
  • How far should it go?

45
Place for Clusters in HPC world ?
Distance between nodes
A chip
SM Parallel computing
A box
Distributed computing
A room
A building
The world
Source Toni Cortes (toni_at_ac.upc.es)
46
Where Do Clusters Fit?
1 TF/s delivered
15 TF/s delivered
Distributed systems
MP systems
Legion\Globus
Superclusters
Berkley NOW
ASCI Red Tflops
SETI_at_home
Beowulf
Condor
Internet
  • Bounded set of resources
  • Apps grow to consume all cycles
  • Application manages resources
  • System SW gets in the way
  • 5 overhead is maximum
  • Apps drive purchase of equipment
  • Real-time constraints
  • Space-shared
  • Gather (unused) resources
  • System SW manages resources
  • System SW adds value
  • 10 - 20 overhead is OK
  • Resources drive applications
  • Time to completion is not critical
  • Time-shared
  • Commercial PopularPower, United Devices,
    Centrata, ProcessTree, Applied Meta, etc.

Src B. Maccabe, UNM, R.Pennington NCSA
47
Top 500 Supercomputers
Rank Computer/Procs Peak performance Country/year
1 Earth Simulator (NEC) 5120 40960 GF Japan / 2002
2 ASCI Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096 10240 GF LANL, USA/2002
3 ASCI Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096 10240 GF LANL, USA/2002
4 ASCI White (IBM) SP power 3 375 MHz / 8192 12288 GF LANL, USA/2000
5 MCR Linux Cluster Xeon 2.4 GHz Qudratics / 2304 11060GF LANL, USA/2002
  • From www.top500.org

48
What makes the Clusters ?
  • The same hardware used for
  • Distributed computing
  • Cluster computing
  • Grid computing
  • Software converts hardware in a cluster
  • Tights everything together

49
Task Distribution
  • The hardware is responsible for
  • High-performance
  • High-availability
  • Scalability (network)
  • The software is responsible for
  • Gluing the hardware
  • Single-system image
  • Scalability
  • High-availability
  • High-performance

50
Classification ofCluster Computers
51
Clusters Classification 1
  • Based on Focus (in Market)
  • High performance (HP) clusters
  • Grand challenging applications
  • High availability (HA) clusters
  • Mission critical applications
  • Web/e-mail
  • Search engines

52
HA Clusters
53
Clusters Classification 2
  • Based on Workstation/PC Ownership
  • Dedicated clusters
  • Non-dedicated clusters
  • Adaptive parallel computing
  • Can be used for CPU cycle stealing

54
Clusters Classification 3
  • Based on Node Architecture
  • Clusters of PCs (CoPs)
  • Clusters of Workstations (COWs)
  • Clusters of SMPs (CLUMPs)

55
Clusters Classification 4
  • Based on Node Components Architecture
    Configuration
  • Homogeneous clusters
  • All nodes have similar configuration
  • Heterogeneous clusters
  • Nodes based on different processors and running
    different OS

56
Clusters Classification 5
  • Based on Node OS Type..
  • Linux Clusters (Beowulf)
  • Solaris Clusters (Berkeley NOW)
  • NT Clusters (HPVM)
  • AIX Clusters (IBM SP2)
  • SCO/Compaq Clusters (Unixware)
  • .Digital VMS Clusters, HP clusters, ..

57
Clusters Classification 6
  • Based on Levels of Clustering
  • Group clusters ( nodes 2-99)
  • A set of dedicated/non-dedicated computers ---
    mainly connected by SAN like Myrinet
  • Departmental clusters ( nodes 99-999)
  • Organizational clusters ( nodes many 100s)
  • Internet-wide clusters Global clusters(
    nodes 1000s to many millions)
  • Computational Grid

58
Clustering Evolution
3rd Gen. Commercial Grade Clusters
4th Gen. Network Transparent Clusters
2nd Gen. Beowulf Clusters
COST
COMPLEXITY
1st Gen. MPP Super Computers
1990
2005
Time
59
Cluster Components
  • Hardware
  • System Software

60
Hardware
61
Nodes
  • The idea is to use standard off-the-shelf
    processors
  • Pentium like Intel, AMDK
  • Sun
  • HP
  • IBM
  • SGI
  • No special development for clusters

62
Interconnection Network
63
Interconnection Network
  • One of the key points in clusters
  • Technical objectives
  • High bandwidth
  • Low latency
  • Reliability
  • Scalability

64
Network Design Issues
  • Plenty of work has been done to improve networks
    for clusters
  • Main design issues
  • Physical layer
  • Routing
  • Switching
  • Error detection and correction
  • Collective operations

65
Physical Layer
  • Trade-off between
  • Raw data transfer rate and cable cost
  • Bit width
  • Serial mediums (Ethernet, Fiber Channel)
  • Moderate bandwidth
  • 64-bit wide cable (HIPPI)
  • Pin count limits the implementation of switches
  • 8-bit wide cable (Myrinet, ServerNet)
  • Good compromise

66
Routing
  • Source-path
  • The entire path is attached to the message at its
    source location
  • Each switch deletes the current head of the path
  • Table-based routing
  • The header only contains the destination node
  • Each switch has a table to help in the decision

67
Switching
  • Packet switching
  • Packets are buffered in the switch before resent
  • Implies an upper-bound packet size
  • Needs buffers in the switch
  • Used by traditional LAN/WAN networks
  • Wormhole switching
  • Data is immediately forwarded to the next stage
  • Low latency
  • No buffers are needed
  • Error correction is more difficult
  • Used by SANs such as Myrinet, PARAMNet

68
Flow Control
  • Credit-based design
  • The receiver grants credit to the sender
  • The sender can only send if it has enough credit
  • On-Off
  • The receiver informs whether it can or cannot
    accept new packets

69
Error Detection
  • It has to be done at hardware level
  • Performance reasons
  • i.e. CRC checking is done by the network
    interface
  • Networks are very reliable
  • Only erroneous messages should see overhead

70
Collective Operations
  • These operations are mainly
  • Barrier
  • Multicast
  • Few interconnects offer this characteristic
  • Synfinity is good example
  • Normally offered by software
  • Easy to achieve in bus-based like Ethernet
  • Difficult to achieve in point-to-point like
    Myrinet

71
Examples of Network
  • The most common networks used are
  • -Ethernet
  • SCI
  • Myrinet
  • PARAMNet
  • HIPPI
  • ATM
  • Fiber Channel
  • AmpNet
  • Etc.

72
-Ethernet
  • Most widely used for LAN
  • Affordable
  • Serial transmission
  • Packet switching and table-based routing
  • Types of Ethernet
  • Ethernet and Fast Ethernet
  • Based on collision domain (Buses)
  • Switched hubs can make different collision
    domains
  • Gigabit Ethernet
  • Based on high-speed point-to-point switches
  • Each nodes is in its own collision domain

73
ATM
  • Standard designed for telecommunication industry
  • Relatively expensive
  • Serial
  • Packet switching and table-based routing
  • Designed around the concept of fixed-size packets
  • Special characteristics
  • Well designed for real-time systems

74
Scalable Coherent Interface (SCI)
  • First standard specially designed for CC
  • Low layer
  • Point-to-point architecture but maintains
    bus-functionality
  • Packet switching and table-based routing
  • Split transactions
  • Dolphin Interconnect Solutions, Sun SPARC Sbus
  • High Layer
  • Defines a distributed cache-coherent scheme
  • Allows transparent shared memory programming
  • Sequent NUMA-Q, Data general AViiON NUMA

75
PARAMNet Myrinet
  • Low-latency and High-bandwidth network
  • Characteristics
  • Byte-wise links
  • Wormhole switching and source-path routing
  • Low-latency cut-through routing switches
  • Automatic mapping, which favors fault tolerance
  • Zero-copying is not possible
  • Programmable on-board processor
  • Allows experimentation with new protocols

76
Comparison
77
Communication Protocols
  • Traditional protocols
  • TCP and UDP
  • Specially designed
  • Active messages
  • VMMC
  • BIP
  • VIA
  • Etc.

78
Data Transfer
  • User-level lightweight communication
  • Avoid OS calls
  • Avoid data copying
  • Examples
  • Fast messages, BIP, ...
  • Kernel-level lightweight communication
  • Simplified protocols
  • Avoid data copying
  • Examples
  • GAMMA, PM, ...

79
TCP and UDP
  • First messaging libraries used
  • TCP is reliable
  • UDP is not reliable
  • Advantages
  • Standard and well known
  • Disadvantages
  • Too much overhead (specially for fast networks)
  • Plenty OS interaction
  • Many copies

80
Active Messages
  • Low-latency communication library
  • Main issues
  • Zero-copying protocol
  • Messages copied directly
  • to/from the network
  • to/from the user-address space
  • Receiver memory has to be pinned
  • There is no need of a receive operation

81
VMMC
  • Virtual-Memory Mapped Communication
  • View messages as read and writes to memory
  • Similar to distributed shared memory
  • Makes a correspondence between
  • A virtual pages at the receiving side
  • A virtual page at the sending side

82
BIP
  • Basic Interface for Parallelism
  • Low-level message-layer for Myrinet
  • Uses various protocols for various message sizes
  • Tries to achieve zero copies (one at most)
  • Used via MPI by programmers
  • 7.6us latency
  • 107 Mbytes/s bandwidth

83
VIA
  • Virtual Interface Architecture
  • First standard promoted by the industry
  • Combines the best features of academic projects
  • Interface
  • Designed to be used by programmers directly
  • Many programmers believe it to be too low level
  • Higher-level APIs are expected
  • NICs with VIA implemented in hardware
  • This is the proposed path

84
Potential and Limitations
  • High bandwidth
  • Can be achieved at low cost
  • Low latency
  • Can be achieved, but at high cost
  • The lower the latency is the closer to a
    traditional supercomputer we get
  • Reliability
  • Can be achieved at low cost
  • Scalability
  • Easy to achieve for the size of clusters

85
System Software
  • Operating system vs. middleware
  • Processor management
  • Memory management
  • I/O management
  • Single-system image
  • Monitoring clusters
  • High Availability
  • Potential and limitations

86
Operating system vs. Middleware
  • Operating system
  • Hardware-control layer
  • Middleware
  • Gluing layer
  • The barrier is not always clear
  • Similar
  • User level
  • Kernel level

Middleware
87
System Software
  • We will not distinguish between
  • Operating system
  • Middleware
  • The middleware related to the operating system
  • Objectives
  • Performance/Scalability
  • Robustness
  • Single-system image
  • Extendibility
  • Scalability
  • Heterogeneity

88
Processor Management
  • Schedule jobs onto the nodes
  • Scheduling policies should take into account
  • Needed vs. available resources
  • Processors
  • Memory
  • I/O requirements
  • Execution-time limits
  • Priorities
  • Different kind of jobs
  • Sequential and parallel jobs

89
Load Balancing
  • Problem
  • A perfect static balance is not possible
  • Execution time of jobs is unknown
  • Unbalanced systems may not be efficient
  • Solution
  • Process migration
  • Prior to execution
  • Granularity must be small
  • During execution
  • Cost must be evaluated

90
Fault Tolerance
  • Large cluster must be fault tolerant
  • The probability of a fault is quite high
  • Solution
  • Re-execution of applications in the failed node
  • Not always possible or acceptable
  • Checkpointing and migration
  • It may have a high overhead
  • Difficult with some kind of applications
  • Applications that modify the environment
  • Transactional behavior may be a solution

91
Managing Heterogeneous Systems
  • Compatible nodes but different characteristics
  • It becomes a load balancing problem
  • Non compatible nodes
  • Binaries for each kind of node are needed
  • Shared data has to be in a compatible format
  • Migration becomes nearly impossible

92
Scheduling Systems
  • Kernel level
  • Very few take care of cluster scheduling
  • High-level applications do the scheduling
  • Distribute the work
  • Migrate processes
  • Balance the load
  • Interact with the users
  • Examples
  • CODINE, CONDOR, NQS, etc

93
Memory Management
  • Objective
  • Use all the memory available in the cluster
  • Basic approaches
  • Software distributed-shared memory
  • General purpose
  • Specific usage of idle remote memory
  • Specific purpose
  • Remote memory paging
  • File-system caches or RAMdisks (described later)

94
Software Distributed Shared Memory
  • Software layer
  • Allows applications running on different nodes to
    share memory regions
  • Relatively transparent to the programmer
  • Address-space structure
  • Single address space
  • Completely transparent to the programmer
  • Shared areas
  • Applications have to mark a given region as
    shared
  • Not completely transparent
  • Approach mostly used due to its simplicity

95
Main Data Problems to be Solved
  • Data consistency vs. Performance
  • A strict semantic is very inefficient
  • Current systems offer relaxed semantics
  • Data location (finding the data)
  • The most common solution is the owner node
  • This node may be fixed or vary dynamically
  • Granularity
  • Usually a fixed block size is implemented
  • Hardware MMU restrictions
  • Leads to false sharing
  • Variable granularity being studied

96
Other Problems to be Solved
  • Synchronization
  • Test-and-set-like mechanisms cannot be used
  • SDSM systems have to offer new mechanisms
  • i.e. semaphores (message passing implementation)
  • Fault tolerance
  • Very important and very seldom implemented
  • Multiples copies
  • Heterogeneity
  • Different page sizes
  • Different data-type implementations
  • Use tags

97
Remote-Memory Paging
  • Keep swapped-out pages in idle memory
  • Assumptions
  • Many workstations are idle
  • Disks are much slower than Remote memory
  • Idea
  • Send swapped-out pages to idle workstations
  • When no remote memory space then use disks
  • Replicate copies to increase fault tolerance
  • Examples
  • The global memory service (GMS)
  • Remote memory pager

98
I/O Management
  • Advances very closely to parallel I/O
  • There are two major differences
  • Network latency
  • Heterogeneity
  • Interesting issues
  • Network configurations
  • Data distribution
  • Name resolution
  • Memory to increase I/O performance

99
Network Configurations
  • Device location
  • Attached to nodes
  • Very easy to have (use the disks in the nodes)
  • Network attached devices
  • I/O bandwidth is not limited by memory bandwidth
  • Number of networks
  • Only one network for everything
  • One special network for I/O traffic (SAN)
  • Becoming very popular

100
Data Distribution
  • Distribution per files
  • Each nodes has its own independent file system
  • Like in distributed file systems (NFS, Andrew,
    CODA, ...)
  • Each node keeps a set of files locally
  • It allows remote access to its files
  • Performance
  • Maximum performance device performance
  • Parallel access only to different files
  • Remote files depends on the network
  • Caches help but increase complexity (coherence)
  • Tolerance
  • File replication in different nodes

101
Data Distribution
  • Distribution per blocks
  • Also known as Software/Parallel RAIDs
  • xFS, Zebra, RAMA, ...
  • Blocks are interleaved among all disks
  • Performance
  • Parallel access to blocks in the same file
  • Parallel access to different files
  • Requires a fast network
  • Usually solved with a SAN
  • Especially good for large requests (multimedia)
  • Fault tolerance
  • RAID levels (3, 4 and 5)

102
Name Resolution
  • Equal than in distributed systems
  • Mounting remote file systems
  • Useful when the distribution is per files
  • Distributed name resolution
  • Useful when the distribution is per files
  • Returns the node where the file resides
  • Useful when the distribution is per blocks
  • Returns the node where the files meta-data is
    located

103
Caching
  • Caching can be done at multiple levels
  • Disk controller
  • Disk servers
  • Client nodes
  • I/O libraries
  • etc.
  • Good to have several levels of cache
  • High levels decrease hit ratio of low levels
  • Higher level caches absorb most of the locality

104
Cooperative Caching
  • Problem of traditional caches
  • Each nodes caches the data it needs
  • Plenty of replication
  • Memory space not well used
  • Increase the coordination of the caches
  • Clients know what other clients are caching
  • Clients can access cached data in remote nodes
  • Replication in the cache is reduced
  • Better use of the memory

105
RAMdisks
  • Assumptions
  • Disks are slow and memorynetwork is fast
  • Disks are persistent and memory is not
  • Build disk unifying idle remote RAM
  • Only used for non-persistent data
  • Temporary data
  • Useful in many applications
  • Compilations
  • Web proxies
  • ...

106
Single-System Image
  • SSI offers the idea that the cluster is a single
    machine
  • It can be done a several levels
  • Hardware
  • Hardware DSM
  • System software
  • It can offers unified view to applications
  • Application
  • It can offer a unified view to the user
  • All SSI have a boundary

107
Key Services of SSI
  • Main services offered by SSI
  • Single point of entry
  • Single file hierarchy
  • Single I/O Space
  • Single point of management
  • Single virtual networking
  • Single job/resource management system
  • Single process space
  • Single user interface
  • Not all of them are always available

108
Monitoring Clusters
  • Clusters need tools to be monitored
  • Administrators have many things to check
  • The cluster must be visible from a single point
  • Subjects of monitoring
  • Physical environment
  • Temperature, power, ..
  • Logical services
  • RPCs, NFS, ...
  • Performance meters
  • Paging, CPU load, ...

109
Monitoring Heterogeneous Clusters
  • Monitoring is specially necessary in
    heterogeneous clusters
  • Several node types
  • Several operating systems
  • The tool should hide the differences
  • The real characteristics are only needed to solve
    some problems
  • Very related to Single-System Image

110
Auto-Administration
  • Monitors know how to make self diagnosis
  • Next step is to run corrective procedures
  • Some systems start to do so (NetSaint)
  • Difficult because tools do not have common sense
  • This step is necessary
  • Many nodes
  • Many devices
  • Many possible problems
  • High probability of error

111
High Availability
  • One of the key points for clusters
  • Specially needed for commercial applications
  • 7 days a week and 24 hours a day
  • Not necessarily very scalable (32 nodes)
  • Based on many issues already described
  • Single-system image
  • Hide any possible change in the configuration
  • Monitoring tools
  • Detect the errors to be able to correct them
  • Process migration
  • Restart/continue applications in running nodes

112
  • Design and Build a Cluster Computer

113
Cluster Design
  • Clusters are good as personal supercomputers
  • Clusters are not often good as general purpose
    multi-user production machines
  • Building such a cluster requires planning and
    understanding design tradeoffs

114
Scalable Cluster Design Principles
  • Principle of Independence
  • Principle of Balanced Design
  • Principle of design for Scalability
  • Principle of Latency hiding

115
Principle of independence
  • Components (hardware Software) of the system
    should be independent of one another
  • Incremental scaling - Scaling up a system along
    one dimension by improving one component,
    independent of others
  • For example upgrade processor to next
    generation, system should operate at higher
    performance with upgrading other components.
  • Should enable heterogeneity scalability

116
Principle of independence
  • The components independence can result in cost
    cutting
  • The component becomes a commodity, with following
    features
  • Open architecture with standard interfaces to the
    rest of the system
  • Off-the-shelf product Public domain
  • Multiple vendor in the open market with large
    volume
  • Relatively mature
  • For all these reasons the commodity component
    has low cost, high availability and reliability

117
Principle of independence
  • Independence principle and application examples
  • The algorithm should be independent of the
    architecture
  • The application should be independent of platform
  • The programming language should be independent of
    the machine
  • The language should be modular and have
    orthogonal feature
  • The node should be independent of the network,
    and the network interface should be independent
    of the network topology
  • Caveat
  • In any parallel system, there is usually some key
    component/technique that is novel
  • We can not build en efficient system by simply
    scaling up one or few components
  • Design should be balanced

118
Principle of Balanced Design
  • Minimize any performance bottleneck
  • Should avoid an unbalanced system design, where
    slow component degrades the performance of the
    entire system
  • Should avoid single point of failure
  • Example
  • The PetaFLOP project The memory requirement for
    wide range of scientific/Engineering applications
  • Memory (GB) Speed3/4 (Gflop/s)
  • 30 TB of memory is appropriate for a Pflop/s
    machine.

119
Principle of Design for Scalability
  • Provision must be made so that System can either
    scale up to provide higher performance
  • Or scale down to allow affordability or greater
    cost-effectiveness
  • Two approaches
  • Overdesign
  • Example Modern processors support 64-bit
    address space. This huge address may not be fully
    utilized by Unix supporting 32-bit address space.
    This overdesign will create much easier
    transition of OS from 32-bit to 64-bit
  • Backward compatibility
  • Example A parallel program designed to run on n
    nodes should be able to run on a single node, may
    be with a reduced input data.

120
Principle of Latency Hiding
  • Future scalable system are most likely to use a
    distributed shared-memory architecture.
  • Access to remote memory may experience a long
    latencies
  • Example GRID
  • Scalable multiprocessors clusters must rely on
    use of
  • Latency hiding
  • Latency avoiding
  • Latency reduction

121
Cluster Building
  • Conventional wisdom Building a cluster is easy
  • Recipe
  • Buy hardware from Computer Shop
  • Install Linux, Connect them via network
  • Configure NFS, NIS
  • Install your application, run and be happy
  • Building it right is a little more difficult
  • Multi user cluster, security, performance tools
  • Basic question - what works reliably?
  • Building it to be compatible with Grid
  • Compilers, libraries
  • Accounts, file storage, reproducibility
  • Hardware configuration may be an issue

122
  • How do people think of parallel programming and
    using clusters ..

123
Panther Cluster
  • Picked 8 PC and named them from Panther family.
  • Connected them by network and setup this cluster
    in a small lab.
  • Using Panther Cluster
  • Select a PC log in
  • Edit and Compile the code
  • Execute the program
  • Analyze the results

Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
124
Panther Cluster - Programming
  • Explicit parallel Programming isn't easy. You
    really have to do yourself
  • Network bandwidth and Latency matter
  • There are good reasons for security patches
  • OopsLion does not have floating point

Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
125
Panther Cluster Attacks Users
  • Grad Students wanted to use the cool cluster.
    They each need only half (a half other than Lion)
  • Grad Students discover that using the same PC at
    the same time is incredibly bad
  • A solution would be to use parts of the cluster
    exclusively for one job at a time.
  • And so.

Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
126
We Discover Scheduling
  • We tried
  • A sign up sheet
  • Yelling across the yard
  • A mailing list
  • finger schedule
  • A scheduler

Queue
Cheeta
Tiger
Job 1
Job 2
Kitten
Cat
Job 3
Jaguar
Leopard
Panther
Lion
127
Panther Expands
  • Panther expands, adding more users and more
    systems
  • Use Panther Node for
  • Login
  • File services
  • Scheduling services
  • .
  • All other compute nodes

Panther
Cheeta
PC1
Tiger
PC5
Kitten
Cat
PC6
PC2
PC3
Jaguar
Leopard
PC7
Lion
PC8
PC 09
PC4
128
Evolution of Cluster Services
The Cluster grows
Login
Login
Login
Login File service Scheduling Management I/O
services
Login
File service
File service
File service
File service Scheduling Management I/O services
Scheduling
Scheduling
Scheduling Management I/O services
Management I/O services
Management
I/O services
Improve computing performance
Improve system reliability and manageability
  • Basic goal

129
Compute _at_ Panther
  • Usage Model
  • Login to login node
  • Compile and test code
  • Schedule a test run
  • Schedule a serious run
  • Carry out I/O through I/O node
  • Management Model
  • The compute nodes are identical
  • Users use Login, I/O and compute nodes
  • All I/O requests are managed by Metadata server

Login
File
Sched
Mgmt
I/O
PC1
Cheeta
Tiger
PC5
Kitten
PC6
Cat
PC2
PC3
Jaguar
Leopard
PC7
Lion
PC8
PC 09
PC4
130
Cluster Building Block
131
Building Blocks - Hardware
  • Processor
  • Complex Instruction Set Computer (CISC)
  • x86, Pentium Pro, Pentium II, III, IV
  • Reduced Instruction Set Computer (RISC)
  • SPARC, RS6000, PA-RISC, PPC, Power PC
  • Explicitly Parallel Instruction Computer (EPIC)
  • IA-64 (McKinley), Itanium

132
Building Blocks - Hardware
  • Memory
  • Extended Data Out (EDO)
  • pipelining by loading next call to or from memory
  • 50 - 60 ns
  • DRAM and SDRAM
  • Dynamic Access and Synchronous (no pairs)
  • 13 ns
  • PC100 and PC133
  • 7ns and less

133
Building Blocks - Hardware
  • Cache
  • L1 - 4 ns
  • L2 - 5 ns
  • L3 (off the chip) - 30 ns
  • Celeron
  • 0 512KB
  • Intel Xeon chips
  • 512 KB - 2MB L2 Cache
  • Intel Itanium
  • 512KB -
  • Most processors have at least 256 KB

134
Building Blocks - Hardware
  • Disks and I/O
  • IDE and EIDE
  • IBM 75 GB 7200 rpm disk w/ 2MB onboard cache
  • SCSI I, II, II and SCA
  • 5400, 7400, and 10000 rpm
  • 20 MB/s, 40 MB/s, 80 MB/s, 160 MB/s
  • Can chain from 6-15 disks
  • RAID Sets
  • software and hardware
  • best for dealing with parallel I/O
  • reserved cache for before flushing to disks

135
Building Blocks - Hardware
  • System Bus
  • ISA
  • 5Mhz - 13 Mhz
  • 32 bit PCI
  • 33Mhz
  • 133 MB/s
  • 64 bit PCI
  • 66Mhz
  • 266MB/s

136
Building Blocks - Hardware
  • Network Interface Cards (NICs)
  • Ethernet - 10 Mbps, 100 Mbps, 1 Gbps
  • ATM - 155 Mbps and higher
  • Quality of Service (QoS)
  • Scalable Coherent Interface (SCI)
  • 12 microseconds latency
  • Myrinet - 1.28 Gbps
  • 120 MB/s
  • 5 microseconds latency
  • PARAMNet 2.5 Gbps

137
Building Blocks Operating System
  • Solaris - Sun
  • AIX - IBM
  • HPUX - HP
  • IRIX - SGI
  • Linux - everyone!
  • Is architecture independent
  • Windows NT/2000

138
Building Blocks - Compilers
  • Commercial
  • Portland Group Incorporated (PGI)
  • C, C, F77, F90
  • Not as expensive as vendor specific and compile
    most applications
  • GNU
  • gcc, g, g77, vast f90
  • free!

139
Building Blocks - Scheduler
  • Cron, at (NT/2000)
  • Condor
  • IBM Loadleveler
  • LSF
  • Portable Batch System (PBS)
  • Maui Scheduler
  • GLOBUS
  • All free, run on more than one OS!

140
Building Blocks Message Passing
  • Commercial and free
  • Naturally Parallel, Highly Parallel
  • Condor
  • High Throughput Computing (HTC)
  • Parallel Virtual Machine PVM
  • oak ridge national labs
  • Message Passing Interface (MPI)
  • mpich from anl

141
Building Blocks Debugging and Analysis
  • Parallel Debuggers
  • TotalView
  • GUI based
  • Performance Analysis Tools
  • monitoring library calls and runtime analysis
  • AIMS, MPE, Pablo,
  • Paradyn - from Wisconsin,
  • SvPablo, Vampir, Dimemas, Paraver

142
Building Block Other
  • Cluster Administration Tools
  • Cluster Monitoring Tools
  • These tools are the part of Single System Image
    Aspects

143
Scalability of Parallel Processors
Cluster of Uniprocessors
SMP
Performance
Cluster of SMPs
Processors
144
Installing the Operating System
  • Which package ?
  • Which Services ?
  • Do I need a graphical environment ?

145
Identifying the hardware bottlenecks
  • Is my hardware optimal ?
  • Can I improve my hardware choices ?
  • How can I identify where is the problem ?
  • Common hardware bottlenecks !!

146
Benchmarks
  • Synthetic Benchmarks
  • Bonnie
  • Stream
  • NetPerf
  • NetPipe
  • Applications Benchmarks
  • High Performance Linpack
  • NAS

147
Networking under Linux
148
Network Terminology Overview
  • IP address the unique machine address on the
    net (e.g., 128.169.92.195)
  • netmask determines which portion of the IP
    address specifies the subnetwork number, and
    which portion specifies the host on that subnet
    (e.g., 255.255.255.0)
  • network address IP address masked bitwise-ANDed
    with the netmask (e.g.,128.169.92.0)
  • broadcast address network address ORed with the
    negation of the netmask (128.169.92.255)

149
Network Terminology Overview
  • gateway address the address of the gateway
    machine that lives on two different networks and
    routes packets between them
  • name server address the address of the name
    server that translates host names into IP
    addresses

150
A Cluster Network
151
Network Configuration
  • IP Address
  • Three private IP address range
  • 10.0.0.0 to 10.255.255.255 172.16.0.0 to
    172.32.255.255 196.168.0.0 to 192.168.255.255
  • Information on private intranet is available in
    RFC 1918
  • Warning Should not use IP address 10.0.0.0 or
    172.16.0.0 or 196.168.0.0 for server
  • Netmask 255.255.255.0 should be sufficient for
    most clusters

152
Network Configuration
  • DHCP Dynamic Host Configuration Protocol
  • Advantages
  • You can simplify network setup
  • Disadvantages
  • It is centralized solution ( is it scalable ?)
  • IP addresses are linked to ethernet address, and
    that can be a problem if you change the NIC or
    want to change the hostname routinely

153
Network Configuration Files
  • /etc/resolv.conf -- configures the name resolver
    specifying the following fields
  • search (a list of alternate domain names to
    search for a hostname)
  • nameserver (IP addresses of DNS used for name
    resolutions)
  • search cse.iitd.ac.in
  • nameserver 128.169.93.2
  • nameserver 128.169.201.2

154
Network Configuration Files
  • /etc/hosts -- contains a list of IP addresses and
    their corresponding hostnames. Used for faster
    name resolution process (no need to query the
    domain name server to get the IP address)
  • 127.0.0.1 localhost
    localhost.localdomain
  • 128.169.92.195 galaxy
    galaxy.cse.iitd.ac.in
  • 192.168.1.100 galaxy galaxy
  • 192.168.1.1 star1 star1
  • /etc/host.conf -- specifies the order of queries
    to resolve host names Example
  • order hosts, bind check the /etc.../hosts
    first and then the DNS
  • multi on allow to have multiple IP
    addresses

155
Host-specific Configuration Files
  • /etc/conf.modules -- specifies the list of
    modules (drivers) that have to be loaded by the
    kerneld (see /lib/modules for a full list)
  • alias eth0 tulip
  • /etc/HOSTNAME - specifies your system hostname
    galaxy1.cse.iitd.ac.in
  • /etc/sysconfig/network -- specifies a gateway
    host, gateway device
  • NETWORKINGyes
  • HOSTNAMEgalaxy.cse.iitd.ac.in
  • GATEWAY128.169.92.1
  • GATEWAYDEVeth0
  • NISDOMAINworkshop

156
Configure Ethernet Interface
  • Loadable ethernet drivers -
  • Loadable modules are pieces of object codes that
    can be loaded into a running kernel. It allows
    Linux to add device drivers to a running Linux
    system in real time. The loadable Ethernet
    drivers are found in the /lib/modules/release/net
    directory
Write a Comment
User Comments (0)
About PowerShow.com