A Tutorial

About This Presentation

Title:

A Tutorial

Description:

A Tutorial Designing Cluster Computers and High Performance Storage Architectures At HPC ASIA 2002, Bangalore INDIA December 16, 2002 By Dheeraj Bhardwaj – PowerPoint PPT presentation

Number of Views:436

Avg rating:3.0/5.0

Slides: 284

Provided by: Dr898

Category:

more less

Transcript and Presenter's Notes

Title: A Tutorial

1
A Tutorial

Designing Cluster Computers
and
High Performance Storage Architectures
At
HPC ASIA 2002, Bangalore INDIA
December 16, 2002
By

Dheeraj Bhardwaj Department of Computer Science
Engineering Indian Institute of Technology,
Delhi INDIA e-mail dheerajb_at_cse.iits.ac.in http
//www.cse.iitd.ac.in/dheerajb
N. Seetharama Krishna Centre for Development of
Advanced Computing Pune University Campus, Pune
INDIA e-mail krishna_at_cdacindia.com http//www.cd
acindia.com
2
Acknowledgments

All the contributors of LINUX
All the contributors of Cluster Technology
All the contributors in the art and science of
parallel computing
Department of Computer Science Engineering, IIT
Delhi
Centre for Development of Advanced Computing,
(C-DAC) and collaborators

3
Disclaimer

The information and examples provided are based
on the Red Hat Linux 7.2 installation on the
Intel PCs platforms ( our specific hardware
specifications)
Much of it should be applicable to other
versions of Linux
There is no warranty that the materials are error
free
Authors will not be held responsible for any
direct, indirect, special, incidental or
consequential damages related to any use of these
materials

4
Part I

Designing Cluster Computers

5
Outline

Introduction
Classification of Parallel Computers
Introduction to Clusters
Classification of Clusters
Cluster Components Issues
Hardware
Interconnection Network
System Software
Design and Build a Cluster Computers
Principles of Cluster Design
Cluster Building Blocks
Networking Under Linux
PBS
PVFS
Single System Image

6
Outline

Tools for Installation and Management
Issues related to Installation, configuration,
Monitoring and Management
NPACI Rocks
OSCAR
EU DataGrid WP4 Project
Other Tools Sycld Beowulf, OpenMosix, Cplant,
SCore
HPC Applications and Parallel Programming
HPC Applications
Issues related to parallel Programming
Parallel Algorithms
Parallel Programming Paradigms
Parallel Programming Models
Message Passing
Applications I/O and Parallel File System
Performance metrics of Parallel Systems
Conclusion

7
Introduction
8
What do We Want to Achieve ?

Develop High Performance Computing (HPC)
Infrastructure which is
Scalable (Parallel MPP Grid)
User Friendly
Based on Open Source
Efficient in Problem Solving
Able to Achieve High Performance
Able to Handle Large Data Volumes
Cost Effective
Develop HPC Applications which are
Portable ( Desktop Supercomputers
Grid)
Future Proof
Grid Ready

9
Who Uses HPC ?

Scientific Engineering Applications
Simulation of physical phenomena
Virtual Prototyping (Modeling)
Data analysis
Business/ Industry Applications
Data warehousing for financial sectors
E-governance
Medical Imaging
Web servers, Search Engines, Digital libraries
etc ..
All face similar problems
Not enough computational resources
Remote facilities Network becomes the
bottleneck
Heterogeneous and fast changing systems

10
HPC Applications

Three Types
High-Capacity Grand Challenge Applications
Throughput Running hundreds/thousands of job,
doing parameter studies,
statistical analysis etc
Data Genome analysis, Particle Physics,
Astronomical observations, Seismic data
processing etc
We are seeing a Fundamental Change in HPC
Applications
They have become multidisciplinary
Require incredible mix of varies technologies and
expertise

11
Why Parallel Computing ?

If your Application requires more computing power
than a sequential computer can provide ? !!!!!
You might suggest to improve the operating speed
of processor and other components
We do not disagree with your suggestion BUT how
long you can go ?
We always have desire and prospects for greater
performance
Parallel Computing is the right answer

12
Serial and Parallel Computing
A parallel computer is a Collection of
processing elements that communicate and
co-operate to solve large problems fast.

PARALLEL COMPUTING
Fetch/Store
Compute/communicate
Cooperative game

SERIAL COMPUTING
Fetch/Store
Compute

13
Classification of Parallel Computers
14
Classification of Parallel Computers
Flynn Classification Number of Instructions
Data Streams
Conventional
Data Parallel, Vector Computing
Systolic Arrays
very general, multiple approaches
15
MIMD Architecture Classification
Current focus is on MIMD model, using general
purpose processors or multicomputers.
16
MIMD Shared Memory Architecture

Source PE writes data to Global Memory
destination retrieves it
Easy to build
Limitation reliability expandability. A
memory component or any processor failure affects
the whole system.
Increase of processors leads to memory
contention.
Ex. Silicon graphics supercomputers....

17
MIMD Distributed Memory Architecture
High Speed Interconnection Network
Processor 1
Processor 2
Processor 3
Memory Bus
Memory Bus
Memory Bus
Memory 3
Memory 1
Memory 2

Inter Process Communication using High Speed
Network.
Network can be configured to various topologies
e.g. Tree, Mesh, Cube..
Unlike Shared MIMD
easily/ readily expandable
Highly reliable (any CPU failure does not affect
the whole system)

18
MIMD Features

MIMD architecture is more general purpose
MIMD needs clever use of synchronization that
comes from message passing to prevent the race
condition
Designing efficient message passing algorithm is
hard because the data must be distributed in a
way that minimizes communication traffic
Cost of message passing is very high

19
Shared Memory (Address-Space) Architecture

Non-Uniform memory access (NUMA) shared address
space computer with local and global memories
Time to access a remote memory bank is longer
than the time to access a local word
Shared address space computers have a local cache
at each processor to increase their effective
processor-bandwidth.
The cache can also be used to provide fast access
to remotely located shared data
Mechanisms developed for handling cache coherence
problem

20
Shared Memory (Address-Space) Architecture
Interconnection Network
M
M
M
Non-uniform memory access (NUMA)
shared-address-space computer with local and
global memories
21
Shared Memory (Address-Space) Architecture
Interconnection Network
Non-uniform-memory-access (NUMA)
shared-address-space computer with local memory
only
22
Shared Memory (Address-Space) Architecture

Provides hardware support for read and write
access by all processors to a shared address
space.
Processors interact by modifying data objects
stored in a shared address space.
MIMD shared -address space computers referred as
multiprocessors
Uniform memory access (UMA) shared address space
computer with local and global memories
Time taken by processor to access any memory word
in the system is identical

23
Shared Memory (Address-Space) Architecture
P
P
P
Interconnection Network
M
M
M
Uniform Memory Access (UMA) shared-address-space
computer
24
Uniform Memory Access (UMA)

Parallel Vector Processors (PVPs)
Symmetric Multiple Processors (SMPs)

25
Parallel Vector Processor
VP Vector Processor SM Shared memory
26
Parallel Vector Processor

Works good only for vector codes
Scalar codes mat not perform perform well
Need to completely rethink and re-express
algorithms so that vector instructions were
performed almost exclusively
Special purpose hardware is necessary
Fastest systems are no longer vector
uniprocessors.

27
Parallel Vector Processor

Small number of powerful custom-designed vector
processors used
Each processor is capable of at least 1 Giga
flop/s performance
A custom-designed, high bandwidth crossbar switch
networks these vector processors.
Most machines do not use caches, rather they use
a large number of vector registers and an
instruction buffer
Examples Cray C-90, Cray T-90, Cray T-3D

28
Symmetric Multiprocessors (SMPs)
P/C Microprocessor and cache SM Shared memory
29
Symmetric Multiprocessors (SMPs) characteristics
Symmetric Multiprocessors (SMPs)

Uses commodity microprocessors with on-chip and
off-chip caches.
Processors are connected to a shared memory
through a high-speed snoopy bus
On Some SMPs, a crossbar switch is used in
addition to the bus.
Scalable upto
4-8 processors (non-back planed based)
few tens of processors (back plane based)

30
Symmetric Multiprocessors (SMPs)
Symmetric Multiprocessors (SMPs) characteristics

All processors see same image of all system
resources
Equal priority for all processors (except for
master or boot CPU)
Memory coherency maintained by HW
Multiple I/O Buses for greater Input / Output

31
Symmetric Multiprocessors (SMPs)
Processor L1 cache
Processor L1 cache
Processor L1 cache
Processor L1 cache
DIR Controller
I/O Bridge
I/O Bus
Memory
32
Symmetric Multiprocessors (SMPs)

Issues
Bus based architecture
Inadequate beyond 8-16 processors
Crossbar based architecture
multistage approach considering I/Os required in
hardware
Clock distribution and HF design issues for
backplanes
Limitation is mainly caused by using a
centralized shared memory and a bus or cross bar
interconnect which are both difficult to scale
once built.

33
Commercial Symmetric Multiprocessors (SMPs)

Sun Ultra Enterprise 10000 (high end, expandable
upto 64 processors), Sun Fire
DEC Alpha server 8400
HP 9000
SGI Origin
IBM RS 6000
IBM P690, P630
Intel Xeon, Itanium, IA-64(McKinley)

34
Symmetric Multiprocessors (SMPs)

Heavily used in commercial applications (data
bases, on-line transaction systems)
System is symmetric (every processor has equal
equal access to the shared memory, the I/O
devices, and the operating systems.
Being symmetric, a higher degree of parallelism
can be achieved.

35
Massively Parallel Processors (MPPs)
P/C Microprocessor and cache LM Local
memory NIC Network interface circuitry MB
Memory bus
36
Massively Parallel Processors (MPPs)

Commodity microprocessors in processing nodes
Physically distributed memory over processing
nodes
High communication bandwidth and low latency as
an interconnect. (High-speed, proprietary
communication network)
Tightly coupled network interface which is
connected to the memory bus of a processing node

37
Massively Parallel Processors (MPPs)

Provide proprietary communication software to
realize the high performance
Processors Interconnected by a high-speed memory
bus to a local memory through and a network
interface circuitry (NIC)
Scaled up to hundred or even thousands of
processors
Each processes has its private address space and
Processes interact by passing messages

38
Massively Parallel Processors (MPPs)

MPPs support asynchronous MIMD modes
MPPs support single system image at different
levels
Microkernel operating system on compute nodes
Provide high-speed I/O system
Example Cray T3D, T3E, Intel Paragon, IBM
SP2

39
Cluster ?

cluster n.
A group of the same or similar elements gathered
or occurring closely together a bunch She held
out her hand, a small tight cluster of fingers
(Anne Tyler).
Linguistics. Two or more successive consonants in
a word, as cl and st in the word cluster.

A Cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnected stand alone/complete computers
cooperatively working together as a single,
integrated computing resource.
40
Cluster System Architecture
Programming Environment Web Windows
Other Subsystems (Java, C, Fortran, MPI, PVM)
User Interface (Database, OLTP)
Single System Image Infrastructure
Availability Infrastructure
OS Node
OS Node
OS Node

Interconnect
41
Clusters ?

A set of
Nodes physically connected over commodity/
proprietary network
Gluing Software
Other than this definition no Official Standard
exists
Depends on the user requirements
Commercial
Academic
Good way to sell old wine in a new bottle
Budget
Etc ..
Designing Clusters is not obvious but Critical
issue.

42
Why Clusters NOW?

Clusters gained momentum when three technologies
converged
Very high performance microprocessors
workstation performance yesterday
supercomputers
High speed communication
Standard tools for parallel/ distributed
computing their growing popularity
Time to market gt performance
Internet services huge demands for scalable,
available, dedicated internet servers
big I/O, big computing power

43
How should we Design them ?

Components
Should they be off-the-shelf and low cost?
Should they be specially built?
Is a mixture a possibility?
Structure
Should each node be in a different box
(workstation)?
Should everything be in a box?
Should everything be in a chip?
Kind of nodes
Should it be homogeneous?
Can it be heterogeneous?

44
What Should it offer ?

Identity
Should each node maintains its identity (and
owner)?
Should it be a pool of nodes?
Availability
How far should it go?
Single-system Image
How far should it go?

45
Place for Clusters in HPC world ?
Distance between nodes
A chip
SM Parallel computing
A box
Distributed computing
A room
A building
The world
Source Toni Cortes (toni_at_ac.upc.es)
46
Where Do Clusters Fit?
1 TF/s delivered
15 TF/s delivered
Distributed systems
MP systems
Legion\Globus
Superclusters
Berkley NOW
ASCI Red Tflops
SETI_at_home
Beowulf
Condor
Internet

Bounded set of resources
Apps grow to consume all cycles
Application manages resources
System SW gets in the way
5 overhead is maximum
Apps drive purchase of equipment
Real-time constraints
Space-shared

Gather (unused) resources
System SW manages resources
System SW adds value
10 - 20 overhead is OK
Resources drive applications
Time to completion is not critical
Time-shared
Commercial PopularPower, United Devices,
Centrata, ProcessTree, Applied Meta, etc.

Src B. Maccabe, UNM, R.Pennington NCSA
47
Top 500 Supercomputers
Rank Computer/Procs Peak performance Country/year
1 Earth Simulator (NEC) 5120 40960 GF Japan / 2002
2 ASCI Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096 10240 GF LANL, USA/2002
3 ASCI Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096 10240 GF LANL, USA/2002
4 ASCI White (IBM) SP power 3 375 MHz / 8192 12288 GF LANL, USA/2000
5 MCR Linux Cluster Xeon 2.4 GHz Qudratics / 2304 11060GF LANL, USA/2002

From www.top500.org

48
What makes the Clusters ?

The same hardware used for
Distributed computing
Cluster computing
Grid computing
Software converts hardware in a cluster
Tights everything together

49
Task Distribution

The hardware is responsible for
High-performance
High-availability
Scalability (network)
The software is responsible for
Gluing the hardware
Single-system image
Scalability
High-availability
High-performance

50
Classification ofCluster Computers
51
Clusters Classification 1

Based on Focus (in Market)
High performance (HP) clusters
Grand challenging applications
High availability (HA) clusters
Mission critical applications
Web/e-mail
Search engines

52
HA Clusters
53
Clusters Classification 2

Based on Workstation/PC Ownership
Dedicated clusters
Non-dedicated clusters
Adaptive parallel computing
Can be used for CPU cycle stealing

54
Clusters Classification 3

Based on Node Architecture
Clusters of PCs (CoPs)
Clusters of Workstations (COWs)
Clusters of SMPs (CLUMPs)

55
Clusters Classification 4

Based on Node Components Architecture
Configuration
Homogeneous clusters
All nodes have similar configuration
Heterogeneous clusters
Nodes based on different processors and running
different OS

56
Clusters Classification 5

Based on Node OS Type..
Linux Clusters (Beowulf)
Solaris Clusters (Berkeley NOW)
NT Clusters (HPVM)
AIX Clusters (IBM SP2)
SCO/Compaq Clusters (Unixware)
.Digital VMS Clusters, HP clusters, ..

57
Clusters Classification 6

Based on Levels of Clustering
Group clusters ( nodes 2-99)
A set of dedicated/non-dedicated computers ---
mainly connected by SAN like Myrinet
Departmental clusters ( nodes 99-999)
Organizational clusters ( nodes many 100s)
Internet-wide clusters Global clusters(
nodes 1000s to many millions)
Computational Grid

58
Clustering Evolution
3rd Gen. Commercial Grade Clusters
4th Gen. Network Transparent Clusters
2nd Gen. Beowulf Clusters
COST
COMPLEXITY
1st Gen. MPP Super Computers
1990
2005
Time
59
Cluster Components

Hardware
System Software

60
Hardware
61
Nodes

The idea is to use standard off-the-shelf
processors
Pentium like Intel, AMDK
Sun
HP
IBM
SGI
No special development for clusters

62
Interconnection Network
63
Interconnection Network

One of the key points in clusters
Technical objectives
High bandwidth
Low latency
Reliability
Scalability

64
Network Design Issues

Plenty of work has been done to improve networks
for clusters
Main design issues
Physical layer
Routing
Switching
Error detection and correction
Collective operations

65
Physical Layer

Trade-off between
Raw data transfer rate and cable cost
Bit width
Serial mediums (Ethernet, Fiber Channel)
Moderate bandwidth
64-bit wide cable (HIPPI)
Pin count limits the implementation of switches
8-bit wide cable (Myrinet, ServerNet)
Good compromise

66
Routing

Source-path
The entire path is attached to the message at its
source location
Each switch deletes the current head of the path
Table-based routing
The header only contains the destination node
Each switch has a table to help in the decision

67
Switching

Packet switching
Packets are buffered in the switch before resent
Implies an upper-bound packet size
Needs buffers in the switch
Used by traditional LAN/WAN networks
Wormhole switching
Data is immediately forwarded to the next stage
Low latency
No buffers are needed
Error correction is more difficult
Used by SANs such as Myrinet, PARAMNet

68
Flow Control

Credit-based design
The receiver grants credit to the sender
The sender can only send if it has enough credit
On-Off
The receiver informs whether it can or cannot
accept new packets

69
Error Detection

It has to be done at hardware level
Performance reasons
i.e. CRC checking is done by the network
interface
Networks are very reliable
Only erroneous messages should see overhead

70
Collective Operations

These operations are mainly
Barrier
Multicast
Few interconnects offer this characteristic
Synfinity is good example
Normally offered by software
Easy to achieve in bus-based like Ethernet
Difficult to achieve in point-to-point like
Myrinet

71
Examples of Network

The most common networks used are
-Ethernet
SCI
Myrinet
PARAMNet
HIPPI
ATM
Fiber Channel
AmpNet
Etc.

72
-Ethernet

Most widely used for LAN
Affordable
Serial transmission
Packet switching and table-based routing
Types of Ethernet
Ethernet and Fast Ethernet
Based on collision domain (Buses)
Switched hubs can make different collision
domains
Gigabit Ethernet
Based on high-speed point-to-point switches
Each nodes is in its own collision domain

73
ATM

Standard designed for telecommunication industry
Relatively expensive
Serial
Packet switching and table-based routing
Designed around the concept of fixed-size packets
Special characteristics
Well designed for real-time systems

74
Scalable Coherent Interface (SCI)

First standard specially designed for CC
Low layer
Point-to-point architecture but maintains
bus-functionality
Packet switching and table-based routing
Split transactions
Dolphin Interconnect Solutions, Sun SPARC Sbus
High Layer
Defines a distributed cache-coherent scheme
Allows transparent shared memory programming
Sequent NUMA-Q, Data general AViiON NUMA

75
PARAMNet Myrinet

Low-latency and High-bandwidth network
Characteristics
Byte-wise links
Wormhole switching and source-path routing
Low-latency cut-through routing switches
Automatic mapping, which favors fault tolerance
Zero-copying is not possible
Programmable on-board processor
Allows experimentation with new protocols

76
Comparison
77
Communication Protocols

Traditional protocols
TCP and UDP
Specially designed
Active messages
VMMC
BIP
VIA
Etc.

78
Data Transfer

User-level lightweight communication
Avoid OS calls
Avoid data copying
Examples
Fast messages, BIP, ...
Kernel-level lightweight communication
Simplified protocols
Avoid data copying
Examples
GAMMA, PM, ...

79
TCP and UDP

First messaging libraries used
TCP is reliable
UDP is not reliable
Advantages
Standard and well known
Disadvantages
Too much overhead (specially for fast networks)
Plenty OS interaction
Many copies

80
Active Messages

Low-latency communication library
Main issues
Zero-copying protocol
Messages copied directly
to/from the network
to/from the user-address space
Receiver memory has to be pinned
There is no need of a receive operation

81
VMMC

Virtual-Memory Mapped Communication
View messages as read and writes to memory
Similar to distributed shared memory
Makes a correspondence between
A virtual pages at the receiving side
A virtual page at the sending side

82
BIP

Basic Interface for Parallelism
Low-level message-layer for Myrinet
Uses various protocols for various message sizes
Tries to achieve zero copies (one at most)
Used via MPI by programmers
7.6us latency
107 Mbytes/s bandwidth

83
VIA

Virtual Interface Architecture
First standard promoted by the industry
Combines the best features of academic projects
Interface
Designed to be used by programmers directly
Many programmers believe it to be too low level
Higher-level APIs are expected
NICs with VIA implemented in hardware
This is the proposed path

84
Potential and Limitations

High bandwidth
Can be achieved at low cost
Low latency
Can be achieved, but at high cost
The lower the latency is the closer to a
traditional supercomputer we get
Reliability
Can be achieved at low cost
Scalability
Easy to achieve for the size of clusters

85
System Software

Operating system vs. middleware
Processor management
Memory management
I/O management
Single-system image
Monitoring clusters
High Availability
Potential and limitations

86
Operating system vs. Middleware

Operating system
Hardware-control layer
Middleware
Gluing layer
The barrier is not always clear
Similar
User level
Kernel level

Middleware
87
System Software

We will not distinguish between
Operating system
Middleware
The middleware related to the operating system
Objectives
Performance/Scalability
Robustness
Single-system image
Extendibility
Scalability
Heterogeneity

88
Processor Management

Schedule jobs onto the nodes
Scheduling policies should take into account
Needed vs. available resources
Processors
Memory
I/O requirements
Execution-time limits
Priorities
Different kind of jobs
Sequential and parallel jobs

89
Load Balancing

Problem
A perfect static balance is not possible
Execution time of jobs is unknown
Unbalanced systems may not be efficient
Solution
Process migration
Prior to execution
Granularity must be small
During execution
Cost must be evaluated

90
Fault Tolerance

Large cluster must be fault tolerant
The probability of a fault is quite high
Solution
Re-execution of applications in the failed node
Not always possible or acceptable
Checkpointing and migration
It may have a high overhead
Difficult with some kind of applications
Applications that modify the environment
Transactional behavior may be a solution

91
Managing Heterogeneous Systems

Compatible nodes but different characteristics
It becomes a load balancing problem
Non compatible nodes
Binaries for each kind of node are needed
Shared data has to be in a compatible format
Migration becomes nearly impossible

92
Scheduling Systems

Kernel level
Very few take care of cluster scheduling
High-level applications do the scheduling
Distribute the work
Migrate processes
Balance the load
Interact with the users
Examples
CODINE, CONDOR, NQS, etc

93
Memory Management

Objective
Use all the memory available in the cluster
Basic approaches
Software distributed-shared memory
General purpose
Specific usage of idle remote memory
Specific purpose
Remote memory paging
File-system caches or RAMdisks (described later)

94
Software Distributed Shared Memory

Software layer
Allows applications running on different nodes to
share memory regions
Relatively transparent to the programmer
Address-space structure
Single address space
Completely transparent to the programmer
Shared areas
Applications have to mark a given region as
shared
Not completely transparent
Approach mostly used due to its simplicity

95
Main Data Problems to be Solved

Data consistency vs. Performance
A strict semantic is very inefficient
Current systems offer relaxed semantics
Data location (finding the data)
The most common solution is the owner node
This node may be fixed or vary dynamically
Granularity
Usually a fixed block size is implemented
Hardware MMU restrictions
Leads to false sharing
Variable granularity being studied

96
Other Problems to be Solved

Synchronization
Test-and-set-like mechanisms cannot be used
SDSM systems have to offer new mechanisms
i.e. semaphores (message passing implementation)
Fault tolerance
Very important and very seldom implemented
Multiples copies
Heterogeneity
Different page sizes
Different data-type implementations
Use tags

97
Remote-Memory Paging

Keep swapped-out pages in idle memory
Assumptions
Many workstations are idle
Disks are much slower than Remote memory
Idea
Send swapped-out pages to idle workstations
When no remote memory space then use disks
Replicate copies to increase fault tolerance
Examples
The global memory service (GMS)
Remote memory pager

98
I/O Management

Advances very closely to parallel I/O
There are two major differences
Network latency
Heterogeneity
Interesting issues
Network configurations
Data distribution
Name resolution
Memory to increase I/O performance

99
Network Configurations

Device location
Attached to nodes
Very easy to have (use the disks in the nodes)
Network attached devices
I/O bandwidth is not limited by memory bandwidth
Number of networks
Only one network for everything
One special network for I/O traffic (SAN)
Becoming very popular

100
Data Distribution

Distribution per files
Each nodes has its own independent file system
Like in distributed file systems (NFS, Andrew,
CODA, ...)
Each node keeps a set of files locally
It allows remote access to its files
Performance
Maximum performance device performance
Parallel access only to different files
Remote files depends on the network
Caches help but increase complexity (coherence)
Tolerance
File replication in different nodes

101
Data Distribution

Distribution per blocks
Also known as Software/Parallel RAIDs
xFS, Zebra, RAMA, ...
Blocks are interleaved among all disks
Performance
Parallel access to blocks in the same file
Parallel access to different files
Requires a fast network
Usually solved with a SAN
Especially good for large requests (multimedia)
Fault tolerance
RAID levels (3, 4 and 5)

102
Name Resolution

Equal than in distributed systems
Mounting remote file systems
Useful when the distribution is per files
Distributed name resolution
Useful when the distribution is per files
Returns the node where the file resides
Useful when the distribution is per blocks
Returns the node where the files meta-data is
located

103
Caching

Caching can be done at multiple levels
Disk controller
Disk servers
Client nodes
I/O libraries
etc.
Good to have several levels of cache
High levels decrease hit ratio of low levels
Higher level caches absorb most of the locality

104
Cooperative Caching

Problem of traditional caches
Each nodes caches the data it needs
Plenty of replication
Memory space not well used
Increase the coordination of the caches
Clients know what other clients are caching
Clients can access cached data in remote nodes
Replication in the cache is reduced
Better use of the memory

105
RAMdisks

Assumptions
Disks are slow and memorynetwork is fast
Disks are persistent and memory is not
Build disk unifying idle remote RAM
Only used for non-persistent data
Temporary data
Useful in many applications
Compilations
Web proxies
...

106
Single-System Image

SSI offers the idea that the cluster is a single
machine
It can be done a several levels
Hardware
Hardware DSM
System software
It can offers unified view to applications
Application
It can offer a unified view to the user
All SSI have a boundary

107
Key Services of SSI

Main services offered by SSI
Single point of entry
Single file hierarchy
Single I/O Space
Single point of management
Single virtual networking
Single job/resource management system
Single process space
Single user interface
Not all of them are always available

108
Monitoring Clusters

Clusters need tools to be monitored
Administrators have many things to check
The cluster must be visible from a single point
Subjects of monitoring
Physical environment
Temperature, power, ..
Logical services
RPCs, NFS, ...
Performance meters
Paging, CPU load, ...

109
Monitoring Heterogeneous Clusters

Monitoring is specially necessary in
heterogeneous clusters
Several node types
Several operating systems
The tool should hide the differences
The real characteristics are only needed to solve
some problems
Very related to Single-System Image

110
Auto-Administration

Monitors know how to make self diagnosis
Next step is to run corrective procedures
Some systems start to do so (NetSaint)
Difficult because tools do not have common sense
This step is necessary
Many nodes
Many devices
Many possible problems
High probability of error

111
High Availability

One of the key points for clusters
Specially needed for commercial applications
7 days a week and 24 hours a day
Not necessarily very scalable (32 nodes)
Based on many issues already described
Single-system image
Hide any possible change in the configuration
Monitoring tools
Detect the errors to be able to correct them
Process migration
Restart/continue applications in running nodes

112

Design and Build a Cluster Computer

113
Cluster Design

Clusters are good as personal supercomputers
Clusters are not often good as general purpose
multi-user production machines
Building such a cluster requires planning and
understanding design tradeoffs

114
Scalable Cluster Design Principles

Principle of Independence
Principle of Balanced Design
Principle of design for Scalability
Principle of Latency hiding

115
Principle of independence

Components (hardware Software) of the system
should be independent of one another
Incremental scaling - Scaling up a system along
one dimension by improving one component,
independent of others
For example upgrade processor to next
generation, system should operate at higher
performance with upgrading other components.
Should enable heterogeneity scalability

116
Principle of independence

The components independence can result in cost
cutting
The component becomes a commodity, with following
features
Open architecture with standard interfaces to the
rest of the system
Off-the-shelf product Public domain
Multiple vendor in the open market with large
volume
Relatively mature
For all these reasons the commodity component
has low cost, high availability and reliability

117
Principle of independence

Independence principle and application examples
The algorithm should be independent of the
architecture
The application should be independent of platform
The programming language should be independent of
the machine
The language should be modular and have
orthogonal feature
The node should be independent of the network,
and the network interface should be independent
of the network topology
Caveat
In any parallel system, there is usually some key
component/technique that is novel
We can not build en efficient system by simply
scaling up one or few components
Design should be balanced

118
Principle of Balanced Design

Minimize any performance bottleneck
Should avoid an unbalanced system design, where
slow component degrades the performance of the
entire system
Should avoid single point of failure
Example
The PetaFLOP project The memory requirement for
wide range of scientific/Engineering applications
Memory (GB) Speed3/4 (Gflop/s)
30 TB of memory is appropriate for a Pflop/s
machine.

119
Principle of Design for Scalability

Provision must be made so that System can either
scale up to provide higher performance
Or scale down to allow affordability or greater
cost-effectiveness
Two approaches
Overdesign
Example Modern processors support 64-bit
address space. This huge address may not be fully
utilized by Unix supporting 32-bit address space.
This overdesign will create much easier
transition of OS from 32-bit to 64-bit
Backward compatibility
Example A parallel program designed to run on n
nodes should be able to run on a single node, may
be with a reduced input data.

120
Principle of Latency Hiding

Future scalable system are most likely to use a
distributed shared-memory architecture.
Access to remote memory may experience a long
latencies
Example GRID
Scalable multiprocessors clusters must rely on
use of
Latency hiding
Latency avoiding
Latency reduction

121
Cluster Building

Conventional wisdom Building a cluster is easy
Recipe
Buy hardware from Computer Shop
Install Linux, Connect them via network
Configure NFS, NIS
Install your application, run and be happy
Building it right is a little more difficult
Multi user cluster, security, performance tools
Basic question - what works reliably?
Building it to be compatible with Grid
Compilers, libraries
Accounts, file storage, reproducibility
Hardware configuration may be an issue

122

How do people think of parallel programming and
using clusters ..

123
Panther Cluster

Picked 8 PC and named them from Panther family.
Connected them by network and setup this cluster
in a small lab.
Using Panther Cluster
Select a PC log in
Edit and Compile the code
Execute the program
Analyze the results

Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
124
Panther Cluster - Programming

Explicit parallel Programming isn't easy. You
really have to do yourself
Network bandwidth and Latency matter
There are good reasons for security patches
OopsLion does not have floating point

Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
125
Panther Cluster Attacks Users

Grad Students wanted to use the cool cluster.
They each need only half (a half other than Lion)
Grad Students discover that using the same PC at
the same time is incredibly bad
A solution would be to use parts of the cluster
exclusively for one job at a time.
And so.

Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
126
We Discover Scheduling

We tried
A sign up sheet
Yelling across the yard
A mailing list
finger schedule
A scheduler

Queue
Cheeta
Tiger
Job 1
Job 2
Kitten
Cat
Job 3
Jaguar
Leopard
Panther
Lion
127
Panther Expands

Panther expands, adding more users and more
systems
Use Panther Node for
Login
File services
Scheduling services
.
All other compute nodes

Panther
Cheeta
PC1
Tiger
PC5
Kitten
Cat
PC6
PC2
PC3
Jaguar
Leopard
PC7
Lion
PC8
PC 09
PC4
128
Evolution of Cluster Services
The Cluster grows
Login
Login
Login
Login File service Scheduling Management I/O
services
Login
File service
File service
File service
File service Scheduling Management I/O services
Scheduling
Scheduling
Scheduling Management I/O services
Management I/O services
Management
I/O services
Improve computing performance
Improve system reliability and manageability

Basic goal

129
Compute _at_ Panther

Usage Model
Login to login node
Compile and test code
Schedule a test run
Schedule a serious run
Carry out I/O through I/O node
Management Model
The compute nodes are identical
Users use Login, I/O and compute nodes
All I/O requests are managed by Metadata server

Login
File
Sched
Mgmt
I/O
PC1
Cheeta
Tiger
PC5
Kitten
PC6
Cat
PC2
PC3
Jaguar
Leopard
PC7
Lion
PC8
PC 09
PC4
130
Cluster Building Block
131
Building Blocks - Hardware

Processor
Complex Instruction Set Computer (CISC)
x86, Pentium Pro, Pentium II, III, IV
Reduced Instruction Set Computer (RISC)
SPARC, RS6000, PA-RISC, PPC, Power PC
Explicitly Parallel Instruction Computer (EPIC)
IA-64 (McKinley), Itanium

132
Building Blocks - Hardware

Memory
Extended Data Out (EDO)
pipelining by loading next call to or from memory
50 - 60 ns
DRAM and SDRAM
Dynamic Access and Synchronous (no pairs)
13 ns
PC100 and PC133
7ns and less

133
Building Blocks - Hardware

Cache
L1 - 4 ns
L2 - 5 ns
L3 (off the chip) - 30 ns
Celeron
0 512KB
Intel Xeon chips
512 KB - 2MB L2 Cache
Intel Itanium
512KB -
Most processors have at least 256 KB

134
Building Blocks - Hardware

Disks and I/O
IDE and EIDE
IBM 75 GB 7200 rpm disk w/ 2MB onboard cache
SCSI I, II, II and SCA
5400, 7400, and 10000 rpm
20 MB/s, 40 MB/s, 80 MB/s, 160 MB/s
Can chain from 6-15 disks
RAID Sets
software and hardware
best for dealing with parallel I/O
reserved cache for before flushing to disks

135
Building Blocks - Hardware

System Bus
ISA
5Mhz - 13 Mhz
32 bit PCI
33Mhz
133 MB/s
64 bit PCI
66Mhz
266MB/s

136
Building Blocks - Hardware

Network Interface Cards (NICs)
Ethernet - 10 Mbps, 100 Mbps, 1 Gbps
ATM - 155 Mbps and higher
Quality of Service (QoS)
Scalable Coherent Interface (SCI)
12 microseconds latency
Myrinet - 1.28 Gbps
120 MB/s
5 microseconds latency
PARAMNet 2.5 Gbps

137
Building Blocks Operating System

Solaris - Sun
AIX - IBM
HPUX - HP
IRIX - SGI
Linux - everyone!
Is architecture independent
Windows NT/2000

138
Building Blocks - Compilers

Commercial
Portland Group Incorporated (PGI)
C, C, F77, F90
Not as expensive as vendor specific and compile
most applications
GNU
gcc, g, g77, vast f90
free!

139
Building Blocks - Scheduler

Cron, at (NT/2000)
Condor
IBM Loadleveler
LSF
Portable Batch System (PBS)
Maui Scheduler
GLOBUS
All free, run on more than one OS!

140
Building Blocks Message Passing

Commercial and free
Naturally Parallel, Highly Parallel
Condor
High Throughput Computing (HTC)
Parallel Virtual Machine PVM
oak ridge national labs
Message Passing Interface (MPI)
mpich from anl

141
Building Blocks Debugging and Analysis

Parallel Debuggers
TotalView
GUI based
Performance Analysis Tools
monitoring library calls and runtime analysis
AIMS, MPE, Pablo,
Paradyn - from Wisconsin,
SvPablo, Vampir, Dimemas, Paraver

142
Building Block Other

Cluster Administration Tools
Cluster Monitoring Tools
These tools are the part of Single System Image
Aspects

143
Scalability of Parallel Processors
Cluster of Uniprocessors
SMP
Performance
Cluster of SMPs
Processors
144
Installing the Operating System

Which package ?
Which Services ?
Do I need a graphical environment ?

145
Identifying the hardware bottlenecks

Is my hardware optimal ?
Can I improve my hardware choices ?
How can I identify where is the problem ?
Common hardware bottlenecks !!

146
Benchmarks

Synthetic Benchmarks
Bonnie
Stream
NetPerf
NetPipe
Applications Benchmarks
High Performance Linpack
NAS

147
Networking under Linux
148
Network Terminology Overview

IP address the unique machine address on the
net (e.g., 128.169.92.195)
netmask determines which portion of the IP
address specifies the subnetwork number, and
which portion specifies the host on that subnet
(e.g., 255.255.255.0)
network address IP address masked bitwise-ANDed
with the netmask (e.g.,128.169.92.0)
broadcast address network address ORed with the
negation of the netmask (128.169.92.255)

149
Network Terminology Overview

gateway address the address of the gateway
machine that lives on two different networks and
routes packets between them
name server address the address of the name
server that translates host names into IP
addresses

150
A Cluster Network
151
Network Configuration

IP Address
Three private IP address range
10.0.0.0 to 10.255.255.255 172.16.0.0 to
172.32.255.255 196.168.0.0 to 192.168.255.255
Information on private intranet is available in
RFC 1918
Warning Should not use IP address 10.0.0.0 or
172.16.0.0 or 196.168.0.0 for server
Netmask 255.255.255.0 should be sufficient for
most clusters

152
Network Configuration

DHCP Dynamic Host Configuration Protocol
Advantages
You can simplify network setup
Disadvantages
It is centralized solution ( is it scalable ?)
IP addresses are linked to ethernet address, and
that can be a problem if you change the NIC or
want to change the hostname routinely

153
Network Configuration Files

/etc/resolv.conf -- configures the name resolver
specifying the following fields
search (a list of alternate domain names to
search for a hostname)
nameserver (IP addresses of DNS used for name
resolutions)
search cse.iitd.ac.in
nameserver 128.169.93.2
nameserver 128.169.201.2

154
Network Configuration Files

/etc/hosts -- contains a list of IP addresses and
their corresponding hostnames. Used for faster
name resolution process (no need to query the
domain name server to get the IP address)
127.0.0.1 localhost
localhost.localdomain
128.169.92.195 galaxy
galaxy.cse.iitd.ac.in
192.168.1.100 galaxy galaxy
192.168.1.1 star1 star1
/etc/host.conf -- specifies the order of queries
to resolve host names Example
order hosts, bind check the /etc.../hosts
first and then the DNS
multi on allow to have multiple IP
addresses

155
Host-specific Configuration Files

/etc/conf.modules -- specifies the list of
modules (drivers) that have to be loaded by the
kerneld (see /lib/modules for a full list)
alias eth0 tulip
/etc/HOSTNAME - specifies your system hostname
galaxy1.cse.iitd.ac.in
/etc/sysconfig/network -- specifies a gateway
host, gateway device
NETWORKINGyes
HOSTNAMEgalaxy.cse.iitd.ac.in
GATEWAY128.169.92.1
GATEWAYDEVeth0
NISDOMAINworkshop

156
Configure Ethernet Interface

Loadable ethernet drivers -
Loadable modules are pieces of object codes that
can be loaded into a running kernel. It allows
Linux to add device drivers to a running Linux
system in real time. The loadable Ethernet
drivers are found in the /lib/modules/release/net
directory

Write a Comment

User Comments (0)