Title: A Tutorial
1A Tutorial
- Designing Cluster Computers
- and
- High Performance Storage Architectures
- At
- HPC ASIA 2002, Bangalore INDIA
- December 16, 2002
- By
Dheeraj Bhardwaj Department of Computer Science
Engineering Indian Institute of Technology,
Delhi INDIA e-mail dheerajb_at_cse.iits.ac.in http
//www.cse.iitd.ac.in/dheerajb
N. Seetharama Krishna Centre for Development of
Advanced Computing Pune University Campus, Pune
INDIA e-mail krishna_at_cdacindia.com http//www.cd
acindia.com
2Acknowledgments
- All the contributors of LINUX
- All the contributors of Cluster Technology
- All the contributors in the art and science of
parallel computing - Department of Computer Science Engineering, IIT
Delhi - Centre for Development of Advanced Computing,
(C-DAC) and collaborators
3Disclaimer
- The information and examples provided are based
on the Red Hat Linux 7.2 installation on the
Intel PCs platforms ( our specific hardware
specifications) - Much of it should be applicable to other
versions of Linux - There is no warranty that the materials are error
free - Authors will not be held responsible for any
direct, indirect, special, incidental or
consequential damages related to any use of these
materials
4Part I
- Designing Cluster Computers
5Outline
- Introduction
- Classification of Parallel Computers
- Introduction to Clusters
- Classification of Clusters
- Cluster Components Issues
- Hardware
- Interconnection Network
- System Software
- Design and Build a Cluster Computers
- Principles of Cluster Design
- Cluster Building Blocks
- Networking Under Linux
- PBS
- PVFS
- Single System Image
6Outline
- Tools for Installation and Management
- Issues related to Installation, configuration,
Monitoring and Management - NPACI Rocks
- OSCAR
- EU DataGrid WP4 Project
- Other Tools Sycld Beowulf, OpenMosix, Cplant,
SCore - HPC Applications and Parallel Programming
- HPC Applications
- Issues related to parallel Programming
- Parallel Algorithms
- Parallel Programming Paradigms
- Parallel Programming Models
- Message Passing
- Applications I/O and Parallel File System
- Performance metrics of Parallel Systems
- Conclusion
7Introduction
8What do We Want to Achieve ?
- Develop High Performance Computing (HPC)
Infrastructure which is - Scalable (Parallel MPP Grid)
- User Friendly
- Based on Open Source
- Efficient in Problem Solving
- Able to Achieve High Performance
- Able to Handle Large Data Volumes
- Cost Effective
- Develop HPC Applications which are
- Portable ( Desktop Supercomputers
Grid) - Future Proof
- Grid Ready
9Who Uses HPC ?
- Scientific Engineering Applications
- Simulation of physical phenomena
- Virtual Prototyping (Modeling)
- Data analysis
- Business/ Industry Applications
- Data warehousing for financial sectors
- E-governance
- Medical Imaging
- Web servers, Search Engines, Digital libraries
- etc ..
- All face similar problems
- Not enough computational resources
- Remote facilities Network becomes the
bottleneck - Heterogeneous and fast changing systems
10HPC Applications
- Three Types
- High-Capacity Grand Challenge Applications
- Throughput Running hundreds/thousands of job,
doing parameter studies,
statistical analysis etc - Data Genome analysis, Particle Physics,
Astronomical observations, Seismic data
processing etc - We are seeing a Fundamental Change in HPC
Applications - They have become multidisciplinary
- Require incredible mix of varies technologies and
expertise
11Why Parallel Computing ?
- If your Application requires more computing power
than a sequential computer can provide ? !!!!! - You might suggest to improve the operating speed
of processor and other components - We do not disagree with your suggestion BUT how
long you can go ? - We always have desire and prospects for greater
performance - Parallel Computing is the right answer
12Serial and Parallel Computing
A parallel computer is a Collection of
processing elements that communicate and
co-operate to solve large problems fast.
- PARALLEL COMPUTING
- Fetch/Store
- Compute/communicate
- Cooperative game
- SERIAL COMPUTING
- Fetch/Store
- Compute
13Classification of Parallel Computers
14Classification of Parallel Computers
Flynn Classification Number of Instructions
Data Streams
Conventional
Data Parallel, Vector Computing
Systolic Arrays
very general, multiple approaches
15MIMD Architecture Classification
Current focus is on MIMD model, using general
purpose processors or multicomputers.
16MIMD Shared Memory Architecture
- Source PE writes data to Global Memory
destination retrieves it - Easy to build
- Limitation reliability expandability. A
memory component or any processor failure affects
the whole system. - Increase of processors leads to memory
contention. - Ex. Silicon graphics supercomputers....
17MIMD Distributed Memory Architecture
High Speed Interconnection Network
Processor 1
Processor 2
Processor 3
Memory Bus
Memory Bus
Memory Bus
Memory 3
Memory 1
Memory 2
- Inter Process Communication using High Speed
Network. - Network can be configured to various topologies
e.g. Tree, Mesh, Cube.. - Unlike Shared MIMD
- easily/ readily expandable
- Highly reliable (any CPU failure does not affect
the whole system)
18MIMD Features
- MIMD architecture is more general purpose
- MIMD needs clever use of synchronization that
comes from message passing to prevent the race
condition - Designing efficient message passing algorithm is
hard because the data must be distributed in a
way that minimizes communication traffic - Cost of message passing is very high
19Shared Memory (Address-Space) Architecture
- Non-Uniform memory access (NUMA) shared address
space computer with local and global memories - Time to access a remote memory bank is longer
than the time to access a local word - Shared address space computers have a local cache
at each processor to increase their effective
processor-bandwidth. - The cache can also be used to provide fast access
to remotely located shared data - Mechanisms developed for handling cache coherence
problem
20Shared Memory (Address-Space) Architecture
Interconnection Network
M
M
M
Non-uniform memory access (NUMA)
shared-address-space computer with local and
global memories
21Shared Memory (Address-Space) Architecture
Interconnection Network
Non-uniform-memory-access (NUMA)
shared-address-space computer with local memory
only
22Shared Memory (Address-Space) Architecture
- Provides hardware support for read and write
access by all processors to a shared address
space. - Processors interact by modifying data objects
stored in a shared address space. - MIMD shared -address space computers referred as
multiprocessors - Uniform memory access (UMA) shared address space
computer with local and global memories - Time taken by processor to access any memory word
in the system is identical
23Shared Memory (Address-Space) Architecture
P
P
P
Interconnection Network
M
M
M
Uniform Memory Access (UMA) shared-address-space
computer
24Uniform Memory Access (UMA)
- Parallel Vector Processors (PVPs)
- Symmetric Multiple Processors (SMPs)
25Parallel Vector Processor
VP Vector Processor SM Shared memory
26Parallel Vector Processor
- Works good only for vector codes
- Scalar codes mat not perform perform well
- Need to completely rethink and re-express
algorithms so that vector instructions were
performed almost exclusively - Special purpose hardware is necessary
- Fastest systems are no longer vector
uniprocessors.
27Parallel Vector Processor
- Small number of powerful custom-designed vector
processors used - Each processor is capable of at least 1 Giga
flop/s performance - A custom-designed, high bandwidth crossbar switch
networks these vector processors. - Most machines do not use caches, rather they use
a large number of vector registers and an
instruction buffer - Examples Cray C-90, Cray T-90, Cray T-3D
28Symmetric Multiprocessors (SMPs)
P/C Microprocessor and cache SM Shared memory
29Symmetric Multiprocessors (SMPs) characteristics
Symmetric Multiprocessors (SMPs)
- Uses commodity microprocessors with on-chip and
off-chip caches. - Processors are connected to a shared memory
through a high-speed snoopy bus - On Some SMPs, a crossbar switch is used in
addition to the bus. - Scalable upto
- 4-8 processors (non-back planed based)
- few tens of processors (back plane based)
30Symmetric Multiprocessors (SMPs)
Symmetric Multiprocessors (SMPs) characteristics
- All processors see same image of all system
resources - Equal priority for all processors (except for
master or boot CPU) - Memory coherency maintained by HW
- Multiple I/O Buses for greater Input / Output
31Symmetric Multiprocessors (SMPs)
Processor L1 cache
Processor L1 cache
Processor L1 cache
Processor L1 cache
DIR Controller
I/O Bridge
I/O Bus
Memory
32Symmetric Multiprocessors (SMPs)
- Issues
- Bus based architecture
- Inadequate beyond 8-16 processors
- Crossbar based architecture
- multistage approach considering I/Os required in
hardware - Clock distribution and HF design issues for
backplanes - Limitation is mainly caused by using a
centralized shared memory and a bus or cross bar
interconnect which are both difficult to scale
once built.
33Commercial Symmetric Multiprocessors (SMPs)
- Sun Ultra Enterprise 10000 (high end, expandable
upto 64 processors), Sun Fire - DEC Alpha server 8400
- HP 9000
- SGI Origin
- IBM RS 6000
- IBM P690, P630
- Intel Xeon, Itanium, IA-64(McKinley)
34Symmetric Multiprocessors (SMPs)
- Heavily used in commercial applications (data
bases, on-line transaction systems) - System is symmetric (every processor has equal
equal access to the shared memory, the I/O
devices, and the operating systems. - Being symmetric, a higher degree of parallelism
can be achieved.
35Massively Parallel Processors (MPPs)
P/C Microprocessor and cache LM Local
memory NIC Network interface circuitry MB
Memory bus
36Massively Parallel Processors (MPPs)
- Commodity microprocessors in processing nodes
- Physically distributed memory over processing
nodes - High communication bandwidth and low latency as
an interconnect. (High-speed, proprietary
communication network) - Tightly coupled network interface which is
connected to the memory bus of a processing node
37Massively Parallel Processors (MPPs)
- Provide proprietary communication software to
realize the high performance - Processors Interconnected by a high-speed memory
bus to a local memory through and a network
interface circuitry (NIC) - Scaled up to hundred or even thousands of
processors - Each processes has its private address space and
Processes interact by passing messages
38Massively Parallel Processors (MPPs)
- MPPs support asynchronous MIMD modes
- MPPs support single system image at different
levels - Microkernel operating system on compute nodes
- Provide high-speed I/O system
- Example Cray T3D, T3E, Intel Paragon, IBM
SP2
39Cluster ?
- cluster n.
- A group of the same or similar elements gathered
or occurring closely together a bunch She held
out her hand, a small tight cluster of fingers
(Anne Tyler). - Linguistics. Two or more successive consonants in
a word, as cl and st in the word cluster.
A Cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnected stand alone/complete computers
cooperatively working together as a single,
integrated computing resource.
40Cluster System Architecture
Programming Environment Web Windows
Other Subsystems (Java, C, Fortran, MPI, PVM)
User Interface (Database, OLTP)
Single System Image Infrastructure
Availability Infrastructure
OS Node
OS Node
OS Node
Interconnect
41Clusters ?
- A set of
- Nodes physically connected over commodity/
proprietary network - Gluing Software
- Other than this definition no Official Standard
exists - Depends on the user requirements
- Commercial
- Academic
- Good way to sell old wine in a new bottle
- Budget
- Etc ..
- Designing Clusters is not obvious but Critical
issue.
42Why Clusters NOW?
- Clusters gained momentum when three technologies
converged - Very high performance microprocessors
- workstation performance yesterday
supercomputers - High speed communication
- Standard tools for parallel/ distributed
computing their growing popularity - Time to market gt performance
- Internet services huge demands for scalable,
available, dedicated internet servers - big I/O, big computing power
43How should we Design them ?
- Components
- Should they be off-the-shelf and low cost?
- Should they be specially built?
- Is a mixture a possibility?
- Structure
- Should each node be in a different box
(workstation)? - Should everything be in a box?
- Should everything be in a chip?
- Kind of nodes
- Should it be homogeneous?
- Can it be heterogeneous?
44What Should it offer ?
- Identity
- Should each node maintains its identity (and
owner)? - Should it be a pool of nodes?
- Availability
- How far should it go?
- Single-system Image
- How far should it go?
45Place for Clusters in HPC world ?
Distance between nodes
A chip
SM Parallel computing
A box
Distributed computing
A room
A building
The world
Source Toni Cortes (toni_at_ac.upc.es)
46Where Do Clusters Fit?
1 TF/s delivered
15 TF/s delivered
Distributed systems
MP systems
Legion\Globus
Superclusters
Berkley NOW
ASCI Red Tflops
SETI_at_home
Beowulf
Condor
Internet
- Bounded set of resources
- Apps grow to consume all cycles
- Application manages resources
- System SW gets in the way
- 5 overhead is maximum
- Apps drive purchase of equipment
- Real-time constraints
- Space-shared
- Gather (unused) resources
- System SW manages resources
- System SW adds value
- 10 - 20 overhead is OK
- Resources drive applications
- Time to completion is not critical
- Time-shared
- Commercial PopularPower, United Devices,
Centrata, ProcessTree, Applied Meta, etc.
Src B. Maccabe, UNM, R.Pennington NCSA
47Top 500 Supercomputers
Rank Computer/Procs Peak performance Country/year
1 Earth Simulator (NEC) 5120 40960 GF Japan / 2002
2 ASCI Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096 10240 GF LANL, USA/2002
3 ASCI Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096 10240 GF LANL, USA/2002
4 ASCI White (IBM) SP power 3 375 MHz / 8192 12288 GF LANL, USA/2000
5 MCR Linux Cluster Xeon 2.4 GHz Qudratics / 2304 11060GF LANL, USA/2002
48What makes the Clusters ?
- The same hardware used for
- Distributed computing
- Cluster computing
- Grid computing
- Software converts hardware in a cluster
- Tights everything together
49Task Distribution
- The hardware is responsible for
- High-performance
- High-availability
- Scalability (network)
- The software is responsible for
- Gluing the hardware
- Single-system image
- Scalability
- High-availability
- High-performance
50Classification ofCluster Computers
51Clusters Classification 1
- Based on Focus (in Market)
- High performance (HP) clusters
- Grand challenging applications
- High availability (HA) clusters
- Mission critical applications
- Web/e-mail
- Search engines
52HA Clusters
53Clusters Classification 2
- Based on Workstation/PC Ownership
- Dedicated clusters
- Non-dedicated clusters
- Adaptive parallel computing
- Can be used for CPU cycle stealing
54Clusters Classification 3
- Based on Node Architecture
- Clusters of PCs (CoPs)
- Clusters of Workstations (COWs)
- Clusters of SMPs (CLUMPs)
55Clusters Classification 4
- Based on Node Components Architecture
Configuration - Homogeneous clusters
- All nodes have similar configuration
- Heterogeneous clusters
- Nodes based on different processors and running
different OS
56Clusters Classification 5
- Based on Node OS Type..
- Linux Clusters (Beowulf)
- Solaris Clusters (Berkeley NOW)
- NT Clusters (HPVM)
- AIX Clusters (IBM SP2)
- SCO/Compaq Clusters (Unixware)
- .Digital VMS Clusters, HP clusters, ..
57Clusters Classification 6
- Based on Levels of Clustering
- Group clusters ( nodes 2-99)
- A set of dedicated/non-dedicated computers ---
mainly connected by SAN like Myrinet - Departmental clusters ( nodes 99-999)
- Organizational clusters ( nodes many 100s)
- Internet-wide clusters Global clusters(
nodes 1000s to many millions) - Computational Grid
58Clustering Evolution
3rd Gen. Commercial Grade Clusters
4th Gen. Network Transparent Clusters
2nd Gen. Beowulf Clusters
COST
COMPLEXITY
1st Gen. MPP Super Computers
1990
2005
Time
59Cluster Components
60Hardware
61Nodes
- The idea is to use standard off-the-shelf
processors - Pentium like Intel, AMDK
- Sun
- HP
- IBM
- SGI
- No special development for clusters
62Interconnection Network
63Interconnection Network
- One of the key points in clusters
- Technical objectives
- High bandwidth
- Low latency
- Reliability
- Scalability
64Network Design Issues
- Plenty of work has been done to improve networks
for clusters - Main design issues
- Physical layer
- Routing
- Switching
- Error detection and correction
- Collective operations
65Physical Layer
- Trade-off between
- Raw data transfer rate and cable cost
- Bit width
- Serial mediums (Ethernet, Fiber Channel)
- Moderate bandwidth
- 64-bit wide cable (HIPPI)
- Pin count limits the implementation of switches
- 8-bit wide cable (Myrinet, ServerNet)
- Good compromise
66Routing
- Source-path
- The entire path is attached to the message at its
source location - Each switch deletes the current head of the path
- Table-based routing
- The header only contains the destination node
- Each switch has a table to help in the decision
67Switching
- Packet switching
- Packets are buffered in the switch before resent
- Implies an upper-bound packet size
- Needs buffers in the switch
- Used by traditional LAN/WAN networks
- Wormhole switching
- Data is immediately forwarded to the next stage
- Low latency
- No buffers are needed
- Error correction is more difficult
- Used by SANs such as Myrinet, PARAMNet
68Flow Control
- Credit-based design
- The receiver grants credit to the sender
- The sender can only send if it has enough credit
- On-Off
- The receiver informs whether it can or cannot
accept new packets
69Error Detection
- It has to be done at hardware level
- Performance reasons
- i.e. CRC checking is done by the network
interface - Networks are very reliable
- Only erroneous messages should see overhead
70Collective Operations
- These operations are mainly
- Barrier
- Multicast
- Few interconnects offer this characteristic
- Synfinity is good example
- Normally offered by software
- Easy to achieve in bus-based like Ethernet
- Difficult to achieve in point-to-point like
Myrinet
71Examples of Network
- The most common networks used are
- -Ethernet
- SCI
- Myrinet
- PARAMNet
- HIPPI
- ATM
- Fiber Channel
- AmpNet
- Etc.
72-Ethernet
- Most widely used for LAN
- Affordable
- Serial transmission
- Packet switching and table-based routing
- Types of Ethernet
- Ethernet and Fast Ethernet
- Based on collision domain (Buses)
- Switched hubs can make different collision
domains - Gigabit Ethernet
- Based on high-speed point-to-point switches
- Each nodes is in its own collision domain
73ATM
- Standard designed for telecommunication industry
- Relatively expensive
- Serial
- Packet switching and table-based routing
- Designed around the concept of fixed-size packets
- Special characteristics
- Well designed for real-time systems
74Scalable Coherent Interface (SCI)
- First standard specially designed for CC
- Low layer
- Point-to-point architecture but maintains
bus-functionality - Packet switching and table-based routing
- Split transactions
- Dolphin Interconnect Solutions, Sun SPARC Sbus
- High Layer
- Defines a distributed cache-coherent scheme
- Allows transparent shared memory programming
- Sequent NUMA-Q, Data general AViiON NUMA
75PARAMNet Myrinet
- Low-latency and High-bandwidth network
- Characteristics
- Byte-wise links
- Wormhole switching and source-path routing
- Low-latency cut-through routing switches
- Automatic mapping, which favors fault tolerance
- Zero-copying is not possible
- Programmable on-board processor
- Allows experimentation with new protocols
76Comparison
77Communication Protocols
- Traditional protocols
- TCP and UDP
- Specially designed
- Active messages
- VMMC
- BIP
- VIA
- Etc.
78Data Transfer
- User-level lightweight communication
- Avoid OS calls
- Avoid data copying
- Examples
- Fast messages, BIP, ...
- Kernel-level lightweight communication
- Simplified protocols
- Avoid data copying
- Examples
- GAMMA, PM, ...
79TCP and UDP
- First messaging libraries used
- TCP is reliable
- UDP is not reliable
- Advantages
- Standard and well known
- Disadvantages
- Too much overhead (specially for fast networks)
- Plenty OS interaction
- Many copies
80Active Messages
- Low-latency communication library
- Main issues
- Zero-copying protocol
- Messages copied directly
- to/from the network
- to/from the user-address space
- Receiver memory has to be pinned
- There is no need of a receive operation
81VMMC
- Virtual-Memory Mapped Communication
- View messages as read and writes to memory
- Similar to distributed shared memory
- Makes a correspondence between
- A virtual pages at the receiving side
- A virtual page at the sending side
82BIP
- Basic Interface for Parallelism
- Low-level message-layer for Myrinet
- Uses various protocols for various message sizes
- Tries to achieve zero copies (one at most)
- Used via MPI by programmers
- 7.6us latency
- 107 Mbytes/s bandwidth
83VIA
- Virtual Interface Architecture
- First standard promoted by the industry
- Combines the best features of academic projects
- Interface
- Designed to be used by programmers directly
- Many programmers believe it to be too low level
- Higher-level APIs are expected
- NICs with VIA implemented in hardware
- This is the proposed path
84Potential and Limitations
- High bandwidth
- Can be achieved at low cost
- Low latency
- Can be achieved, but at high cost
- The lower the latency is the closer to a
traditional supercomputer we get - Reliability
- Can be achieved at low cost
- Scalability
- Easy to achieve for the size of clusters
85System Software
- Operating system vs. middleware
- Processor management
- Memory management
- I/O management
- Single-system image
- Monitoring clusters
- High Availability
- Potential and limitations
86Operating system vs. Middleware
- Operating system
- Hardware-control layer
- Middleware
- Gluing layer
- The barrier is not always clear
- Similar
- User level
- Kernel level
Middleware
87System Software
- We will not distinguish between
- Operating system
- Middleware
- The middleware related to the operating system
- Objectives
- Performance/Scalability
- Robustness
- Single-system image
- Extendibility
- Scalability
- Heterogeneity
88Processor Management
- Schedule jobs onto the nodes
- Scheduling policies should take into account
- Needed vs. available resources
- Processors
- Memory
- I/O requirements
- Execution-time limits
- Priorities
- Different kind of jobs
- Sequential and parallel jobs
89Load Balancing
- Problem
- A perfect static balance is not possible
- Execution time of jobs is unknown
- Unbalanced systems may not be efficient
- Solution
- Process migration
- Prior to execution
- Granularity must be small
- During execution
- Cost must be evaluated
90Fault Tolerance
- Large cluster must be fault tolerant
- The probability of a fault is quite high
- Solution
- Re-execution of applications in the failed node
- Not always possible or acceptable
- Checkpointing and migration
- It may have a high overhead
- Difficult with some kind of applications
- Applications that modify the environment
- Transactional behavior may be a solution
91Managing Heterogeneous Systems
- Compatible nodes but different characteristics
- It becomes a load balancing problem
- Non compatible nodes
- Binaries for each kind of node are needed
- Shared data has to be in a compatible format
- Migration becomes nearly impossible
92Scheduling Systems
- Kernel level
- Very few take care of cluster scheduling
- High-level applications do the scheduling
- Distribute the work
- Migrate processes
- Balance the load
- Interact with the users
- Examples
- CODINE, CONDOR, NQS, etc
93Memory Management
- Objective
- Use all the memory available in the cluster
- Basic approaches
- Software distributed-shared memory
- General purpose
- Specific usage of idle remote memory
- Specific purpose
- Remote memory paging
- File-system caches or RAMdisks (described later)
94Software Distributed Shared Memory
- Software layer
- Allows applications running on different nodes to
share memory regions - Relatively transparent to the programmer
- Address-space structure
- Single address space
- Completely transparent to the programmer
- Shared areas
- Applications have to mark a given region as
shared - Not completely transparent
- Approach mostly used due to its simplicity
95Main Data Problems to be Solved
- Data consistency vs. Performance
- A strict semantic is very inefficient
- Current systems offer relaxed semantics
- Data location (finding the data)
- The most common solution is the owner node
- This node may be fixed or vary dynamically
- Granularity
- Usually a fixed block size is implemented
- Hardware MMU restrictions
- Leads to false sharing
- Variable granularity being studied
96Other Problems to be Solved
- Synchronization
- Test-and-set-like mechanisms cannot be used
- SDSM systems have to offer new mechanisms
- i.e. semaphores (message passing implementation)
- Fault tolerance
- Very important and very seldom implemented
- Multiples copies
- Heterogeneity
- Different page sizes
- Different data-type implementations
- Use tags
97Remote-Memory Paging
- Keep swapped-out pages in idle memory
- Assumptions
- Many workstations are idle
- Disks are much slower than Remote memory
- Idea
- Send swapped-out pages to idle workstations
- When no remote memory space then use disks
- Replicate copies to increase fault tolerance
- Examples
- The global memory service (GMS)
- Remote memory pager
98I/O Management
- Advances very closely to parallel I/O
- There are two major differences
- Network latency
- Heterogeneity
- Interesting issues
- Network configurations
- Data distribution
- Name resolution
- Memory to increase I/O performance
99Network Configurations
- Device location
- Attached to nodes
- Very easy to have (use the disks in the nodes)
- Network attached devices
- I/O bandwidth is not limited by memory bandwidth
- Number of networks
- Only one network for everything
- One special network for I/O traffic (SAN)
- Becoming very popular
100Data Distribution
- Distribution per files
- Each nodes has its own independent file system
- Like in distributed file systems (NFS, Andrew,
CODA, ...) - Each node keeps a set of files locally
- It allows remote access to its files
- Performance
- Maximum performance device performance
- Parallel access only to different files
- Remote files depends on the network
- Caches help but increase complexity (coherence)
- Tolerance
- File replication in different nodes
101Data Distribution
- Distribution per blocks
- Also known as Software/Parallel RAIDs
- xFS, Zebra, RAMA, ...
- Blocks are interleaved among all disks
- Performance
- Parallel access to blocks in the same file
- Parallel access to different files
- Requires a fast network
- Usually solved with a SAN
- Especially good for large requests (multimedia)
- Fault tolerance
- RAID levels (3, 4 and 5)
102Name Resolution
- Equal than in distributed systems
- Mounting remote file systems
- Useful when the distribution is per files
- Distributed name resolution
- Useful when the distribution is per files
- Returns the node where the file resides
- Useful when the distribution is per blocks
- Returns the node where the files meta-data is
located
103Caching
- Caching can be done at multiple levels
- Disk controller
- Disk servers
- Client nodes
- I/O libraries
- etc.
- Good to have several levels of cache
- High levels decrease hit ratio of low levels
- Higher level caches absorb most of the locality
104Cooperative Caching
- Problem of traditional caches
- Each nodes caches the data it needs
- Plenty of replication
- Memory space not well used
- Increase the coordination of the caches
- Clients know what other clients are caching
- Clients can access cached data in remote nodes
- Replication in the cache is reduced
- Better use of the memory
105RAMdisks
- Assumptions
- Disks are slow and memorynetwork is fast
- Disks are persistent and memory is not
- Build disk unifying idle remote RAM
- Only used for non-persistent data
- Temporary data
- Useful in many applications
- Compilations
- Web proxies
- ...
106Single-System Image
- SSI offers the idea that the cluster is a single
machine - It can be done a several levels
- Hardware
- Hardware DSM
- System software
- It can offers unified view to applications
- Application
- It can offer a unified view to the user
- All SSI have a boundary
107Key Services of SSI
- Main services offered by SSI
- Single point of entry
- Single file hierarchy
- Single I/O Space
- Single point of management
- Single virtual networking
- Single job/resource management system
- Single process space
- Single user interface
- Not all of them are always available
108Monitoring Clusters
- Clusters need tools to be monitored
- Administrators have many things to check
- The cluster must be visible from a single point
- Subjects of monitoring
- Physical environment
- Temperature, power, ..
- Logical services
- RPCs, NFS, ...
- Performance meters
- Paging, CPU load, ...
109Monitoring Heterogeneous Clusters
- Monitoring is specially necessary in
heterogeneous clusters - Several node types
- Several operating systems
- The tool should hide the differences
- The real characteristics are only needed to solve
some problems - Very related to Single-System Image
110Auto-Administration
- Monitors know how to make self diagnosis
- Next step is to run corrective procedures
- Some systems start to do so (NetSaint)
- Difficult because tools do not have common sense
- This step is necessary
- Many nodes
- Many devices
- Many possible problems
- High probability of error
111High Availability
- One of the key points for clusters
- Specially needed for commercial applications
- 7 days a week and 24 hours a day
- Not necessarily very scalable (32 nodes)
- Based on many issues already described
- Single-system image
- Hide any possible change in the configuration
- Monitoring tools
- Detect the errors to be able to correct them
- Process migration
- Restart/continue applications in running nodes
112 - Design and Build a Cluster Computer
113Cluster Design
- Clusters are good as personal supercomputers
- Clusters are not often good as general purpose
multi-user production machines - Building such a cluster requires planning and
understanding design tradeoffs
114Scalable Cluster Design Principles
- Principle of Independence
- Principle of Balanced Design
- Principle of design for Scalability
- Principle of Latency hiding
115Principle of independence
- Components (hardware Software) of the system
should be independent of one another - Incremental scaling - Scaling up a system along
one dimension by improving one component,
independent of others - For example upgrade processor to next
generation, system should operate at higher
performance with upgrading other components. - Should enable heterogeneity scalability
116Principle of independence
- The components independence can result in cost
cutting - The component becomes a commodity, with following
features - Open architecture with standard interfaces to the
rest of the system - Off-the-shelf product Public domain
- Multiple vendor in the open market with large
volume - Relatively mature
- For all these reasons the commodity component
has low cost, high availability and reliability
117Principle of independence
- Independence principle and application examples
- The algorithm should be independent of the
architecture - The application should be independent of platform
- The programming language should be independent of
the machine - The language should be modular and have
orthogonal feature - The node should be independent of the network,
and the network interface should be independent
of the network topology - Caveat
- In any parallel system, there is usually some key
component/technique that is novel - We can not build en efficient system by simply
scaling up one or few components - Design should be balanced
118Principle of Balanced Design
- Minimize any performance bottleneck
- Should avoid an unbalanced system design, where
slow component degrades the performance of the
entire system - Should avoid single point of failure
- Example
- The PetaFLOP project The memory requirement for
wide range of scientific/Engineering applications - Memory (GB) Speed3/4 (Gflop/s)
- 30 TB of memory is appropriate for a Pflop/s
machine.
119Principle of Design for Scalability
- Provision must be made so that System can either
scale up to provide higher performance - Or scale down to allow affordability or greater
cost-effectiveness - Two approaches
- Overdesign
- Example Modern processors support 64-bit
address space. This huge address may not be fully
utilized by Unix supporting 32-bit address space.
This overdesign will create much easier
transition of OS from 32-bit to 64-bit - Backward compatibility
- Example A parallel program designed to run on n
nodes should be able to run on a single node, may
be with a reduced input data.
120Principle of Latency Hiding
- Future scalable system are most likely to use a
distributed shared-memory architecture. - Access to remote memory may experience a long
latencies - Example GRID
- Scalable multiprocessors clusters must rely on
use of - Latency hiding
- Latency avoiding
- Latency reduction
121Cluster Building
- Conventional wisdom Building a cluster is easy
- Recipe
- Buy hardware from Computer Shop
- Install Linux, Connect them via network
- Configure NFS, NIS
- Install your application, run and be happy
- Building it right is a little more difficult
- Multi user cluster, security, performance tools
- Basic question - what works reliably?
- Building it to be compatible with Grid
- Compilers, libraries
- Accounts, file storage, reproducibility
- Hardware configuration may be an issue
122 - How do people think of parallel programming and
using clusters ..
123Panther Cluster
- Picked 8 PC and named them from Panther family.
- Connected them by network and setup this cluster
in a small lab. - Using Panther Cluster
- Select a PC log in
- Edit and Compile the code
- Execute the program
- Analyze the results
Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
124Panther Cluster - Programming
- Explicit parallel Programming isn't easy. You
really have to do yourself - Network bandwidth and Latency matter
- There are good reasons for security patches
- OopsLion does not have floating point
Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
125Panther Cluster Attacks Users
- Grad Students wanted to use the cool cluster.
They each need only half (a half other than Lion) - Grad Students discover that using the same PC at
the same time is incredibly bad - A solution would be to use parts of the cluster
exclusively for one job at a time. - And so.
Cheeta
Tiger
Kitten
Cat
Jaguar
Leopard
Panther
Lion
126We Discover Scheduling
- We tried
- A sign up sheet
- Yelling across the yard
- A mailing list
- finger schedule
-
- A scheduler
Queue
Cheeta
Tiger
Job 1
Job 2
Kitten
Cat
Job 3
Jaguar
Leopard
Panther
Lion
127Panther Expands
- Panther expands, adding more users and more
systems - Use Panther Node for
- Login
- File services
- Scheduling services
- .
- All other compute nodes
Panther
Cheeta
PC1
Tiger
PC5
Kitten
Cat
PC6
PC2
PC3
Jaguar
Leopard
PC7
Lion
PC8
PC 09
PC4
128Evolution of Cluster Services
The Cluster grows
Login
Login
Login
Login File service Scheduling Management I/O
services
Login
File service
File service
File service
File service Scheduling Management I/O services
Scheduling
Scheduling
Scheduling Management I/O services
Management I/O services
Management
I/O services
Improve computing performance
Improve system reliability and manageability
129Compute _at_ Panther
- Usage Model
- Login to login node
- Compile and test code
- Schedule a test run
- Schedule a serious run
- Carry out I/O through I/O node
- Management Model
- The compute nodes are identical
- Users use Login, I/O and compute nodes
- All I/O requests are managed by Metadata server
Login
File
Sched
Mgmt
I/O
PC1
Cheeta
Tiger
PC5
Kitten
PC6
Cat
PC2
PC3
Jaguar
Leopard
PC7
Lion
PC8
PC 09
PC4
130Cluster Building Block
131Building Blocks - Hardware
- Processor
- Complex Instruction Set Computer (CISC)
- x86, Pentium Pro, Pentium II, III, IV
- Reduced Instruction Set Computer (RISC)
- SPARC, RS6000, PA-RISC, PPC, Power PC
- Explicitly Parallel Instruction Computer (EPIC)
- IA-64 (McKinley), Itanium
132Building Blocks - Hardware
- Memory
- Extended Data Out (EDO)
- pipelining by loading next call to or from memory
- 50 - 60 ns
- DRAM and SDRAM
- Dynamic Access and Synchronous (no pairs)
- 13 ns
- PC100 and PC133
- 7ns and less
133Building Blocks - Hardware
- Cache
- L1 - 4 ns
- L2 - 5 ns
- L3 (off the chip) - 30 ns
- Celeron
- 0 512KB
- Intel Xeon chips
- 512 KB - 2MB L2 Cache
- Intel Itanium
- 512KB -
- Most processors have at least 256 KB
134Building Blocks - Hardware
- Disks and I/O
- IDE and EIDE
- IBM 75 GB 7200 rpm disk w/ 2MB onboard cache
- SCSI I, II, II and SCA
- 5400, 7400, and 10000 rpm
- 20 MB/s, 40 MB/s, 80 MB/s, 160 MB/s
- Can chain from 6-15 disks
- RAID Sets
- software and hardware
- best for dealing with parallel I/O
- reserved cache for before flushing to disks
135Building Blocks - Hardware
- System Bus
- ISA
- 5Mhz - 13 Mhz
- 32 bit PCI
- 33Mhz
- 133 MB/s
- 64 bit PCI
- 66Mhz
- 266MB/s
136Building Blocks - Hardware
- Network Interface Cards (NICs)
- Ethernet - 10 Mbps, 100 Mbps, 1 Gbps
- ATM - 155 Mbps and higher
- Quality of Service (QoS)
- Scalable Coherent Interface (SCI)
- 12 microseconds latency
- Myrinet - 1.28 Gbps
- 120 MB/s
- 5 microseconds latency
- PARAMNet 2.5 Gbps
137Building Blocks Operating System
- Solaris - Sun
- AIX - IBM
- HPUX - HP
- IRIX - SGI
- Linux - everyone!
- Is architecture independent
- Windows NT/2000
138Building Blocks - Compilers
- Commercial
- Portland Group Incorporated (PGI)
- C, C, F77, F90
- Not as expensive as vendor specific and compile
most applications - GNU
- gcc, g, g77, vast f90
- free!
139Building Blocks - Scheduler
- Cron, at (NT/2000)
- Condor
- IBM Loadleveler
- LSF
- Portable Batch System (PBS)
- Maui Scheduler
- GLOBUS
- All free, run on more than one OS!
140Building Blocks Message Passing
- Commercial and free
- Naturally Parallel, Highly Parallel
- Condor
- High Throughput Computing (HTC)
- Parallel Virtual Machine PVM
- oak ridge national labs
- Message Passing Interface (MPI)
- mpich from anl
141Building Blocks Debugging and Analysis
- Parallel Debuggers
- TotalView
- GUI based
- Performance Analysis Tools
- monitoring library calls and runtime analysis
- AIMS, MPE, Pablo,
- Paradyn - from Wisconsin,
- SvPablo, Vampir, Dimemas, Paraver
142Building Block Other
- Cluster Administration Tools
- Cluster Monitoring Tools
- These tools are the part of Single System Image
Aspects
143Scalability of Parallel Processors
Cluster of Uniprocessors
SMP
Performance
Cluster of SMPs
Processors
144Installing the Operating System
- Which package ?
-
- Which Services ?
- Do I need a graphical environment ?
145Identifying the hardware bottlenecks
- Is my hardware optimal ?
- Can I improve my hardware choices ?
- How can I identify where is the problem ?
- Common hardware bottlenecks !!
146Benchmarks
- Synthetic Benchmarks
- Bonnie
- Stream
- NetPerf
- NetPipe
- Applications Benchmarks
- High Performance Linpack
- NAS
147Networking under Linux
148Network Terminology Overview
- IP address the unique machine address on the
net (e.g., 128.169.92.195) - netmask determines which portion of the IP
address specifies the subnetwork number, and
which portion specifies the host on that subnet
(e.g., 255.255.255.0) - network address IP address masked bitwise-ANDed
with the netmask (e.g.,128.169.92.0) - broadcast address network address ORed with the
negation of the netmask (128.169.92.255)
149Network Terminology Overview
- gateway address the address of the gateway
machine that lives on two different networks and
routes packets between them - name server address the address of the name
server that translates host names into IP
addresses
150A Cluster Network
151Network Configuration
- IP Address
- Three private IP address range
- 10.0.0.0 to 10.255.255.255 172.16.0.0 to
172.32.255.255 196.168.0.0 to 192.168.255.255 - Information on private intranet is available in
RFC 1918 - Warning Should not use IP address 10.0.0.0 or
172.16.0.0 or 196.168.0.0 for server - Netmask 255.255.255.0 should be sufficient for
most clusters
152Network Configuration
- DHCP Dynamic Host Configuration Protocol
- Advantages
- You can simplify network setup
- Disadvantages
- It is centralized solution ( is it scalable ?)
- IP addresses are linked to ethernet address, and
that can be a problem if you change the NIC or
want to change the hostname routinely
153Network Configuration Files
- /etc/resolv.conf -- configures the name resolver
specifying the following fields - search (a list of alternate domain names to
search for a hostname) - nameserver (IP addresses of DNS used for name
resolutions) - search cse.iitd.ac.in
- nameserver 128.169.93.2
- nameserver 128.169.201.2
154Network Configuration Files
- /etc/hosts -- contains a list of IP addresses and
their corresponding hostnames. Used for faster
name resolution process (no need to query the
domain name server to get the IP address) - 127.0.0.1 localhost
localhost.localdomain - 128.169.92.195 galaxy
galaxy.cse.iitd.ac.in - 192.168.1.100 galaxy galaxy
- 192.168.1.1 star1 star1
- /etc/host.conf -- specifies the order of queries
to resolve host names Example - order hosts, bind check the /etc.../hosts
first and then the DNS - multi on allow to have multiple IP
addresses
155Host-specific Configuration Files
- /etc/conf.modules -- specifies the list of
modules (drivers) that have to be loaded by the
kerneld (see /lib/modules for a full list) - alias eth0 tulip
- /etc/HOSTNAME - specifies your system hostname
galaxy1.cse.iitd.ac.in - /etc/sysconfig/network -- specifies a gateway
host, gateway device - NETWORKINGyes
- HOSTNAMEgalaxy.cse.iitd.ac.in
- GATEWAY128.169.92.1
- GATEWAYDEVeth0
- NISDOMAINworkshop
156Configure Ethernet Interface
- Loadable ethernet drivers -
- Loadable modules are pieces of object codes that
can be loaded into a running kernel. It allows
Linux to add device drivers to a running Linux
system in real time. The loadable Ethernet
drivers are found in the /lib/modules/release/net
directory