Title: Commodity Computing Clusters - next generation supercomputers?
1Commodity Computing Clusters - next generation
supercomputers?
- Pawel Pisarczyk, ATM S. A.
- pawel.pisarczyk_at_atm.com.pl
2Agenda
- Introduction
- Supercomputer classification
- Architecture and implementations
- Commodity clusters
- Processors
- Operating systems
- Summary
3Supercomputer
- A supercomputer is a device for turning
compute-bound problems into I/O-bound problem -
Seymour Cray - A supercomputer is a computer system that leads
the world in terms of processing capacity,
particularly speed of calculations, at the time
of its introduction. - source http//en.wikipedia.org
4Supercomputer History (1)
- 1945-50 - Manchester Mark I
- 1950-55 - MIT Whirlwind
- 1955-60 - IBM 7090 - 210 KFLOPS
- 1960-65 - CDC 6600 -10.24 MFLOPS
- 1965-70 - CDC 7600 - 32.27 MFLOPS
- 1970-75 - CDC Cyber 76
5Supercomputer History (2)
- 1975-80 - Cray-1 - 160 MFLOPS
- 1980-85 - Cray X-MP - 500 MFLOPS
- 1985-90 - Cray Y-MP - 1.3 GFLOPS
- 1990-95 - Fujitsu Numerical Wind Tunnel - 236
GFLOPS - 1995-00 - Intel ASCI Red - 2.150 TFLOPS
- 2000-02 - IBM ASCI White, SP Power3 375 MHz -
7.226 TFLOPS - 2002-03 - NEC Earth Simulator - 35 TFLOPS
6Supercomputer Classes (1)
- General-purpose supercomputers
- vector processing machines - the same operation
carried out on a large amount of data
simultaneously - tightly connected cluster computers (NUMA) -
communication oriented architectures engineered
from ground up, based on high speed interconnects
and large number of processors - commodity clusters - collection of large number
of commodity PCs (COTS) interconnected by
high-bandwidth low-latency network
7Supercomputer Classes (2)
- Special-purpose supercomputers - high performance
computing devices with a hardware architecture
dedicated to solve a single problem (equipped
with custom ASICS or FPGA chips) - Examples
- Deep Blue
- GRAPE for astrophysics
8Flynn taxonomy - 1972 (1)
- SISD - Single Instruction Single Data (DEC, Sun
Microsystems, PC) - SIMD - Single Instruction Multiple Data
- computers with large number o processing units
(i.e. ALUs) - CPP DAP Gamma II, Quadrics Apemille - vector processing machines - NEC SX6, IA32 MMX
- MISD - Multiple Instruction Single Data
- theoretical model, no practical implementation
9Flynn taxonomy - 1972 (2)
- MIMD - Multiple Instruction Multiple Data
- SM-MIMD - Shared Memory MIMD
- global address space
- SMP systems and ccNUMA systems
- DM-MIMD - Distributed Memory MIMD
- many nodes with local address spaces
- high-bandwidth, low-latency communication
- common NUMA architectures (Non Uniform Memory
Access) - operating system have to be communication
oriented (Mach project)
10SM-MIMD implementations
- S-COMA - Simple Cache-Only Memory Architecture
- common SMP systems
- ccNUMA - Cache Coherent NUMA
- SGI Origin 3000
- SGI Altix 3000
- HP SuperDome
11S-COMA (SMP)
RAM
L2 cache
L2 cache
L2 cache
CPU 0
CPU 1
CPU N
12ccNUMA
RAM 0
L3 cache
L2 cache
L2 cache
CPU 0
CPU 1
13ccNUMA implementation
- SGI Altix 3000 (ccNUMA)
- 64 Itanium 2 (IA64) processors
- C-brick modules with 2 CPUs and ASIC SHUB
- NUMAflex, NUMAlink interconnects (6.4 GB/s, 2.4
GB/s) - Modified Linux kernel (2.6 NUMA support)
14DM-MIMD implementations
- Massively parallel systems (NUMA)
- communication oriented architecture
- low-latency, high-bandwidth interconnects
- topologies hypercube, torus, tree
- Butterfly networks, Omega networks, engineered
from ground up communication
15DM-MIMD implementations
- Commodity clusters
- a cluster is a collection of connected,
independent computers working in unison to solve
a problem - COTS technology
- nodes are interconnected by Ethernet LAN,
Myrinet, QsNet ELAN etc. - computation can be performed by using popular
programming toolkits and frameworks OpenMP, MPI - clusters require dedicated management software
16NUMA implementations
- Cray T3E-1350
- Processor Alpha 21164 675 MHz
- Number of CPUs 40 - 2176
- 3-D Torus topology
- Operating system UNICOS/mk - microkernel based
- Peak performance 3 TFLOPS
17Commodity cluster implementation (1)
- Linux Networx/Quadrics
- Processor Intel Xeon 2.4 GHz
- CPUs 2304
- Interconnections QsNet ELAN3
- Operating system Linux management tools
Lustre Cluster File System - Peak performance 7.6 TFLOPS
- 3rd computer on TOP500 list
- Developed for Lawrence Livermore National
Laboratory in 2002
18Commodity cluster implementation (2)
- HP XC6000 Cluster (XC3000 Cluster)
- Processor Intel Itanium 2 6M 1.5 GHz (Intel Xeon
3 GHz) - Node HP Integrity rx2600 (HP ProLiant DL380)
- Number of processors 34-512
- Interconnections QsNet ELAN3 (Myricom Myrinet
XP) - Operating system Linux SSI Middleware
management tools Lustre Cluster File System - Peak performance 34 CPUs - 204 GFLOPS, 512 CPUs
- 3 TFLOPS
19Commodity Clusters - software
- Operating system - Linux or SSI Linux (Single
System Image) - Platform for specialized applications for
science, engineering and business (simulation,
modeling, data mining) - Distributed computation environments are used for
software development (OpenMP, MPI) - Common supercomputer applications require porting
to clusters
20Performance Scaling
Scale Right
Scale-Up (SMP, ccNUMA)
Scale-Out (Cluster)
21Processors (1)
- Many types of existing processors are used in
supercomputers - Microprocessor development directions
- Increasing of clock frequency and speed
instruction stream processing - Processing of large collection of data in single
processor instruction - SIMD - Control path multiplication multithreading
22Processors (2)
- Vector processors
- NEC SX-6
- Cray (Cray X1)
- RISC processors
- MIPS
- IBM Power4
- Alpha
- CISC processors
- IA32
- AMD x86-64
- VLIW processors
- IA64
23Intel Itanium 2 features
- State-of-the-art unconventional 64-bit
architecture - New programming model implementing VLIW paradigm
- EPIC technology Explicitly Parallel Instruction
Computing compiler determines instruction
dependency informing processor how to process an
instruction stream parallel - Many registers (128 64-bit), register stack
management - 6 GFLOPS peak performance
- Full advantages of the processor can be used by
dedicated compiler
24Operating systems
- Monolithic kernel based OSs - UNIX (modification
of existing solutions) - BSD
- Solaris
- Irix
- Linux
- Microkernel based OSs
- Mach
25Microkernel architecture
Task A
Task B
Task C
Kernel
Kernel
Hardware
Hardware
26Summary
- Todays there is a lot of supercomputer
architectures - Both vector processors and common RISC, CISC,
VLIW chips are used for supercomputers - Commodity clusters under control of Linux OS are
an attractive method for supercomputer
implementation
27TOP 500 list (1)
1. Earth Simulator, NEC - 35.86 TFLOPS
2. HP Alphaserver SC, HP - 13.88 TFLOPS
3. Linux Networx / Quadrics IA32 - 7.634 TFLOPS
28Top 500 list (2)