Title: Commodity processor with commodity interprocessor connection
1Commercial Parallel Computer Architecture
Loosely Coupled Tightly Coupled
- ? Commodity processor with commodity
inter-processor connection - Clusters
- Pentium, Itanium, Opteron, Alpha
- GigE, Infiniband, Myrinet, Quadrics, SCI
- NEC TX7
- HP Alpha
- ? Commodity processor with custom interconnect
- SGI Altix
- Intel Itanium 2
- Cray Red Storm
- AMD Opteron
- ? Custom processor with custom interconnect
- Cray X1
- NEC SX-7
- IBM Regatta
- IBM Blue Gene/L
2Super computers examples
- SGI Altix
- The Columbia Supercomputer at NASA's Advanced
Supercomputing Facility at Ames Research Center. - It consists of a 10,240-processor SGI Altix
system comprised of 20 nodes, each with 512 Intel
Itanium 2 processors, and running a Linux
operating system - Black Hole Simulations
-
- Hitachi SR11000
- NEC SX-7
- Apple
- Cray RedStorm
- Cray BlackWidow
- IBM Blue Gene/L
3- IBM Regatta p690
- 41 SMP nodes with 32 processors each (total 1312)
- Processortype Power4 1.7 GHz
- Overall peak performance 8.9 Teraflops
- Linpack 5.6 Teraflops
- Main memory 41 x 128 Gbytes (aggregate 5.2 TB)
- Operating system AIX 5.2
- Fujitsu Primepower
- 16 SPARC64 processors 1.35GHz/1.89GHz
- 128GB memory
- 16 disks
- 2 x 8-way system boards
- Solaris 8, 9, 10
4Processors used in supercomputer and performance
Linpack a standard benchmark software that test
how fast your computer runs
? Intel Pentium Xeon 3.2 GHz, peak 6.4
Gflop/s Linpack 100 1.7 Gflop/s Linpack 1000
3.1 Gflop/s ? AMD Opteron 2.2 GHz, peak 4.4
Gflop/s Linpack 100 1.3 Gflop/s Linpack 1000
3.1 Gflop/s ? Intel Itanium 2 1.5 GHz, peak
6 Gflop/s Linpack 100 1.7 Gflop/s Linpack
1000 5.4 Gflop/s
Gflop/s One billion floating point operations
per second
? HP PA RISC ? Sun UltraSPARC IV ? HP Alpha
EV68 1.25 GHz, 2.5 Gflop/s ? MIPS R16000
5- Inter-processor connection technologies
? Gig Ethernet ? Myrinet ? Infiniband ? QsNet ?
SCI
More detail
6Tree, Fat-tree
- Tree network there is only one path between
any pair of processors. - Fat tree network increase the number of
communication links close to the root.
Root level has more physical connections
7Torus topology
- A.K.A----Wrapped-around-mesh topology
Mesh with wraparound
Three-dimensional Mesh
8Clos network
- is a kind of multistage switching network
- Three stages, each consisting a number of
crossbars. - Middle stage have redundant switching boxes to
alleviate blocking probability
9Myrinet
- By Myricom company
- First Myrinet in 1994
- An alternative for Ethernet to connect the nodes
in a cluster - entirely operated in user space, no Operating
System delays
10G PCI Express NIC With fiber connectors
Miyinet switch 10-Gbps, 12,800 Clos networks
up to 128 host ports
10QsNetII network
- By Quadrics (formed in 1996)
- uses a 'fat tree' topology
- QsNetII scales up to 4096 nodes
- Each node might have multiple CPUs
- Designed for use within SMP systems
- MPI latency on standard AMD Opteron starts at
1.22 usec - Bandwidth on Intel Xeon EM64T is 912 Mbytes/s.
QsNetII E-Series 128-way switch
11- Each chip contains two nodes
- Each node is a PPC440 processor
- Each node has 512 local memory
- Each node runs lightweight OS with MPI.
- Each node runs one user process
- No context switching at node
12BlueGene/L Interconnection
- Use five networks
- GigE for I/O nodes, to external systems
- A control network use FastEthernet
- 3-D Torus for node-to-node message passing
- Handle majority of application traffic (mpi
messaging) - Longest path 64 hops
- MPI software is highly customized
- A collective network for broadcasting
- A barrier network
13BlueGene/L Interconnection Networks
Global Tree ?? Interconnects all compute and I/O
nodes (1024) ?? One-to-all broadcast
functionality ?? Reduction operations
functionality ?? 2.8 Gb/s of bandwidth per
link ?? Latency of one way tree traversal 2.5
µs ?? 23TB/s total binary tree bandwidth (64k
machine)
3 Dimensional Torus ?? Interconnects all compute
nodes (65,536) ?? Virtual cut-through hardware
routing ?? 1.4Gb/s on all 12 node links (2.1 GB/s
per node) ?? 1 µs latency between nearest
neighbors, 5 µs to the farthest ?? 4 µs latency
for one hop with MPI, 10 µs to the farthest
Ethernet ?? Incorporated into every node ASIC ??
Active in the I/O nodes (164) ?? All external
comm. (file I/O, control, user interaction,
etc.) Low Latency Global Barrier and Interrupt ??
Latency of round trip 1.3 µs
14(No Transcript)