Commodity processor with commodity interprocessor connection

About This Presentation

Title:

Commodity processor with commodity interprocessor connection

Description:

Commodity processor with commodity inter-processor connection. Clusters ... Gig Ethernet. Myrinet. Infiniband. QsNet. SCI. More detail... Tree, Fat-tree ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 15

Provided by: engi57

Learn more at: http://www.people.vcu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Commodity processor with commodity interprocessor connection

1
Commercial Parallel Computer Architecture
Loosely Coupled Tightly Coupled

? Commodity processor with commodity
inter-processor connection
Clusters
Pentium, Itanium, Opteron, Alpha
GigE, Infiniband, Myrinet, Quadrics, SCI
NEC TX7
HP Alpha
? Commodity processor with custom interconnect
SGI Altix
Intel Itanium 2
Cray Red Storm
AMD Opteron
? Custom processor with custom interconnect
Cray X1
NEC SX-7
IBM Regatta
IBM Blue Gene/L

2
Super computers examples

SGI Altix
The Columbia Supercomputer at NASA's Advanced
Supercomputing Facility at Ames Research Center.
It consists of a 10,240-processor SGI Altix
system comprised of 20 nodes, each with 512 Intel
Itanium 2 processors, and running a Linux
operating system
Black Hole Simulations

Hitachi SR11000
NEC SX-7
Apple
Cray RedStorm
Cray BlackWidow
IBM Blue Gene/L

IBM Regatta p690
41 SMP nodes with 32 processors each (total 1312)
Processortype Power4 1.7 GHz
Overall peak performance 8.9 Teraflops
Linpack 5.6 Teraflops
Main memory 41 x 128 Gbytes (aggregate 5.2 TB)
Operating system AIX 5.2

Fujitsu Primepower
16 SPARC64 processors 1.35GHz/1.89GHz
128GB memory
16 disks
2 x 8-way system boards
Solaris 8, 9, 10

4
Processors used in supercomputer and performance
Linpack a standard benchmark software that test
how fast your computer runs
? Intel Pentium Xeon 3.2 GHz, peak 6.4
Gflop/s Linpack 100 1.7 Gflop/s Linpack 1000
3.1 Gflop/s ? AMD Opteron 2.2 GHz, peak 4.4
Gflop/s Linpack 100 1.3 Gflop/s Linpack 1000
3.1 Gflop/s ? Intel Itanium 2 1.5 GHz, peak
6 Gflop/s Linpack 100 1.7 Gflop/s Linpack
1000 5.4 Gflop/s
Gflop/s One billion floating point operations
per second
? HP PA RISC ? Sun UltraSPARC IV ? HP Alpha
EV68 1.25 GHz, 2.5 Gflop/s ? MIPS R16000
5

Inter-processor connection technologies

? Gig Ethernet ? Myrinet ? Infiniband ? QsNet ?
SCI
More detail
6
Tree, Fat-tree

Tree network there is only one path between
any pair of processors.
Fat tree network increase the number of
communication links close to the root.

Root level has more physical connections
7
Torus topology

A.K.A----Wrapped-around-mesh topology

Mesh with wraparound
Three-dimensional Mesh
8
Clos network

is a kind of multistage switching network
Three stages, each consisting a number of
crossbars.
Middle stage have redundant switching boxes to
alleviate blocking probability

9
Myrinet

By Myricom company
First Myrinet in 1994
An alternative for Ethernet to connect the nodes
in a cluster
entirely operated in user space, no Operating
System delays

10G PCI Express NIC With fiber connectors
Miyinet switch 10-Gbps, 12,800 Clos networks
up to 128 host ports
10
QsNetII network

By Quadrics (formed in 1996)
uses a 'fat tree' topology
QsNetII scales up to 4096 nodes
Each node might have multiple CPUs
Designed for use within SMP systems
MPI latency on standard AMD Opteron starts at
1.22 usec
Bandwidth on Intel Xeon EM64T is 912 Mbytes/s.

QsNetII E-Series 128-way switch
11

Each chip contains two nodes
Each node is a PPC440 processor
Each node has 512 local memory
Each node runs lightweight OS with MPI.
Each node runs one user process
No context switching at node

12
BlueGene/L Interconnection

Use five networks
GigE for I/O nodes, to external systems
A control network use FastEthernet
3-D Torus for node-to-node message passing
Handle majority of application traffic (mpi
messaging)
Longest path 64 hops
MPI software is highly customized
A collective network for broadcasting
A barrier network

13
BlueGene/L Interconnection Networks
Global Tree ?? Interconnects all compute and I/O
nodes (1024) ?? One-to-all broadcast
functionality ?? Reduction operations
functionality ?? 2.8 Gb/s of bandwidth per
link ?? Latency of one way tree traversal 2.5
µs ?? 23TB/s total binary tree bandwidth (64k
machine)
3 Dimensional Torus ?? Interconnects all compute
nodes (65,536) ?? Virtual cut-through hardware
routing ?? 1.4Gb/s on all 12 node links (2.1 GB/s
per node) ?? 1 µs latency between nearest
neighbors, 5 µs to the farthest ?? 4 µs latency
for one hop with MPI, 10 µs to the farthest
Ethernet ?? Incorporated into every node ASIC ??
Active in the I/O nodes (164) ?? All external
comm. (file I/O, control, user interaction,
etc.) Low Latency Global Barrier and Interrupt ??
Latency of round trip 1.3 µs
14
(No Transcript)

Write a Comment

User Comments (0)