Scaling the Cray MTA - PowerPoint PPT Presentation

About This Presentation

Title:

Scaling the Cray MTA

Description:

An automorphism of G is a mapping : v v such that (x,y) is in G if and only if ... Premultiplication by y x-1 is an automorphism taking x to y ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 23

Provided by: burt159

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scaling the Cray MTA

1
Scaling the Cray MTA

Burton Smith
Cray Inc.

2
Overview

The MTA is a uniform shared memory multiprocessor
with
latency tolerance using fine-grain multithreading
no data caches or other hierarchy
no memory bandwidth bottleneck, absent hot-spots
Every 64-bit memory word also has a full/empty
bit
Load and store can act as receive and send,
respectively
The bit can also implement locks and atomic
updates
Every processor has 16 protection domains
One is used by the operating system
The rest can be used to multiprogram the
processor
We limit the number of big jobs per processor

3
Multithreading on one processor
Unused streams
4
Multithreading on multiple processors
5
Typical MTA processor utilization
6
Processor features

Multithreaded VLIW with three operations per
64-bit word
The ops. are named M(emory), A(rithmetic), and
C(ontrol)
31 general-purpose 64-bit registers per stream
Paged program address space (4KB pages)
Segmented data address space (8KB256MB segments)
Privilege and interleaving is specified in the
descriptor
Data addressability to the byte level
Explicit-dependence lookahead
Multiple orthogonally generated condition codes
Explicit branch target registers
Speculative loads
Unprivileged traps
and no interrupts at all

7
Supported data types

8, 16, 32, and 64-bit signed and unsigned integer
64-bit IEEE and 128-bit doubled precision
floating point
conversion to and from 32-bit 1EEE is supported
Bit vectors and matrices of arbitrary shape
64-bit pointer with 16 tag bits and 48 address
bits
64-bit stream status word (SSW)
64-bit exception register

8
Bit vector and matrix operations

The usual logical operations and shifts are
available in both A and C versions
ttera_bit_tally(u)
ttera_bit_odd_and,nimp,or,xor(u,v)
xtera_bit_left,right_ones,zeros(y,z)
ttera_shift_pair_left,right(t,u,v,w)
ttera_bit_merge(u,v,w)
ttera_bit_mat_exor,or(u,v)
ttera_bit_mat_transpose(u)
ttera_bit_pack,unpack(u,v)

9
The memory subsystem

Program addresses are hashed to logical addresses
We use an invertible matrix over GF(2)
The result is no stride sensitivity at all
logical addresses are then interleaved among
physical memory unit numbers and offsets
The mumber of memory units can be a power of 2
times any factor of 315579
1, 2, or 4 GB of memory per processor
The memory units support 1 memory reference per
cycle per processor
plus instruction fetches to the local processors
L2 cache

10
Memory word organization

64-bit words
with 8 more bite for SECDED
Big-endian partial word order
addressing halfwords, quarterwords, and bytes
4 tag bits per word
with four more SECDED bits
The memory implements a 64-bit fetch-and-add
operation

tag bits
data value
11
Synchronized memory operations

Each word of memory has an associated full/empty
bit
Normal loads ignore this bit, and normal stores
set it full
Sync memory operations are available via data
declarations
Sync loads atomically wait for full, then load
and set empty
Sync stores atomically wait for empty, then store
and set full
Waiting is autonomous, consuming no processor
issue cycles
After a while, a trap occurs and the thread state
is saved
Sync and normal memory operations usually take
the same time because of this optimistic
approach
In any event, synchronization latency is tolerated

12
I/O Processor (IOP)

There are as many IOPs as there are processors
An IOP program describes a sequence of
unit-stride block transfers to or from anywhere
in memory
Each IOP drives a 100MB/s (32-bit) HIPPI channel
both directions can be driven simultaneously
memory-to-memory copies are also possible
We soon expect to be leveraging off-the-shelf
buses and microprocessors as outboard devices

13
The memory network

The current MTA memory network is a 3D toroidal
mesh with pure deflection (hot potato) routing
It must deliver one random memory reference per
processor per cycle
When this condition is met, the topology is
transparent
The most expensive part of the system is its
wires
This is a general property of high bandwidth
systems
Larger systems will need more sophisticated
topologies
Surprisingly, network topology is not a dead
subject
Unlike wires, transistors keep getting faster and
cheaper
We should use transistors aggressively to save
wires

14
Our problem is bandwidth, not latency

In any memory network, concurrency latency x
bandwidth
Multithreading supplies ample memory network
concurrency
even to the point of implementing uniform shared
memory
Bandwidth (not latency) limits practical MTA
system size
and large MTA systems will have expensive memory
networks
In future, systems will be differentiated by
their bandwidths
System purchasers will buy the class of bandwidth
they need
System vendors will make sure their bandwidth
scales properly
The issue is the total cost of a given amount of
bandwidth
How much bandwidth is enough?
The answer pretty clearly depends on the
application
We need a better theoretical understanding of this

15
Reducing the number and cost of wires

Use on-wafer and on-board wires whenever possible
Use the highest possible bandwidth per wire
Use optics (or superconductors) for long-distance
interconnect to avoid skin effect
Leverage technologies from other markets
DWDM is not quite economical enough yet
Use direct interconnection network topologies
Indirect networks waste wires
Use symmetric (bidirectional) links for fault
tolerance
Disabling an entire cycle preserves balance
Base networks on low-diameter graphs of low
degree
bandwidth per node ? degree /average distance

16
Graph symmetries

Suppose G(v, e) is a graph with vertex set v and
directed edge set e?v?v
G is called bidirectional when (x,y)?G implies
(y,x)?G
Bidirectional links are helpful for fault
reconfiguration
An automorphism of G is a mapping ? v ? v such
that (x,y) is in G if and only if (?(x), ?(y)) is
also
G is vertex-symmetric when for any pair of
vertices there is an automorphism mapping one
vertex to the other
G is edge-symmetric when for any pair of edges
there is an automorphism mapping one edge to the
other
Edge and vertex symmetries help in balancing
network load

17
Specific bandwidth

Consider an n-node edge-symmetric bidirectional
network with (out-)degree ? and link bandwidth ?
so the total aggregate link bandwidth available
is n??
Let message destinations be uniformly distributed
among the nodes
hashing memory addresses helps guarantee this
Let d be the average distance (in hops) between
nodes
Assume every node generates messages at bandwidth
b
then nbd ? n?? and therefore b/??????d
The ratio ??d of degree to average distance
limits the ratio b/? of injection bandwidth to
link bandwidth
We call ??d the specific bandwidth of the network

18
Graphs with average distance ? degree
Source Bermond, Delorme, and Quisquater, JPDC 3
(1986), p. 433
19
Cayley graphs

Groups are a good source of low-diameter graphs
The vertices of a Cayley graph are the group
elements
The ? edges leaving a vertex are generators of
the group
Generator g goes from node x to node x g
Cayley graphs are always vertex-symmetric
Premultiplication by y x-1 is an automorphism
taking x to y
A Cayley graph is edge-symmetric if and only if
every pair of generators is related by a group
automorphism
Example the k-ary n-cube is a Cayley graph of
(?k)n
(?k)n is the n-fold direct product of the
integers modulo k
The 2n generators are (1,00), (-1,00)
,(0,0-1)
This graph is clearly edge-symmetric

20
Another example the Star graph

The Star graph is an edge-symmetric Cayley graph
of the group Sn of permutations on n symbols
The generators are the exchanges of the rightmost
symbol with every other symbol position
It therefore has n! vertices and degree n-1
For moderate n, the specific bandwidth is close
to 1

21
The Star graph of size 4! 24
22
Compiler optimizations

Automatic loop nest optimization and
parallelization
Parallelization of reductions and recurrences
Memory updates are handled automatically
Heuristic loop scheduling
Function inlining
Directives for controlling (nearly) everything
enhancing performance, portability and
correctness
Function annotations
canal describes what the compiler did

23
A program example

void counting_sort(int src, int dst,int
num_items, int num_vals)
int i, j, countnum_vals, startnum_vals
for (j 0 j lt num_vals j)countj0
for (i 0 i lt num_items i)countsrci
start0 0
for (j 1 j lt num_vals j)startj
startj-1 countj-1
pragma tera assert parallel
for (i 0 i lt num_items i)dststartsrci
srci

24
NPB Integer Sort
25
An example using futures

Futures can be used for parallel
divide-and-conquer
int task(SomeType work, int n) future int
lft if ( n lt SMALL ) return
small_task(work,n) else lft
future()return task(work,n/2) rgt
task(_at_workn/2,n-n/2) return lft rgt

26
Performance tuning