Title: Scaling the Cray MTA
1Scaling the Cray MTA
2Overview
- The MTA is a uniform shared memory multiprocessor
with - latency tolerance using fine-grain multithreading
- no data caches or other hierarchy
- no memory bandwidth bottleneck, absent hot-spots
- Every 64-bit memory word also has a full/empty
bit - Load and store can act as receive and send,
respectively - The bit can also implement locks and atomic
updates - Every processor has 16 protection domains
- One is used by the operating system
- The rest can be used to multiprogram the
processor - We limit the number of big jobs per processor
3Multithreading on one processor
Unused streams
4Multithreading on multiple processors
5Typical MTA processor utilization
6Processor features
- Multithreaded VLIW with three operations per
64-bit word - The ops. are named M(emory), A(rithmetic), and
C(ontrol) - 31 general-purpose 64-bit registers per stream
- Paged program address space (4KB pages)
- Segmented data address space (8KB256MB segments)
- Privilege and interleaving is specified in the
descriptor - Data addressability to the byte level
- Explicit-dependence lookahead
- Multiple orthogonally generated condition codes
- Explicit branch target registers
- Speculative loads
- Unprivileged traps
- and no interrupts at all
7Supported data types
- 8, 16, 32, and 64-bit signed and unsigned integer
- 64-bit IEEE and 128-bit doubled precision
floating point - conversion to and from 32-bit 1EEE is supported
- Bit vectors and matrices of arbitrary shape
- 64-bit pointer with 16 tag bits and 48 address
bits - 64-bit stream status word (SSW)
- 64-bit exception register
8Bit vector and matrix operations
- The usual logical operations and shifts are
available in both A and C versions - ttera_bit_tally(u)
- ttera_bit_odd_and,nimp,or,xor(u,v)
- xtera_bit_left,right_ones,zeros(y,z)
- ttera_shift_pair_left,right(t,u,v,w)
- ttera_bit_merge(u,v,w)
- ttera_bit_mat_exor,or(u,v)
- ttera_bit_mat_transpose(u)
- ttera_bit_pack,unpack(u,v)
9The memory subsystem
- Program addresses are hashed to logical addresses
- We use an invertible matrix over GF(2)
- The result is no stride sensitivity at all
- logical addresses are then interleaved among
physical memory unit numbers and offsets - The mumber of memory units can be a power of 2
times any factor of 315579 - 1, 2, or 4 GB of memory per processor
- The memory units support 1 memory reference per
cycle per processor - plus instruction fetches to the local processors
L2 cache
10Memory word organization
- 64-bit words
- with 8 more bite for SECDED
- Big-endian partial word order
- addressing halfwords, quarterwords, and bytes
- 4 tag bits per word
- with four more SECDED bits
- The memory implements a 64-bit fetch-and-add
operation
tag bits
data value
11Synchronized memory operations
- Each word of memory has an associated full/empty
bit - Normal loads ignore this bit, and normal stores
set it full - Sync memory operations are available via data
declarations - Sync loads atomically wait for full, then load
and set empty - Sync stores atomically wait for empty, then store
and set full - Waiting is autonomous, consuming no processor
issue cycles - After a while, a trap occurs and the thread state
is saved - Sync and normal memory operations usually take
the same time because of this optimistic
approach - In any event, synchronization latency is tolerated
12I/O Processor (IOP)
- There are as many IOPs as there are processors
- An IOP program describes a sequence of
unit-stride block transfers to or from anywhere
in memory - Each IOP drives a 100MB/s (32-bit) HIPPI channel
- both directions can be driven simultaneously
- memory-to-memory copies are also possible
- We soon expect to be leveraging off-the-shelf
buses and microprocessors as outboard devices
13The memory network
- The current MTA memory network is a 3D toroidal
mesh with pure deflection (hot potato) routing - It must deliver one random memory reference per
processor per cycle - When this condition is met, the topology is
transparent - The most expensive part of the system is its
wires - This is a general property of high bandwidth
systems - Larger systems will need more sophisticated
topologies - Surprisingly, network topology is not a dead
subject - Unlike wires, transistors keep getting faster and
cheaper - We should use transistors aggressively to save
wires
14Our problem is bandwidth, not latency
- In any memory network, concurrency latency x
bandwidth - Multithreading supplies ample memory network
concurrency - even to the point of implementing uniform shared
memory - Bandwidth (not latency) limits practical MTA
system size - and large MTA systems will have expensive memory
networks - In future, systems will be differentiated by
their bandwidths - System purchasers will buy the class of bandwidth
they need - System vendors will make sure their bandwidth
scales properly - The issue is the total cost of a given amount of
bandwidth - How much bandwidth is enough?
- The answer pretty clearly depends on the
application - We need a better theoretical understanding of this
15Reducing the number and cost of wires
- Use on-wafer and on-board wires whenever possible
- Use the highest possible bandwidth per wire
- Use optics (or superconductors) for long-distance
interconnect to avoid skin effect - Leverage technologies from other markets
- DWDM is not quite economical enough yet
- Use direct interconnection network topologies
- Indirect networks waste wires
- Use symmetric (bidirectional) links for fault
tolerance - Disabling an entire cycle preserves balance
- Base networks on low-diameter graphs of low
degree - bandwidth per node ? degree /average distance
16Graph symmetries
- Suppose G(v, e) is a graph with vertex set v and
directed edge set e?v?v - G is called bidirectional when (x,y)?G implies
(y,x)?G - Bidirectional links are helpful for fault
reconfiguration - An automorphism of G is a mapping ? v ? v such
that (x,y) is in G if and only if (?(x), ?(y)) is
also - G is vertex-symmetric when for any pair of
vertices there is an automorphism mapping one
vertex to the other - G is edge-symmetric when for any pair of edges
there is an automorphism mapping one edge to the
other - Edge and vertex symmetries help in balancing
network load
17Specific bandwidth
- Consider an n-node edge-symmetric bidirectional
network with (out-)degree ? and link bandwidth ? - so the total aggregate link bandwidth available
is n?? - Let message destinations be uniformly distributed
among the nodes - hashing memory addresses helps guarantee this
- Let d be the average distance (in hops) between
nodes - Assume every node generates messages at bandwidth
b - then nbd ? n?? and therefore b/??????d
- The ratio ??d of degree to average distance
limits the ratio b/? of injection bandwidth to
link bandwidth - We call ??d the specific bandwidth of the network
18Graphs with average distance ? degree
Source Bermond, Delorme, and Quisquater, JPDC 3
(1986), p. 433
19Cayley graphs
- Groups are a good source of low-diameter graphs
- The vertices of a Cayley graph are the group
elements - The ? edges leaving a vertex are generators of
the group - Generator g goes from node x to node x g
- Cayley graphs are always vertex-symmetric
- Premultiplication by y x-1 is an automorphism
taking x to y - A Cayley graph is edge-symmetric if and only if
every pair of generators is related by a group
automorphism - Example the k-ary n-cube is a Cayley graph of
(?k)n - (?k)n is the n-fold direct product of the
integers modulo k - The 2n generators are (1,00), (-1,00)
,(0,0-1) - This graph is clearly edge-symmetric
20Another example the Star graph
- The Star graph is an edge-symmetric Cayley graph
of the group Sn of permutations on n symbols - The generators are the exchanges of the rightmost
symbol with every other symbol position - It therefore has n! vertices and degree n-1
- For moderate n, the specific bandwidth is close
to 1
21The Star graph of size 4! 24
22Compiler optimizations
- Automatic loop nest optimization and
parallelization - Parallelization of reductions and recurrences
- Memory updates are handled automatically
- Heuristic loop scheduling
- Function inlining
- Directives for controlling (nearly) everything
- enhancing performance, portability and
correctness - Function annotations
- canal describes what the compiler did
23A program example
- void counting_sort(int src, int dst,int
num_items, int num_vals) - int i, j, countnum_vals, startnum_vals
- for (j 0 j lt num_vals j)countj0
- for (i 0 i lt num_items i)countsrci
- start0 0
- for (j 1 j lt num_vals j)startj
startj-1 countj-1 - pragma tera assert parallel
- for (i 0 i lt num_items i)dststartsrci
srci
24NPB Integer Sort
25An example using futures
- Futures can be used for parallel
divide-and-conquer - int task(SomeType work, int n) future int
lft if ( n lt SMALL ) return
small_task(work,n) else lft
future()return task(work,n/2) rgt
task(_at_workn/2,n-n/2) return lft rgt
26Performance tuning
- The compiler generates code to enqueue records
containing - The program counter
- The (globally synchronous) clock
- Values of some of the hardware resource counters
- instructions issued
- memory references
- floating point operations
- phantoms
- A daemon dequeues, filters, and writes records to
a file - Users can (re)define the filters
- traceview displays the sorted, swap-corrected
trace
27A traceview performance profile
28Conclusions
- The Cray MTA is a new kind of high performance
system - scalar multithreaded processors
- uniform shared memory
- fine-grain synchronization
- simple programming
- It will scale to 64 processors in 2001 and 256 in
2002 - future versions will have thousands of processors
- It extends the capabilities of supercomputers
- scalar parallelism, e.g. data base
- fine-grain synchronization, e.g. sparse linear
systems