Scaling the Cray MTA - PowerPoint PPT Presentation

About This Presentation
Title:

Scaling the Cray MTA

Description:

An automorphism of G is a mapping : v v such that (x,y) is in G if and only if ... Premultiplication by y x-1 is an automorphism taking x to y ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 23
Provided by: burt159
Category:

less

Transcript and Presenter's Notes

Title: Scaling the Cray MTA


1
Scaling the Cray MTA
  • Burton Smith
  • Cray Inc.

2
Overview
  • The MTA is a uniform shared memory multiprocessor
    with
  • latency tolerance using fine-grain multithreading
  • no data caches or other hierarchy
  • no memory bandwidth bottleneck, absent hot-spots
  • Every 64-bit memory word also has a full/empty
    bit
  • Load and store can act as receive and send,
    respectively
  • The bit can also implement locks and atomic
    updates
  • Every processor has 16 protection domains
  • One is used by the operating system
  • The rest can be used to multiprogram the
    processor
  • We limit the number of big jobs per processor

3
Multithreading on one processor
Unused streams
4
Multithreading on multiple processors
5
Typical MTA processor utilization
6
Processor features
  • Multithreaded VLIW with three operations per
    64-bit word
  • The ops. are named M(emory), A(rithmetic), and
    C(ontrol)
  • 31 general-purpose 64-bit registers per stream
  • Paged program address space (4KB pages)
  • Segmented data address space (8KB256MB segments)
  • Privilege and interleaving is specified in the
    descriptor
  • Data addressability to the byte level
  • Explicit-dependence lookahead
  • Multiple orthogonally generated condition codes
  • Explicit branch target registers
  • Speculative loads
  • Unprivileged traps
  • and no interrupts at all

7
Supported data types
  • 8, 16, 32, and 64-bit signed and unsigned integer
  • 64-bit IEEE and 128-bit doubled precision
    floating point
  • conversion to and from 32-bit 1EEE is supported
  • Bit vectors and matrices of arbitrary shape
  • 64-bit pointer with 16 tag bits and 48 address
    bits
  • 64-bit stream status word (SSW)
  • 64-bit exception register

8
Bit vector and matrix operations
  • The usual logical operations and shifts are
    available in both A and C versions
  • ttera_bit_tally(u)
  • ttera_bit_odd_and,nimp,or,xor(u,v)
  • xtera_bit_left,right_ones,zeros(y,z)
  • ttera_shift_pair_left,right(t,u,v,w)
  • ttera_bit_merge(u,v,w)
  • ttera_bit_mat_exor,or(u,v)
  • ttera_bit_mat_transpose(u)
  • ttera_bit_pack,unpack(u,v)

9
The memory subsystem
  • Program addresses are hashed to logical addresses
  • We use an invertible matrix over GF(2)
  • The result is no stride sensitivity at all
  • logical addresses are then interleaved among
    physical memory unit numbers and offsets
  • The mumber of memory units can be a power of 2
    times any factor of 315579
  • 1, 2, or 4 GB of memory per processor
  • The memory units support 1 memory reference per
    cycle per processor
  • plus instruction fetches to the local processors
    L2 cache

10
Memory word organization
  • 64-bit words
  • with 8 more bite for SECDED
  • Big-endian partial word order
  • addressing halfwords, quarterwords, and bytes
  • 4 tag bits per word
  • with four more SECDED bits
  • The memory implements a 64-bit fetch-and-add
    operation

tag bits
data value
11
Synchronized memory operations
  • Each word of memory has an associated full/empty
    bit
  • Normal loads ignore this bit, and normal stores
    set it full
  • Sync memory operations are available via data
    declarations
  • Sync loads atomically wait for full, then load
    and set empty
  • Sync stores atomically wait for empty, then store
    and set full
  • Waiting is autonomous, consuming no processor
    issue cycles
  • After a while, a trap occurs and the thread state
    is saved
  • Sync and normal memory operations usually take
    the same time because of this optimistic
    approach
  • In any event, synchronization latency is tolerated

12
I/O Processor (IOP)
  • There are as many IOPs as there are processors
  • An IOP program describes a sequence of
    unit-stride block transfers to or from anywhere
    in memory
  • Each IOP drives a 100MB/s (32-bit) HIPPI channel
  • both directions can be driven simultaneously
  • memory-to-memory copies are also possible
  • We soon expect to be leveraging off-the-shelf
    buses and microprocessors as outboard devices

13
The memory network
  • The current MTA memory network is a 3D toroidal
    mesh with pure deflection (hot potato) routing
  • It must deliver one random memory reference per
    processor per cycle
  • When this condition is met, the topology is
    transparent
  • The most expensive part of the system is its
    wires
  • This is a general property of high bandwidth
    systems
  • Larger systems will need more sophisticated
    topologies
  • Surprisingly, network topology is not a dead
    subject
  • Unlike wires, transistors keep getting faster and
    cheaper
  • We should use transistors aggressively to save
    wires

14
Our problem is bandwidth, not latency
  • In any memory network, concurrency latency x
    bandwidth
  • Multithreading supplies ample memory network
    concurrency
  • even to the point of implementing uniform shared
    memory
  • Bandwidth (not latency) limits practical MTA
    system size
  • and large MTA systems will have expensive memory
    networks
  • In future, systems will be differentiated by
    their bandwidths
  • System purchasers will buy the class of bandwidth
    they need
  • System vendors will make sure their bandwidth
    scales properly
  • The issue is the total cost of a given amount of
    bandwidth
  • How much bandwidth is enough?
  • The answer pretty clearly depends on the
    application
  • We need a better theoretical understanding of this

15
Reducing the number and cost of wires
  • Use on-wafer and on-board wires whenever possible
  • Use the highest possible bandwidth per wire
  • Use optics (or superconductors) for long-distance
    interconnect to avoid skin effect
  • Leverage technologies from other markets
  • DWDM is not quite economical enough yet
  • Use direct interconnection network topologies
  • Indirect networks waste wires
  • Use symmetric (bidirectional) links for fault
    tolerance
  • Disabling an entire cycle preserves balance
  • Base networks on low-diameter graphs of low
    degree
  • bandwidth per node ? degree /average distance

16
Graph symmetries
  • Suppose G(v, e) is a graph with vertex set v and
    directed edge set e?v?v
  • G is called bidirectional when (x,y)?G implies
    (y,x)?G
  • Bidirectional links are helpful for fault
    reconfiguration
  • An automorphism of G is a mapping ? v ? v such
    that (x,y) is in G if and only if (?(x), ?(y)) is
    also
  • G is vertex-symmetric when for any pair of
    vertices there is an automorphism mapping one
    vertex to the other
  • G is edge-symmetric when for any pair of edges
    there is an automorphism mapping one edge to the
    other
  • Edge and vertex symmetries help in balancing
    network load

17
Specific bandwidth
  • Consider an n-node edge-symmetric bidirectional
    network with (out-)degree ? and link bandwidth ?
  • so the total aggregate link bandwidth available
    is n??
  • Let message destinations be uniformly distributed
    among the nodes
  • hashing memory addresses helps guarantee this
  • Let d be the average distance (in hops) between
    nodes
  • Assume every node generates messages at bandwidth
    b
  • then nbd ? n?? and therefore b/??????d
  • The ratio ??d of degree to average distance
    limits the ratio b/? of injection bandwidth to
    link bandwidth
  • We call ??d the specific bandwidth of the network

18
Graphs with average distance ? degree
Source Bermond, Delorme, and Quisquater, JPDC 3
(1986), p. 433
19
Cayley graphs
  • Groups are a good source of low-diameter graphs
  • The vertices of a Cayley graph are the group
    elements
  • The ? edges leaving a vertex are generators of
    the group
  • Generator g goes from node x to node x g
  • Cayley graphs are always vertex-symmetric
  • Premultiplication by y x-1 is an automorphism
    taking x to y
  • A Cayley graph is edge-symmetric if and only if
    every pair of generators is related by a group
    automorphism
  • Example the k-ary n-cube is a Cayley graph of
    (?k)n
  • (?k)n is the n-fold direct product of the
    integers modulo k
  • The 2n generators are (1,00), (-1,00)
    ,(0,0-1)
  • This graph is clearly edge-symmetric

20
Another example the Star graph
  • The Star graph is an edge-symmetric Cayley graph
    of the group Sn of permutations on n symbols
  • The generators are the exchanges of the rightmost
    symbol with every other symbol position
  • It therefore has n! vertices and degree n-1
  • For moderate n, the specific bandwidth is close
    to 1

21
The Star graph of size 4! 24
22
Compiler optimizations
  • Automatic loop nest optimization and
    parallelization
  • Parallelization of reductions and recurrences
  • Memory updates are handled automatically
  • Heuristic loop scheduling
  • Function inlining
  • Directives for controlling (nearly) everything
  • enhancing performance, portability and
    correctness
  • Function annotations
  • canal describes what the compiler did

23
A program example
  • void counting_sort(int src, int dst,int
    num_items, int num_vals)
  • int i, j, countnum_vals, startnum_vals
  • for (j 0 j lt num_vals j)countj0
  • for (i 0 i lt num_items i)countsrci
  • start0 0
  • for (j 1 j lt num_vals j)startj
    startj-1 countj-1
  • pragma tera assert parallel
  • for (i 0 i lt num_items i)dststartsrci
    srci

24
NPB Integer Sort
25
An example using futures
  • Futures can be used for parallel
    divide-and-conquer
  • int task(SomeType work, int n) future int
    lft if ( n lt SMALL ) return
    small_task(work,n) else lft
    future()return task(work,n/2) rgt
    task(_at_workn/2,n-n/2) return lft rgt

26
Performance tuning
  • The compiler generates code to enqueue records
    containing
  • The program counter
  • The (globally synchronous) clock
  • Values of some of the hardware resource counters
  • instructions issued
  • memory references
  • floating point operations
  • phantoms
  • A daemon dequeues, filters, and writes records to
    a file
  • Users can (re)define the filters
  • traceview displays the sorted, swap-corrected
    trace

27
A traceview performance profile
28
Conclusions
  • The Cray MTA is a new kind of high performance
    system
  • scalar multithreaded processors
  • uniform shared memory
  • fine-grain synchronization
  • simple programming
  • It will scale to 64 processors in 2001 and 256 in
    2002
  • future versions will have thousands of processors
  • It extends the capabilities of supercomputers
  • scalar parallelism, e.g. data base
  • fine-grain synchronization, e.g. sparse linear
    systems
Write a Comment
User Comments (0)
About PowerShow.com