Title: Microthreaded models for CMPs
1Microthreaded models for CMPs
- IFIP 10.3 Seminar 6/12/2005
- By Chris Jesshope
- University of Amsterdam
2Motivation
- Problems and opportunities in designing chip
multi-processor architectures - Memory wall - large memories are slow need to
tolerate long latency operations - Global communication - proportion of chip
reachable in one clock cycle is diminishing
exponentially need asynchrony - Unscalable support structures - uni-processor
issue width does not scale need distributed
structures - Power barrier - cannot obtain performance
indefinitely through frequency scaling need to
fully exploit concurrency in instruction execution
3The facts of concurrency
- Or how to teach mother to suck eggs
4Concurrency - real and virtual
- All code has concurrency
- either explictly or implicitly
- This concurrency can be exploited to
- gain throughput - real
- tolerate latency - virtual
- Lines of iso-concurrency in this space identify
tradeoffs between the two - Schedule invariance allows that tradeoff to be
dynamic
Latency tolerance
iso-concurrency
Schedule invariance
Virtual concurrency
Through- put
Real concurrency
5Example
For i 1,n sum sum a(i)b(i)
- Take the simplest of examples the inner product
operation - Depending on how this is viewed it has either
- No concurrency
- O(n) concurrency in parts
- O(n) concurrency
- What determines how the concurrency is exploited
is the schedule of operations
6Schedules - distribution and interleaving
- Different kinds of schedules are required for
physical concurrency - distribution virtual
concurrency - interleaving - Ideally
- distribution should be static or deterministic -
to provide control over locality and
communication - interleaving should be dynamic or
non-deterministic - to allow for asynchrony in
distributing operations and data
7Dynamic scheduling
- Dynamic scheduling requires synchronisation which
in turn requires synchronising memory - the amount of synchronising memory limits the
amount of dynamic concurrency - This is independent of the how the concurrency is
identified or executed, e.g. - Out-of-order issue - issue windows or reservation
stations and reorder buffers - Dataflow - matching stores
- Any other approach!
8Example
For i 1,n sum sum a(i)b(i)
- Dynamic scheduling must resolve dependencies -
there are - dependencies within and between iterations in
this example. - within - sequence of operations (on
non-deterministic delay) - between - generation of index summation of
products
9Conventional vs. dataflow ISAs
- Synchronisation of dynamic concurrency in
different ISAs - In a dataflow ISA synchronisation is on the nodes
- receive one or more inputs with (almost)
identical tags - schedule the instruction for execution
- write the result to one or more target nodes
- In a conventional ISA synchronisation is on the
arcs - wait for one or two inputs with different tags
- schedule the instruction for execution
- write the result to a single arc
10Synchronisation memory
- Implemented with either
- associative or pseudo-associative memory
- expensive to implement but supports flexibility
in tagging - explicit token store - memory with full/empty
bits - here the tag is used as the address into memory
- ETS is now more widely used
- in an out-of-order issue processor the tags are
the renamed, global register specifiers - in a dataflow architecture the tags identify a
distributed memory processor, matching location
11Dataflow matching
a(i)
ml
sl
- Two arcs incident on a node have the same
tag/address - m - arc ordering is needed for non-associative
operations - hence have l-r discriminator
- The first token arriving sets a full/empty bit to
full (1) - The second token arriving at the same node
schedules the operation
mr
b(i)
To ALU
a(i) b(i) sl
0
Empty
0
1
l
a(i)
sl
12Conventional ISA synchronisation
- Each arc corresponds to a unique address in the
synchronisation memory - This is the renamed register specifier in
out-of-order issue - The location is flagged as empty when initialised
- An instruction that requires this location can
not be executed until a value has been written - The flag is set to full on a write to this
location
m
a(i)
0
Empty
0
1
a(i)
13Contextual information
- For reentrant code contextual information is
required to manage the use of synchronisation
memory - General purpose dataflow - frames see 1
- Wavescalar - wave number identifies and
sequentialises contexts ( memory references) - TRIPS - small fixed number of contexts exposed by
speculatively executing high-level control flow
graph - Out-of-order issue - no explicit contextual
information
1 G. Papadopoulos (1996) Implementation of a
General-Purpose Dataflow Multiprocessor, MIT press
14Summation dependency
All loads can be scheduled simultaneously in this
example also n independent multiplications are
each dependent on their respective loads the
summation however, is sequential or is it?
15Reduction operations
For i 1,n s(i) a(i)b(i) sab sum(s)
- The code can be transformed using a reduction
operation - sum - however, the schedule below is too specific as
the reduction can be performed in any order
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
etc
16Dataflow scheduling of reductions
- Sum is implemented most generally using a set of
values s(i) where each scheduled operation - removes a pair of values from the set sums them
and returns the result to the set - n-1 such operations are required and at most n/2
can be performed concurrently
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
etc
17Matching for reductions
- Example of the decentralised control of
reductions - Use special reduction tokens addressing special
locations - The location matches pairs of operands and sends
results to the same location - Again distribution can be deterministic
- but the ordering of operations and the pairs
matched is arbitrary - Termination can use token counting, i.e. tag one
token with n-1 and all others with 1 terminate
when their sum0
tag
s(i)
Pid
Data to be summed
Processor id to perform operation
special ID sum location
18The loop dependency
- The loop provides contextual information
- out-of-order issue the loop-control is predicted
and the loop is unrolled into synchronising
memory - but no notion of context is retained - Dataflow loop-control operations executed as
instructions and identify loop-frames - Vector loop is embedded completely in a single
instruction - In general this dependency can be statically or
deterministically removed
19Removal of loop dependency
1
2
3
20Microthreads
- Fragmentation of sequential code
21Goals of our work
- Concurrent instruction issue such that
- silicon area is proportional to issue width
- power dissipated is proportional to issue width
and - performance is proportional to power dissipated
for a given clock frequency - Should be programmable from sequential code
- should be backwards compatible
- unmodified code should run on a single processor
- or be translated once for any number of
processors - binary-to-binary translation or by recompilation
Number of instruction issued concurrently on
chip regardless of mechanism
22Concurrency in instruction issue
- How to manage concurrency when departing from
sequential-instruction issue - No controls - execute instructions out of
programmed order, e.g. superscalar - Fixed schedules - execute instructions using
compiled schedules, e.g. VLIW - Dataflow - execute instructions dynamically when
data is available, e.g. Wavescalar - Software - thread-level concurrency on multiple
processors, e.g. multithreading
Support dynamic concurrency
235. Code fragmentation
- Transform sequential code into code fragments by
adding explicit concurrency controls to the ISA - Execute the fragments out of order - concurrently
- Execute instructions within a fragment in-order
- Schedule instructions dynamically - data driven
- interleave fragments on one processor to give
tolerance to latency (e.g. memory other long
operations) - distribute fragments to multiple processors to
give scalable performance from wide-issue width - Examples Tera, Microthreads, Intrathreads
This is an incremental model that adds a handful
of additional instructions to a given ISA - for
compatibility
24Microthreads
- Microthreading is instruction-level concurrency
- it uses registers as synchronising memory to give
the lowest latency in dependency resolution - it is a bulk-synchronous model with MIMD, SPMD or
mixed blocks of concurrent execution separated by
barriers - it can support shared or distributed memory
- Schedules have deterministic distribution and
dynamic interleaving - the code is schedule invariant and can trade
latency tolerance and speedup within resource
constraints - The concurrency controls are efficiently
implemented and the support structures are
scalable
25Microthreading
SPMD
Source code
Binary code
Create (i 1, n)
For i 1, n --- --- ---
code fragments
Hardware
MIMD
deterministicglobal schedule
µ-thread queues
i3
i6
i9
i12
i2
i5
i8
i11
i1
i4
i7
i10
local schedulers
pipelines
26Microcontexts
- A typical RISC ISA only has a 5-bit address
- Microthread CMPs uses a large distributed
register file that must be addressed by
instructions - e.g. thousands of processors by up to a thousand
registers per processor - Microcontexts give a mechanism to bridge this gap
- A microcontext is a window on a given processor
at a given location in the register file
allocated dynamically to a thread - All registers implement i-structures
- are allocated empty and implement a blocking read
27Local register files
Large register file - no register renaming, e.g.
1024 entries
Sharing between microcontexts
Remotely by network
Locally by address translation
Offset from instruction
Context from thread state
Architectural register set, e.g. 32 entries
28Different models
- We can identify three models based on the
flexibility of inter-context communication - Vector - threads read/write their own context and
a subset of the enclosing context (globals) - Fixed dependency - as a) plus a subset of one or
more other contexts at the same level at and a
fixed distance from it (dependents) - General - as a) plus individual registers from
any context at the same level as it - We are currently focusing on b)
29Fixed-dependency model
- To implement this model we store and associate
with each instruction in the pipeline - two 10-bit offsets into the register file, and
- two 5-bit boundaries within a context
- The advantages of this model are
- ring connectivity provides all remote
communication with modulo schedules - on the same processor static mappings enable
register bypassing on all local registers reads
30Fixed-dependency model
G, S and in this case an dependency distance of 1
are invariants in the threads created
Locals
Shared
thread i1 offset
Locals
Thread i-1 on another processor
Dependent
Shared
S
Dependent
thread i offset
Broadcast between processors
Globals
Globals
G
global offset
thread i
thread i1 (on same processor)
31CMP Concept
May be heterogeneous
Independently clocked domain
Multi-ported shared-memory system
Create
Create
Reconfigurable Broadcast Bus
Create/write G
Create/write G
Write G
Write G
Initialise L0
Initialise L0
Decoupled Lw
Decoupled Lw
D read
D read
Reconfigurable Ring network for micro-context
sharing
32Typical structure of loop code
- Define profile - i.e. number of processors on
which to execute the loop - Define context parameters - i.e. partitions of
context L, S, G - Set any global variables used in loop
- Create loop - as concurrent microthreads using a
control block start, step, limit, dependency
distance, schedule block and pointer(s) to code
Profile n Context 6 1 1 Mv G0 L1 Create
controlblock Bsync
33Scalable implementation
- Broadcast register sharing are implemented by
the same asynchronous ring network - Register file is distributed and has only 5 ports
- Independent of the number of processors!
- The scheduler memory is also distributed and is
similar in size to the register file - Both can be scaled in size to adjust the latency
tolerance - Uses a simple in-order processor
- issue is stalled when data is not available -
processors go into standby mode clock-cycle by
clock cycle - wake up is asynchronous
34GALS implementation
- Each processor can be implemented in its own
synchronous domain - with asynchronous interfaces to ring network,
memory and FPU processors - Single-cycle operations are statically scheduled
all other operations use the asynchronous ports
to the register file - if a processor has no active threads it is
waiting for an asynchronous input and can stop
its clocks
35Power conservation
Two feedback loops
Workload measure
Schedule work When data available schedule work
to processor
Hardware scheduler
Power clock control
Voltage/frequency scaling Adjust voltage and
frequency to relative workload Stop clocks and
standby when no work
Schedule instructions
Data availability
Power/clock
Fragment Processor Single cycle operations
Asynchronous operations. e.g. memory or FPU
36Vector code performance
Speedup of µ-threaded code is linear over 2.5
orders of magnitude to within 2
Speedup is super-linear with respect to
non-threaded code as 20 fewer instructions are
executed to complete the computation
Max IPC 1612 on 2048 processors 80 of
theoretical maximum
37Scalable power dissipation
Energy of computation is constant to within 2
over 2.5 orders of magnitude
Using fine-grain control of clock (dynamic
dissipation) and coarse grain control of power
(static dissipation)
38Performance with and without D-cache
The residual D-cache made no difference on
performance
39Simulation framework
- Based on a cycle-accurate CMP simulator of
ISA-Alpha ISA-µt - The same binary was executed on profiles of
from 1 to 2048 processors with cold caches
(I-cache and D-cache) - Fixed number of iterations (64K) of the Livermore
hydro kernel
40Adaptive System Environment
- Legacy code executes unchanged on 1 processor
- Microthreaded code executes on n processors
- n can be dynamic (e.g. per loop) to adapt to
system or user goals e.g. performance or power
dissipation - n processors can be drawn from a pool
- When the concurrency collapses, only the
architectural state remains - on the creating
processor - Software concurrency (distributed memory) can
also be supported with I/O mapped to
micro-threaded registers - Need a dynamic model of the resources to get
self-adaptive execution of compiled code
41Summary and future directions
- We have been working on these models and
verifying their scalability for some years - We have just started a 4-year project to
- formalise the models
- develop compilers for the models
- thoroughly investigate performance relative to
other approaches - implement IP