Microthreaded models for CMPs - PowerPoint PPT Presentation

About This Presentation

Title:

Microthreaded models for CMPs

Description:

Microthreaded models for CMPs IFIP 10.3 Seminar 6/12/2005 By Chris Jesshope University of Amsterdam Motivation Problems and opportunities in designing chip multi ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 42

Provided by: ChrisJe9

Learn more at: http://www.ifipwg103.org

Category:

more less

Transcript and Presenter's Notes

Title: Microthreaded models for CMPs

1
Microthreaded models for CMPs

IFIP 10.3 Seminar 6/12/2005
By Chris Jesshope
University of Amsterdam

2
Motivation

Problems and opportunities in designing chip
multi-processor architectures
Memory wall - large memories are slow need to
tolerate long latency operations
Global communication - proportion of chip
reachable in one clock cycle is diminishing
exponentially need asynchrony
Unscalable support structures - uni-processor
issue width does not scale need distributed
structures
Power barrier - cannot obtain performance
indefinitely through frequency scaling need to
fully exploit concurrency in instruction execution

3
The facts of concurrency

Or how to teach mother to suck eggs

4
Concurrency - real and virtual

All code has concurrency
either explictly or implicitly
This concurrency can be exploited to
gain throughput - real
tolerate latency - virtual
Lines of iso-concurrency in this space identify
tradeoffs between the two
Schedule invariance allows that tradeoff to be
dynamic

Latency tolerance
iso-concurrency
Schedule invariance
Virtual concurrency
Through- put
Real concurrency
5
Example
For i 1,n sum sum a(i)b(i)

Take the simplest of examples the inner product
operation
Depending on how this is viewed it has either
No concurrency
O(n) concurrency in parts
O(n) concurrency
What determines how the concurrency is exploited
is the schedule of operations

6
Schedules - distribution and interleaving

Different kinds of schedules are required for
physical concurrency - distribution virtual
concurrency - interleaving
Ideally
distribution should be static or deterministic -
to provide control over locality and
communication
interleaving should be dynamic or
non-deterministic - to allow for asynchrony in
distributing operations and data

7
Dynamic scheduling

Dynamic scheduling requires synchronisation which
in turn requires synchronising memory
the amount of synchronising memory limits the
amount of dynamic concurrency
This is independent of the how the concurrency is
identified or executed, e.g.
Out-of-order issue - issue windows or reservation
stations and reorder buffers
Dataflow - matching stores
Any other approach!

8
Example
For i 1,n sum sum a(i)b(i)

Dynamic scheduling must resolve dependencies -
there are
dependencies within and between iterations in
this example.
within - sequence of operations (on
non-deterministic delay)
between - generation of index summation of
products

9
Conventional vs. dataflow ISAs

Synchronisation of dynamic concurrency in
different ISAs
In a dataflow ISA synchronisation is on the nodes
receive one or more inputs with (almost)
identical tags
schedule the instruction for execution
write the result to one or more target nodes
In a conventional ISA synchronisation is on the
arcs
wait for one or two inputs with different tags
schedule the instruction for execution
write the result to a single arc

10
Synchronisation memory

Implemented with either
associative or pseudo-associative memory
expensive to implement but supports flexibility
in tagging
explicit token store - memory with full/empty
bits
here the tag is used as the address into memory
ETS is now more widely used
in an out-of-order issue processor the tags are
the renamed, global register specifiers
in a dataflow architecture the tags identify a
distributed memory processor, matching location

11
Dataflow matching
a(i)
ml
sl

Two arcs incident on a node have the same
tag/address - m
arc ordering is needed for non-associative
operations
hence have l-r discriminator
The first token arriving sets a full/empty bit to
full (1)
The second token arriving at the same node
schedules the operation

mr
b(i)
To ALU
a(i) b(i) sl
0
Empty

0
1
l
a(i)
sl
12
Conventional ISA synchronisation

Each arc corresponds to a unique address in the
synchronisation memory
This is the renamed register specifier in
out-of-order issue
The location is flagged as empty when initialised
An instruction that requires this location can
not be executed until a value has been written
The flag is set to full on a write to this
location

m
a(i)
0
Empty
0
1
a(i)
13
Contextual information

For reentrant code contextual information is
required to manage the use of synchronisation
memory
General purpose dataflow - frames see 1
Wavescalar - wave number identifies and
sequentialises contexts ( memory references)
TRIPS - small fixed number of contexts exposed by
speculatively executing high-level control flow
graph
Out-of-order issue - no explicit contextual
information

1 G. Papadopoulos (1996) Implementation of a
General-Purpose Dataflow Multiprocessor, MIT press
14
Summation dependency
All loads can be scheduled simultaneously in this
example also n independent multiplications are
each dependent on their respective loads the
summation however, is sequential or is it?
15
Reduction operations
For i 1,n s(i) a(i)b(i) sab sum(s)

The code can be transformed using a reduction
operation - sum
however, the schedule below is too specific as
the reduction can be performed in any order

Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)

etc

16
Dataflow scheduling of reductions

Sum is implemented most generally using a set of
values s(i) where each scheduled operation
removes a pair of values from the set sums them
and returns the result to the set
n-1 such operations are required and at most n/2
can be performed concurrently

Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)

etc

17
Matching for reductions

Example of the decentralised control of
reductions
Use special reduction tokens addressing special
locations
The location matches pairs of operands and sends
results to the same location
Again distribution can be deterministic
but the ordering of operations and the pairs
matched is arbitrary
Termination can use token counting, i.e. tag one
token with n-1 and all others with 1 terminate
when their sum0

tag
s(i)
Pid
Data to be summed
Processor id to perform operation
special ID sum location
18
The loop dependency

The loop provides contextual information
out-of-order issue the loop-control is predicted
and the loop is unrolled into synchronising
memory - but no notion of context is retained
Dataflow loop-control operations executed as
instructions and identify loop-frames
Vector loop is embedded completely in a single
instruction
In general this dependency can be statically or
deterministically removed

19
Removal of loop dependency
1
2
3
20
Microthreads

Fragmentation of sequential code

21
Goals of our work

Concurrent instruction issue such that
silicon area is proportional to issue width
power dissipated is proportional to issue width
and
performance is proportional to power dissipated
for a given clock frequency
Should be programmable from sequential code
should be backwards compatible
unmodified code should run on a single processor
or be translated once for any number of
processors
binary-to-binary translation or by recompilation

Number of instruction issued concurrently on
chip regardless of mechanism
22
Concurrency in instruction issue

How to manage concurrency when departing from
sequential-instruction issue
No controls - execute instructions out of
programmed order, e.g. superscalar
Fixed schedules - execute instructions using
compiled schedules, e.g. VLIW
Dataflow - execute instructions dynamically when
data is available, e.g. Wavescalar
Software - thread-level concurrency on multiple
processors, e.g. multithreading

Support dynamic concurrency
23
5. Code fragmentation

Transform sequential code into code fragments by
adding explicit concurrency controls to the ISA
Execute the fragments out of order - concurrently
Execute instructions within a fragment in-order
Schedule instructions dynamically - data driven
interleave fragments on one processor to give
tolerance to latency (e.g. memory other long
operations)
distribute fragments to multiple processors to
give scalable performance from wide-issue width
Examples Tera, Microthreads, Intrathreads

This is an incremental model that adds a handful
of additional instructions to a given ISA - for
compatibility
24
Microthreads

Microthreading is instruction-level concurrency
it uses registers as synchronising memory to give
the lowest latency in dependency resolution
it is a bulk-synchronous model with MIMD, SPMD or
mixed blocks of concurrent execution separated by
barriers
it can support shared or distributed memory
Schedules have deterministic distribution and
dynamic interleaving
the code is schedule invariant and can trade
latency tolerance and speedup within resource
constraints
The concurrency controls are efficiently
implemented and the support structures are
scalable

25
Microthreading
SPMD
Source code
Binary code
Create (i 1, n)
For i 1, n --- --- ---
code fragments
Hardware
MIMD
deterministicglobal schedule
µ-thread queues
i3
i6
i9
i12
i2
i5
i8
i11
i1
i4
i7
i10
local schedulers
pipelines
26
Microcontexts

A typical RISC ISA only has a 5-bit address
Microthread CMPs uses a large distributed
register file that must be addressed by
instructions
e.g. thousands of processors by up to a thousand
registers per processor
Microcontexts give a mechanism to bridge this gap
A microcontext is a window on a given processor
at a given location in the register file
allocated dynamically to a thread
All registers implement i-structures
are allocated empty and implement a blocking read

27
Local register files
Large register file - no register renaming, e.g.
1024 entries
Sharing between microcontexts
Remotely by network
Locally by address translation
Offset from instruction
Context from thread state
Architectural register set, e.g. 32 entries
28
Different models

We can identify three models based on the
flexibility of inter-context communication
Vector - threads read/write their own context and
a subset of the enclosing context (globals)
Fixed dependency - as a) plus a subset of one or
more other contexts at the same level at and a
fixed distance from it (dependents)
General - as a) plus individual registers from
any context at the same level as it
We are currently focusing on b)

29
Fixed-dependency model

To implement this model we store and associate
with each instruction in the pipeline
two 10-bit offsets into the register file, and
two 5-bit boundaries within a context
The advantages of this model are
ring connectivity provides all remote
communication with modulo schedules
on the same processor static mappings enable
register bypassing on all local registers reads

30
Fixed-dependency model
G, S and in this case an dependency distance of 1
are invariants in the threads created
Locals
Shared
thread i1 offset
Locals
Thread i-1 on another processor
Dependent
Shared
S
Dependent
thread i offset
Broadcast between processors
Globals
Globals
G
global offset
thread i
thread i1 (on same processor)
31
CMP Concept
May be heterogeneous
Independently clocked domain
Multi-ported shared-memory system
Create
Create

Reconfigurable Broadcast Bus
Create/write G
Create/write G
Write G
Write G
Initialise L0
Initialise L0
Decoupled Lw
Decoupled Lw
D read
D read
Reconfigurable Ring network for micro-context
sharing
32
Typical structure of loop code

Define profile - i.e. number of processors on
which to execute the loop
Define context parameters - i.e. partitions of
context L, S, G
Set any global variables used in loop
Create loop - as concurrent microthreads using a
control block start, step, limit, dependency
distance, schedule block and pointer(s) to code

Profile n Context 6 1 1 Mv G0 L1 Create
controlblock Bsync
33
Scalable implementation

Broadcast register sharing are implemented by
the same asynchronous ring network
Register file is distributed and has only 5 ports
Independent of the number of processors!
The scheduler memory is also distributed and is
similar in size to the register file
Both can be scaled in size to adjust the latency
tolerance
Uses a simple in-order processor
issue is stalled when data is not available -
processors go into standby mode clock-cycle by
clock cycle
wake up is asynchronous

34
GALS implementation

Each processor can be implemented in its own
synchronous domain
with asynchronous interfaces to ring network,
memory and FPU processors
Single-cycle operations are statically scheduled
all other operations use the asynchronous ports
to the register file
if a processor has no active threads it is
waiting for an asynchronous input and can stop
its clocks

35
Power conservation
Two feedback loops
Workload measure
Schedule work When data available schedule work
to processor
Hardware scheduler
Power clock control
Voltage/frequency scaling Adjust voltage and
frequency to relative workload Stop clocks and
standby when no work
Schedule instructions
Data availability
Power/clock
Fragment Processor Single cycle operations
Asynchronous operations. e.g. memory or FPU
36
Vector code performance
Speedup of µ-threaded code is linear over 2.5
orders of magnitude to within 2
Speedup is super-linear with respect to
non-threaded code as 20 fewer instructions are
executed to complete the computation
Max IPC 1612 on 2048 processors 80 of
theoretical maximum
37
Scalable power dissipation
Energy of computation is constant to within 2
over 2.5 orders of magnitude
Using fine-grain control of clock (dynamic
dissipation) and coarse grain control of power
(static dissipation)
38
Performance with and without D-cache
The residual D-cache made no difference on
performance
39
Simulation framework

Based on a cycle-accurate CMP simulator of
ISA-Alpha ISA-µt
The same binary was executed on profiles of
from 1 to 2048 processors with cold caches
(I-cache and D-cache)
Fixed number of iterations (64K) of the Livermore
hydro kernel

40
Adaptive System Environment

Legacy code executes unchanged on 1 processor
Microthreaded code executes on n processors
n can be dynamic (e.g. per loop) to adapt to
system or user goals e.g. performance or power
dissipation
n processors can be drawn from a pool
When the concurrency collapses, only the
architectural state remains - on the creating
processor
Software concurrency (distributed memory) can
also be supported with I/O mapped to
micro-threaded registers
Need a dynamic model of the resources to get
self-adaptive execution of compiled code

41
Summary and future directions

We have been working on these models and
verifying their scalability for some years
We have just started a 4-year project to
formalise the models
develop compilers for the models
thoroughly investigate performance relative to
other approaches
implement IP

Write a Comment

User Comments (0)