Microthreaded model and DRISC processors Managing concurrency dynamically - PowerPoint PPT Presentation

About This Presentation

Title:

Microthreaded model and DRISC processors Managing concurrency dynamically

Description:

... and Digital Techniques, 143, ... Threads are dynamic share memory and can be ... The big picture - where are we? FPGA Microthreaded. processor. TC to ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 38

Provided by: chrisje2

Learn more at: http://www.ifipwg103.org

Category:

more less

Transcript and Presenter's Notes

Title: Microthreaded model and DRISC processors Managing concurrency dynamically

1
Microthreaded model and DRISC processorsManaging
concurrency dynamically

A seminar given to IFIP 10.3 on 9/5/2007
Chris Jesshope
Professor of Computer Systems Engineering
University of Amsterdam
Jesshope_at_science.uva.nl

2
Background - 10 years of research

This work started in 1996 as a latency-tolerant
processor architecture called DRISC designed
for executing data-parallel languages on
multiprocessors
It has evolved over 10 years into a self-similar
concurrency model called SVP - or Microthreading
with implementations at the ISA and system level

A Bolychevsky, C R Jesshope and V B Muchnick,
(1996) Dynamic scheduling in RISC architectures,
IEE Trans. E, Computers and Digital Techniques,
143, pp 309-317 C R Jesshope (2006)
Microthreading - a model for distributed
instruction-level concurrency, Parallel
processing Letters, 16(2), pp209-228 - C R
Jesshope (2007) A model for the design and
programming of multicores, submitted to advances
in Parallel Computing L. Grandinetti (Ed.), IOS
Press, Amsterdam, http//staff.science.uva.nl/jes
shope/papers/Multicores.pdf
3
Current and proposed projects

The NWO Microgrids project model is evaluating
homogeneous reconfigurable multi-cores based on
microthreaded microprocessors
4 years from 01/09/05
SVP has been adopted in the EU AETHER project as
a model for self-adaptive computation based on
FPGAs
3 years from 01/01/06
The APPLE-CORE FP7 proposal will target C and SAC
languages to SVP and will implement prototypes of
microthreaded microprocessors (we hope)

4
UvAs multi-core mission

Managing 102 - 105 processors per chip
Operands from large distributed register files
Processors tolerant to significant latency
hundreds of processor cycles
On-chip COMA distributed shared memory
Support for a range of architectural paradigms
homogeneous / heterogeneous / FPGA / SIMD
To do all of this we need a programming model
supporting concurrency as a core concept

5
Programming models

Sequential programming has advantages
sequential programs are deterministic and safely
composable - i.e. using the well understood
concept of hierarchy (calling functions)
source code is universally compatible and can be
compiled to any sequential ISA without
modification
binary-code compatibility is important in
commodity processors - although this is not
scalable in current processors
Our aim is to gain the same benefits from a
concurrent programming model for multi-cores

6
Microthread or SVP model

Blocking threads with
data-driven instruction execution

7
Concurrency trees - hierarchichal composition

Concurrent composition - build programs
concurrently
nodes represent threads - leaf nodes perform
computation
branching at nodes represent concurrent
subordinate threads

Program A
Program B
Program AB
8
Blocking threads
A
What does this mean?

B0
Bn

Threads at different levels run concurrently
A creates Bi for all i in some set
dependencies defined between threads
A continues until a sync
The identifiable events are
when A creates B
when A writes a value used by B etc.
when Bi completes for all i

9
Terminology and concepts

Family of threads
All threads at one level
Unit of work
a sub-tree i.e. all of a threads subordinate
threads
may be considered as a job or a task
Place
where a unit of work executes - one or more
processors FPGA cells etc.

10
Safe composition

A family of threads is created dynamically as an
ordered set defined on an index sequence
each thread in the family has access to a unique
value in the index sequence - its index in the
family
Restrictions are placed on the communication
between threads - these are blocking reads
the creating thread may write to the first thread
in index sequence and
any created thread may write to the thread whose
index is next in sequence to its own

Communication in a family is acyclic and deadlock
cannot be induced by composition - i.e. one
thread creating a subordinate family of threads
11
Thread distribution

A create operation distributes a parameterised
family of threads to processing resources -
deterministically
the number of threads processors is defined at
runtime
Processors may be one or more homogeneous
processors a dedicated unit configured FPGA cells
etc.
Communication deadlock is avoided but resource
deadlock can occur
have a finite set of registers for synchronising
contexts
this can be statically analysed for some
programs
but not for unbounded recursion of creates -
solved by delegating a unit of work to a new place

12
Registers as synchronisers

Efficient implementations of microthreads
synchronise in shared registers (as i-structures)
avoids a memory round-trip latency in
synchronising
single-cycle synchronisation is possible
Families of threads communicate and synchronise
on shared memory
a familys output to memory is not defined until
the family completes (the synchronising event)
i.e. a bulk synchronisation or barrier

focus on direct implementations of the model at
the level of ISA instructions
13
Putting it all together
i 0 2 4 6
Family of threads - indexed and dynamically
defined

Squeeze is a preemption or retraction of a
concurrent unit of work
14
System issues

Threads are dynamic share memory and can be
executed anywhere
Shared and/or distributed memory implementations
are possible
A place can be on the same chip or on another
Deterministic distribution of families can be
used to optimise data locality

15
Implementation of SVPin conventional processors

Dynamic RISC - DRISC

16
DRISC processor

Can apply microthreading to any base ISA
just add concurrency control instructions
provide control structures for threads and
families
provide a large synchronising register file
Have backward compatibility to the base ISA
old binaries run as a threads under the model
New binaries are schedule invariant
use from 1 to Nmax number of processors

17
Synchronous vs Asynchronous register updates

An instruction in a microthreaded pipeline
updates registers either
synchronously in when the register is set at the
writeback stage of the pipeline
asynchronously in when the register is set to
empty at the writeback stage and some activity
concurrent to the pipelines operation will write
a value to the register file asynchronously
Some instructions do one or the other depending
on machine state, e.g. load word depends on L1
cache hit

18
Regular ISA concurrency control

Add just five new instructions
cre - creates a family of microthreads - this is
asynchronous and may set more than one register
the events are when the family is identified and
completes
a Thread Control Block (TCB) in memory contains
parameters
brk - terminates the family of the executing
thread
a return value is read from the first register
specifier
kill sqze - terminate preempt a family
specified by a family id
the family identifier is read from the second
register specifier

19
DRISC pipeline

Note the potential for power efficiency
If a thread is inactive its TIB line is turned
off
If the queue is empty the processor turns off
The queue length measures local load and can be
used to adjust the local clock rates

4. Suspended threads are rescheduled when data is
written and re-execute the blocked instruction
Synchronising memory
Fixed delay operations
Thread instruction buffer
Queue Of Active threads
Variable delay Operations (e.g. loads)
3. If data is available it is sent for
processing otherwise the thread suspends on the
empty register
2. Instructions issued from the head of the
active queue and read synchronising memory
instructions
data
20
Processor control structures required

A large synchronising register file (RF)
also a register-file map for register allocation
A thread table (TT) to store a threads state
PC RF base addresses queue link field etc.
A thread instruction buffer (TIB)
an active thread is associated with a line in the
TIB
A family table (FT) to store family information
Thread and family identifiers are indices into TT
and FT respectively - i.e. they are direct access
structures

Do not require branch predictors large data
caches or complex issue logic
21
Synchronising memory
initialisation

Registers provide the synchronising memory in a
microthreaded pipeline
The state of a register is stored with its data
and ports adapt according to that state
In state T-cont the register contains a TT
address
In state RR-cont the register contains a remote
RF address

empty
Local read no data
Remote read no data
T-cont
RR-cont
data write
data write reschedules thread
data write completes remote-read
full
asynchronous pipeline operations
22
Memory references

To provide latency tolerance loads and stores are
decoupled from the pipelines operation
n.b the datapath cache may be very small e.g.
1KByte
The ISAs load instruction is
synchronous on L1 D-cache hit
asynchronous on L1 D-cache miss
In the latter case the target register is written
empty by the pipeline and overwritten
asynchronously by the memory subsystem when it
provides data

23
Register-to-register operations

Single-cycle operations are synchronous and
scheduled every clock cycle using bypassing
Multi-cycle operations can be either synchronous
or asynchronous
Variable-cycle operations are scheduled
asynchronously (e.g. shared FPU)
the writeback sets the register empty and any
dependent instruction is blocked

24
Sharing registers between threads

Each thread has an identified context in the
register file ( 31 registers R310 with Alpha
ISA)
registers are shared between threads contexts to
support the distributed-shared register file -
sharing is restricted
on the same processor sharing is performed by
mapping
on adjacent processors sharing is performed by
local communication
Have sub-classes of variables managed in the
context
global - to all threads in a family
local - to one thread only
shared/dependent - written by one thread read by
its neighbour

25
Creating thread
Thread 1
Thread 2
Thread n
Neighbours shared
Neighbours shared
Locals
Neighbours shared
Local shared
Local shared
Local shared
31
Locals
Locals
Locals
Locals
Global scalars
read only
read/write
26
Create

Create performs the following actions
autonomously
writes TCB address to the create buffer at
execute stage
sets two targets (e.g. Ra and Ra1) to empty at
WB stage
when the family is a allocated an FT slot it
(optionally) it writes the fid to Ra1 using the
asynchronous port
the family may now be killed or squeezed
when the family completes it (optionally) writes
the return value to the target specified in the
TCB using the asynchronous port
finally when the familys memory writes have
completed it writes the return code to Ra using
the asynchronous port and cleans itself up -
i.e.releasing the FT slot

27
Squeeze and Kill

kill and squeeze are asynchronous and very
powerful!
To provide security a pseudo random number is
generated by the processor and kept in the FT and
as a part of the fid
They require these to match in order to enable
the operations
kill and squeeze traverse down through the create
tree from the node the signal was sent to
for squeeze this is to a user defined level
The concurrency tree is captured implicitly by a
parent in the FT
i.e. families are located in related FTs that
have the same fid as a parent these children then
propagate the signal in turn

28
Thread state

Threads are held in an indexed table
the table index is the threads reference and and
is used to build queues on that table
Thread state in the TT is encoded by the queue a
thread is currently in
empty - not allocated
active - head/tail in family table
suspended - degenerate queue (headtail) stored
in the register the thread is suspended on
waiting - head/tail in I-cache line
N.b no thread will execute unless its
instructions are in cache

29
Thread state transition
active
Executes, context switches reads data
successfully
Data written PC hits I cache
Executes, context switches Reads data
unsuccessfully
suspended
Data written PC misses I cache
waiting
Cache line returns
30
Microgrids

of microthreaded micropropcessors

31
Family distribution to clusters
Source code
Binary code
for i 1, n
create i 1, n
Hardware
deterministic global schedule
i1,n
Microthreads scheduled to pipelines dynamically
and instructions executed according to dataflow
constraints
thread queues
i3
i6
i9
i12
i2
i5
i11
i8
i1
i4
i7
i10
schedulers
Pipelines
P0
P1
P2
P3
register-sharing ring network
32
SEP - dynamic processor allocation

The microgrid concept defines a pool of bare
processors allocated dynamically by the SEP to
threads at any level in the concurrency tree in
order to delegate units of work
clusters of processors is configured to a ring
and is known as a place and identified by the
address of the rtoot processor
microthreaded binary code can be executed
anywhere and on any number of processors

33
Delegation across CMP
Coherent shared memory
SEP
Cluster 1
µT proc
µT proc
µT proc
µT proc
Cluster 2
Cluster 2
Cluster 3
Cluster 3
Cluster 3
µT proc
Cluster 4
Cluster 4
Cluster 3
Cluster 3
Cluster 3
µT proc
Cluster 4
Cluster 4
µT proc
µT proc
µT proc
µT proc
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
34
Example Chip Architecture
Level 0 tile
Level 1 tile
Data-diffusion memory
Configuration switches
Pipe 2
Pipe 1
FPU Pipes
Pipe 3
Pipe 0
Coherency network (64 bytes wide ring / ring of
rings)
Register-sharing network (8 bytes wide ring)
Delegation network (1 bit wide grid)
35
The big picture - where are we?
Sequential Data parallel Streaming
exist today
in development
to be developed
36
Discussion

Microthreading provides a unified model of
concurrency on a scale from CMPs to grids
The model is composed concurrently with
restrictions to allow safe composition
It reflects future silicon implementations
problems
We have developed a language µTC that captures
this concurrency

37
Conclusions