Title: Microthreaded model and DRISC processors Managing concurrency dynamically
1Microthreaded model and DRISC processorsManaging
concurrency dynamically
- A seminar given to IFIP 10.3 on 9/5/2007
- Chris Jesshope
- Professor of Computer Systems Engineering
- University of Amsterdam
- Jesshope_at_science.uva.nl
2Background - 10 years of research
- This work started in 1996 as a latency-tolerant
processor architecture called DRISC designed
for executing data-parallel languages on
multiprocessors - It has evolved over 10 years into a self-similar
concurrency model called SVP - or Microthreading
with implementations at the ISA and system level
A Bolychevsky, C R Jesshope and V B Muchnick,
(1996) Dynamic scheduling in RISC architectures,
IEE Trans. E, Computers and Digital Techniques,
143, pp 309-317 C R Jesshope (2006)
Microthreading - a model for distributed
instruction-level concurrency, Parallel
processing Letters, 16(2), pp209-228 - C R
Jesshope (2007) A model for the design and
programming of multicores, submitted to advances
in Parallel Computing L. Grandinetti (Ed.), IOS
Press, Amsterdam, http//staff.science.uva.nl/jes
shope/papers/Multicores.pdf
3Current and proposed projects
- The NWO Microgrids project model is evaluating
homogeneous reconfigurable multi-cores based on
microthreaded microprocessors - 4 years from 01/09/05
- SVP has been adopted in the EU AETHER project as
a model for self-adaptive computation based on
FPGAs - 3 years from 01/01/06
- The APPLE-CORE FP7 proposal will target C and SAC
languages to SVP and will implement prototypes of
microthreaded microprocessors (we hope)
4UvAs multi-core mission
- Managing 102 - 105 processors per chip
- Operands from large distributed register files
- Processors tolerant to significant latency
- hundreds of processor cycles
- On-chip COMA distributed shared memory
- Support for a range of architectural paradigms
- homogeneous / heterogeneous / FPGA / SIMD
- To do all of this we need a programming model
supporting concurrency as a core concept
5Programming models
- Sequential programming has advantages
- sequential programs are deterministic and safely
composable - i.e. using the well understood
concept of hierarchy (calling functions) - source code is universally compatible and can be
compiled to any sequential ISA without
modification - binary-code compatibility is important in
commodity processors - although this is not
scalable in current processors - Our aim is to gain the same benefits from a
concurrent programming model for multi-cores
6Microthread or SVP model
- Blocking threads with
- data-driven instruction execution
7Concurrency trees - hierarchichal composition
- Concurrent composition - build programs
concurrently - nodes represent threads - leaf nodes perform
computation - branching at nodes represent concurrent
subordinate threads
Program A
Program B
Program AB
8Blocking threads
A
What does this mean?
B0
Bn
- Threads at different levels run concurrently
- A creates Bi for all i in some set
- dependencies defined between threads
- A continues until a sync
- The identifiable events are
- when A creates B
- when A writes a value used by B etc.
- when Bi completes for all i
9Terminology and concepts
- Family of threads
- All threads at one level
- Unit of work
- a sub-tree i.e. all of a threads subordinate
threads - may be considered as a job or a task
- Place
- where a unit of work executes - one or more
processors FPGA cells etc.
10Safe composition
- A family of threads is created dynamically as an
ordered set defined on an index sequence - each thread in the family has access to a unique
value in the index sequence - its index in the
family - Restrictions are placed on the communication
between threads - these are blocking reads - the creating thread may write to the first thread
in index sequence and - any created thread may write to the thread whose
index is next in sequence to its own
Communication in a family is acyclic and deadlock
cannot be induced by composition - i.e. one
thread creating a subordinate family of threads
11Thread distribution
- A create operation distributes a parameterised
family of threads to processing resources -
deterministically - the number of threads processors is defined at
runtime - Processors may be one or more homogeneous
processors a dedicated unit configured FPGA cells
etc. - Communication deadlock is avoided but resource
deadlock can occur - have a finite set of registers for synchronising
contexts - this can be statically analysed for some
programs - but not for unbounded recursion of creates -
solved by delegating a unit of work to a new place
12Registers as synchronisers
- Efficient implementations of microthreads
synchronise in shared registers (as i-structures) - avoids a memory round-trip latency in
synchronising - single-cycle synchronisation is possible
- Families of threads communicate and synchronise
on shared memory - a familys output to memory is not defined until
the family completes (the synchronising event) - i.e. a bulk synchronisation or barrier
focus on direct implementations of the model at
the level of ISA instructions
13Putting it all together
i 0 2 4 6
Family of threads - indexed and dynamically
defined
Squeeze is a preemption or retraction of a
concurrent unit of work
14System issues
- Threads are dynamic share memory and can be
executed anywhere - Shared and/or distributed memory implementations
are possible - A place can be on the same chip or on another
- Deterministic distribution of families can be
used to optimise data locality
15Implementation of SVPin conventional processors
16DRISC processor
- Can apply microthreading to any base ISA
- just add concurrency control instructions
- provide control structures for threads and
families - provide a large synchronising register file
- Have backward compatibility to the base ISA
- old binaries run as a threads under the model
- New binaries are schedule invariant
- use from 1 to Nmax number of processors
17Synchronous vs Asynchronous register updates
- An instruction in a microthreaded pipeline
updates registers either - synchronously in when the register is set at the
writeback stage of the pipeline - asynchronously in when the register is set to
empty at the writeback stage and some activity
concurrent to the pipelines operation will write
a value to the register file asynchronously - Some instructions do one or the other depending
on machine state, e.g. load word depends on L1
cache hit
18Regular ISA concurrency control
- Add just five new instructions
- cre - creates a family of microthreads - this is
asynchronous and may set more than one register - the events are when the family is identified and
completes - a Thread Control Block (TCB) in memory contains
parameters - brk - terminates the family of the executing
thread - a return value is read from the first register
specifier - kill sqze - terminate preempt a family
specified by a family id - the family identifier is read from the second
register specifier
19DRISC pipeline
- Note the potential for power efficiency
- If a thread is inactive its TIB line is turned
off - If the queue is empty the processor turns off
- The queue length measures local load and can be
used to adjust the local clock rates
4. Suspended threads are rescheduled when data is
written and re-execute the blocked instruction
Synchronising memory
Fixed delay operations
Thread instruction buffer
Queue Of Active threads
Variable delay Operations (e.g. loads)
3. If data is available it is sent for
processing otherwise the thread suspends on the
empty register
2. Instructions issued from the head of the
active queue and read synchronising memory
instructions
data
20Processor control structures required
- A large synchronising register file (RF)
- also a register-file map for register allocation
- A thread table (TT) to store a threads state
- PC RF base addresses queue link field etc.
- A thread instruction buffer (TIB)
- an active thread is associated with a line in the
TIB - A family table (FT) to store family information
- Thread and family identifiers are indices into TT
and FT respectively - i.e. they are direct access
structures
Do not require branch predictors large data
caches or complex issue logic
21Synchronising memory
initialisation
- Registers provide the synchronising memory in a
microthreaded pipeline - The state of a register is stored with its data
and ports adapt according to that state - In state T-cont the register contains a TT
address - In state RR-cont the register contains a remote
RF address
empty
Local read no data
Remote read no data
T-cont
RR-cont
data write
data write reschedules thread
data write completes remote-read
full
asynchronous pipeline operations
22Memory references
- To provide latency tolerance loads and stores are
decoupled from the pipelines operation - n.b the datapath cache may be very small e.g.
1KByte - The ISAs load instruction is
- synchronous on L1 D-cache hit
- asynchronous on L1 D-cache miss
- In the latter case the target register is written
empty by the pipeline and overwritten
asynchronously by the memory subsystem when it
provides data
23Register-to-register operations
- Single-cycle operations are synchronous and
scheduled every clock cycle using bypassing - Multi-cycle operations can be either synchronous
or asynchronous - Variable-cycle operations are scheduled
asynchronously (e.g. shared FPU) - the writeback sets the register empty and any
dependent instruction is blocked
24Sharing registers between threads
- Each thread has an identified context in the
register file ( 31 registers R310 with Alpha
ISA) - registers are shared between threads contexts to
support the distributed-shared register file -
sharing is restricted - on the same processor sharing is performed by
mapping - on adjacent processors sharing is performed by
local communication - Have sub-classes of variables managed in the
context - global - to all threads in a family
- local - to one thread only
- shared/dependent - written by one thread read by
its neighbour
25Creating thread
Thread 1
Thread 2
Thread n
Neighbours shared
Neighbours shared
Locals
Neighbours shared
Local shared
Local shared
Local shared
31
Locals
Locals
Locals
Locals
Global scalars
read only
read/write
26Create
- Create performs the following actions
autonomously - writes TCB address to the create buffer at
execute stage - sets two targets (e.g. Ra and Ra1) to empty at
WB stage - when the family is a allocated an FT slot it
(optionally) it writes the fid to Ra1 using the
asynchronous port - the family may now be killed or squeezed
- when the family completes it (optionally) writes
the return value to the target specified in the
TCB using the asynchronous port - finally when the familys memory writes have
completed it writes the return code to Ra using
the asynchronous port and cleans itself up -
i.e.releasing the FT slot
27Squeeze and Kill
- kill and squeeze are asynchronous and very
powerful! - To provide security a pseudo random number is
generated by the processor and kept in the FT and
as a part of the fid - They require these to match in order to enable
the operations - kill and squeeze traverse down through the create
tree from the node the signal was sent to - for squeeze this is to a user defined level
- The concurrency tree is captured implicitly by a
parent in the FT - i.e. families are located in related FTs that
have the same fid as a parent these children then
propagate the signal in turn
28Thread state
- Threads are held in an indexed table
- the table index is the threads reference and and
is used to build queues on that table - Thread state in the TT is encoded by the queue a
thread is currently in - empty - not allocated
- active - head/tail in family table
- suspended - degenerate queue (headtail) stored
in the register the thread is suspended on - waiting - head/tail in I-cache line
- N.b no thread will execute unless its
instructions are in cache
29Thread state transition
active
Executes, context switches reads data
successfully
Data written PC hits I cache
Executes, context switches Reads data
unsuccessfully
suspended
Data written PC misses I cache
waiting
Cache line returns
30Microgrids
- of microthreaded micropropcessors
31Family distribution to clusters
Source code
Binary code
for i 1, n
create i 1, n
Hardware
deterministic global schedule
i1,n
Microthreads scheduled to pipelines dynamically
and instructions executed according to dataflow
constraints
thread queues
i3
i6
i9
i12
i2
i5
i11
i8
i1
i4
i7
i10
schedulers
Pipelines
P0
P1
P2
P3
register-sharing ring network
32SEP - dynamic processor allocation
- The microgrid concept defines a pool of bare
processors allocated dynamically by the SEP to
threads at any level in the concurrency tree in
order to delegate units of work - clusters of processors is configured to a ring
and is known as a place and identified by the
address of the rtoot processor - microthreaded binary code can be executed
anywhere and on any number of processors
33Delegation across CMP
Coherent shared memory
SEP
Cluster 1
µT proc
µT proc
µT proc
µT proc
Cluster 2
Cluster 2
Cluster 3
Cluster 3
Cluster 3
µT proc
Cluster 4
Cluster 4
Cluster 3
Cluster 3
Cluster 3
µT proc
Cluster 4
Cluster 4
µT proc
µT proc
µT proc
µT proc
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
34Example Chip Architecture
Level 0 tile
Level 1 tile
Data-diffusion memory
Configuration switches
Pipe 2
Pipe 1
FPU Pipes
Pipe 3
Pipe 0
Coherency network (64 bytes wide ring / ring of
rings)
Register-sharing network (8 bytes wide ring)
Delegation network (1 bit wide grid)
35The big picture - where are we?
Sequential Data parallel Streaming
exist today
in development
to be developed
36Discussion
- Microthreading provides a unified model of
concurrency on a scale from CMPs to grids - The model is composed concurrently with
restrictions to allow safe composition - It reflects future silicon implementations
problems - We have developed a language µTC that captures
this concurrency
37Conclusions
- Microthreaded processors are both computationally
and power efficient - code is schedule invariant and dynamically
distributed - instructions are dynamically interleaved
- Control structures are distributed and scalable
- Small compared to an FPU
- Can manage code fragments (threads) as small as a
few instructions - context switch - signal - reschedule a thread on
every clock cycle