Title: Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16
1Multi Threaded ArchitecturesSima, Fountain and
KacsukChapter 16
2Memory and Synchronization Latency
- Scalability of system is limited by ability to
handle memory latency algorithmic
sychronization delays - Overall solution is well known
- Do something else whilst waiting
- Remote memory accesses
- Much slower than local
- Varying delay depending on
- Network traffic
- Memory traffic
3Processor Utilization
- Utilization
- P/T
- P time spent processing
- T total time
- P/(P I S)
- I time spent waiting on other tasks
- S time spent switching tasks
4Basic ideas - Multithreading
- Fine Grain task switch every cycle
- Coarse Grain Task swith every n cycles
Blocked
Blocked
Blocked
Task Switch Overhead
Task Switch Overhead
5Design Space
6Classification of multi-threaded architectures
7Computational Models
8Sequential control flow (von Neumann)
- Flow of control and data separated
- Executed sequentially (or at least sequential
semantics see chapter 7) - Control flow changed with JUMP/GOTO/CALL
instructions - Data stored in rewritable memory
- Flow of data does not affect execution order
9Sequential Control Flow Model
L1
-
A
B
m1
L2
Control Flow
B
1
m2
L3
m1
m2
R (A - B) (B 1)
R
10Dataflow
- Control tied to data
- Instruction fires when data is available
- Otherwise it is suspended
- Order of instructions in program has no effect on
execution order - Cf Von Neumann
- No shared rewritable memory
- Write once semantics
- Code is stored as a dataflow graph
- Data transported as tokens
- Parallelism occurs if multiple instructions can
fire at same time - Needs a parallel processor
- Nodes are self scheduling
11Dataflow arbitrary execution order
A
B
1
-
R
R (A - B) (B 1)
12Dataflow arbitrary execution order
A
B
1
-
R
R (A - B) (B 1)
13Dataflow Parallel Execution
A
B
1
-
R
R (A - B) (B 1)
14Implementation
- Dataflow model required very different execution
engine - Data must be stored in special matching store
- Instructions must be triggered when both operands
are available - Parallel operations must be scheduled to
processors dynamically - Dont know apriori when they are available.
- Instruction operands are pointers
- To instruction
- Operand number
15Dataflow model of execution
L1
Compte B
L2/2
L2
L3
L3/1
B
-
A
B
1
L4/1
L4/2
L4
L6/1
16Parallel Control flow
- Sometimes called macro dataflow
- Data flows between blocks of sequential code
- Has advantaged of dataflow Von Neumann
- Context switch overhead reduced
- Compiler can schedule instructions statically
- Dont need fast matching store
- Requires additional control instructions
- Fork/Join
17Macro Dataflow (Hybrid Control/Dataflow)
L4
L1
FORK
L4
B
1
L2
-
m2
A
Control Flow
B
m1
L5
Control Flow
JOIN
2
L6
L3
GOTO
m1
L5
m2
R
R (A - B) (B 1)
18Issues for Hybrid dataflow
- Blocks of sequential instructions need to be
large enough to absorb overheads of context
switching - Data memory same as MIMD
- Can be partitioned or shared
- Synchronization instructions required
- Semaphores, test-and-set
- Control tokens required to synchronize threads.
19Some examples
20Denelcor HEP
- Designed to tolerate latency in memory
- Fine grain interleaving of threads
- Processor pipeline contains 8 stages
- Each time step a new thread enters the pipeline
- Threads are taken from the Process Status Word
(PSW) - After thread taken from the PSW queue,
instruction and operands are fetched - When an instruction is executed, another one is
placed on the PSW queue - Threads are interleaved at the instruction level.
21Denelcor HEP
- Memory latency toleration solved with Scheduler
Function Unit (SFU) - Memory words are tagged as full or empty
- Attempting to read an empty suspends the current
thread - Then current PSW entry is moved to the SFU
- When data is written, taken from the SFU and
placed back on the PSW queue.
22Synchronization on the HEP
- All registers have Full/Empty/Reserved bit
- Reading an empty register causes thread toe be
placed back on the PSW queue without updating its
program counter - Thread synchronization is busy-wait
- But other threads can run
23HEP Architecture
PSW queue
Matching Unit
Program memory
Increment control
Operand hand 1
Operand fetch
Registers
Operand hand 2
SFU
Funct unit 1
Funct unit 2
Funct unit N
To/from Data memory
24HEP configuration
- Up to 16 processors
- Up to 128 data memories
- Connected by high speed switch
- Limitations
- Threads can have only 1 outstanding memory
request - Thread synchronization puts bubbles in the
pipeline - Maximum of 64 threads causing problems for
software - Need to throttle loops
- If parallelism is lower than 8 full utilisation
not possible.
25MIT Alewife Processor
- 512 Processors in 2-dim mesh
- Sparcle Processor
- Physcially distributed memory
- Logical shared memory
- Hardware supported cache coherence
- Hardware supported user level message passing
- Multi-threading
26Threading in Alewife
- Coarse-grained multithreading
- Pipeline works on single thread as long as remote
memory access or synchronization not required - Can exploit register optimization in the pipeline
- Integration of multi-threading with hardware
supported cache coherence
27The Sparcle Processor
- Extension of SUN Sparc architecture
- Tolerant of memory latency
- Fine grained synchronisation
- Efficient user level message passing
28Fast context switching
- In Sparc 8 overlapping register windows
- Used in Sparcle in paris to represent 4
independent, non-overlapping contexts - Three for user threads
- 1 for traps and message handlers
- Each context contains 32 general purpose
registers and - PSR (Processor State Register)
- PC (Program Counter)
- nPC (next Program Counter)
- Thread states
- Active
- Loaded
- State stored in registers can become active
- Ready
- Not suspended and not loaded
- Suspended
- Thread switching
- In fast if one is active and the other is loaded
- Need to flush the pipeline (cf HEP)
29Sparcle Architecture
0R0
PSR
PC
nPC
0R31
PSR
Active thread
1R0
PC
CP
nPC
PSR
1R31
PC
2R0
nPC
PSR
PC
nPC
2R31
3R0
3R31
30MIT Alewife and Sparcle
NR
Sparcle
Cache 64 kbytes
Main Memory 4 Bytes
CMMU
FPU
NR Network router CMMU Communication memory
management unit FPU Floating point unit
31From here figures are drawn by Tim
32Figures 16.10 Thread states in Sparcle
Process state
Memory
Global register frames
G0
Ready queue
Suspended queue
G7
PC and PSR frames
0R0
PSR
PC
0R31
nPC
activethread
1R0
PSR
CP
PC
1R31
nPC
2R0
Loaded thread
PSR
PC
Unloaded thread
nPC
PSR
2R31
PC
3R0
nPC
3R31
33Figures 16.11 structure of a typical static
dataflow PE
Activitystore
Fetch unit
Instruction queue
Func. Unit 1
Func. Unit 2
Func. Unit N
Update unit
To/From other (PEs)
34Figures 16.12 structure of a typical tagged-token
dataflow PE
Matching unit
Matching store
Instruction/data memory
Fetch unit
Token queue
Func. Unit 1
Func. Unit 2
Func. Unit N
Update unit
To other (PEs)
35Figures 16.13 organization of the I-structure
storage
Data storage
Data storage
k
k1
k2
k3
k4
Presence bits (AAbsent, PPresent, WWaiting
36Figures 16.14 coding in explicit token-store
architectures (a) and (b)
lt12, ltFP, IPgtgt
lt35, ltFP, IPgtgt
-
-
fire
lt23, ltFP, IP1gtgt
lt23, ltFP, IP2gtgt
37Figures 16.14 coding in explicit token-store
architectures (c)
Instruction memory
Frame memory
Frame memory
FP
IP
FP
fire
FP2
FP3
FP4
Presence bit
38Figures 16.15 structure of a typical explicit
token-store dataflow PE
From other PEs
Fetch unit
Fetch unit
Effective address
Presence bits
Framememory
Frame store operation
Func. Unit 1
Func. Unit 2
Func. Unit N
Form tokenunit
Form tokenunit
To/from other PEs
39Figures 16.16 scale of von Neumann/dataflow
architectures
Dataflow
Macro dataflow
Decoupled hybrid dataflow
RISC-like hybrid
von Neumann
40Figures 16.17 structure of a typical macro
dataflow PE
Instruction Framememory
Matching unit
Fetch unit
Token queue
Func. Unit
Internal control pipeline (program counter-based
sequential execution)
Form tokenunit
To/from other (PEs)
41Figures 16.18 organization of a PE in the MIT
hybrid Machine
PC
FBR
1
Instruction fetch
Instruction memory
Enabled continuation queue(Token queue)
Framememory
Decode unit
Operand fetch
Execution unit
Registers
To/from global memory
42Figures 16.19 comparison of (a) SQ and (b) SCB
macro nodes
a
b
c
a
b
c
SQ1
SQ2
SCB1
SCB2
l4
l4
l1
l1
3
l5
l5
l2
l2
1
2
1
2
43Figures 16.20 structure of the USC Decoupled
Architecture
To/from network (Graph virtual space)
Cluster graph memory
GC
GC
DFGE
DFGE
AQ
RQ
AQ
RQ
Cluster 0
CE
CE
CC
CC
Cluster graph memory
To/from network (Computation virtual space)
44Figures 16.21 structure of a node in the SAM
Mainmemory
APU
fire
ASU
SEU
done
LEU
To/from network
45Figures 16.22 structure of the P-RISC processing
element
Internal control pipeline (conventional
RISC-processor)
Local memory
Instruction
Instruction fetch
Framememory
Operand fetch
Load/Store
Token queue
Func. unit
Messages to/from other PEs memory
Operand store
Start
46Figures 16.23 transformation of dataflow graphs
into control flow graphs (a) dataflow graph (b)
control flow graph
join
fork L1
-
join
join
L1
-
47Figures 16.24 structure of T node
Network interface
Message formatter
From network
To network
Message queues
Remote memory requestcoprocessor
ContinuationqueueltIP,FPgt
Local memory