Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16

Description:

... interleaving of threads. Processor pipeline ... Threads are taken from the Process Status Word (PSW) ... Threads can have only 1 outstanding memory request ... – PowerPoint PPT presentation

Number of Views:353

Avg rating:5.0/5.0

Slides: 48

Provided by: david2523

Category:

more less

Transcript and Presenter's Notes

Title: Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16

1
Multi Threaded ArchitecturesSima, Fountain and
KacsukChapter 16

CSE462

2
Memory and Synchronization Latency

Scalability of system is limited by ability to
handle memory latency algorithmic
sychronization delays
Overall solution is well known
Do something else whilst waiting
Remote memory accesses
Much slower than local
Varying delay depending on
Network traffic
Memory traffic

3
Processor Utilization

Utilization
P/T
P time spent processing
T total time
P/(P I S)
I time spent waiting on other tasks
S time spent switching tasks

4
Basic ideas - Multithreading

Fine Grain task switch every cycle
Coarse Grain Task swith every n cycles

Blocked
Blocked
Blocked
Task Switch Overhead
Task Switch Overhead
5
Design Space
6
Classification of multi-threaded architectures
7
Computational Models
8
Sequential control flow (von Neumann)

Flow of control and data separated
Executed sequentially (or at least sequential
semantics see chapter 7)
Control flow changed with JUMP/GOTO/CALL
instructions
Data stored in rewritable memory
Flow of data does not affect execution order

9
Sequential Control Flow Model
L1
-
A
B
m1
L2

Control Flow
B
1
m2
L3

m1
m2
R (A - B) (B 1)
R
10
Dataflow

Control tied to data
Instruction fires when data is available
Otherwise it is suspended
Order of instructions in program has no effect on
execution order
Cf Von Neumann
No shared rewritable memory
Write once semantics
Code is stored as a dataflow graph
Data transported as tokens
Parallelism occurs if multiple instructions can
fire at same time
Needs a parallel processor
Nodes are self scheduling

11
Dataflow arbitrary execution order
A
B
1
-

R
R (A - B) (B 1)
12
Dataflow arbitrary execution order
A
B
1
-

R
R (A - B) (B 1)
13
Dataflow Parallel Execution
A
B
1
-

R
R (A - B) (B 1)
14
Implementation

Dataflow model required very different execution
engine
Data must be stored in special matching store
Instructions must be triggered when both operands
are available
Parallel operations must be scheduled to
processors dynamically
Dont know apriori when they are available.
Instruction operands are pointers
To instruction
Operand number

15
Dataflow model of execution
L1
Compte B
L2/2
L2
L3
L3/1
B
-

A
B
1
L4/1
L4/2
L4

L6/1
16
Parallel Control flow

Sometimes called macro dataflow
Data flows between blocks of sequential code
Has advantaged of dataflow Von Neumann
Context switch overhead reduced
Compiler can schedule instructions statically
Dont need fast matching store
Requires additional control instructions
Fork/Join

17
Macro Dataflow (Hybrid Control/Dataflow)
L4
L1
FORK

L4
B
1
L2
-
m2
A
Control Flow
B
m1
L5
Control Flow
JOIN
2
L6

L3
GOTO
m1
L5
m2
R
R (A - B) (B 1)
18
Issues for Hybrid dataflow

Blocks of sequential instructions need to be
large enough to absorb overheads of context
switching
Data memory same as MIMD
Can be partitioned or shared
Synchronization instructions required
Semaphores, test-and-set
Control tokens required to synchronize threads.

19
Some examples
20
Denelcor HEP

Designed to tolerate latency in memory
Fine grain interleaving of threads
Processor pipeline contains 8 stages
Each time step a new thread enters the pipeline
Threads are taken from the Process Status Word
(PSW)
After thread taken from the PSW queue,
instruction and operands are fetched
When an instruction is executed, another one is
placed on the PSW queue
Threads are interleaved at the instruction level.

21
Denelcor HEP

Memory latency toleration solved with Scheduler
Function Unit (SFU)
Memory words are tagged as full or empty
Attempting to read an empty suspends the current
thread
Then current PSW entry is moved to the SFU
When data is written, taken from the SFU and
placed back on the PSW queue.

22
Synchronization on the HEP

All registers have Full/Empty/Reserved bit
Reading an empty register causes thread toe be
placed back on the PSW queue without updating its
program counter
Thread synchronization is busy-wait
But other threads can run

23
HEP Architecture
PSW queue
Matching Unit
Program memory
Increment control
Operand hand 1
Operand fetch
Registers
Operand hand 2
SFU
Funct unit 1
Funct unit 2
Funct unit N
To/from Data memory
24
HEP configuration

Up to 16 processors
Up to 128 data memories
Connected by high speed switch
Limitations
Threads can have only 1 outstanding memory
request
Thread synchronization puts bubbles in the
pipeline
Maximum of 64 threads causing problems for
software
Need to throttle loops
If parallelism is lower than 8 full utilisation
not possible.

25
MIT Alewife Processor

512 Processors in 2-dim mesh
Sparcle Processor
Physcially distributed memory
Logical shared memory
Hardware supported cache coherence
Hardware supported user level message passing
Multi-threading

26
Threading in Alewife

Coarse-grained multithreading
Pipeline works on single thread as long as remote
memory access or synchronization not required
Can exploit register optimization in the pipeline
Integration of multi-threading with hardware
supported cache coherence

27
The Sparcle Processor

Extension of SUN Sparc architecture
Tolerant of memory latency
Fine grained synchronisation
Efficient user level message passing

28
Fast context switching

In Sparc 8 overlapping register windows
Used in Sparcle in paris to represent 4
independent, non-overlapping contexts
Three for user threads
1 for traps and message handlers
Each context contains 32 general purpose
registers and
PSR (Processor State Register)
PC (Program Counter)
nPC (next Program Counter)
Thread states
Active
Loaded
State stored in registers can become active
Ready
Not suspended and not loaded
Suspended
Thread switching
In fast if one is active and the other is loaded
Need to flush the pipeline (cf HEP)

29
Sparcle Architecture
0R0
PSR
PC
nPC
0R31
PSR
Active thread
1R0
PC
CP
nPC
PSR
1R31
PC
2R0
nPC
PSR
PC
nPC
2R31
3R0
3R31
30
MIT Alewife and Sparcle
NR
Sparcle
Cache 64 kbytes
Main Memory 4 Bytes
CMMU
FPU
NR Network router CMMU Communication memory
management unit FPU Floating point unit
31
From here figures are drawn by Tim
32
Figures 16.10 Thread states in Sparcle
Process state
Memory
Global register frames
G0
Ready queue
Suspended queue
G7
PC and PSR frames
0R0
PSR
PC
0R31
nPC
activethread
1R0
PSR
CP
PC
1R31
nPC
2R0
Loaded thread
PSR
PC
Unloaded thread
nPC
PSR
2R31
PC
3R0
nPC
3R31
33
Figures 16.11 structure of a typical static
dataflow PE
Activitystore
Fetch unit
Instruction queue
Func. Unit 1
Func. Unit 2
Func. Unit N
Update unit
To/From other (PEs)
34
Figures 16.12 structure of a typical tagged-token
dataflow PE
Matching unit
Matching store
Instruction/data memory
Fetch unit
Token queue
Func. Unit 1
Func. Unit 2
Func. Unit N
Update unit
To other (PEs)
35
Figures 16.13 organization of the I-structure
storage
Data storage
Data storage
k
k1
k2
k3
k4
Presence bits (AAbsent, PPresent, WWaiting
36
Figures 16.14 coding in explicit token-store
architectures (a) and (b)
lt12, ltFP, IPgtgt
lt35, ltFP, IPgtgt
-
-
fire
lt23, ltFP, IP1gtgt
lt23, ltFP, IP2gtgt

37
Figures 16.14 coding in explicit token-store
architectures (c)
Instruction memory
Frame memory
Frame memory
FP
IP
FP
fire
FP2
FP3
FP4
Presence bit
38
Figures 16.15 structure of a typical explicit
token-store dataflow PE
From other PEs
Fetch unit
Fetch unit
Effective address
Presence bits
Framememory
Frame store operation
Func. Unit 1
Func. Unit 2
Func. Unit N
Form tokenunit
Form tokenunit
To/from other PEs
39
Figures 16.16 scale of von Neumann/dataflow
architectures
Dataflow
Macro dataflow
Decoupled hybrid dataflow
RISC-like hybrid
von Neumann
40
Figures 16.17 structure of a typical macro
dataflow PE
Instruction Framememory
Matching unit
Fetch unit
Token queue
Func. Unit
Internal control pipeline (program counter-based
sequential execution)
Form tokenunit
To/from other (PEs)
41
Figures 16.18 organization of a PE in the MIT
hybrid Machine
PC
FBR
1
Instruction fetch
Instruction memory
Enabled continuation queue(Token queue)
Framememory
Decode unit
Operand fetch
Execution unit
Registers
To/from global memory
42
Figures 16.19 comparison of (a) SQ and (b) SCB
macro nodes
a
b
c
a
b
c
SQ1
SQ2
SCB1
SCB2
l4
l4
l1
l1
3
l5
l5
l2
l2
1
2
1
2
43
Figures 16.20 structure of the USC Decoupled
Architecture
To/from network (Graph virtual space)
Cluster graph memory
GC
GC
DFGE
DFGE
AQ
RQ
AQ
RQ
Cluster 0
CE
CE
CC
CC
Cluster graph memory
To/from network (Computation virtual space)
44
Figures 16.21 structure of a node in the SAM
Mainmemory
APU
fire
ASU
SEU
done
LEU
To/from network
45
Figures 16.22 structure of the P-RISC processing
element
Internal control pipeline (conventional
RISC-processor)
Local memory
Instruction
Instruction fetch
Framememory
Operand fetch
Load/Store
Token queue
Func. unit
Messages to/from other PEs memory
Operand store
Start
46
Figures 16.23 transformation of dataflow graphs
into control flow graphs (a) dataflow graph (b)
control flow graph
join

fork L1

-
join
join
L1

-
47
Figures 16.24 structure of T node
Network interface
Message formatter
From network
To network
Message queues
Remote memory requestcoprocessor
ContinuationqueueltIP,FPgt
Local memory

Write a Comment

User Comments (0)