Lecture 4 Multithreaded Processors continued and an Example - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Lecture 4 Multithreaded Processors continued and an Example

Description:

Number of Views:28

Avg rating:3.0/5.0

Slides: 17

Provided by: juny8

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 4 Multithreaded Processors continued and an Example

1
Lecture 4 Multithreaded Processors (continued)
and an Example

2
Implicitly Multithreaded Processors

Explicit Multithreading (covered in last class)
programmer /compiler create multiple threads.
Implicit multithreading dynamically spawn
threads (with the help of compiler), number of
active threads varies.
Child threads are relatively short (tens of
instructions), often need to communicate large
amounts of state with other threads to resolve
data and control dependences.
How are threads created?
How are register data dependences resolved?
How are memory data dependences resolved?

3
Thread Creation

A
B
C
4
Thread Creation

Creation of C and H is out of program order
The reorder buffer has to support out-of-order
insertion of an arbitrary number of instructions
into the middle of a set of already active
instructions.

5
Thread Creation

Disjoint Eager Execution (DEE)
Eager execution execute both paths following a
branch.
Explosion of paths as multiple branches are
traversed
DEE prune the decision tree according to branch
prediction rate, spawn thread along high rated
branch path.

1
5
2
0.75
0.25
3
0.19
0.56
4
0.42
0.14
0.32
0.24
6
Where Are They Running

Partition execution resources, each thread runs
on each partition, each partition can be less
aggressive.
Much like on-chip multiprocessor.
E.g. thread-level speculation (TLS) creates a
thread for each iteration of a loop. Each
iteration is run on a core.
Rely on an SMT-like processor, interleaves
implicit threads (instead of explicit threads).
E.g. DMT spawns threads at procedure calls
backward loop branches. Threads share the
existing reourses.
Multiple processing elements structured as a
circular queue. Each element executes one thread.
The tail of the queue is the current thread
(non-speculative), others are executing future
threads (can be speculative).
E.g. Multiscalar allows creation of a thread at
arbitrary points in programs control flow.

7
Resolving Register Data Dependences

Intrathread dependences are handled with standard
techniques.
Interthread dependences are hard
A new future thread might need a register read at
the beginning. But its producer may be at the end
of the prior thread. The producer instruction
might not even be fetched.
Disallow reg. data dependences. Communicate all
shared operands through memory w/ loads and
stores, e.g. TLS.
Compiler tells the dependences explicitly. Embeds
a write mask in future thread, telling which
registers have pending writes to them., e.g.
Multiscalar.
Speculatively execute as if the operands are
ready. Recover if wrong. E.g. DMT.

8
Resolving Memory Data Depend.

Intrathread memory dependences are resolved w/
standard tech.
WAR and WAW interthread dependences buffer
writes from future threads and committing them
when the threads retire.
RAW interthread memory dependence resolution
(complex!)
A load in later thread searches an aliasing store
in earlier threads store queue (bypass). A store
in an earlier thread searches an aliasing load
(already executed) in a later threads load queue
(a violation). E.g. DMT, DEE.
Centralized address resolution buffer (ARB). One
entry for each load in future thread and any
aliased store from old threads flag a violation
(future thread will be squashed and restarted).
Each load is checked with all unretired stores
also in ARB. E.g. Multscalar

9
Concluding Remarks

10
Case Study

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Pipeline Stages
15
Branch Decisions

Branches are classified into 3 classes
Class 1 branches are resolved in stage P1
instruction in P0 may be aborted
Class 2 branches are resolved either in P2 or P1
depending the condition set by instruction in P3
instruction in P0,P1 may be aborted.
Class 3 branches are resolved in P3
instructions in P0,1,2 may be aborted.

16
Context Switch

Specified by instruction explicitly.
Overhead is at most 1 cycle a context switch
instruction is a class 1 branch instruction which
is resolved in P1.
If delayed branch is present, this 1 cycle loss
can be removed.
Needs to update event signals.
Context event arbiter wakes up a thread
(round-robin fashion)

Write a Comment

User Comments (0)