Title: Multithreading and Dataflow Architectures CPSC 321
1Multithreading and Dataflow Architectures CPSC
321
2Plan
- T November 16 Multithreading
- R November 18 Quantum Computing
- T November 23 QC Exam prep
- R November 25 Thanksgiving
- M November 29 Review ???
- T November 30 Exam
- R December 02 Summary and Outlook
- T December 07 move to November 29?
3Announcements
- Office hours 200pm-300pm
- Bonfire memorial
4Parallelism
- Hardware parallelism
- all current architectures
- Instruction-level parallelism
- superscalar processor, VLIW processor
- Thread-level parallelism
- Niagara, Pentium 4, ...
- Process parallelism
- MIMD computer
5What is a Thread?
- A thread is a sequence of instructions that can
be executed in parallel with other sequences. - Threads typically share the same resources and
have a minimal context.
6Threads
- Definition 1 Different threads within the same
process share the same address space, but have
separate copies of the register file, PC, and
stack - Definition 2 Alternatively, different threads
have separate copies of the register file, PC,
and page table (more relaxed than previous
definition). - One can use a multiple issue, out-of-order,
execution engine.
7Why Thread-Level Parallelism?
- Extracting instruction-level parallelism is
non-trivial - hazards and stalls
- data dependencies
- structural limitations
- static optimization limits
8Von Neumann Execution Model
Each node is an instruction. The pink arrow
indicates a static scheduling of the
instructions. If an instruction stalls (e.g. due
to a cache miss) then the entire program must
wait for the stalled instruction to resume
execution.
9The Dataflow Execution Model
Each node represents an instruction. The
instructions are not scheduled until run-time. If
an instruction stalls, other instructions can
still execute, provided their input data is
available.
10The Multithreaded Execution Model
Each node represents an instruction and each gray
region represents a thread. The instructions
within each thread are statically scheduled while
the threads themselves are dynamically scheduled.
If an instruction stalls, the thread stalls but
other threads can continue execution.
11Single-Threaded Processors
Memory access latency can dominate the processing
time, because each time a cache miss occurs
hundreds of clock cycles can be lost when a
single-threaded processor is waiting for the
memory. Top Increasing the clock speed improves
the processing time, but does not affect the
memory access time.
12Multi-Threaded Processors
13Multithreading Types
- Coarse-grained multithreading
- If a thread faces a costly stall, switch to
another thread. Usually flushes the pipe before
switching threads. - Fine-grained multithreading
- interleave the issue of instruction from multiple
threads (cycle-by-cycle), skipping the threads
that are stalled. Instructions issued in any
given cycle comes from the same thread.
14Scalar Execution
Dependencies reduce throughput and utilization.
15Superscalar Execution
16Chip Multiprocessor
17Fine-Grained Multithreading
Instructions issues in the same cycle come from
the same thread
18Fine-Grained Multithreading
- Threads are switched every clock cycle, in round
robin fashion, among active threads - Throughput is improved, instructions can be
issued every cycle - Single-thread performance is decreased, because
one thread is expected to get just every n-th
clock cycle among n processes - Fine-grained multithreading requires hardware
modifications to keep track of threads (separate
register files, renaming tables and commit
buffers)
19Multithreading Types
- A single thread cannot effectively use all
functional units of a multiple issue processor - Simultaneous multithreading
- uses multiple issue slots in each clock cycle for
different threads. - More flexible than fine grained MT.
20Simultaneous Multithreading
21Comparison
- Superscalar
- looks at multiple instructions from same process,
both horizontal and vertical waste. - Multithreaded
- minimizes vertical waste tolerate long latency
operations - Simultaneous Multithreading
- Selects instructions from any "ready" thread
22 Superscalar
Multithreaded
SMT
Issue slots
23SMT Issues
24A Glance at a Pentium 4 Chip
Picture courtesy of Toms hardware guide
25The Pipeline
Trace cache
26Intels Hyperthreading Patent
27Pentium 4 Pipeline
- Trace cache access, predictor 5 clock cycles
- Microoperation queue
- Reorder buffer allocation, register renaming 4
clock cycles - functional unit queues
- Scheduling and dispatch unit 5 clock cycles
- Register file access 2 clock cycles
- Execution 1 clock cycle
- reorder buffer
- Commit 3 clock cycles (total 20 clock cycles)
28PACT XPP
- The XPP processes a stream of data using
configurable arithmetic-logic units. - The architecture owes much to dataflow
processing.
29A Matrix-Vector Multiplication
Graphic courtesy of PACT
30Basic Idea
- Replace the von Neumann instruction stream with
fixed instruction scheduling by a configuration
stream. - Process streams of data as opposed to processing
of small data entities.
31von Neumann vs. XPP
32Basic Components of XPP
- Processing Arrays
- Packet oriented communication network
- Hierarchical configuration manager tree
- A set of I/O modules
- Supports the execution of multiple data flow
applications running in parallel.
33Four Processing Arrays
Graphics courtesy of PACT. SCM is short for
supervising configuration manager
34Data Processing
35Event Packets
36(No Transcript)
37XPP 64-A
38Further Reading
- PACT XPP A Reconfigurable Data Processing
Architecture by Baumgarte, May, Nueckel, Vorbach
and Weinhardt