Decoupled Architectures for Complexity-Effective General Purpose Processors - PowerPoint PPT Presentation

About This Presentation

Title:

Decoupled Architectures for Complexity-Effective General Purpose Processors

Description:

... utilization by partioning between CP/AP/DP, processors can have specialized ISAs ... DP. IFE. IFE. RF. RF. IFB. IFB. param. param. SAQ. SDQ. LDQ. RD. LAQ ... – PowerPoint PPT presentation

Number of Views:592

Avg rating:3.0/5.0

Slides: 50

Provided by: PAJ

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Decoupled Architectures for Complexity-Effective General Purpose Processors

1
Decoupled Architectures for Complexity-Effective
General Purpose Processors

Ronny Krashinsky and Mike Sung
6.893 Term Project Presentation
MIT Laboratory for Computer Science
12-7-2000

2
Motivation

out-of-order superscalar designs are inefficient
and hard to scale
decoupled architectures can provide latency
hiding, dynamic scheduling, and ILP in a much
more complexity-effective and scalable manner
in previous work, decoupled architectures have
been investigated for scientific apps
superscalar architectures are used universally
for general purpose computing requirements
why? superscalars provide more flexibility, and
decoupled architectures break down when there is
a loss of decoupling

3
Proposal

use decoupled architectures for
complexity-effective general purpose computing
multithreading can be used to hide loss of
decoupling latency
potentially get the best out of both
architectures by providing a superscalar
processor with decoupled engines for
complexity-effective streaming computations
we will present a survey of prior work and our
proposed architectural innovations, unfortunately
a lot of infrastructure (e.g. a compiler) is
required for a more detailed investigation

4
Decoupled Access/Execute Architecture

AP EP process separate instruction streams
EP used for computation (floating point)
ILP
data values communicated via queues
slip AP runs ahead of EP
memory latency hiding
dynamic scheduling
head of AEQ can be used as instruction operand in
EP
blocks if data isnt available
takes the place of register renaming
store addresses wait in WAQ until corresponding
data arrives from EP
loads can bypass stores (check address)

Decoupled Access/Execute Computer Architectures,
Smith, 1982
5
Decoupled Access/Execute Architecture

program control flow implemented with
corresponding conditional branch in each stream
branch condition queues allow AP to hide branch
latency from EP
loss of decoupling if AP depends on branch
condition from EP
not discussed in early works
implemented in the Astronautics ZS-1 Processor
single interleaved instruction stream is split to
feed instruction queues
control flow instruction executed in the splitter

Decoupled Access/Execute Computer Architectures,
Smith, 1982
6
Simultaneous Multithreading with DAE

observation that functional unit latencies and
true data dependencies in EP hinder performance
use SMT and thread level parallelism to better
utilize functional units (same as with SMT in
superscalars)
few threads are required
decoupling provides memory latency tolerance, SMT
hides functional unit latencies

The Synergy of Multithreading and Access/Execute
Decoupling, Parcerisa and Gonzalez, 1998
7
Decoupled Control/Access/Execute Architecture

further optimization control decoupling
three instruction streams, dynamic slip
CP processes control flow graph, sends directives
to AP and EP to execute basic blocks
limited control capabilities in AP and EP loop
count and predication
fetch engines fill queues with valid instructions
dynamic loop unrolling
control latency hidden (without speculation)
stream units
CU can operate in stand-alone mode
implemented as a 21064, ran the OS

The Effectiveness of Decoupling, Bird et. al.,
1993
8
Decoupled Control/Access/Execute Architecture

loss of decoupling events cause breakdown

The Performance of Decoupled Architectures,
Parcerisa et. al., 1996
9
Decoupled Control/Access/Execute Architecture
10
Decoupled Control/Access/Execute Architecture
11
Decoupled Control/Access/Execute Architecture
12
Decoupled Control/Access/Execute Architecture
13
Decoupled Control/Access/Execute Architecture
14
Decoupled Control/Access/Execute Architecture
LOD!
15
Decoupled Control/Access/Execute Architecture
16
Decoupled Control/Access/Execute Architecture
17
Decoupled Control/Access/Execute Architecture
18
Decoupled Control/Access/Execute Architecture
19
Decoupled Control/Access/Execute Architecture
20
Decoupled Control/Access/Execute Architecture
21
Decoupled Control/Access/Execute Architecture
22
Decoupled Control/Access/Execute Architecture
23
Decoupled Control/Access/Execute Architecture
24
Decoupled Control/Access/Execute Architecture
25
Decoupled vs. Superscalar Architectures

Dynamic out-of-order execution with less
complexity
Allows non-speculative instruction and data
prefetching. We can shrink data structures like
first level caches, potentially reducing critical
paths as well as reducing power
Inherent long memory latency toleration
provides performance advantage for streaming
applications, etc. where lack of locality
mitigates performance advantages of caches
Simplified issue logic which can be implemented
with small structures/queues (contrast with
ROB/IW/bypass structures)
Better resource utilization by partioning between
CP/AP/DP, processors can have specialized ISAs
Scalability direct consequence of simplified
logic
For superscalar processors, need to increase IW
which does not scale (Palacharla/Agawal papers)
Decoupled machines alleviate centralized resource
bottlenecks
Queue-based structure is amenable to tiled
architectures with on-chip networks

26
Decoupled Architectures for General Purpose
Computing

So why havent decoupled machines taken over the
world?
Because superscalar architectures took over the
world first
Primary drawback of decoupled architectures from
LOD events - twisty C code can cause severe
performance degradation
Inability for compilers to program effectively
for separate instruction streams lack of
research/development in the area of
programming/compiling analysis
Wheel of Reincarnation no such thing as a new
idea
If we can augment existing decoupled
architectures to remove the effects of LOD
events, we effectively have an architecture that
can feasibly be used for general purpose
computing
Leverage exiting ideas to augment decoupling
Multithreading and Auxiliary Processing

27
Multithreading on a DCAE Architecture

Multithreading hides latency of LOD events.
LOD events result in very long latencies (gt100s
cycles) to reestablish decoupling
Motivation is to hide LOD events to prevent need
to resynchronize
SMT hides functional unit latencies.

28
Multithreading on a DCAE Architecture

Multithreading in Access/execute units
Multiple contexts (IP/RF) for fast
context-switching during LOD event
Interleaved SMT to hide horizontal as well as
vertical waste within execute processor

29
Multithreading on a DCAE Architecture

With multithreading, utilization of CP/AP/EP by
different threads is pipelined
analgous to instruction pipelining in a CPU
datapath

30
Multithreading on a DCAE Architecture
31
Multithreading on a DCAE Architecture
32
Multithreading on a DCAE Architecture
33
Multithreading on a DCAE Architecture
34
Multithreading on a DCAE Architecture
35
Multithreading on a DCAE Architecture
LOD!
36
Multithreading on a DCAE Architecture
37
Multithreading on a DCAE Architecture
38
Multithreading on a DCAE Architecture
39
Multithreading on a DCAE Architecture
40
Multithreading on a DCAE Architecture
41
Multithreading on a DCAE Architecture
42
Multithreading on a DCAE Architecture
43
Multithreading on a DCAE Architecture
44
Multithreading on a DCAE Architecture
45
Multithreading on a DCAE Architecture
46
Multithreading on a DCAE Architecture
47
Auxiliary Decoupled Access/Execute Streaming Units

Implement control processor as fully functional
high-performance microprocessor. Compiler can
avoid decoupling control intensive code.
When decoupling is possible (e.g. streaming
computations), the decoupled access/execute
engines provide a high-performance
complexity-effective alternative.
Analogous to vector coprocessors or SIMD array
coprocessors. Basic idea is to utilize
specialized hardware when possible and have a
fallback plan when Achilles heel is exposed.

48
Extensions for Improved Performance

Wider issue access/execute processors
Speculative Multithreading
Control processor can spawn speculative threads
when only a single thread of control is available
Miss-speculation detection can be performed by
checking accessed memory addresses (in queues)
for collisions
Kill speculative thread by simply flushing
queues/context
Can merge concepts, with multithreaded decoupled
execution under the auxiliary access/execute
units paradigm.
Use decoupling/multithreading when possible, and
fall back on high performance control processor
otherwise
Tiled architectures Extend decoupled
architectures to scaleable multiprocessor systems
such as RAW.
Queue-based structure is a good fit for
encorporating communication from other tiles

49
Summary

Decoupled architectures represent a
complexity-effective and scalable way to provide
dynamic scheduling, hide latency, and exploit ILP
To enable general purpose computation, we can
augment decoupling with multithreading to hide
the latency of LODs
By using decoupled access and execute units as
auxillary processors, we can leverage the
benefits of both decoupling for streaming
computations, and out-of-order superscalars for
control flow intensive computations