Title: Decoupled Architectures for Complexity-Effective General Purpose Processors
1Decoupled Architectures for Complexity-Effective
General Purpose Processors
- Ronny Krashinsky and Mike Sung
- 6.893 Term Project Presentation
- MIT Laboratory for Computer Science
- 12-7-2000
2Motivation
- out-of-order superscalar designs are inefficient
and hard to scale - decoupled architectures can provide latency
hiding, dynamic scheduling, and ILP in a much
more complexity-effective and scalable manner - in previous work, decoupled architectures have
been investigated for scientific apps - superscalar architectures are used universally
for general purpose computing requirements - why? superscalars provide more flexibility, and
decoupled architectures break down when there is
a loss of decoupling
3Proposal
- use decoupled architectures for
complexity-effective general purpose computing - multithreading can be used to hide loss of
decoupling latency - potentially get the best out of both
architectures by providing a superscalar
processor with decoupled engines for
complexity-effective streaming computations - we will present a survey of prior work and our
proposed architectural innovations, unfortunately
a lot of infrastructure (e.g. a compiler) is
required for a more detailed investigation
4Decoupled Access/Execute Architecture
- AP EP process separate instruction streams
- EP used for computation (floating point)
- ILP
- data values communicated via queues
- slip AP runs ahead of EP
- memory latency hiding
- dynamic scheduling
- head of AEQ can be used as instruction operand in
EP - blocks if data isnt available
- takes the place of register renaming
- store addresses wait in WAQ until corresponding
data arrives from EP - loads can bypass stores (check address)
Decoupled Access/Execute Computer Architectures,
Smith, 1982
5Decoupled Access/Execute Architecture
- program control flow implemented with
corresponding conditional branch in each stream - branch condition queues allow AP to hide branch
latency from EP - loss of decoupling if AP depends on branch
condition from EP - not discussed in early works
- implemented in the Astronautics ZS-1 Processor
- single interleaved instruction stream is split to
feed instruction queues - control flow instruction executed in the splitter
Decoupled Access/Execute Computer Architectures,
Smith, 1982
6Simultaneous Multithreading with DAE
- observation that functional unit latencies and
true data dependencies in EP hinder performance - use SMT and thread level parallelism to better
utilize functional units (same as with SMT in
superscalars) - few threads are required
- decoupling provides memory latency tolerance, SMT
hides functional unit latencies
The Synergy of Multithreading and Access/Execute
Decoupling, Parcerisa and Gonzalez, 1998
7Decoupled Control/Access/Execute Architecture
- further optimization control decoupling
- three instruction streams, dynamic slip
- CP processes control flow graph, sends directives
to AP and EP to execute basic blocks - limited control capabilities in AP and EP loop
count and predication - fetch engines fill queues with valid instructions
- dynamic loop unrolling
- control latency hidden (without speculation)
- stream units
- CU can operate in stand-alone mode
- implemented as a 21064, ran the OS
The Effectiveness of Decoupling, Bird et. al.,
1993
8Decoupled Control/Access/Execute Architecture
- loss of decoupling events cause breakdown
The Performance of Decoupled Architectures,
Parcerisa et. al., 1996
9Decoupled Control/Access/Execute Architecture
10Decoupled Control/Access/Execute Architecture
11Decoupled Control/Access/Execute Architecture
12Decoupled Control/Access/Execute Architecture
13Decoupled Control/Access/Execute Architecture
14Decoupled Control/Access/Execute Architecture
LOD!
15Decoupled Control/Access/Execute Architecture
16Decoupled Control/Access/Execute Architecture
17Decoupled Control/Access/Execute Architecture
18Decoupled Control/Access/Execute Architecture
19Decoupled Control/Access/Execute Architecture
20Decoupled Control/Access/Execute Architecture
21Decoupled Control/Access/Execute Architecture
22Decoupled Control/Access/Execute Architecture
23Decoupled Control/Access/Execute Architecture
24Decoupled Control/Access/Execute Architecture
25Decoupled vs. Superscalar Architectures
- Dynamic out-of-order execution with less
complexity - Allows non-speculative instruction and data
prefetching. We can shrink data structures like
first level caches, potentially reducing critical
paths as well as reducing power - Inherent long memory latency toleration
provides performance advantage for streaming
applications, etc. where lack of locality
mitigates performance advantages of caches - Simplified issue logic which can be implemented
with small structures/queues (contrast with
ROB/IW/bypass structures) - Better resource utilization by partioning between
CP/AP/DP, processors can have specialized ISAs - Scalability direct consequence of simplified
logic - For superscalar processors, need to increase IW
which does not scale (Palacharla/Agawal papers) - Decoupled machines alleviate centralized resource
bottlenecks - Queue-based structure is amenable to tiled
architectures with on-chip networks
26Decoupled Architectures for General Purpose
Computing
- So why havent decoupled machines taken over the
world? - Because superscalar architectures took over the
world first - Primary drawback of decoupled architectures from
LOD events - twisty C code can cause severe
performance degradation - Inability for compilers to program effectively
for separate instruction streams lack of
research/development in the area of
programming/compiling analysis - Wheel of Reincarnation no such thing as a new
idea - If we can augment existing decoupled
architectures to remove the effects of LOD
events, we effectively have an architecture that
can feasibly be used for general purpose
computing - Leverage exiting ideas to augment decoupling
Multithreading and Auxiliary Processing
27Multithreading on a DCAE Architecture
- Multithreading hides latency of LOD events.
- LOD events result in very long latencies (gt100s
cycles) to reestablish decoupling - Motivation is to hide LOD events to prevent need
to resynchronize - SMT hides functional unit latencies.
28Multithreading on a DCAE Architecture
- Multithreading in Access/execute units
- Multiple contexts (IP/RF) for fast
context-switching during LOD event - Interleaved SMT to hide horizontal as well as
vertical waste within execute processor
29Multithreading on a DCAE Architecture
- With multithreading, utilization of CP/AP/EP by
different threads is pipelined - analgous to instruction pipelining in a CPU
datapath
30Multithreading on a DCAE Architecture
31Multithreading on a DCAE Architecture
32Multithreading on a DCAE Architecture
33Multithreading on a DCAE Architecture
34Multithreading on a DCAE Architecture
35Multithreading on a DCAE Architecture
LOD!
36Multithreading on a DCAE Architecture
37Multithreading on a DCAE Architecture
38Multithreading on a DCAE Architecture
39Multithreading on a DCAE Architecture
40Multithreading on a DCAE Architecture
41Multithreading on a DCAE Architecture
42Multithreading on a DCAE Architecture
43Multithreading on a DCAE Architecture
44Multithreading on a DCAE Architecture
45Multithreading on a DCAE Architecture
46Multithreading on a DCAE Architecture
47Auxiliary Decoupled Access/Execute Streaming Units
- Implement control processor as fully functional
high-performance microprocessor. Compiler can
avoid decoupling control intensive code. - When decoupling is possible (e.g. streaming
computations), the decoupled access/execute
engines provide a high-performance
complexity-effective alternative. - Analogous to vector coprocessors or SIMD array
coprocessors. Basic idea is to utilize
specialized hardware when possible and have a
fallback plan when Achilles heel is exposed.
48Extensions for Improved Performance
- Wider issue access/execute processors
- Speculative Multithreading
- Control processor can spawn speculative threads
when only a single thread of control is available - Miss-speculation detection can be performed by
checking accessed memory addresses (in queues)
for collisions - Kill speculative thread by simply flushing
queues/context - Can merge concepts, with multithreaded decoupled
execution under the auxiliary access/execute
units paradigm. - Use decoupling/multithreading when possible, and
fall back on high performance control processor
otherwise - Tiled architectures Extend decoupled
architectures to scaleable multiprocessor systems
such as RAW. - Queue-based structure is a good fit for
encorporating communication from other tiles
49Summary
- Decoupled architectures represent a
complexity-effective and scalable way to provide
dynamic scheduling, hide latency, and exploit ILP
- To enable general purpose computation, we can
augment decoupling with multithreading to hide
the latency of LODs - By using decoupled access and execute units as
auxillary processors, we can leverage the
benefits of both decoupling for streaming
computations, and out-of-order superscalars for
control flow intensive computations