Title: The Stanford Hydra Chip Multiprocessor
1The Stanford Hydra Chip Multiprocessor
- Kunle Olukotun
- The Hydra Team
- Computer Systems Laboratory
- Stanford University
2Technology ??Architecture
- Transistors are cheap, plentiful and fast
- Moores law
- 100 million transistors by 2000
- Wires are cheap, plentiful and slow
- Wires get slower relative to transistors
- Long cross-chip wires are especially slow
- Architectural implications
- Plenty of room for innovation
- Single cycle communication requires localized
blocks of logic - High communication bandwidth across the chip
easier to achieve than low latency
3Exploiting Program Parallelism
4Hydra Approach
- A single-chip multiprocessor architecture
composed of simple fast processors - Multiple threads of control
- Exploits parallelism at all levels
- Memory renaming and thread-level speculation
- Makes it easy to develop parallel programs
- Keep design simple by taking advantage of single
chip implementation
5Outline
- Base Hydra Architecture
- Performance of base architecture
- Speculative thread support
- Speculative thread performance
- Improving speculative thread performance
- Hydra prototype design
- Conclusions
6The Base Hydra Design
-
- Shared 2nd-level cache
- Low latency interprocessor communication (10
cycles) - Separate read and write buses
- Single-chip multiprocessor
- Four processors
- Separate primary caches
- Write-through data caches to maintain coherence
7Hydra vs. Superscalar
- ILP only
- ??SS 30-50 better than single Hydra processor
- ILP fine thread
- ??SS and Hydra comparable
- ILP coarse thread
- ??Hydra 1.52???better
- The Case for a CMP ASPLOS 96
4
Hydra 4 x 2-way issue
3.5
Superscalar 6-way issue
3
2.5
2
Speedup
1.5
1
0.5
0
swim
applu
OLTP
eqntott
MPEG2
tomcatv
compress
m88ksim
8Problem Parallel Software
- Parallel software is limited
- Hand-parallelized applications
- Auto-parallelized dense matrix FORTRAN
applications - Traditional auto-parallelization of C-programs is
very difficult - Threads have data dependencies ??synchronization
- Pointer disambiguation is difficult and expensive
- Compile time analysis is too conservative
- How can hardware help?
- Remove need for pointer disambiguation
- Allow the compiler to be aggressive
9Solution Data Speculation
- Data speculation enables parallelization without
regard for data-dependencies - Loads and stores follow original sequential
semantics - Speculation hardware ensures correctness
- Add synchronization only for performance
- Loop parallelization is now easily automated
- Other ways to parallelize code
- Break code into arbitrary threads (e.g.
speculative subroutines ) - Parallel execution with sequential commits
- Data speculation support
- Wisconsin multiscalar
- Hydra provides low-overhead support for CMP
10Data Speculation Requirements I
- Forward data between parallel threads
- Detect violations when reads occur too early
11Data Speculation Requirements II
- Safely discard bad state after violation
- Correctly retire speculative state
12Data Speculation Requirements III
- Maintain multiple views of memory
13Hydra Speculation Support
- Write bus and L2 buffers provide forwarding
- Read L1 tag bits detect violations
- Dirty L1 tag bits and write buffers provide
backup - Write buffers reorder and retire speculative
state - Separate L1 caches with pre-invalidation smart
L2 forwarding for view - Speculation coprocessors to control threads
14Speculative Reads
- L1 hit
- The read bits are set
- L1 miss
- L2 and write buffers are checked in parallel
- The newest bytes written to a line are pulled in
by priority encoders on each byte (priority A-D)
15Speculative Writes
- A CPU writes to its L1 cache write buffer
- Earlier CPUs invalidate our L1 cause RAW
hazard checks - Later CPUs just pre-invalidate our L1
- Non-speculative write buffer drains out into the
L2
16Speculation Runtime System
- Software Handlers
- Control speculative threads through CP2 interface
- Track order of all speculative threads
- Exception routines recover from data dependency
violations - Adds more overhead to speculation than hardware
but more flexible and simpler to implement - Complete description in Data Speculation Support
for a Chip Multiprocessor ASPLOS 98 and
Improving the Performance of Speculatively
Parallel Applications on the Hydra CMP ICS 99
17Creating Speculative Threads
- Speculative loops
- for and while loop iterations
- Typically one speculative thread per iteration
- Speculative procedures
- Execute code after procedure speculatively
- Procedure calls generate a speculative thread
- Compiler support
- C source to source translator
- Pfor, pwhile
- Analyze loop body and globalize any local
variables that could cause loop-carried
dependencies
18Base Speculative Thread Performance
4
3.5
Base
- Entire applications
- GCC 2.7.2 -O2
- 4 single-issue processors
- Accurate modeling of all aspects of Hydra
architecture and real runtime system -
3
2.5
Speedup
2
1.5
1
0.5
0
wc
ear
ijpeg
grep
alvin
eqntott
mpeg2
simplex
m88ksim
cholesky
compress
sparse1.3
19Improving Speculative Runtime System
- Procedure support adds overhead to loops
- Threads are not created sequentially
- Dynamic thread scheduling necessary
- Start and end of loop 75 cycles
- End of iteration 80 cycles
- Performance
- Best performing speculative applications use
loops - Procedure speculation often lowers performance
- Need to optimize RTS for common case
- Lower speculative overheads
- Start and end of loop 25 cycles
- End of iteration 12 cycles (almost a factor of
7) - Limit procedure speculation to specific
procedures
20Improved Speculative Performance
4
3.5
Optimized RTS
- Improves performance of all applications
- Most improvement for applications with
fine-grained threads - Eqntott uses procedure speculation
Base
3
2.5
Speedup
2
1.5
1
0.5
0
wc
ear
ijpeg
grep
alvin
eqntott
mpeg2
simplex
cholesky
m88ksim
sparse1.3
compress
21Optimizing Parallel Performance
- Cache coherent shared memory
- No explicit data movement
- 100 cycle communication latency
- Need to optimize for data locality
- Look at cache misses (MemSpy, Flashpoint)
- Speculative threads
- No explicit data independence
- Frequent dependence violations limit performance
- Need to optimize to reduce frequency and impact
of data violations - Dependence prediction can help
- Look at violation statistics (requires some
hardware support)
22Feedback and Code Transformations
- Feedback tool
- Collects violation statistics (PCs, frequency,
work lost) - Correlates read and write PC values with source
code - Synchronization
- Synchronize frequently occurring violations
- Use non-violating loads
- Code Motion
- Find dependent load-stores
- Move loads down in thread
- Move stores up in thread
23Code Motion
- Rearrange reads and writes to increase
parallelism - Delay reads and advance writes
- Create local copies to allow earlier data
forwarding
iteration i
iteration i
read x
read x
iteration i1
write x
read x
read x
write x
read x
read x
write x
iteration i1
read x
write x
24Optimized Speculative Performance
4
3.5
3
- Base performance
- Optimized RTS with no manual intervention
- Violation statistics used to manually transform
code
2.5
Speedup
2
1.5
1
0.5
0
wc
ear
grep
alvin
ijpeg
eqntott
mpeg2
simplex
cholesky
m88ksim
compress
sparse1.3
25Size of Speculative Write State
Max no. lines of write state
- Max size determines size of write buffer for max
performance - Non-head processor stalls when write buffer fills
up - Small write buffers (lt 64 lines) will achieve
good performance
32 byte cache lines
26Hydra Prototype
- Design based on Integrated Device Technology
(IDT) RC32364 - 88 mm2 in 0.25mm with 8 KB I, D and 128 KB L2
27Conclusions
- Hydra offers a new way to design microprocessors
- Single-chip MP exploits parallelism at all levels
- Low overhead support for speculative parallelism
- Provides high performance on applications with
medium to large-grain parallelism - Allows performance optimization migration path
for difficult to parallelize fine-grain
applications - Prototype Implementation
- Work out implementation details
- Provide platform for application and compiler
development - Realistic performance evaluation
28Hydra Team
- Team
- Monica Lam, Lance Hammond, Mike Chen, Ben
Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim
and Maciek Kozyrczak (IDT) - URL
- http//www-hydra.stanford.edu