Title: WaveScalar
1WaveScalar
WaveScalar
Steven Swanson
Steven Swanson
Chris Fisher
Chris Fisher
Sponsored by NSF, Intel, The ARCS Foundation,
Xilinx, and StoreTek
2Monolithic von Neumann Processors
A phenomenal success today. But in 10 years?
- ? Communication
- Broadcast networks
- ? Defect tolerance
- 1 flaw -gt paperweight
- ? Complexity
- 40-60 of design is validation
- ? Performance
- Deeper pipes unlikely (ISCA02)
3The WaveScalar ISA
- A dataflow ISA with imperative language support
The best of both worlds - Von Neumann
- Normal memory semantics.
- Coarse-grain, von Neumann-style threads
- Dataflow
- Unordered memory.
- Fine-grain, dataflow-style threads
- Use the best tool for the job.
4WaveScalar example
5WaveScalar example
i
j
A
Load
Store
b
6WaveScalar example
i
j
A
Load
Store
b
7WaveScalar example
i
j
A
Load
Store
b
8WaveScalar example
i
j
A
Load
Store
b
9WaveScalar example
i
j
A
Load
Store
b
10WaveScalar Execution Model
- Put an ALU at every word of instruction memory.
- No processor core.
- Instructions communicate directly.
11The WaveCache
- The I-Cache is the processor.
12The WaveCache
13The WaveCache
- Long distance communication
- Dynamic routing
- Grid-based network
- 2 cycle/cluster
- Traditional cache coherence
- Normal memory hierarchy
14WaveCache Performance
15Multithreaded Performance
16Fine-grain Performance
17WaveScalars Future
Steven Swanson Martha Mercaldi Andrew
Petersen Andrew Putnam Andrew Schwerin
Mark Oskin Susan Eggers Tom Anderson Carl
Ebeling Hank Levy
Ken Michelson David Sunderland Jared Wilkens
Chris Fisher
18Instruction Placement(Martha Mercaldi)
- Status
- Profile-based, two-level static instruction
placement - Cache-aware performance modeling
- Future Questions
- Should instructions be moved once in place?
- How can placement policy manage matching table
resources?
19Compiler(Andrew Petersen)
C
C
???
- Status
- Simple C code works!
- Future Questions
- What optimizations are (not) valuable in
WaveScalar? - How/should predication be applied?
- Should WaveScalar speculate in software? How?
Compiler
20Operating System(Andrew Schwerin)
- Status
- Designing fine-grain interposition system.
- Future Questions
- How can fine-grain, low-overhead interposition
make the OS safer,more efficient, etc.? - How should the OS manage the WaveCache?
21Conclusions
- WaveScalar ISA
- A unified dataflow and von Neumann execution
model - Mix-and-Match parallelism models
- WaveCache Architecture
- gt2x performance/area than OOO
- Excellent multi-threaded performance.
- Over 250x performance for hand-coded apps.
- Enormous opportunities for future research
22Decentralized Processing
- ? Communication
- ? Defect tolerance
- ? Complexity
- ? High Performancce
23WaveScalar example
24Decentralized Processors
- ? Communication
- ? Defect tolerance
- ? Complexity
- ? Performance
But how do you execute?
25Von Neumann is Centralized
- PC-driven fetch is the problem
- One program counter
- Dataflow is the solution
26Dataflow has been done before...
- Operations fire when data is available
- No program counter
- Convert true control dependences to data
dependences - Exposes massive parallelism
- But...
27...it had issues
- Scalability
- Dataflow never executed mainstream code
- No total load-store ordering
- Special languages
- Different memory semantics
- No mutable data structures (mostly)
- Functional (mostly)
28Things to keep you up at night 2016
- Opportunities
- 8 billion transistors 28Ghz
- 4GB per DRAM chip
- 120 P4s OR 200,000 RISC-1 per die
- Challenges
- Communication
- Defects
- Complexity
- Performance
29Microarchitecture (Steven Swanson, Andrew
Putnam, Ken Michelson)
- Domain
- How to spend wires?
- What are PEs?
- Network topology and routing
- SystemC model
30Performance
- Cycle-accurate simulator
- Binary translator from Alpha -gt WaveScalar
assembly - A selection of Spec2000 and MediaBench
- WaveCache
- 2000 Processing elements
- No speculation
- Compare to a very aggressive superscalar
- 15-stage, 16-wide
- 1024- registers, 1024-entry issue queue
31FPGA Prototype(Chris Fisher, Jared Wilkens)
- FPGA prototype
- Boards
- 4 FPGA w/ 2 PPC cores
- DDR Memory
- SRAM
- Attached to a PPC Brain