Title: WaveScalar
1WaveScalar
WaveScalar
Steven Swanson
Steven Swanson
Ken Michelson David Sunderland Jared Wilkins
Chris Fisher
Ken Michelson David Sunderland Jared Wilkins
Chris Fisher
Sponsored by NSF, Intel, The ARCS Foundation,
Xilinx, and StoreTek
2Things to keep you up at night 2016
- Opportunities
- 8 billion transistors 28Ghz
- 4GB per DRAM chip
- 120 P4s OR 200,000 RISC-1 per die
- Challenges
- Communication
- Defects
- Complexity
- Performance
3Monolithic von Neumann Processors
A phenomenal success today. But in 2016?
- ? Communication
- Broadcast networks
- ? Defect tolerance
- 1 flaw -gt paperweight
- ? Complexity
- 40-60 of design is validation
- ? Performance
- Deeper pipes unlikely (ISCA02)
4Decentralized Processors
- ? Communication
- ? Defect tolerance
- ? Complexity
- ? Performance
But how do you execute?
5Von Neumann is Centralized
- PC-driven fetch is the problem
- One program counter
- Dataflow is the solution
6Dataflow has been done before...
- Operations fire when data is available
- No program counter
- Convert true control dependences to data
dependences - Exposes massive parallelism
- But...
7...it had issues
- Scalability
- Dataflow never executed mainstream code
- No total load-store ordering
- Special languages
- Different memory semantics
- No mutable data structures (mostly)
- Functional (mostly)
8The WaveScalar ISA
- A dataflow ISA with imperative language support
The best of both worlds - Von Neumann
- Normal memory semantics.
- Coarse-grain, von Neumann-style threads
- Dataflow
- Unordered memory.
- Fine-grain, dataflow-style threads
- Use the best tool for the job.
9WaveScalar example
10WaveScalar example
11WaveScalar example
i
j
A
Load
Store
b
12WaveScalar example
i
j
A
Load
Store
b
13WaveScalar example
i
j
A
Load
Store
b
14WaveScalar example
i
j
A
Load
Store
b
15WaveScalar example
i
j
A
Load
Store
b
16WaveScalar example
17Wave-ordered memory
Load
- Compiler annotates memory operations
- Send memory requests in any order
- Hardware reconstructs the correct order
Store
Store
Load
Load
Store
18Wave-ordering Example
Load
Store
5
6
4
Store
Load
Load
6
8
5
Store
19Wave-ordered Memory
- Waves are loop-free sections of the dataflow
graph - Each dynamic wave has a wave number
- Wave-ordered memory
- Wave-numbers
- Sequence number
20WaveScalar Execution Model
- Put an ALU at every word of instruction memory.
- No processor core.
- Instructions communicate directly.
21The WaveCache
- The I-Cache is the processor.
22Processing Element
23Domain
24Cluster
25The WaveCache
- Long distance communication
- Dynamic routing
- Grid-based network
- 1 cycle/cluster
- Traditional cache coherence
- Normal memory hierarchy
- 16K instructions
26The WaveCache in Action!
27Performance
- Cycle-accurate simulator
- Binary translator from Alpha -gt WaveScalar
assembly - A selection of Spec2000 and MediaBench
- WaveCache
- 2000 Processing elements
- No speculation
- Compare to a very aggressive superscalar
- 15-stage, 16-wide
- 1024- registers, 1024-entry issue queue
28WaveCache Performance
29Decentralized Processing
- ? Communication
- ? Defect tolerance
- ? Complexity
- ? High Performancce
30Multithreading the WaveCache
- What is a thread?
- A flow of control?
- Von Neumann PC registers.
- WaveScalar A memory ordering?
- How do threads work in WaveScalar?
- ISA changes
- Architectural changes
31ISA Support for Threads
- Extend tag with ThreadID
- Instructions for memory ordering management
- Mem_Sequence_Start -- associate ordering with a
ThreadID - Mem_Sequence_Stop -- destroy ordering
32Thread Synchronization
- Memory-based (Test-And-Set)
- Spin on memory
- Memory-free (Thread_Coordinate)
- A queue lock.
33Hardware support for threads
- Very little must change
- Wider busses
- Wider input queues
- More store buffers
- ThreadIDs control instruction replication
- One copy of each instructions/ThreadID
34Multithreaded Performance
35The WaveScalar ISA
- Von Neumann
- Normal memory semantics.
- Coarse-grain, von Neumann-style threads
- Dataflow
- Unordered memory.
- Fine-grain, dataflow-style threads
36Unordered memory
- Load_Unordered
- A normal load (but not wave-ordered)
- Store_Unordered
- Write to memory and return a value.
- Mem_nop_Ack
- A no-op, but returns a value upon execution.
- Coordination point between ordered and unordered
operations.
37Exploiting Unordered Memory
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
38Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
39Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
40Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
41Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
42Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Ld p-gtx
Ld p-gty
St r.x
St r.y
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
43Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Ld p-gtx
Ld p-gty
St r.x
St r.y
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
44Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
45Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
46Exploiting Unordered Memory
Ordered
St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
47Dataflow performance
48Putting it all together Equake
- Finite element earthquake simulation
- gt90 execution is in two functions
- Sim()
- Series of data-independent loops
- Initialization and copying
- Thread pool implementation
- Smvp()
- Cross-iteration dependences
- Basically matrix multiplications
- Rewrite in WaveScalar assembly
49Putting it all together Equake
(11)
(3.5)
Single-threaded
50WaveScalars Future
Steven Swanson Martha Mercaldi Andrew
Petersen Andrew Putnam Andrew Schwerin
Mark Oskin Susan Eggers Tom Anderson Carl
Ebeling Hank Levy
Ken Michelson David Sunderland Jared Wilkens
Chris Fisher
51Microarchitecture (Steven Swanson, Andrew
Putnam, Ken Michelson)
- Domain
- How to spend wires?
- What are PEs?
- Network topology and routing
- SystemC model
52Microarchitecture Status
- HDL done
- PE
- Domain
- Store buffer/cache
- Network switch
- 4x4 WaveCache, 8PEs/domain
- 160mm2 _at_ 90nm
- Tools estimate 250-300Mhz
53Instruction Placement(Martha Mercaldi)
- Static vs. Dynamic
- Simulated annealing
- Instruction migration
- Which instruction to evict? How aggressively?
54Compiler(Andrew Petersen, David Sunderland)
C
C
???
- Custom WaveScalar optimizations
- Unordered memory operations
- Alias Analysis
- Re-examine well-known optimizations
- Is software pipelining useful?
- Dataflow Languages
- SISAL, Id, etc.
- ???
Compiler
55Operating System(Andrew Schwerin)
- Cache and Address organization
- Coherence protocols
- Fine-grained protection domains.
56FPGA Prototype(Chris Fisher, Jared Wilkens)
- FPGA prototype
- Boards
- 4 FPGA w/ 2 PPC cores
- DDR Memory
- SRAM
- Attached to a PPC Brain
57Conclusions
- WaveScalar ISA
- A unified dataflow and von Neumann execution
model - Mix-and-Match parallelism models
- WaveCache Architecture
- Outperforms an OOO superscalar by 2.8x
- Excellent multi-threaded performance.
- Over 300 IPC for hand-coded apps.
- And you can build it today!!!
- Enormous opportunities for future research