WaveScalar

About This Presentation

Title:

WaveScalar

Description:

Coarse-grain, von Neumann-style threads. Dataflow 'Unordered' memory. Fine-grain, dataflow-style threads. Use the best tool for the job. UPC November, 2004 ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 58

Provided by: Swa147

Learn more at: https://research.ac.upc.edu

Category:

Tags: wavescalar

more less

Transcript and Presenter's Notes

Title: WaveScalar

1
WaveScalar
WaveScalar
Steven Swanson
Steven Swanson
Ken Michelson David Sunderland Jared Wilkins
Chris Fisher
Ken Michelson David Sunderland Jared Wilkins
Chris Fisher
Sponsored by NSF, Intel, The ARCS Foundation,
Xilinx, and StoreTek
2
Things to keep you up at night 2016

Opportunities
8 billion transistors 28Ghz
4GB per DRAM chip
120 P4s OR 200,000 RISC-1 per die
Challenges
Communication
Defects
Complexity
Performance

3
Monolithic von Neumann Processors
A phenomenal success today. But in 2016?

? Communication
Broadcast networks
? Defect tolerance
1 flaw -gt paperweight
? Complexity
40-60 of design is validation
? Performance
Deeper pipes unlikely (ISCA02)

4
Decentralized Processors

? Communication
? Defect tolerance
? Complexity
? Performance

But how do you execute?
5
Von Neumann is Centralized

PC-driven fetch is the problem
One program counter
Dataflow is the solution

6
Dataflow has been done before...

Operations fire when data is available
No program counter
Convert true control dependences to data
dependences
Exposes massive parallelism
But...

7
...it had issues

Scalability
Dataflow never executed mainstream code
No total load-store ordering
Special languages
Different memory semantics
No mutable data structures (mostly)
Functional (mostly)

8
The WaveScalar ISA

A dataflow ISA with imperative language support
The best of both worlds
Von Neumann
Normal memory semantics.
Coarse-grain, von Neumann-style threads
Dataflow
Unordered memory.
Fine-grain, dataflow-style threads
Use the best tool for the job.

9
WaveScalar example

Aj ii i
b Aij

10
WaveScalar example

Aj ii i
b Aij

11
WaveScalar example
i
j
A

Aj ii i
b Aij

Load

Store
b
12
WaveScalar example
i
j
A

Aj ii i
b Aij

Load

Store
b
13
WaveScalar example
i
j
A

Aj ii i
b Aij

Load

Store
b
14
WaveScalar example
i
j
A

Aj ii i
b Aij

Load

Store
b
15
WaveScalar example
i
j
A

Aj ii i
b Aij

Load

Store
b
16
WaveScalar example

Aj ii i
b Aij

17
Wave-ordered memory
Load

Compiler annotates memory operations
Send memory requests in any order
Hardware reconstructs the correct order

Store
Store
Load
Load
Store
18
Wave-ordering Example
Load
Store
5
6
4
Store
Load
Load
6
8
5
Store
19
Wave-ordered Memory

Waves are loop-free sections of the dataflow
graph
Each dynamic wave has a wave number
Wave-ordered memory
Wave-numbers
Sequence number

20
WaveScalar Execution Model

Put an ALU at every word of instruction memory.
No processor core.
Instructions communicate directly.

21
The WaveCache

The I-Cache is the processor.

22
Processing Element
23
Domain
24
Cluster
25
The WaveCache

Long distance communication
Dynamic routing
Grid-based network
1 cycle/cluster
Traditional cache coherence
Normal memory hierarchy
16K instructions

26
The WaveCache in Action!
27
Performance

Cycle-accurate simulator
Binary translator from Alpha -gt WaveScalar
assembly
A selection of Spec2000 and MediaBench
WaveCache
2000 Processing elements
No speculation
Compare to a very aggressive superscalar
15-stage, 16-wide
1024- registers, 1024-entry issue queue

28
WaveCache Performance
29
Decentralized Processing

? Communication
? Defect tolerance
? Complexity
? High Performancce

30
Multithreading the WaveCache

What is a thread?
A flow of control?
Von Neumann PC registers.
WaveScalar A memory ordering?
How do threads work in WaveScalar?
ISA changes
Architectural changes

31
ISA Support for Threads

Extend tag with ThreadID
Instructions for memory ordering management
Mem_Sequence_Start -- associate ordering with a
ThreadID
Mem_Sequence_Stop -- destroy ordering

32
Thread Synchronization

Memory-based (Test-And-Set)
Spin on memory
Memory-free (Thread_Coordinate)
A queue lock.

33
Hardware support for threads

Very little must change
Wider busses
Wider input queues
More store buffers
ThreadIDs control instruction replication
One copy of each instructions/ThreadID

34
Multithreaded Performance
35
The WaveScalar ISA

Von Neumann
Normal memory semantics.
Coarse-grain, von Neumann-style threads
Dataflow
Unordered memory.
Fine-grain, dataflow-style threads

36
Unordered memory

Load_Unordered
A normal load (but not wave-ordered)
Store_Unordered
Write to memory and return a value.
Mem_nop_Ack
A no-op, but returns a value upon execution.
Coordination point between ordered and unordered
operations.

37
Exploiting Unordered Memory

Fine-grain intermingling

struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
38
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
39
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
40
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
41
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
42
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Ld p-gtx
Ld p-gty
St r.x
St r.y

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
43
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
45
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
46
Exploiting Unordered Memory
Ordered

Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
47
Dataflow performance
48
Putting it all together Equake

Finite element earthquake simulation
gt90 execution is in two functions
Sim()
Series of data-independent loops
Initialization and copying
Thread pool implementation
Smvp()
Cross-iteration dependences
Basically matrix multiplications
Rewrite in WaveScalar assembly

49
Putting it all together Equake
(11)
(3.5)
Single-threaded
50
WaveScalars Future
Steven Swanson Martha Mercaldi Andrew
Petersen Andrew Putnam Andrew Schwerin
Mark Oskin Susan Eggers Tom Anderson Carl
Ebeling Hank Levy
Ken Michelson David Sunderland Jared Wilkens
Chris Fisher
51
Microarchitecture (Steven Swanson, Andrew
Putnam, Ken Michelson)