WaveScalar - PowerPoint PPT Presentation

About This Presentation
Title:

WaveScalar

Description:

Coarse-grain, von Neumann-style threads. Dataflow 'Unordered' memory. Fine-grain, dataflow-style threads. Use the best tool for the job. UPC November, 2004 ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 58
Provided by: Swa147
Category:
Tags: wavescalar

less

Transcript and Presenter's Notes

Title: WaveScalar


1
WaveScalar
WaveScalar
Steven Swanson
Steven Swanson
Ken Michelson David Sunderland Jared Wilkins
Chris Fisher
Ken Michelson David Sunderland Jared Wilkins
Chris Fisher
Sponsored by NSF, Intel, The ARCS Foundation,
Xilinx, and StoreTek
2
Things to keep you up at night 2016
  • Opportunities
  • 8 billion transistors 28Ghz
  • 4GB per DRAM chip
  • 120 P4s OR 200,000 RISC-1 per die
  • Challenges
  • Communication
  • Defects
  • Complexity
  • Performance

3
Monolithic von Neumann Processors
A phenomenal success today. But in 2016?
  • ? Communication
  • Broadcast networks
  • ? Defect tolerance
  • 1 flaw -gt paperweight
  • ? Complexity
  • 40-60 of design is validation
  • ? Performance
  • Deeper pipes unlikely (ISCA02)

4
Decentralized Processors
  • ? Communication
  • ? Defect tolerance
  • ? Complexity
  • ? Performance

But how do you execute?
5
Von Neumann is Centralized
  • PC-driven fetch is the problem
  • One program counter
  • Dataflow is the solution

6
Dataflow has been done before...
  • Operations fire when data is available
  • No program counter
  • Convert true control dependences to data
    dependences
  • Exposes massive parallelism
  • But...

7
...it had issues
  • Scalability
  • Dataflow never executed mainstream code
  • No total load-store ordering
  • Special languages
  • Different memory semantics
  • No mutable data structures (mostly)
  • Functional (mostly)

8
The WaveScalar ISA
  • A dataflow ISA with imperative language support
    The best of both worlds
  • Von Neumann
  • Normal memory semantics.
  • Coarse-grain, von Neumann-style threads
  • Dataflow
  • Unordered memory.
  • Fine-grain, dataflow-style threads
  • Use the best tool for the job.

9
WaveScalar example
  • Aj ii i
  • b Aij

10
WaveScalar example
  • Aj ii i
  • b Aij

11
WaveScalar example
i
j
A

  • Aj ii i
  • b Aij



Load

Store
b
12
WaveScalar example
i
j
A

  • Aj ii i
  • b Aij



Load

Store
b
13
WaveScalar example
i
j
A

  • Aj ii i
  • b Aij



Load

Store
b
14
WaveScalar example
i
j
A

  • Aj ii i
  • b Aij



Load

Store
b
15
WaveScalar example
i
j
A

  • Aj ii i
  • b Aij



Load

Store
b
16
WaveScalar example
  • Aj ii i
  • b Aij

17
Wave-ordered memory
Load
  • Compiler annotates memory operations
  • Send memory requests in any order
  • Hardware reconstructs the correct order

Store
Store
Load
Load
Store
18
Wave-ordering Example
Load
Store
5
6
4
Store
Load
Load
6
8
5
Store
19
Wave-ordered Memory
  • Waves are loop-free sections of the dataflow
    graph
  • Each dynamic wave has a wave number
  • Wave-ordered memory
  • Wave-numbers
  • Sequence number

20
WaveScalar Execution Model
  • Put an ALU at every word of instruction memory.
  • No processor core.
  • Instructions communicate directly.

21
The WaveCache
  • The I-Cache is the processor.

22
Processing Element
23
Domain
24
Cluster
25
The WaveCache
  • Long distance communication
  • Dynamic routing
  • Grid-based network
  • 1 cycle/cluster
  • Traditional cache coherence
  • Normal memory hierarchy
  • 16K instructions

26
The WaveCache in Action!
27
Performance
  • Cycle-accurate simulator
  • Binary translator from Alpha -gt WaveScalar
    assembly
  • A selection of Spec2000 and MediaBench
  • WaveCache
  • 2000 Processing elements
  • No speculation
  • Compare to a very aggressive superscalar
  • 15-stage, 16-wide
  • 1024- registers, 1024-entry issue queue

28
WaveCache Performance
29
Decentralized Processing
  • ? Communication
  • ? Defect tolerance
  • ? Complexity
  • ? High Performancce

30
Multithreading the WaveCache
  • What is a thread?
  • A flow of control?
  • Von Neumann PC registers.
  • WaveScalar A memory ordering?
  • How do threads work in WaveScalar?
  • ISA changes
  • Architectural changes

31
ISA Support for Threads
  • Extend tag with ThreadID
  • Instructions for memory ordering management
  • Mem_Sequence_Start -- associate ordering with a
    ThreadID
  • Mem_Sequence_Stop -- destroy ordering

32
Thread Synchronization
  • Memory-based (Test-And-Set)
  • Spin on memory
  • Memory-free (Thread_Coordinate)
  • A queue lock.

33
Hardware support for threads
  • Very little must change
  • Wider busses
  • Wider input queues
  • More store buffers
  • ThreadIDs control instruction replication
  • One copy of each instructions/ThreadID

34
Multithreaded Performance
35
The WaveScalar ISA
  • Von Neumann
  • Normal memory semantics.
  • Coarse-grain, von Neumann-style threads
  • Dataflow
  • Unordered memory.
  • Fine-grain, dataflow-style threads

36
Unordered memory
  • Load_Unordered
  • A normal load (but not wave-ordered)
  • Store_Unordered
  • Write to memory and return a value.
  • Mem_nop_Ack
  • A no-op, but returns a value upon execution.
  • Coordination point between ordered and unordered
    operations.

37
Exploiting Unordered Memory
  • Fine-grain intermingling

struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
38
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
39
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
40
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
41
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
42
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Ld p-gtx
Ld p-gty
St r.x
St r.y

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
43
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b
Ld p-gtx
Ld p-gty
St r.x
St r.y

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
44
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
45
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
46
Exploiting Unordered Memory
Ordered
  • Fine-grain intermingling

St a, 0 lt0,1,2gt
Unordered
Mem_nop_ack lt1,2,3gt
struct int x,y Pair foo(Pair p, int a,
int b) Pair r a 0 r.x p-gtx r.y
p-gty return b

Mem_nop_ack lt2,3,4gt
Ld b lt3,4,5gt
47
Dataflow performance
48
Putting it all together Equake
  • Finite element earthquake simulation
  • gt90 execution is in two functions
  • Sim()
  • Series of data-independent loops
  • Initialization and copying
  • Thread pool implementation
  • Smvp()
  • Cross-iteration dependences
  • Basically matrix multiplications
  • Rewrite in WaveScalar assembly

49
Putting it all together Equake
(11)
(3.5)
Single-threaded
50
WaveScalars Future
Steven Swanson Martha Mercaldi Andrew
Petersen Andrew Putnam Andrew Schwerin
Mark Oskin Susan Eggers Tom Anderson Carl
Ebeling Hank Levy
Ken Michelson David Sunderland Jared Wilkens
Chris Fisher
51
Microarchitecture (Steven Swanson, Andrew
Putnam, Ken Michelson)
  • Domain
  • How to spend wires?
  • What are PEs?
  • Network topology and routing
  • SystemC model

52
Microarchitecture Status
  • HDL done
  • PE
  • Domain
  • Store buffer/cache
  • Network switch
  • 4x4 WaveCache, 8PEs/domain
  • 160mm2 _at_ 90nm
  • Tools estimate 250-300Mhz

53
Instruction Placement(Martha Mercaldi)
  • Static vs. Dynamic
  • Simulated annealing
  • Instruction migration
  • Which instruction to evict? How aggressively?

54
Compiler(Andrew Petersen, David Sunderland)
C
C
???
  • Custom WaveScalar optimizations
  • Unordered memory operations
  • Alias Analysis
  • Re-examine well-known optimizations
  • Is software pipelining useful?
  • Dataflow Languages
  • SISAL, Id, etc.
  • ???

Compiler
55
Operating System(Andrew Schwerin)
  • Cache and Address organization
  • Coherence protocols
  • Fine-grained protection domains.

56
FPGA Prototype(Chris Fisher, Jared Wilkens)
  • FPGA prototype
  • Boards
  • 4 FPGA w/ 2 PPC cores
  • DDR Memory
  • SRAM
  • Attached to a PPC Brain

57
Conclusions
  • WaveScalar ISA
  • A unified dataflow and von Neumann execution
    model
  • Mix-and-Match parallelism models
  • WaveCache Architecture
  • Outperforms an OOO superscalar by 2.8x
  • Excellent multi-threaded performance.
  • Over 300 IPC for hand-coded apps.
  • And you can build it today!!!
  • Enormous opportunities for future research
Write a Comment
User Comments (0)
About PowerShow.com