WaveScalar - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

WaveScalar

Description:

Coarse-grain, von Neumann-style threads. Dataflow 'Unordered' memory. ... How can fine-grain, low-overhead interposition make the OS safer,more efficient, etc. ... – PowerPoint PPT presentation

Number of Views:533
Avg rating:3.0/5.0
Slides: 32
Provided by: Swa148
Category:
Tags: wavescalar

less

Transcript and Presenter's Notes

Title: WaveScalar


1
WaveScalar
WaveScalar
Steven Swanson
Steven Swanson
Chris Fisher
Chris Fisher
Sponsored by NSF, Intel, The ARCS Foundation,
Xilinx, and StoreTek
2
Monolithic von Neumann Processors
A phenomenal success today. But in 10 years?
  • ? Communication
  • Broadcast networks
  • ? Defect tolerance
  • 1 flaw -gt paperweight
  • ? Complexity
  • 40-60 of design is validation
  • ? Performance
  • Deeper pipes unlikely (ISCA02)

3
The WaveScalar ISA
  • A dataflow ISA with imperative language support
    The best of both worlds
  • Von Neumann
  • Normal memory semantics.
  • Coarse-grain, von Neumann-style threads
  • Dataflow
  • Unordered memory.
  • Fine-grain, dataflow-style threads
  • Use the best tool for the job.

4
WaveScalar example
  • b Aij
  • Aj ii i

5
WaveScalar example
i
j
A

  • b Aij
  • Aj ii i



Load

Store
b
6
WaveScalar example
i
j
A

  • b Aij
  • Aj ii i



Load

Store
b
7
WaveScalar example
i
j
A

  • b Aij
  • Aj ii i



Load

Store
b
8
WaveScalar example
i
j
A

  • b Aij
  • Aj ii i



Load

Store
b
9
WaveScalar example
i
j
A

  • b Aij
  • Aj ii i



Load

Store
b
10
WaveScalar Execution Model
  • Put an ALU at every word of instruction memory.
  • No processor core.
  • Instructions communicate directly.

11
The WaveCache
  • The I-Cache is the processor.

12
The WaveCache
13
The WaveCache
  • Long distance communication
  • Dynamic routing
  • Grid-based network
  • 2 cycle/cluster
  • Traditional cache coherence
  • Normal memory hierarchy

14
WaveCache Performance
15
Multithreaded Performance
16
Fine-grain Performance
17
WaveScalars Future
Steven Swanson Martha Mercaldi Andrew
Petersen Andrew Putnam Andrew Schwerin
Mark Oskin Susan Eggers Tom Anderson Carl
Ebeling Hank Levy
Ken Michelson David Sunderland Jared Wilkens
Chris Fisher
18
Instruction Placement(Martha Mercaldi)
  • Status
  • Profile-based, two-level static instruction
    placement
  • Cache-aware performance modeling
  • Future Questions
  • Should instructions be moved once in place?
  • How can placement policy manage matching table
    resources?

19
Compiler(Andrew Petersen)
C
C
???
  • Status
  • Simple C code works!
  • Future Questions
  • What optimizations are (not) valuable in
    WaveScalar?
  • How/should predication be applied?
  • Should WaveScalar speculate in software? How?

Compiler
20
Operating System(Andrew Schwerin)
  • Status
  • Designing fine-grain interposition system.
  • Future Questions
  • How can fine-grain, low-overhead interposition
    make the OS safer,more efficient, etc.?
  • How should the OS manage the WaveCache?

21
Conclusions
  • WaveScalar ISA
  • A unified dataflow and von Neumann execution
    model
  • Mix-and-Match parallelism models
  • WaveCache Architecture
  • gt2x performance/area than OOO
  • Excellent multi-threaded performance.
  • Over 250x performance for hand-coded apps.
  • Enormous opportunities for future research

22
Decentralized Processing
  • ? Communication
  • ? Defect tolerance
  • ? Complexity
  • ? High Performancce

23
WaveScalar example
  • Aj ii i
  • b Aij

24
Decentralized Processors
  • ? Communication
  • ? Defect tolerance
  • ? Complexity
  • ? Performance

But how do you execute?
25
Von Neumann is Centralized
  • PC-driven fetch is the problem
  • One program counter
  • Dataflow is the solution

26
Dataflow has been done before...
  • Operations fire when data is available
  • No program counter
  • Convert true control dependences to data
    dependences
  • Exposes massive parallelism
  • But...

27
...it had issues
  • Scalability
  • Dataflow never executed mainstream code
  • No total load-store ordering
  • Special languages
  • Different memory semantics
  • No mutable data structures (mostly)
  • Functional (mostly)

28
Things to keep you up at night 2016
  • Opportunities
  • 8 billion transistors 28Ghz
  • 4GB per DRAM chip
  • 120 P4s OR 200,000 RISC-1 per die
  • Challenges
  • Communication
  • Defects
  • Complexity
  • Performance

29
Microarchitecture (Steven Swanson, Andrew
Putnam, Ken Michelson)
  • Domain
  • How to spend wires?
  • What are PEs?
  • Network topology and routing
  • SystemC model

30
Performance
  • Cycle-accurate simulator
  • Binary translator from Alpha -gt WaveScalar
    assembly
  • A selection of Spec2000 and MediaBench
  • WaveCache
  • 2000 Processing elements
  • No speculation
  • Compare to a very aggressive superscalar
  • 15-stage, 16-wide
  • 1024- registers, 1024-entry issue queue

31
FPGA Prototype(Chris Fisher, Jared Wilkens)
  • FPGA prototype
  • Boards
  • 4 FPGA w/ 2 PPC cores
  • DDR Memory
  • SRAM
  • Attached to a PPC Brain
Write a Comment
User Comments (0)
About PowerShow.com