Nat%20DucaJonathan%20Cohen - PowerPoint PPT Presentation

About This Presentation

Title:

Nat%20DucaJonathan%20Cohen

Description:

Window Buffer. Insight: As M increases, this starts to resemble a ... Window Buffer. Thought: What do we gain by allowing a finite delay in the loopback stream? ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 27

Provided by: dimacsR

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Nat%20DucaJonathan%20Cohen

1
Stream CachingMechanisms for General Purpose
Stream Processing
Nat Duca Jonathan Cohen Johns Hopkins University
Peter Kirchner IBM Research
2
Talk Outline

Objective reconcile current practices of CPU
design with stream processing theory
Part 1 Streaming Ideas in current architectures
Latency and Die-Space
Processor types and tricks
Part 2 Insights about Stream Caches
Could window-based streaming be the next step in
computer architecture?

3
Streaming Architectures

Graphics processors
Signal processors
Network processors
Scalar/Superscalar processors
Data stream processors?
Software architectures?

4
What is a Streaming Computer?

Two overlapping ideas
A system that executes strict-streaming
algorithms unbounded N, small M
A general purpose system that is geared toward
general computation, but is best for the
streaming case
Big motivator ALU-bound computation!
To what extent do present computer architectures
serve these two views of a streaming computer?

5
Superscalar Architectures

Keep memory latency from limiting computation
speed
Solutions
Caches
Pipelining
Prefetching
Eager execution / branch predictionthe super in
superscalar
These are heuristics to locate streaming patterns
in unstructured program behavior

6
By the Numbers, Data

Optimized using caches, pipelines, and
eager-execution
Random 182MB/s
Sequential 315MB/s
Optimizing with prefetching
Random 490MB/s
Sequential 516MB/s
Theoretical Maximum 533MB/s

7
By the Numbers, Observations

Achieving full throughput on a scalar CPU
requires either
(a) prefetching requires advance knowledge
(b) sequential access no advances req'd
Vector architectures hide latency in their
instruction set using implicit prefetching
Dataflow machines solve latency using automatic
prefetching
Rule 1 Sequential I/O simplifies control and
access to memory, etc

8
Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
Prefetch
9
Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
The P4, by surface area, is about 95
cache, prefetch, and branch- prediction
logic. The remaining area is primarily the
floating point ALU.
Prefetch
10
Pure Streaming (e.g. Imagine)
Out Streams
In Streams
11
Can We Build This Machine?
Local Memory Hierarchy
Out Streams
In Streams

Rule 2 Small memory footprint allows more room
for ALU --gt more throughput

12
Part II Chromium

Pure stream processing model
Deals with OpenGL command stream
Begin(Triangles) Vertex, Vertex, Vertex End
Record splits are supported, joins are not
You perform useful computation in Chromium by
joining together Stream Processors into a DAG
Note DAG is constructed across multiple
processors (unlike dataflow)

13
Chromium w/ Stream Caches

We added join capability to Chromium for the
purpose of collapsing multiple records to one
Incidentally this allows windowed computations
Thought there seems to be direct connection
between streaming-joins and sliding-windows
Because we're in software, the windows can become
quite big without too much hassle
What if we move to hardware?

14
Windowed Streaming
Window Buffer
Out Streams
In Streams

Uses for Window Buffer of size M
Store program structures of up to size M
Cache M input records, where M ltlt N

15
Windowed Streaming
Window Buffer
In Streams
Out Streams
Realistic values of M if you stay exclusively on
chip 128k... 256K ... 2MB DRAM-on-chip tech
is promising
16
Impact on Window Size
Window Buffer
Out Streams
In Streams
Insight As M increases, this starts to resemble
a superscalar computer
17
The Continuum Architecture
Memory Hierarchy
Out Streams
In Streams

For too large a value of M
Non-Sequential I/O --gt caches
Caches --gt less room for ALU (etc)

18
Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Thought Can we augment window-buffer limit by a
loopback feature?
19
Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Memory
Thought What do we gain by allowing a finite
delay in the loopback stream?
20
Streaming Networks Primitive
21
Streaming Networks 1N
Hanrahan model
22
Streaming Networks N1
23
Streaming Networks The Ugly
24
Versatility of Streaming Networks?