Title: Nat%20DucaJonathan%20Cohen
1Stream CachingMechanisms for General Purpose
Stream Processing
Nat Duca Jonathan Cohen Johns Hopkins University
Peter Kirchner IBM Research
2Talk Outline
- Objective reconcile current practices of CPU
design with stream processing theory - Part 1 Streaming Ideas in current architectures
- Latency and Die-Space
- Processor types and tricks
- Part 2 Insights about Stream Caches
- Could window-based streaming be the next step in
computer architecture?
3Streaming Architectures
- Graphics processors
- Signal processors
- Network processors
- Scalar/Superscalar processors
- Data stream processors?
- Software architectures?
4What is a Streaming Computer?
- Two overlapping ideas
- A system that executes strict-streaming
algorithms unbounded N, small M - A general purpose system that is geared toward
general computation, but is best for the
streaming case - Big motivator ALU-bound computation!
- To what extent do present computer architectures
serve these two views of a streaming computer?
5Superscalar Architectures
- Keep memory latency from limiting computation
speed - Solutions
- Caches
- Pipelining
- Prefetching
- Eager execution / branch predictionthe super in
superscalar - These are heuristics to locate streaming patterns
in unstructured program behavior
6By the Numbers, Data
- Optimized using caches, pipelines, and
eager-execution - Random 182MB/s
- Sequential 315MB/s
- Optimizing with prefetching
- Random 490MB/s
- Sequential 516MB/s
- Theoretical Maximum 533MB/s
7By the Numbers, Observations
- Achieving full throughput on a scalar CPU
requires either - (a) prefetching requires advance knowledge
- (b) sequential access no advances req'd
- Vector architectures hide latency in their
instruction set using implicit prefetching - Dataflow machines solve latency using automatic
prefetching - Rule 1 Sequential I/O simplifies control and
access to memory, etc
8Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
Prefetch
9Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
The P4, by surface area, is about 95
cache, prefetch, and branch- prediction
logic. The remaining area is primarily the
floating point ALU.
Prefetch
10Pure Streaming (e.g. Imagine)
Out Streams
In Streams
11Can We Build This Machine?
Local Memory Hierarchy
Out Streams
In Streams
- Rule 2 Small memory footprint allows more room
for ALU --gt more throughput
12Part II Chromium
- Pure stream processing model
- Deals with OpenGL command stream
- Begin(Triangles) Vertex, Vertex, Vertex End
- Record splits are supported, joins are not
- You perform useful computation in Chromium by
joining together Stream Processors into a DAG - Note DAG is constructed across multiple
processors (unlike dataflow)
13Chromium w/ Stream Caches
- We added join capability to Chromium for the
purpose of collapsing multiple records to one - Incidentally this allows windowed computations
- Thought there seems to be direct connection
between streaming-joins and sliding-windows - Because we're in software, the windows can become
quite big without too much hassle - What if we move to hardware?
14Windowed Streaming
Window Buffer
Out Streams
In Streams
- Uses for Window Buffer of size M
- Store program structures of up to size M
- Cache M input records, where M ltlt N
15Windowed Streaming
Window Buffer
In Streams
Out Streams
Realistic values of M if you stay exclusively on
chip 128k... 256K ... 2MB DRAM-on-chip tech
is promising
16Impact on Window Size
Window Buffer
Out Streams
In Streams
Insight As M increases, this starts to resemble
a superscalar computer
17The Continuum Architecture
Memory Hierarchy
Out Streams
In Streams
- For too large a value of M
- Non-Sequential I/O --gt caches
- Caches --gt less room for ALU (etc)
18Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Thought Can we augment window-buffer limit by a
loopback feature?
19Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Memory
Thought What do we gain by allowing a finite
delay in the loopback stream?
20Streaming Networks Primitive
21Streaming Networks 1N
Hanrahan model
22Streaming Networks N1
23Streaming Networks The Ugly
24Versatility of Streaming Networks?
- Question What algorithms can we support here?
How? - Both from a Theoretical and Practical view
- We have experimented with graphics problems only
- Stream compression, visibility culling, level
of detail
25New Concepts with Streaming Networks
- An individual processor's cost is small
- Highly flexible use high level ideas of Dataflow
- Multiple streams in and out
- Interleaving or non-interleaved
- Scalable window size
- Open to entirely new concepts
- E.g. How do you add more memory in this system?
26Summary
- Systems are easily built on the basis of
streaming I/O and memory models - By design, it makes maximum use of hardware very
very efficient - Continuum of ArchitecturesPure Streaming to
Superscalar - Stream processors are trivially chained, even in
cycles - Such a chained architecture may be higly
flexible - Experimental evidence systems work
- Dataflow literature
- Streaming literature