Nat%20DucaJonathan%20Cohen - PowerPoint PPT Presentation

About This Presentation
Title:

Nat%20DucaJonathan%20Cohen

Description:

Window Buffer. Insight: As M increases, this starts to resemble a ... Window Buffer. Thought: What do we gain by allowing a finite delay in the loopback stream? ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 27
Provided by: dimacsR
Category:

less

Transcript and Presenter's Notes

Title: Nat%20DucaJonathan%20Cohen


1
Stream CachingMechanisms for General Purpose
Stream Processing
Nat Duca Jonathan Cohen Johns Hopkins University
Peter Kirchner IBM Research
2
Talk Outline
  • Objective reconcile current practices of CPU
    design with stream processing theory
  • Part 1 Streaming Ideas in current architectures
  • Latency and Die-Space
  • Processor types and tricks
  • Part 2 Insights about Stream Caches
  • Could window-based streaming be the next step in
    computer architecture?

3
Streaming Architectures
  • Graphics processors
  • Signal processors
  • Network processors
  • Scalar/Superscalar processors
  • Data stream processors?
  • Software architectures?

4
What is a Streaming Computer?
  • Two overlapping ideas
  • A system that executes strict-streaming
    algorithms unbounded N, small M
  • A general purpose system that is geared toward
    general computation, but is best for the
    streaming case
  • Big motivator ALU-bound computation!
  • To what extent do present computer architectures
    serve these two views of a streaming computer?

5
Superscalar Architectures
  • Keep memory latency from limiting computation
    speed
  • Solutions
  • Caches
  • Pipelining
  • Prefetching
  • Eager execution / branch predictionthe super in
    superscalar
  • These are heuristics to locate streaming patterns
    in unstructured program behavior

6
By the Numbers, Data
  • Optimized using caches, pipelines, and
    eager-execution
  • Random 182MB/s
  • Sequential 315MB/s
  • Optimizing with prefetching
  • Random 490MB/s
  • Sequential 516MB/s
  • Theoretical Maximum 533MB/s

7
By the Numbers, Observations
  • Achieving full throughput on a scalar CPU
    requires either
  • (a) prefetching requires advance knowledge
  • (b) sequential access no advances req'd
  • Vector architectures hide latency in their
    instruction set using implicit prefetching
  • Dataflow machines solve latency using automatic
    prefetching
  • Rule 1 Sequential I/O simplifies control and
    access to memory, etc

8
Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
Prefetch
9
Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
The P4, by surface area, is about 95
cache, prefetch, and branch- prediction
logic. The remaining area is primarily the
floating point ALU.
Prefetch
10
Pure Streaming (e.g. Imagine)
Out Streams
In Streams
11
Can We Build This Machine?
Local Memory Hierarchy
Out Streams
In Streams
  • Rule 2 Small memory footprint allows more room
    for ALU --gt more throughput

12
Part II Chromium
  • Pure stream processing model
  • Deals with OpenGL command stream
  • Begin(Triangles) Vertex, Vertex, Vertex End
  • Record splits are supported, joins are not
  • You perform useful computation in Chromium by
    joining together Stream Processors into a DAG
  • Note DAG is constructed across multiple
    processors (unlike dataflow)

13
Chromium w/ Stream Caches
  • We added join capability to Chromium for the
    purpose of collapsing multiple records to one
  • Incidentally this allows windowed computations
  • Thought there seems to be direct connection
    between streaming-joins and sliding-windows
  • Because we're in software, the windows can become
    quite big without too much hassle
  • What if we move to hardware?

14
Windowed Streaming
Window Buffer
Out Streams
In Streams
  • Uses for Window Buffer of size M
  • Store program structures of up to size M
  • Cache M input records, where M ltlt N

15
Windowed Streaming
Window Buffer
In Streams
Out Streams
Realistic values of M if you stay exclusively on
chip 128k... 256K ... 2MB DRAM-on-chip tech
is promising
16
Impact on Window Size
Window Buffer
Out Streams
In Streams
Insight As M increases, this starts to resemble
a superscalar computer
17
The Continuum Architecture
Memory Hierarchy
Out Streams
In Streams
  • For too large a value of M
  • Non-Sequential I/O --gt caches
  • Caches --gt less room for ALU (etc)

18
Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Thought Can we augment window-buffer limit by a
loopback feature?
19
Windowed Streaming
Window Buffer
In Streams
Out Streams
Loopback streams
Memory
Thought What do we gain by allowing a finite
delay in the loopback stream?
20
Streaming Networks Primitive
21
Streaming Networks 1N
Hanrahan model
22
Streaming Networks N1
23
Streaming Networks The Ugly
24
Versatility of Streaming Networks?
  • Question What algorithms can we support here?
    How?
  • Both from a Theoretical and Practical view
  • We have experimented with graphics problems only
  • Stream compression, visibility culling, level
    of detail

25
New Concepts with Streaming Networks
  • An individual processor's cost is small
  • Highly flexible use high level ideas of Dataflow
  • Multiple streams in and out
  • Interleaving or non-interleaved
  • Scalable window size
  • Open to entirely new concepts
  • E.g. How do you add more memory in this system?

26
Summary
  • Systems are easily built on the basis of
    streaming I/O and memory models
  • By design, it makes maximum use of hardware very
    very efficient
  • Continuum of ArchitecturesPure Streaming to
    Superscalar
  • Stream processors are trivially chained, even in
    cycles
  • Such a chained architecture may be higly
    flexible
  • Experimental evidence systems work
  • Dataflow literature
  • Streaming literature
Write a Comment
User Comments (0)
About PowerShow.com