Streaming Processors - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Streaming Processors

Description:

Data parallelism different ALUs can work on different stream elements ... Clusters that still have work continue. Those that do not get new element ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 35

Provided by: anselmo9

Category:

more less

Transcript and Presenter's Notes

Title: Streaming Processors

1
Streaming Processors

Anselmo Lastra

2
Presentations

1000-1200
Will schedule noon exam takers early
Issues
Setting up demos
Transitioning from one to another
Reports emailed by Monday 8AM
One page each person, please

3
Topics

What are streaming processors?
Arguments about advantages for media
How good are they for graphics?
To think about
People say GPUs are streaming is that true?

4
Motivation

Authors say that 100s to 1000s of ALUs can fit on
a chip
Problem is getting data to/from them
Stream paradigm way to manage that by exposing
I/O and parallelism
Media apps have high computational intensity

5
Streaming Example
6
Streaming Media Processors

Streams are sets of records
Records are fixed lengths
Kernels are the programs that compute on the
streams
There can be multiple streams in/out

7
Levels of Storage

Local Register File (LRF) store data near ALUs
Stream Register File (SRF) Stores streams local
to chip (say)
DRAM for larger storage when necessary

8
Types of Parallelism

Instruction-level
Data parallelism different ALUs can work on
different stream elements
Tasks can use the stream graph to observe
dependencies and run different kernels in parallel

9
Program w/ Multiple Kernels

Part of MPEG-2 decoder

10
Basic Stream Processor
11
Imagine
12
Notes

48 ALUs
Different types? 3 add, 2 mult, 1 divide/sqrt
KRF 9.7Kbytes
LRF 128 Kbytes
The kernel execution unit is SIMD
Presumably to have more computation for a given
overhead (in circuitry)
Paper is recent
Imagine running now

13
Cluster

Crosspoint to send data to registers of other
unit (or out)
Also a scratchpad set of registers

14
Performance

Best perf/power at 2.4GFlops/watt
They compare to 3GHz P4 with peak performance of
12 GFlops at 80 watts

15
Comparison to Vectors

Similar in that they work on chunks of data
They argue that streams are much more efficient
because kernels to more work than the typical
single vector instruction
Helps if you have memory hierarchy
Recall the Cray style memory system

16
Comparison to Other Approaches

Sidebar to another of their papers
VLIW DSPs and SIMD (MMX style)
Say that Imagine deals with memory
hierarchy/bandwidth issue
They cite Cheops (1995) a stream-type processor
with more hardwired architecture

17
Kernels for the OpenGL

Works on batches of primitives at a time
Keeps data in SRF

18
Load Balancing

Kernels like rasterization take unequal amounts
of time
Larger/smaller triangles, for example
They use dynamic load balancing
Clusters that still have work continue
Those that do not get new element
Since SIMD, must attempt load element each cycle,
even if none actually loaded

19
Textures

Stream processor cant read memory directly
Must generate a texture address stream
That reads texels from memory into another stream
Then the texture mapping/shading kernel reads the
two streams
How does this affect access to DRAM?
Would we be better off sorting to make memory
access coherent, then restoring the original
order? How (in streaming computer)?

20
Multiple Textures
21
Ordering

Attach a triangle ID to each fragment
Only worry about ordering of fragments with same
offset (screen location)
Hash using low order 6 bits of x y
Split stream into those with conflicts and those
without
How without making two passes over input stream?

22
Z Compare

Need to be careful about stale z values
If two fragments for same screen pixel in batch,
second could get old Z
Have same problem with write queues
They use the sorted conflict stream from the
reordering kernel
Make as many passes as the depth complexity of
the batch

23
Advantages of Streaming

They say that streaming enables
Latency tolerance the textures can be read
while something else is running
Reordering operations the texturing can be
moved to after Z test if blending disabled
Flexible resource allocation since all the same
computational elements, the pipeline accommodates
naturally to cases such as fewer large triangles

24
Predicted Performance

This was before they had Imagine chips
Benchmarks
Sphere finely tessellated, 82K tris, 362K
fragments
Advs-1 one of the SPECviewperf benchmarks, 26K
tris, 70K fragments, point-sampled texture
Advs-8 same except mipmapped textures
Fill 20K mipmapped tris filling screen
(720x720)
They scaled NVIDIA to same technology

25
Comparison
26
Relative Performance

Imagine worse on rasterization-heavy benchmarks

27
Some Observations

The hash function is important instead of just
sorting
21 speedup
14.7 of fragments collided
Mipmapped streams take more space so stream
length is shorter, thus lower efficiency
Imagines off-chip memory bandwidth lower than
Quadro

28
Possible Hardware Changes