Streaming Processors - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Streaming Processors

Description:

Data parallelism different ALUs can work on different stream elements ... Clusters that still have work continue. Those that do not get new element ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 35
Provided by: anselmo9
Category:

less

Transcript and Presenter's Notes

Title: Streaming Processors


1
Streaming Processors
  • Anselmo Lastra

2
Presentations
  • 1000-1200
  • Will schedule noon exam takers early
  • Issues
  • Setting up demos
  • Transitioning from one to another
  • Reports emailed by Monday 8AM
  • One page each person, please

3
Topics
  • What are streaming processors?
  • Arguments about advantages for media
  • How good are they for graphics?
  • To think about
  • People say GPUs are streaming is that true?

4
Motivation
  • Authors say that 100s to 1000s of ALUs can fit on
    a chip
  • Problem is getting data to/from them
  • Stream paradigm way to manage that by exposing
    I/O and parallelism
  • Media apps have high computational intensity

5
Streaming Example
6
Streaming Media Processors
  • Streams are sets of records
  • Records are fixed lengths
  • Kernels are the programs that compute on the
    streams
  • There can be multiple streams in/out

7
Levels of Storage
  • Local Register File (LRF) store data near ALUs
  • Stream Register File (SRF) Stores streams local
    to chip (say)
  • DRAM for larger storage when necessary

8
Types of Parallelism
  • Instruction-level
  • Data parallelism different ALUs can work on
    different stream elements
  • Tasks can use the stream graph to observe
    dependencies and run different kernels in parallel

9
Program w/ Multiple Kernels
  • Part of MPEG-2 decoder

10
Basic Stream Processor
11
Imagine
12
Notes
  • 48 ALUs
  • Different types? 3 add, 2 mult, 1 divide/sqrt
  • KRF 9.7Kbytes
  • LRF 128 Kbytes
  • The kernel execution unit is SIMD
  • Presumably to have more computation for a given
    overhead (in circuitry)
  • Paper is recent
  • Imagine running now

13
Cluster
  • Crosspoint to send data to registers of other
    unit (or out)
  • Also a scratchpad set of registers

14
Performance
  • Best perf/power at 2.4GFlops/watt
  • They compare to 3GHz P4 with peak performance of
    12 GFlops at 80 watts

15
Comparison to Vectors
  • Similar in that they work on chunks of data
  • They argue that streams are much more efficient
    because kernels to more work than the typical
    single vector instruction
  • Helps if you have memory hierarchy
  • Recall the Cray style memory system

16
Comparison to Other Approaches
  • Sidebar to another of their papers
  • VLIW DSPs and SIMD (MMX style)
  • Say that Imagine deals with memory
    hierarchy/bandwidth issue
  • They cite Cheops (1995) a stream-type processor
    with more hardwired architecture

17
Kernels for the OpenGL
  • Works on batches of primitives at a time
  • Keeps data in SRF

18
Load Balancing
  • Kernels like rasterization take unequal amounts
    of time
  • Larger/smaller triangles, for example
  • They use dynamic load balancing
  • Clusters that still have work continue
  • Those that do not get new element
  • Since SIMD, must attempt load element each cycle,
    even if none actually loaded

19
Textures
  • Stream processor cant read memory directly
  • Must generate a texture address stream
  • That reads texels from memory into another stream
  • Then the texture mapping/shading kernel reads the
    two streams
  • How does this affect access to DRAM?
  • Would we be better off sorting to make memory
    access coherent, then restoring the original
    order? How (in streaming computer)?

20
Multiple Textures
21
Ordering
  • Attach a triangle ID to each fragment
  • Only worry about ordering of fragments with same
    offset (screen location)
  • Hash using low order 6 bits of x y
  • Split stream into those with conflicts and those
    without
  • How without making two passes over input stream?

22
Z Compare
  • Need to be careful about stale z values
  • If two fragments for same screen pixel in batch,
    second could get old Z
  • Have same problem with write queues
  • They use the sorted conflict stream from the
    reordering kernel
  • Make as many passes as the depth complexity of
    the batch

23
Advantages of Streaming
  • They say that streaming enables
  • Latency tolerance the textures can be read
    while something else is running
  • Reordering operations the texturing can be
    moved to after Z test if blending disabled
  • Flexible resource allocation since all the same
    computational elements, the pipeline accommodates
    naturally to cases such as fewer large triangles

24
Predicted Performance
  • This was before they had Imagine chips
  • Benchmarks
  • Sphere finely tessellated, 82K tris, 362K
    fragments
  • Advs-1 one of the SPECviewperf benchmarks, 26K
    tris, 70K fragments, point-sampled texture
  • Advs-8 same except mipmapped textures
  • Fill 20K mipmapped tris filling screen
    (720x720)
  • They scaled NVIDIA to same technology

25
Comparison
26
Relative Performance
  • Imagine worse on rasterization-heavy benchmarks

27
Some Observations
  • The hash function is important instead of just
    sorting
  • 21 speedup
  • 14.7 of fragments collided
  • Mipmapped streams take more space so stream
    length is shorter, thus lower efficiency
  • Imagines off-chip memory bandwidth lower than
    Quadro

28
Possible Hardware Changes
  • A cache on-chip
  • Streams would be tagged cache-able or
    not-cache-able
  • 3DRAM-like ALU in memory system
  • Multi-node Imagine system
  • Theres a 750 MB/s network to connect them

29
Reyes on Stream Processor
30
Crack Prevention
31
Performance Comparison
  • Note the log scale

32
Where Time was Spent
  • Subdivision is in geometry stage
  • Reyes shading is in Vertex Program
  • Last three have no subdivision
  • Reyes still half speed with no subdivision
  • Shading many quads that dont cover pixels

33
Conclusions on Reyes
  • Need hardware-accelerated subdivision
  • He provides some specific suggestions

34
Other Readings in Hdw
  • Look at hardware for network routing
  • Also a domain high performance and high bandwidth
  • Lower computational intensity
  • Look at lower-level technologies
  • Embedded DRAM
  • Signaling
  • What problems are open to a university researcher?
Write a Comment
User Comments (0)
About PowerShow.com