CDA 5155 - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

CDA 5155

Description:

Simplest hardware great for the right problems. Statically Scheduled Multiple Issue. Better known as Very Long Instruction Word (VLIW) Compiler dominated Scheduling ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 13
Provided by: tod116
Category:
Tags: cda | dominated

less

Transcript and Presenter's Notes

Title: CDA 5155


1
CDA 5155
  • Superscalar, VLIW, Vector, Decoupled
  • Week 4

2
Processors Design Families
  • Superscalar
  • Not an Architectural Specification!
  • Vector Processors
  • Simplest hardware great for the right problems
  • Statically Scheduled Multiple Issue
  • Better known as Very Long Instruction Word (VLIW)
  • Compiler dominated Scheduling
  • Better known as EPIC (almost VLIW)
  • Decoupled Architectures
  • Tightly interconnected Scalar Processors
  • Relatively unknown area, influencing current
    designs
  • (also my dissertation research)

3
Vector Processors
  • Im certainly not inventing vector processors.
    There are three kinds that I know of existing
    today. They are represented by the Illiac-IV,
    the (CDC) Star processor, and the TI(ASC)
    processor. Those three were all pioneering
    processors One of the problems of being a
    pioneer is you always make mistakes and I never,
    never want to be a pioneer. Its always best to
    come second when you can look at the mistakes the
    pioneers made
  • - Seymour Cray (Cray-1 1976)

4
Vector Processor Design
  • Early super computers
  • Add Special instructions (addV) that operate on
    sequences (or vectors) of data
  • A single instruction defines a long sequence of
    operations to be performed.
  • Sequences do not have hazards no stalling,
    forwarding, etc.
  • Eliminates the need for overhead instructions for
    loop iteration
  • Very simple pipeline organization
  • More constrained memory access makes scheduling
    LV/SV instructions match memory banking designs
  • This enables very efficient use of memory bus
    (like caches do to a smaller extent)

5
Organization of a Vector Machine
6
Handling Vectors in Memory
  • LV V1 ? MemR1
  • Loads an entire vector of data starting at
    location MR1
  • This looks a lot like a cache line fill operation
  • Can design the number of memory banks to reflect
    the vector size.
  • What about non-contiguous accesses?
  • Column access on a 2D array elements out of a
    structure
  • LV V1 ? MemR1,R2
  • Loads vector starting at R1, with a stride of R2
    bytes
  • What about more complex accesses?
  • Indexed (scatter/gather) access
  • LV V1 ? MemR1, V2
  • V11 ? MemR1V21 V12 ? MemR1V22
    etc.

7
Pipelining Vectors
8
Chaining Vectors
  • Enable forwarding of vectors (DAXPY Z aX Y)
  • LV V1, R1 load X
  • LV V2, R2 load Y
  • MULSV V3, F0, V1 calculate aX
  • ADDV V4, V3, V2 calculate (aX) Y
  • SV V4, R3 store at Z
  • How can we overlap instructions?

9
Other Vector Issues
  • Compiler analysis to find vectorizable code
  • Determining vector length
  • Amdahls law
  • Complexity
  • Code base
  • Image Processing, scientific code (genomes?),
    graphics (MMX)

10
VLIW Processors
  • What happens to hardware complexity if we make
    the microarchitecture (pipeline organization)
    visible to the programmer/compiler?
  • Scheduling is a software problem
  • Hazard detection is a software problem
  • Memory Scheduling is (mostly) a software problem
  • Speculation (branch prediction) is (mostly) a
    software problem
  • Hardware is simpler!
  • Compiler/Programmers job is much harder

11
Non-unit latency
  • No hazard detection
  • If we write code that reads R3, it means whatever
    is in R3 at that cycle.
  • Note that Superscalar will get the most recent
    definition (that is what the hazard detector
    check for)
  • R1 ? 5
  • R1 ? 10
  • R2 ? R1 (5 or 10?)
  • It depends on the structure of the pipeline
    (which is known by the software)
  • Pipeline registers are visible to the compiler
    (but may not be accessed)

12
Decoupled Processors
Multiple Processors Asynchronous Queues P1 LD
Xi? P3 P2 LD Yi ? P4 P3 Mul a,Mem ? P4 P4
Add P3, Mem ? Mem
Write a Comment
User Comments (0)
About PowerShow.com