Multiscalar processors - PowerPoint PPT Presentation

About This Presentation
Title:

Multiscalar processors

Description:

Multiscalar processors. Gurindar S. Sohi. Scott E. Breach. T.N. Vijaykumar. University of Wisconsin-Madison. Outline. Motivation. Multiscalar paradigm ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 30
Provided by: MAN5154
Category:

less

Transcript and Presenter's Notes

Title: Multiscalar processors


1
Multiscalar processors
  • Gurindar S. Sohi
  • Scott E. Breach
  • T.N. Vijaykumar
  • University of Wisconsin-Madison

2
Outline
  • Motivation
  • Multiscalar paradigm
  • Multiscalar architecture
  • Software and hardware support
  • Distribution of cycles
  • Results
  • Conclusion

3
Motivation
  • Current architecture techniques reaching their
    limits
  • Amount of ILP that can be extracted by
    superscalar processor is limited
  • Kunle Olukotun (stanford university)

4
Limits of ILP
  • Parallelism that can be extracted from a single
    program is very limited 4 or 5 in integer
    programs
  • Limits of instruction-level parallelism- David W.
    Wall (1990)

5
Limitations of superscalar
  • Branch prediction accuracy limits ILP
  • Every 5 instruction is a branch
  • Executing an instruction across 5 branches leads
    to useful result only 60 of the time (with
    branch prediction accuracy 90)
  • There are branches which are difficult to predict
    increasing the window size doesnt always means
    executing useful instructions

6
Limitations of superscalar.. contd
  • Large window size
  • Issuing more instructions per cycle needs large
    window of instructions
  • Each cycle search the whole window to find
    instructions to issue
  • Increases the pipeline length
  • Issue complexity
  • To issue an instruction dependence checks have to
    be performed with other issuing instructions
  • To issue n instructions complexity of issue is n2

7
Limitations of superscalar.. contd
  • Load and store queue limitations
  • Loads and stores cannot be reordered before
    knowing their addresses
  • One load or store waiting for its address can
    block the entire processor

8
Superscalar limitation example
  • Consider the following hypothetical loop
  • Iter 1
  • inst 1
  • inst 2
  • inst n
  • Iter 2
  • inst 1
  • inst 2
  • If window size is less than n, superscalar
    considers only one iteration at a time
  • Possible improvement
  • Iter 1 iter 2
  • inst 1 inst 1
  • inst 2 inst 2
  • inst n inst n

9
Multiscalar paradigm
  • Divide the program (CFG) into multiple tasks (not
    necessarily parallel)
  • Execute the tasks in different processing
    elements, residing in the same die
    communication cost is less
  • Sequential semantics is preserved by hardware and
    software mechanisms
  • Tasks are typically re-executed if there is any
    violations

10
Crossing the limits of superscalar
  • Branch prediction
  • Each thread executes independently
  • Each thread is limited by branch prediction but
    number of useful instructions available is much
    larger than superscalar
  • Window size
  • Each processing element has its own window
  • Total size of the windows in a die can be very
    large, while each window can be of moderate size

11
Crossing the limits of superscalar.. contd
  • Issue Complexity
  • Each processing element issue only a few
    instructions simplifies logic
  • Loads and Stores
  • Loads and stores can executed without waiting for
    the previous threads load or store

12
Multiscalar architecture
  • A possible microarchitecture

13
Multiscalar execution
  • The sequencer walks over the CFG
  • According the hints inserted in the code, it
    assigns tasks to PEs
  • PEs execute the tasks in parallel
  • Maintaining sequential semantics
  • Register dependencies
  • Memory dependencies
  • Tasks are assigned in the ring order and are
    committed in the ring order

14
Register Dependencies
  • Register dependencies can be easily identified
    using compiler
  • Dependencies are always synchronized
  • Registers that a task may write are maintained in
    a create mask
  • Reservations are created in the successor tasks
    using the accum mask
  • If the reservation exist (value not arrived), the
    instruction reading the register waits

15
Memory dependencies
  • Cannot be statically found
  • Multiscalar uses an aggressive approach
    speculate always
  • The loads dont wait for stores in the
    predecessor tasks
  • Hardware checks for violation and the task is
    re-executed if it violates any memory dependency

16
Task commit
  • Speculative tasks are not allowed to modify
    memory
  • Store values are buffered in hardware
  • When the processing element becomes head it
    retires its values into memory
  • In order to maintain sequential semantics the
    tasks retire in order ring arrangement of
    processing elements

17
Compiler support
  • Structure of CFG
  • Sequencer needs information of tasks
  • Compiler or a assembly code analyzer marks the
    structure of the CFG task boundaries
  • Sequencer walks through this information

18
Compiler support .. contd
  • Communication information
  • Gives the create mask as part of task header
  • Sets the forward and stop bits
  • Register value is forwarded if forward bit is set
  • Task is done when it sees a stop bit
  • Also needs to give release information

19
Hardware support
  • Need to buffer speculative values
  • Need to detect memory dependence violations
  • If a speculative thread loads a value its address
    is recorded in ARB
  • If a thread stores into some location, then ARB
    is checked to see if there was a load from the
    same location by a later thread
  • Also the speculative values are buffered

20
Cycle distribution
  • Best scenario all processing element does
    useful work always never happens
  • Possible wastage
  • Non-useful computation
  • If the task is squashed later due to incorrect
    value or incorrect prediction
  • No computation
  • Waits for some dependency to be resolved
  • Waits to commit its result
  • Remains idle
  • No task assigned

21
Non-useful computation
  • Synchronization of memory values
  • Squashes usually occur on global or static data
    values
  • Easy to predict this dependency
  • Explicitly synchronizations can be inserted to
    eliminate squashes due these dependencies
  • Early validation of prediction
  • For example loop exit testing can be done at the
    beginning of the iteration

22
No computation
  • Intra-task dependences
  • These can be eliminated through a variety of
    hardware and software techniques
  • Inter-task dependences
  • Possible scope for scheduling to reduce the wait
    time
  • Load balancing
  • Tasks retire in-order
  • Some tasks finish fast and wait for a long time
    to become the head task

23
Differences with other paradigms
  • Major improvement over superscalar
  • VLIW limited because of the limits of static
    optimizations
  • Multiprocessor
  • Very much similar
  • Communication costs is very less
  • Leads to fine grained thread parallelism

24
Methodology
  • Simulator which uses MIPS code
  • 5 stage pipeline
  • Sequencer has a 1024 entry direct mapped cache of
    task descriptors

25
Results
26
Results
  • Compress long critical path
  • Eqntott and cmppt has parallel loops with good
    coverage
  • Espresso one loop has load balancing issue
  • Sc also has load imbalance
  • Tomcatv good parallel loops
  • Cmp and wc intra task dependences

27
Conclusion
  • Multiscalar paradigm has very good potential
  • Tackles the major limits of superscalar
  • Lots of scope for compiler and hardware
    optimizations
  • Paper gives a good introduction to the paradigm
    and also discusses the major optimization
    opportunities

28
Discussion
29
BREAK!
Write a Comment
User Comments (0)
About PowerShow.com