The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops

Description:

parasol.tamu.edu – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 28
Provided by: Francis159
Category:

less

Transcript and Presenter's Notes

Title: The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops


1
The R-LRPD TestSpeculative Parallelization of
Partially Parallel Loops
  • Francis Dang, Hao Yu, and Lawrence Rauchwerger
  • Department of Computer Science
  • Texas AM University

2
Motivation
  • To maximize performance, extract the maximum
    available parallelism from loops.
  • Static compiler methods may be insufficient.
  • Access patterns may be too complex.
  • Required information is only available at
    runtime.
  • Run-time methods needed to extract loop
    parallelism
  • Inspector/Executor
  • Speculative Parallelization

3
Speculative Parallelization LRPD Test
  • Main Idea
  • Execute a loop as a DOALL.
  • Record memory references during execution.
  • Check for data dependences.
  • If there was a dependence, re-execute the loop
    sequentially.
  • Disadvantages
  • One data dependence can invalidate speculative
    parallelization.
  • Slowdown is proportional to speculative parallel
    execution time.
  • Partial parallelism is not exploited.

4
Partially Parallel Loop Example
do i 1, 8 z AKi ALi z Ci end do K18 1,2,3,1,4,2,1,1 L18 4,5,5,4,3,5,3,3
iter 1 2 3 4 5 6 7 8
A()
1 R R R R
2 R R
3 R W W W
4 W W R
5 W W W
5
The Recursive LRPD
  • Main Idea
  • Transform a partially parallel loop into a
    sequence of fully parallel, block-scheduled
    loops.
  • Iterations before the first data dependence are
    correct and committed.
  • Re-apply the LRPD test on the remaining
    iterations.
  • Worst case
  • Sequential time plus testing overhead

6
Algorithm
7
Implementation
  • Implemented in run-time pass in Polaris and
    additional hand-inserted code.
  • Privatization with copy-in/copy-out for arrays
    under test.
  • Replicated buffers for reductions.
  • Backup arrays for checkpointing.

8
Recursive LRPD Example
do i 1, 8 z AKi ALi z Ci end do K18 1,2,3,1,4,2,1,1 L18 4,5,5,4,2,5,3,3
9
Heuristics
  • Work Redistribution
  • Sliding Window Approach
  • Data Dependence Graph Extraction

10
Work Redistribution
  • Redistribute remaining iterations across
    processors.
  • Execution time for each stage will decrease.
  • Disadvantages
  • May uncover new dependences across processors.
  • May incur remote cache misses from data
    redistribution.

11
Work Redistribution Example
do i 1, 8 z AKi ALi z Ci end do K18 1,2,3,1,4,2,1,1 L18 4,5,5,4,2,5,3,3
12
Redistribution Model
  • Redistribution may not always be beneficial.
  • Stop redistribution if
  • The cost of data redistribution outweighs the
    benefit from work redistribution.
  • Synthetic loop to model this adaptive method.

13
Redistribution Model
14
Sliding Window R-LRPD
  • R-LRPD can generate a sequential schedule for
    long dependence distributions.
  • Strip-mine the speculative execution.
  • Apply the R-LRPD on a contiguous block of
    iterations.
  • Only dependences within the window cause
    failures.
  • Adds more global synchronizations and test
    overhead.

15
DDG Extraction
  • R-LRPD can generate sequential schedules for
    complex dependence distributions.
  • Use the SW R-LRPD scheme to extract the data
    dependence graph (DDG).
  • Generate an optimized schedule from the DDG.
  • Obtains the DDG for loops from which a proper
    inspector cannot be extracted.

16
Performance Issues
  • Performance issues
  • Blocked scheduling potential cause for load
    imbalance.
  • Checkpointing can be expensive.
  • Feedback guided blocked scheduling
  • Use the timing information from the previous
    instantiation (Bull, EuroPar 98)
  • Estimate the processor chunk sizes for minimal
    load imbalance.
  • On-Demand Checkpointing
  • Checkpoint only data modified during execution.

17
Experiments
  • Setup
  • 16 processor HP V-Class
  • 4 GB memory
  • HP-UX 11.0

18
Experimental Results Input Profiles
19
Experimental Results - TRACK
20
Experimental Results - TRACK
21
Experimental Results - TRACK
22
Experimental Results - TRACK
23
Experimental Results Sliding Window
24
Experimental Results Sliding Window
25
Experimental Results FMA3D
26
Experimental Results SPICE 2G6
27
Conclusion
  • Contribution
  • Can speculatively parallelize any loop.
  • Concern is now optimizing the parallelization and
    not when to parallelize.
  • Future work
  • Use dependence distribution information for
    adaptive redistribution and scheduling.
Write a Comment
User Comments (0)
About PowerShow.com