Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers presentation

About This Presentation

Title:

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Description:

Weighted Graph With Weights On Edges. Goal Is to Compute, For Each Node, Sum of Weights on All Incoming Edges ... Represent Computed Values Using Opaque Constants ... –

Number of Views:44

Avg rating:3.0/5.0

Slides: 62

Provided by: martin49

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

1
Commutativity Analysis A New Analysis Framework
for Parallelizing Compilers

Martin C. Rinard
University of California, Santa Barbara

2
Goal

Automatically Parallelize
Irregular, Object-Based Computations
That Manipulate
Dynamic, Linked Data Structures

3
Structure of Talk

Model of Computation
Graph Traversal Example
Commutativity Testing
Basic Technique
Practical Extensions
Advanced Techniques
Synchronization Optimizations
Experimental Results
Future Research

4
Model of Computation
5
7
operations
objects
executing operation
new object state
operation
5
5
9
initial object state
invoked operations
5
Example

Weighted In Degree Computation
Weighted Graph With Weights On Edges
Goal Is to Compute, For Each Node, Sum of Weights
on All Incoming Edges
Serial Algorithm Marked Depth-First Traversal

6
Serial Code For Example

class node
node left, right
int left_weight, right_weight
int sum
boolean marked
void nodetraverse(int weight)
sum weight
if (!marked)
marked true
if (left !NULL) left-gttraverse(left_weight)
if (right!NULL) right-gttraverse(right_weight)
Goal Execute left and right Traverse Operations
In Parallel

7
Parallel Traversal
6
3
6
3
5
2
5
2
6
3
6
3
6
3
5
2
5
2
8
Parallel Traversal
6
3
6
3
6
6
3
3
5
2
5
2
2
6
3
6
3
6
6
3
3
5
2
5
2
7
6
3
6
3
6
6
3
3
5
2
5
2
5
9
Traditional Approach

Data Dependence Analysis
Compiler Analyzes Reads and Writes
Finds Independent Pieces of Code
Independent Pieces of Code Execute in Parallel
Demonstrated Success for Array-Based Programs
Dense Matrices
Affine Access Functions

10
Data Dependence Analysis in Example

For Data Dependence Analysis to Succeed in
Example
left and right Traverse Must Be Independent
left and right Subgraphs Must Be Disjoint
Graph Must Be a Tree
Depends on Global Topology of Data Structure
Analyze Code that Builds Data Structure
Extract and Propagate Topology Information
Fails for Graphs - Computations Are Not
Independent!

11
Commuting Operations In Parallel Traversal
6
3
6
3
6
6
3
3
5
2
5
2
2
6
3
6
3
6
6
3
3
5
2
5
2
7
6
3
6
3
6
6
3
3
5
2
5
2
5
12
Commutativity Analysis

Compiler Computes Extent of the Computation
Representation of all Operations in Computation
Algorithm Traverses Call Graph
In Example nodetraverse
Do All Pairs of Operations in Extent Commute?
No - Generate Serial Code
Yes - Generate Parallel Code
In Example All Pairs Commute

13
Generated Code In Example

class node
lock mutex
node left, right
int left_weight, right_weight
int sum
boolean marked
void nodetraverse(int weight)
parallel_traverse(weight)
wait()

Class Declaration
Driver Version
14
Generated Code In Example

void nodeparallel_traverse(int weight)
mutex.acquire()
sum weight
if (!marked)
marked true
mutex.release()
if (left !NULL)
spawn(left-gtparallel_traverse(left_weight))
if (right!NULL)
spawn(right-gtparallel_traverse(right_weight))
else
mutex.release()

Critical Region
15
Properties of Commutativity Analysis

Oblivious to Data Structure Topology
Local Analysis
Simple Analysis
Suitable for a Wide Range of Programs
Programs that Manipulate Lists, Trees and Graphs
Commuting Updates to Central Data Structure
General Reductions
Incomplete Programs
Introduces Synchronization

Commutativity Testing

17
Separable Operations

Each Operation Consists of Two Sections

Object Section Only Accesses Receiver Object
Invocation Section Only Invokes Operations
Both Sections May Access Parameters and Local
Variables
18
Commutativity Testing Conditions

Do Two Operations A and B Commute?
Compiler Must Consider Two Potential Execution
Orders
A executes before B
B executes before A
Compiler Must Check Two Conditions

Instance Variables In both execution orders, new
values of the instance variables are the same
after the execution of the two object sections
Invoked Operations In both execution orders, the
two invocation sections together directly invoke
the same multiset of operations
19
Commutativity Testing Algorithm

Symbolic Execution
Compiler Executes Operations
Computes with Expressions Instead of Values
Compiler Symbolically Executes Operations
In Both Execution Orders
Expressions for New Values of Instance Variables
Expressions for Multiset of Invoked Operations

20
Checking Instance Variables Condition

Compiler Generates Two Symbolic Operations
n-gttraverse(w1) and n-gttraverse(w2)
In Order n-gttraverse(w1) n-gttraverse(w2)
New Value of sum (sumw1)w2
New Value of marked true
In Order n-gttraverse(w2) n-gttraverse(w1)
New Value of sum (sumw2)w1
New Value of marked true

21
Checking Invoked Operations Condition

In Order n-gttraverse(w1) n-gttraverse(w2)
Multiset of Invoked Operations Is
if (!marked left!NULL) left-gttraverse(left_we
ight),
if (!marked right!NULL) right-gttraverse(right_
weight)
In Order n-gttraverse(w2) n-gttraverse(w1)
Multiset of Invoked Operations Is
if (!marked left!NULL) left-gttraverse(left_we
ight),
if (!marked right!NULL) right-gttraverse(right_
weight)

22
Expression Simplification and Comparison

Compiler Applies Rewrite Rules to Simplify
Expressions
b(ac) gt (abc)
a(bc) gt (ab)(ac)
aif(bltc,d,e) gt if(bltc,ad,ae)
Compiler Compares Corresponding Expressions
If All Equal - Operations Commute
If Not All Equal - Operations May Not Commute

23
Practical Extensions

Exploit Read-Only Data
Recognize When Computed Values Depend Only On
Unmodified Instance Variables or Global Variables
Parameters
Represent Computed Values Using Opaque Constants
Increases Set of Programs that Compiler Can
Analyze
Operations Can Freely Access Read-Only Data
Coarsen Commutativity Testing Granularity
Integrate Operations into Callers for Analysis
Purposes
Mechanism Interprocedural Symbolic Execution
Increases Effectiveness of Commutativity Testing

24
Advanced Techniques

Relative Commutativity
Recognize Commuting Operations That Generate
Equivalent But Not Identical Data Structures
Techniques for Operations that Contain
Conditionals
Distribute Conditionals Out of Expressions
Test for Equivalence By Doing Case Analysis
Techniques for Operations that Access Arrays
Use Array Update Expressions to Represent New
Values
Rewrite Rules for Array Update Expressions
Techniques for Operations that Execute Loops

25
Commutativity Testing for Operations With Loops

Prerequisite Represent Values Computed In Loops
View Body of Loop as an Expression Transformer
Input Expressions Values Before Iteration
Executes
Output Expressions Values After Iteration
Executes
Represent Values Computed In Loop Using
Recursively Defined Symbolic Loop Modeling
Functions
int tsum for(i)ttai sumt
s(e,0) e s(e,i1) s(e,i)ai
New Value of sums(sum,n)
Use Nested Induction Proofs to Determine
Equivalence of Expressions With Symbolic Loop
Modeling Functions

26
Important Special Case

Independent Operations Commute
Analysis in Current Compiler
Dependence Analysis
Operations on Objects of Different Classes
Independent Operations on Objects of Same Class
Symbolic Commutativity Testing
Dependent Operations on Objects of Same Class
Future
Integrate Shape Analysis
Integrate Array Data Dependence Analysis

Steps to Practicality

28
Programming Model Extensions

Extensions for Read-Only Data
Allow Operations to Freely Access Read-Only Data
Enhances Ability of Compiler to Represent
Expressions
Increases Set of Programs that Compiler Can
Analyze
Analysis Granularity Extensions
Integrate Operations Into Callers for Analysis
Purposes
Coarsens Commutativity Testing Granularity
Reduces Number of Pairs Tested for Commutativity
Enhances Effectiveness of Commutativity Testing

29
Optimizations

Parallel Loop Optimization
Suppress Exploitation of Excess Concurrency
Synchronization Optimizations
Eliminate Synchronization Constructs in Methods
that Only Access Read-Only Data
Lock Coarsening
Replaces Multiple Mutual Exclusion Regions with
Single Larger Mutual Exclusion Region

30
Synchronization Optimizations
31
Default Code Generation Strategy

Each Object Has its Own Mutual Exclusion Lock
Each Operation Acquires and Releases Lock

Simple Lock Optimization
Eliminate Lock Constructs In Operations That Only
Access Read-Only Data
32
Data Lock Coarsening Transformation

Compiler Gives Multiple Objects the Same Lock
Current Policy Nested Objects Use the Lock in
Enclosing Object
Finds Sequences of Operations
Access Different Objects
Acquire and Release Same Lock
Transformed Code
Acquires Lock Once At Beginning of Sequence
Releases Lock Once At End of Sequence
Original Code
Each Operation Acquires and Releases Lock

33
Data Lock Coarsening Example
Original Code
Transformed Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
34
Data Lock Coarsening Tradeoff

Advantage
Reduces Number of Executed Acquires and Releases
Reduces Acquire and Release Overhead
Disadvantage May Cause False Exclusion
Multiple Parallel Operations Access Different
Objects
But Operations Attempt to Acquire Same Lock
Result Operations Execute Serially

35
Computation Lock Coarsening Transformation

Compiler Finds Sequences of Operations
Acquire and Release Same Lock
Transformed Code
Acquires Lock Once at Beginning of Sequence
Releases Lock Once at End of Sequence
Result
Replaces Multiple Mutual Exclusion Regions With
One Large Mutual Exclusion Region
Algorithm Based On Local Transformations
Move Lock Acquire and Release To Become Adjacent
Eliminate Adjacent Acquire and Release

36
Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()

class body
lock mutex
double phi
vector acc
void bodygravsub(body b)
double p, vNDIM
mutex.acquire()
p computeInter(b,v)
phi - p
acc.add(v)
mutex.release()
void bodyloopsub(body b)
int i
for (i 0 i lt N i)
this-gtgravsub(bi)

37
Computation Lock Coarsening Tradeoff

Advantage
Reduces Number of Executed Acquires and Releases
Reduces Acquire and Release Overhead
Disadvantage May Introduce False Contention
Multiple Processors Attempt to Acquire Same Lock
Processor Holding the Lock is Executing Code that
was Originally in No Mutual Exclusion Region

38
Managing Tradeoff Lock Coarsening Policies

To Manage Tradeoff, Compiler Must Successfully
Reduce Lock Overhead by Increasing Lock
Granularity
Avoid Excessive False Exclusion and False
Contention
Original Policy
Use Original Lock Algorithm
Bounded Policy
Apply Transformation Unless Transformed Code
Holds Lock During a Recursive Call, or
Holds Lock During a Loop that Invokes Operations
Aggressive Policy
Always Apply Transformation

39
Choosing Best Policy

Best Policy May Depend On
Topology of Data Structures
Dynamic Schedule Of Computation
Information Required to Choose Best Policy
Unavailable At Compile Time
Complications
Different Phases May Have Different Best Policy
In Same Phase, Best Policy May Change Over Time

40
Use Dynamic Feedback to Choose Best Policy

Sampling Phase Measures Overhead of Different
Policies
Production Phase Uses Best Policy From Sampling
Phase
Periodically Resample to Discover Changes in Best
Policy
Guaranteed Performance Bounds

Original
Bounded
Overhead
Aggressive
Original
Aggressive
Time
Sampling Phase
Production Phase
Sampling Phase
41

Experimental Results

42
Methodology

Built Prototype Compiler for Subset of C
Built Run Time System for Shared Memory Machines
Concurrency Generation and Task Management
Dynamic Load Balancing and Synchronization
Acquired Three Complete Applications
Barnes-Hut
Water
String
Automatically Parallelized Applications
Ran Applications on Stanford DASH Machine
Compare with Highly Tuned, Explicitly Parallel
Versions

43
Major Assumptions and Restrictions

Assumption No Violation of Type Declarations
Restrictions

Conceptually Significant
No Virtual Functions
No Function Pointers
No Exceptions
Operations Access Only
Parameters
Read-Only Data
Data Members Declared in Class of the Receiver

Implementation Convenience
No Multiple Inheritance
No Templates
No union, struct or enum Types
No typedef Declarations
Global Variables Must Be of Class Types
No Static Members
No Default Arguments or Variable Numbers of
Arguments
No Numeric Casts

44
Applications

Barnes-Hut
O(NlgN) N-Body Solver
Space Subdivision Tree
1500 Lines of C Code
Water
Simulates Liquid Water
O(N2) Algorithm
1850 Lines of C Code
String
Computes Model of Geology Between Two Oil Wells
2050 Lines of C Code

45
Obtaining Serial C Version of Barnes-Hut

Started with Explicitly Parallel Version
(SPLASH-2)
Removed Parallel Constructs to get Serial C
Converted to Clean Object-Based C
Major Structural Changes
Eliminated Scheduling Code and Data Structures
Split a Loop in Force Computation Phase
Introduced New Field into Particle Data Structure

46
Obtaining Serial C Version of Water

Started with Serial C Translated from Fortran
Converted to Clean Object-Based C
Major Structural Change
Auxiliary Objects for O(N2) phases

47
Obtaining Serial C Version of String

Started With Serial C Translated From Fortran
Converted to Clean C
No Major Structural Changes

48
Performance Results for Barnes-Hut and Water
16
16
12
12
Speedup
Speedup
8
8
4
4
0
0
0
4
8
12
16
0
4
8
12
16
Number of Processors
Number of Processors
Water on DASH 512 Molecules
Barnes-Hut on DASH 16K Particles
49
Performance Results for String
Speedup
Number of Processors
String on DASH Big Well Model
50
Synchronization Optimizations

Generated A Version of Each Application for Each
Lock Coarsening Policy
Original
Bounded
Aggressive
Dynamic Feedback
Ran Applications on Stanford DASH Machine

51
Lock Overhead

Percentage of Time that the Single Processor
Execution Spends Acquiring and Releasing Mutual
Exculsion Locks

Percentage Lock Overhead
Barnes-Hut on DASH 16K Particles
Water on DASH 512 Molecules
String on DASH Big Well Model
52
Contention Overhead for Barnes-Hut

Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors

Contention Percentage
Barnes-Hut on DASH 16K Particles
53
Contention Overhead for Water

Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors

Aggressive
Bounded
Original
Contention Percentage
Water on DASH 512 Molecules
54
Contention Overhead for String

Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors

Aggressive
Original
Contention Percentage
String on DASH Big Well Model
55
Performance Results for Barnes-Hut and Water
Ideal
Ideal
Aggressive
Bounded
Dynamic Feedback
Dynamic Feedback
16
Bounded
Original
Original
Aggressive
12
Speedup
Speedup
8
4
0
Number of Processors
Number of Processors
Barnes-Hut on DASH 16K Particles
Water on DASH 512 Molecules
56
Performance Results for String
Ideal
Original
16
Dynamic Feedback

Aggressive
12
Speedup
8
4
0
0
4
8
12
16
Number of Processors
String on DASH Big Well Model
57
New Directions For Parallelizing Compilers

Presented Results
Complete Computations
Shared Memory Machines
Larger Class of Hardware Platforms
Clusters of Shared Memory Machines
Fine-Grain Parallel Machines
Larger Class of Computations
Migratory Computations
Persistent Data
Multithreaded Servers

58
Migratory Computations

Goal Parallelize Computation
Key Issues Incomplete Programs, Dynamic
Parallelization, Persistent Data

59
Multithreaded Servers
Accepts Request
Accepts Request
Sends Response
Sends Response
Sends Response

Goals Better Throughput, Better Response Time
Key Issues Semantics of Communication and
Resource Allocation Primitives, Consistency of
Shared State

60
Future

Key Trends
Increasing Need For Parallel Computing
Increasing Availability of Cheap Parallel
Hardware
Key Challenge
Effectively Support Development of Robust,
Efficient Parallel Software for Wide Range of
Applications
Successful Approach Will
Leverage Structure in Modern Programming
Paradigms
Use Advanced Compiler Technology
Provide A Simpler, More Effective Programming
Model

61
Conclusion