Title: Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers
1Commutativity Analysis A New Analysis Framework
for Parallelizing Compilers
- Martin C. Rinard
- University of California, Santa Barbara
2Goal
- Automatically Parallelize
- Irregular, Object-Based Computations
- That Manipulate
- Dynamic, Linked Data Structures
3Structure of Talk
- Model of Computation
- Graph Traversal Example
- Commutativity Testing
- Basic Technique
- Practical Extensions
- Advanced Techniques
- Synchronization Optimizations
- Experimental Results
- Future Research
4Model of Computation
5
7
operations
objects
executing operation
new object state
operation
5
5
9
initial object state
invoked operations
5Example
- Weighted In Degree Computation
- Weighted Graph With Weights On Edges
- Goal Is to Compute, For Each Node, Sum of Weights
on All Incoming Edges - Serial Algorithm Marked Depth-First Traversal
6Serial Code For Example
- class node
- node left, right
- int left_weight, right_weight
- int sum
- boolean marked
-
- void nodetraverse(int weight)
- sum weight
- if (!marked)
- marked true
- if (left !NULL) left-gttraverse(left_weight)
- if (right!NULL) right-gttraverse(right_weight)
-
- Goal Execute left and right Traverse Operations
In Parallel
7Parallel Traversal
6
3
6
3
5
2
5
2
6
3
6
3
6
3
5
2
5
2
8Parallel Traversal
6
3
6
3
6
6
3
3
5
2
5
2
2
6
3
6
3
6
6
3
3
5
2
5
2
7
6
3
6
3
6
6
3
3
5
2
5
2
5
9Traditional Approach
- Data Dependence Analysis
- Compiler Analyzes Reads and Writes
- Finds Independent Pieces of Code
- Independent Pieces of Code Execute in Parallel
- Demonstrated Success for Array-Based Programs
- Dense Matrices
- Affine Access Functions
10Data Dependence Analysis in Example
- For Data Dependence Analysis to Succeed in
Example - left and right Traverse Must Be Independent
- left and right Subgraphs Must Be Disjoint
- Graph Must Be a Tree
- Depends on Global Topology of Data Structure
- Analyze Code that Builds Data Structure
- Extract and Propagate Topology Information
- Fails for Graphs - Computations Are Not
Independent!
11Commuting Operations In Parallel Traversal
6
3
6
3
6
6
3
3
5
2
5
2
2
6
3
6
3
6
6
3
3
5
2
5
2
7
6
3
6
3
6
6
3
3
5
2
5
2
5
12Commutativity Analysis
- Compiler Computes Extent of the Computation
- Representation of all Operations in Computation
- Algorithm Traverses Call Graph
- In Example nodetraverse
- Do All Pairs of Operations in Extent Commute?
- No - Generate Serial Code
- Yes - Generate Parallel Code
- In Example All Pairs Commute
13Generated Code In Example
- class node
- lock mutex
- node left, right
- int left_weight, right_weight
- int sum
- boolean marked
-
- void nodetraverse(int weight)
- parallel_traverse(weight)
- wait()
Class Declaration
Driver Version
14Generated Code In Example
- void nodeparallel_traverse(int weight)
- mutex.acquire()
- sum weight
- if (!marked)
- marked true
- mutex.release()
- if (left !NULL)
- spawn(left-gtparallel_traverse(left_weight))
- if (right!NULL)
- spawn(right-gtparallel_traverse(right_weight))
- else
- mutex.release()
-
-
Critical Region
15Properties of Commutativity Analysis
- Oblivious to Data Structure Topology
- Local Analysis
- Simple Analysis
- Suitable for a Wide Range of Programs
- Programs that Manipulate Lists, Trees and Graphs
- Commuting Updates to Central Data Structure
- General Reductions
- Incomplete Programs
- Introduces Synchronization
16 17Separable Operations
- Each Operation Consists of Two Sections
Object Section Only Accesses Receiver Object
Invocation Section Only Invokes Operations
Both Sections May Access Parameters and Local
Variables
18Commutativity Testing Conditions
- Do Two Operations A and B Commute?
- Compiler Must Consider Two Potential Execution
Orders - A executes before B
- B executes before A
- Compiler Must Check Two Conditions
Instance Variables In both execution orders, new
values of the instance variables are the same
after the execution of the two object sections
Invoked Operations In both execution orders, the
two invocation sections together directly invoke
the same multiset of operations
19Commutativity Testing Algorithm
- Symbolic Execution
- Compiler Executes Operations
- Computes with Expressions Instead of Values
- Compiler Symbolically Executes Operations
- In Both Execution Orders
- Expressions for New Values of Instance Variables
- Expressions for Multiset of Invoked Operations
20Checking Instance Variables Condition
- Compiler Generates Two Symbolic Operations
- n-gttraverse(w1) and n-gttraverse(w2)
- In Order n-gttraverse(w1) n-gttraverse(w2)
- New Value of sum (sumw1)w2
- New Value of marked true
- In Order n-gttraverse(w2) n-gttraverse(w1)
- New Value of sum (sumw2)w1
- New Value of marked true
21Checking Invoked Operations Condition
- In Order n-gttraverse(w1) n-gttraverse(w2)
- Multiset of Invoked Operations Is
- if (!marked left!NULL) left-gttraverse(left_we
ight), - if (!marked right!NULL) right-gttraverse(right_
weight) - In Order n-gttraverse(w2) n-gttraverse(w1)
- Multiset of Invoked Operations Is
- if (!marked left!NULL) left-gttraverse(left_we
ight), - if (!marked right!NULL) right-gttraverse(right_
weight)
22Expression Simplification and Comparison
- Compiler Applies Rewrite Rules to Simplify
Expressions - b(ac) gt (abc)
- a(bc) gt (ab)(ac)
- aif(bltc,d,e) gt if(bltc,ad,ae)
- Compiler Compares Corresponding Expressions
- If All Equal - Operations Commute
- If Not All Equal - Operations May Not Commute
23Practical Extensions
- Exploit Read-Only Data
- Recognize When Computed Values Depend Only On
- Unmodified Instance Variables or Global Variables
- Parameters
- Represent Computed Values Using Opaque Constants
- Increases Set of Programs that Compiler Can
Analyze - Operations Can Freely Access Read-Only Data
- Coarsen Commutativity Testing Granularity
- Integrate Operations into Callers for Analysis
Purposes - Mechanism Interprocedural Symbolic Execution
- Increases Effectiveness of Commutativity Testing
24Advanced Techniques
- Relative Commutativity
- Recognize Commuting Operations That Generate
Equivalent But Not Identical Data Structures - Techniques for Operations that Contain
Conditionals - Distribute Conditionals Out of Expressions
- Test for Equivalence By Doing Case Analysis
- Techniques for Operations that Access Arrays
- Use Array Update Expressions to Represent New
Values - Rewrite Rules for Array Update Expressions
- Techniques for Operations that Execute Loops
25Commutativity Testing for Operations With Loops
- Prerequisite Represent Values Computed In Loops
- View Body of Loop as an Expression Transformer
- Input Expressions Values Before Iteration
Executes - Output Expressions Values After Iteration
Executes - Represent Values Computed In Loop Using
- Recursively Defined Symbolic Loop Modeling
Functions - int tsum for(i)ttai sumt
- s(e,0) e s(e,i1) s(e,i)ai
- New Value of sums(sum,n)
- Use Nested Induction Proofs to Determine
Equivalence of Expressions With Symbolic Loop
Modeling Functions
26Important Special Case
- Independent Operations Commute
- Analysis in Current Compiler
- Dependence Analysis
- Operations on Objects of Different Classes
- Independent Operations on Objects of Same Class
- Symbolic Commutativity Testing
- Dependent Operations on Objects of Same Class
- Future
- Integrate Shape Analysis
- Integrate Array Data Dependence Analysis
27 28Programming Model Extensions
- Extensions for Read-Only Data
- Allow Operations to Freely Access Read-Only Data
- Enhances Ability of Compiler to Represent
Expressions - Increases Set of Programs that Compiler Can
Analyze - Analysis Granularity Extensions
- Integrate Operations Into Callers for Analysis
Purposes - Coarsens Commutativity Testing Granularity
- Reduces Number of Pairs Tested for Commutativity
- Enhances Effectiveness of Commutativity Testing
29Optimizations
- Parallel Loop Optimization
- Suppress Exploitation of Excess Concurrency
- Synchronization Optimizations
- Eliminate Synchronization Constructs in Methods
that Only Access Read-Only Data - Lock Coarsening
- Replaces Multiple Mutual Exclusion Regions with
- Single Larger Mutual Exclusion Region
30Synchronization Optimizations
31Default Code Generation Strategy
- Each Object Has its Own Mutual Exclusion Lock
- Each Operation Acquires and Releases Lock
Simple Lock Optimization
Eliminate Lock Constructs In Operations That Only
Access Read-Only Data
32Data Lock Coarsening Transformation
- Compiler Gives Multiple Objects the Same Lock
- Current Policy Nested Objects Use the Lock in
Enclosing Object - Finds Sequences of Operations
- Access Different Objects
- Acquire and Release Same Lock
- Transformed Code
- Acquires Lock Once At Beginning of Sequence
- Releases Lock Once At End of Sequence
- Original Code
- Each Operation Acquires and Releases Lock
33Data Lock Coarsening Example
Original Code
Transformed Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
34Data Lock Coarsening Tradeoff
- Advantage
- Reduces Number of Executed Acquires and Releases
- Reduces Acquire and Release Overhead
- Disadvantage May Cause False Exclusion
- Multiple Parallel Operations Access Different
Objects - But Operations Attempt to Acquire Same Lock
- Result Operations Execute Serially
35Computation Lock Coarsening Transformation
- Compiler Finds Sequences of Operations
- Acquire and Release Same Lock
- Transformed Code
- Acquires Lock Once at Beginning of Sequence
- Releases Lock Once at End of Sequence
- Result
- Replaces Multiple Mutual Exclusion Regions With
- One Large Mutual Exclusion Region
- Algorithm Based On Local Transformations
- Move Lock Acquire and Release To Become Adjacent
- Eliminate Adjacent Acquire and Release
36Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()
- class body
- lock mutex
- double phi
- vector acc
-
- void bodygravsub(body b)
- double p, vNDIM
- mutex.acquire()
- p computeInter(b,v)
- phi - p
- acc.add(v)
- mutex.release()
-
- void bodyloopsub(body b)
- int i
- for (i 0 i lt N i)
- this-gtgravsub(bi)
-
37Computation Lock Coarsening Tradeoff
- Advantage
- Reduces Number of Executed Acquires and Releases
- Reduces Acquire and Release Overhead
- Disadvantage May Introduce False Contention
- Multiple Processors Attempt to Acquire Same Lock
- Processor Holding the Lock is Executing Code that
was Originally in No Mutual Exclusion Region
38Managing Tradeoff Lock Coarsening Policies
- To Manage Tradeoff, Compiler Must Successfully
- Reduce Lock Overhead by Increasing Lock
Granularity - Avoid Excessive False Exclusion and False
Contention - Original Policy
- Use Original Lock Algorithm
- Bounded Policy
- Apply Transformation Unless Transformed Code
- Holds Lock During a Recursive Call, or
- Holds Lock During a Loop that Invokes Operations
- Aggressive Policy
- Always Apply Transformation
39Choosing Best Policy
- Best Policy May Depend On
- Topology of Data Structures
- Dynamic Schedule Of Computation
- Information Required to Choose Best Policy
Unavailable At Compile Time - Complications
- Different Phases May Have Different Best Policy
- In Same Phase, Best Policy May Change Over Time
40Use Dynamic Feedback to Choose Best Policy
- Sampling Phase Measures Overhead of Different
Policies - Production Phase Uses Best Policy From Sampling
Phase - Periodically Resample to Discover Changes in Best
Policy - Guaranteed Performance Bounds
Original
Bounded
Overhead
Aggressive
Original
Aggressive
Time
Sampling Phase
Production Phase
Sampling Phase
41 42Methodology
- Built Prototype Compiler for Subset of C
- Built Run Time System for Shared Memory Machines
- Concurrency Generation and Task Management
- Dynamic Load Balancing and Synchronization
- Acquired Three Complete Applications
- Barnes-Hut
- Water
- String
- Automatically Parallelized Applications
- Ran Applications on Stanford DASH Machine
- Compare with Highly Tuned, Explicitly Parallel
Versions
43Major Assumptions and Restrictions
- Assumption No Violation of Type Declarations
- Restrictions
- Conceptually Significant
- No Virtual Functions
- No Function Pointers
- No Exceptions
- Operations Access Only
- Parameters
- Read-Only Data
- Data Members Declared in Class of the Receiver
- Implementation Convenience
- No Multiple Inheritance
- No Templates
- No union, struct or enum Types
- No typedef Declarations
- Global Variables Must Be of Class Types
- No Static Members
- No Default Arguments or Variable Numbers of
Arguments - No Numeric Casts
44Applications
- Barnes-Hut
- O(NlgN) N-Body Solver
- Space Subdivision Tree
- 1500 Lines of C Code
- Water
- Simulates Liquid Water
- O(N2) Algorithm
- 1850 Lines of C Code
- String
- Computes Model of Geology Between Two Oil Wells
- 2050 Lines of C Code
45Obtaining Serial C Version of Barnes-Hut
- Started with Explicitly Parallel Version
(SPLASH-2) -
- Removed Parallel Constructs to get Serial C
- Converted to Clean Object-Based C
- Major Structural Changes
- Eliminated Scheduling Code and Data Structures
- Split a Loop in Force Computation Phase
- Introduced New Field into Particle Data Structure
46Obtaining Serial C Version of Water
- Started with Serial C Translated from Fortran
- Converted to Clean Object-Based C
- Major Structural Change
- Auxiliary Objects for O(N2) phases
47Obtaining Serial C Version of String
- Started With Serial C Translated From Fortran
- Converted to Clean C
- No Major Structural Changes
48Performance Results for Barnes-Hut and Water
16
16
12
12
Speedup
Speedup
8
8
4
4
0
0
0
4
8
12
16
0
4
8
12
16
Number of Processors
Number of Processors
Water on DASH 512 Molecules
Barnes-Hut on DASH 16K Particles
49Performance Results for String
Speedup
Number of Processors
String on DASH Big Well Model
50Synchronization Optimizations
- Generated A Version of Each Application for Each
Lock Coarsening Policy - Original
- Bounded
- Aggressive
- Dynamic Feedback
- Ran Applications on Stanford DASH Machine
51Lock Overhead
- Percentage of Time that the Single Processor
Execution Spends Acquiring and Releasing Mutual
Exculsion Locks
Percentage Lock Overhead
Barnes-Hut on DASH 16K Particles
Water on DASH 512 Molecules
String on DASH Big Well Model
52Contention Overhead for Barnes-Hut
- Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors
Contention Percentage
Barnes-Hut on DASH 16K Particles
53Contention Overhead for Water
- Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors
Aggressive
Bounded
Original
Contention Percentage
Water on DASH 512 Molecules
54Contention Overhead for String
- Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors
Aggressive
Original
Contention Percentage
String on DASH Big Well Model
55Performance Results for Barnes-Hut and Water
Ideal
Ideal
Aggressive
Bounded
Dynamic Feedback
Dynamic Feedback
16
Bounded
Original
Original
Aggressive
12
Speedup
Speedup
8
4
0
Number of Processors
Number of Processors
Barnes-Hut on DASH 16K Particles
Water on DASH 512 Molecules
56Performance Results for String
Ideal
Original
16
Dynamic Feedback
Aggressive
12
Speedup
8
4
0
0
4
8
12
16
Number of Processors
String on DASH Big Well Model
57New Directions For Parallelizing Compilers
- Presented Results
- Complete Computations
- Shared Memory Machines
- Larger Class of Hardware Platforms
- Clusters of Shared Memory Machines
- Fine-Grain Parallel Machines
- Larger Class of Computations
- Migratory Computations
- Persistent Data
- Multithreaded Servers
58Migratory Computations
- Goal Parallelize Computation
- Key Issues Incomplete Programs, Dynamic
Parallelization, Persistent Data
59Multithreaded Servers
Accepts Request
Accepts Request
Sends Response
Sends Response
Sends Response
- Goals Better Throughput, Better Response Time
- Key Issues Semantics of Communication and
Resource Allocation Primitives, Consistency of
Shared State
60Future
- Key Trends
- Increasing Need For Parallel Computing
- Increasing Availability of Cheap Parallel
Hardware - Key Challenge
- Effectively Support Development of Robust,
Efficient Parallel Software for Wide Range of
Applications - Successful Approach Will
- Leverage Structure in Modern Programming
Paradigms - Use Advanced Compiler Technology
- Provide A Simpler, More Effective Programming
Model
61Conclusion
- New Analysis Framework for Parallelizing
Compilers - Commutativity Analysis
- New Class of Applications for Parallelizing
Compilers - Irregular Computations
- Dynamic, Linked Data Structures
- Current Status
- Implemented Prototype
- Good Results
- Future
- New Hardware Platforms
- New Application Domains