Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

About This Presentation
Title:

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Description:

Weighted Graph With Weights On Edges. Goal Is to Compute, For Each Node, Sum of Weights on All Incoming Edges ... Represent Computed Values Using Opaque Constants ... –

Number of Views:44
Avg rating:3.0/5.0
Slides: 62
Provided by: martin49
Category:

less

Transcript and Presenter's Notes

Title: Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers


1
Commutativity Analysis A New Analysis Framework
for Parallelizing Compilers
  • Martin C. Rinard
  • University of California, Santa Barbara

2
Goal
  • Automatically Parallelize
  • Irregular, Object-Based Computations
  • That Manipulate
  • Dynamic, Linked Data Structures

3
Structure of Talk
  • Model of Computation
  • Graph Traversal Example
  • Commutativity Testing
  • Basic Technique
  • Practical Extensions
  • Advanced Techniques
  • Synchronization Optimizations
  • Experimental Results
  • Future Research

4
Model of Computation
5
7
operations
objects
executing operation
new object state
operation
5
5
9
initial object state
invoked operations
5
Example
  • Weighted In Degree Computation
  • Weighted Graph With Weights On Edges
  • Goal Is to Compute, For Each Node, Sum of Weights
    on All Incoming Edges
  • Serial Algorithm Marked Depth-First Traversal

6
Serial Code For Example
  • class node
  • node left, right
  • int left_weight, right_weight
  • int sum
  • boolean marked
  • void nodetraverse(int weight)
  • sum weight
  • if (!marked)
  • marked true
  • if (left !NULL) left-gttraverse(left_weight)
  • if (right!NULL) right-gttraverse(right_weight)
  • Goal Execute left and right Traverse Operations
    In Parallel

7
Parallel Traversal
6
3
6
3
5
2
5
2
6
3
6
3
6
3
5
2
5
2
8
Parallel Traversal
6
3
6
3
6
6
3
3
5
2
5
2
2
6
3
6
3
6
6
3
3
5
2
5
2
7
6
3
6
3
6
6
3
3
5
2
5
2
5
9
Traditional Approach
  • Data Dependence Analysis
  • Compiler Analyzes Reads and Writes
  • Finds Independent Pieces of Code
  • Independent Pieces of Code Execute in Parallel
  • Demonstrated Success for Array-Based Programs
  • Dense Matrices
  • Affine Access Functions

10
Data Dependence Analysis in Example
  • For Data Dependence Analysis to Succeed in
    Example
  • left and right Traverse Must Be Independent
  • left and right Subgraphs Must Be Disjoint
  • Graph Must Be a Tree
  • Depends on Global Topology of Data Structure
  • Analyze Code that Builds Data Structure
  • Extract and Propagate Topology Information
  • Fails for Graphs - Computations Are Not
    Independent!

11
Commuting Operations In Parallel Traversal
6
3
6
3
6
6
3
3
5
2
5
2
2
6
3
6
3
6
6
3
3
5
2
5
2
7
6
3
6
3
6
6
3
3
5
2
5
2
5
12
Commutativity Analysis
  • Compiler Computes Extent of the Computation
  • Representation of all Operations in Computation
  • Algorithm Traverses Call Graph
  • In Example nodetraverse
  • Do All Pairs of Operations in Extent Commute?
  • No - Generate Serial Code
  • Yes - Generate Parallel Code
  • In Example All Pairs Commute

13
Generated Code In Example
  • class node
  • lock mutex
  • node left, right
  • int left_weight, right_weight
  • int sum
  • boolean marked
  • void nodetraverse(int weight)
  • parallel_traverse(weight)
  • wait()

Class Declaration
Driver Version
14
Generated Code In Example
  • void nodeparallel_traverse(int weight)
  • mutex.acquire()
  • sum weight
  • if (!marked)
  • marked true
  • mutex.release()
  • if (left !NULL)
  • spawn(left-gtparallel_traverse(left_weight))
  • if (right!NULL)
  • spawn(right-gtparallel_traverse(right_weight))
  • else
  • mutex.release()

Critical Region
15
Properties of Commutativity Analysis
  • Oblivious to Data Structure Topology
  • Local Analysis
  • Simple Analysis
  • Suitable for a Wide Range of Programs
  • Programs that Manipulate Lists, Trees and Graphs
  • Commuting Updates to Central Data Structure
  • General Reductions
  • Incomplete Programs
  • Introduces Synchronization

16
  • Commutativity Testing

17
Separable Operations
  • Each Operation Consists of Two Sections

Object Section Only Accesses Receiver Object
Invocation Section Only Invokes Operations
Both Sections May Access Parameters and Local
Variables
18
Commutativity Testing Conditions
  • Do Two Operations A and B Commute?
  • Compiler Must Consider Two Potential Execution
    Orders
  • A executes before B
  • B executes before A
  • Compiler Must Check Two Conditions

Instance Variables In both execution orders, new
values of the instance variables are the same
after the execution of the two object sections
Invoked Operations In both execution orders, the
two invocation sections together directly invoke
the same multiset of operations
19
Commutativity Testing Algorithm
  • Symbolic Execution
  • Compiler Executes Operations
  • Computes with Expressions Instead of Values
  • Compiler Symbolically Executes Operations
  • In Both Execution Orders
  • Expressions for New Values of Instance Variables
  • Expressions for Multiset of Invoked Operations

20
Checking Instance Variables Condition
  • Compiler Generates Two Symbolic Operations
  • n-gttraverse(w1) and n-gttraverse(w2)
  • In Order n-gttraverse(w1) n-gttraverse(w2)
  • New Value of sum (sumw1)w2
  • New Value of marked true
  • In Order n-gttraverse(w2) n-gttraverse(w1)
  • New Value of sum (sumw2)w1
  • New Value of marked true

21
Checking Invoked Operations Condition
  • In Order n-gttraverse(w1) n-gttraverse(w2)
  • Multiset of Invoked Operations Is
  • if (!marked left!NULL) left-gttraverse(left_we
    ight),
  • if (!marked right!NULL) right-gttraverse(right_
    weight)
  • In Order n-gttraverse(w2) n-gttraverse(w1)
  • Multiset of Invoked Operations Is
  • if (!marked left!NULL) left-gttraverse(left_we
    ight),
  • if (!marked right!NULL) right-gttraverse(right_
    weight)

22
Expression Simplification and Comparison
  • Compiler Applies Rewrite Rules to Simplify
    Expressions
  • b(ac) gt (abc)
  • a(bc) gt (ab)(ac)
  • aif(bltc,d,e) gt if(bltc,ad,ae)
  • Compiler Compares Corresponding Expressions
  • If All Equal - Operations Commute
  • If Not All Equal - Operations May Not Commute

23
Practical Extensions
  • Exploit Read-Only Data
  • Recognize When Computed Values Depend Only On
  • Unmodified Instance Variables or Global Variables
  • Parameters
  • Represent Computed Values Using Opaque Constants
  • Increases Set of Programs that Compiler Can
    Analyze
  • Operations Can Freely Access Read-Only Data
  • Coarsen Commutativity Testing Granularity
  • Integrate Operations into Callers for Analysis
    Purposes
  • Mechanism Interprocedural Symbolic Execution
  • Increases Effectiveness of Commutativity Testing

24
Advanced Techniques
  • Relative Commutativity
  • Recognize Commuting Operations That Generate
    Equivalent But Not Identical Data Structures
  • Techniques for Operations that Contain
    Conditionals
  • Distribute Conditionals Out of Expressions
  • Test for Equivalence By Doing Case Analysis
  • Techniques for Operations that Access Arrays
  • Use Array Update Expressions to Represent New
    Values
  • Rewrite Rules for Array Update Expressions
  • Techniques for Operations that Execute Loops

25
Commutativity Testing for Operations With Loops
  • Prerequisite Represent Values Computed In Loops
  • View Body of Loop as an Expression Transformer
  • Input Expressions Values Before Iteration
    Executes
  • Output Expressions Values After Iteration
    Executes
  • Represent Values Computed In Loop Using
  • Recursively Defined Symbolic Loop Modeling
    Functions
  • int tsum for(i)ttai sumt
  • s(e,0) e s(e,i1) s(e,i)ai
  • New Value of sums(sum,n)
  • Use Nested Induction Proofs to Determine
    Equivalence of Expressions With Symbolic Loop
    Modeling Functions

26
Important Special Case
  • Independent Operations Commute
  • Analysis in Current Compiler
  • Dependence Analysis
  • Operations on Objects of Different Classes
  • Independent Operations on Objects of Same Class
  • Symbolic Commutativity Testing
  • Dependent Operations on Objects of Same Class
  • Future
  • Integrate Shape Analysis
  • Integrate Array Data Dependence Analysis

27
  • Steps to Practicality

28
Programming Model Extensions
  • Extensions for Read-Only Data
  • Allow Operations to Freely Access Read-Only Data
  • Enhances Ability of Compiler to Represent
    Expressions
  • Increases Set of Programs that Compiler Can
    Analyze
  • Analysis Granularity Extensions
  • Integrate Operations Into Callers for Analysis
    Purposes
  • Coarsens Commutativity Testing Granularity
  • Reduces Number of Pairs Tested for Commutativity
  • Enhances Effectiveness of Commutativity Testing

29
Optimizations
  • Parallel Loop Optimization
  • Suppress Exploitation of Excess Concurrency
  • Synchronization Optimizations
  • Eliminate Synchronization Constructs in Methods
    that Only Access Read-Only Data
  • Lock Coarsening
  • Replaces Multiple Mutual Exclusion Regions with
  • Single Larger Mutual Exclusion Region

30
Synchronization Optimizations
31
Default Code Generation Strategy
  • Each Object Has its Own Mutual Exclusion Lock
  • Each Operation Acquires and Releases Lock

Simple Lock Optimization
Eliminate Lock Constructs In Operations That Only
Access Read-Only Data
32
Data Lock Coarsening Transformation
  • Compiler Gives Multiple Objects the Same Lock
  • Current Policy Nested Objects Use the Lock in
    Enclosing Object
  • Finds Sequences of Operations
  • Access Different Objects
  • Acquire and Release Same Lock
  • Transformed Code
  • Acquires Lock Once At Beginning of Sequence
  • Releases Lock Once At End of Sequence
  • Original Code
  • Each Operation Acquires and Releases Lock

33
Data Lock Coarsening Example
Original Code
Transformed Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
34
Data Lock Coarsening Tradeoff
  • Advantage
  • Reduces Number of Executed Acquires and Releases
  • Reduces Acquire and Release Overhead
  • Disadvantage May Cause False Exclusion
  • Multiple Parallel Operations Access Different
    Objects
  • But Operations Attempt to Acquire Same Lock
  • Result Operations Execute Serially

35
Computation Lock Coarsening Transformation
  • Compiler Finds Sequences of Operations
  • Acquire and Release Same Lock
  • Transformed Code
  • Acquires Lock Once at Beginning of Sequence
  • Releases Lock Once at End of Sequence
  • Result
  • Replaces Multiple Mutual Exclusion Regions With
  • One Large Mutual Exclusion Region
  • Algorithm Based On Local Transformations
  • Move Lock Acquire and Release To Become Adjacent
  • Eliminate Adjacent Acquire and Release

36
Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()
  • class body
  • lock mutex
  • double phi
  • vector acc
  • void bodygravsub(body b)
  • double p, vNDIM
  • mutex.acquire()
  • p computeInter(b,v)
  • phi - p
  • acc.add(v)
  • mutex.release()
  • void bodyloopsub(body b)
  • int i
  • for (i 0 i lt N i)
  • this-gtgravsub(bi)

37
Computation Lock Coarsening Tradeoff
  • Advantage
  • Reduces Number of Executed Acquires and Releases
  • Reduces Acquire and Release Overhead
  • Disadvantage May Introduce False Contention
  • Multiple Processors Attempt to Acquire Same Lock
  • Processor Holding the Lock is Executing Code that
    was Originally in No Mutual Exclusion Region

38
Managing Tradeoff Lock Coarsening Policies
  • To Manage Tradeoff, Compiler Must Successfully
  • Reduce Lock Overhead by Increasing Lock
    Granularity
  • Avoid Excessive False Exclusion and False
    Contention
  • Original Policy
  • Use Original Lock Algorithm
  • Bounded Policy
  • Apply Transformation Unless Transformed Code
  • Holds Lock During a Recursive Call, or
  • Holds Lock During a Loop that Invokes Operations
  • Aggressive Policy
  • Always Apply Transformation

39
Choosing Best Policy
  • Best Policy May Depend On
  • Topology of Data Structures
  • Dynamic Schedule Of Computation
  • Information Required to Choose Best Policy
    Unavailable At Compile Time
  • Complications
  • Different Phases May Have Different Best Policy
  • In Same Phase, Best Policy May Change Over Time

40
Use Dynamic Feedback to Choose Best Policy
  • Sampling Phase Measures Overhead of Different
    Policies
  • Production Phase Uses Best Policy From Sampling
    Phase
  • Periodically Resample to Discover Changes in Best
    Policy
  • Guaranteed Performance Bounds

Original
Bounded
Overhead
Aggressive
Original
Aggressive
Time
Sampling Phase
Production Phase
Sampling Phase
41
  • Experimental Results

42
Methodology
  • Built Prototype Compiler for Subset of C
  • Built Run Time System for Shared Memory Machines
  • Concurrency Generation and Task Management
  • Dynamic Load Balancing and Synchronization
  • Acquired Three Complete Applications
  • Barnes-Hut
  • Water
  • String
  • Automatically Parallelized Applications
  • Ran Applications on Stanford DASH Machine
  • Compare with Highly Tuned, Explicitly Parallel
    Versions

43
Major Assumptions and Restrictions
  • Assumption No Violation of Type Declarations
  • Restrictions
  • Conceptually Significant
  • No Virtual Functions
  • No Function Pointers
  • No Exceptions
  • Operations Access Only
  • Parameters
  • Read-Only Data
  • Data Members Declared in Class of the Receiver
  • Implementation Convenience
  • No Multiple Inheritance
  • No Templates
  • No union, struct or enum Types
  • No typedef Declarations
  • Global Variables Must Be of Class Types
  • No Static Members
  • No Default Arguments or Variable Numbers of
    Arguments
  • No Numeric Casts

44
Applications
  • Barnes-Hut
  • O(NlgN) N-Body Solver
  • Space Subdivision Tree
  • 1500 Lines of C Code
  • Water
  • Simulates Liquid Water
  • O(N2) Algorithm
  • 1850 Lines of C Code
  • String
  • Computes Model of Geology Between Two Oil Wells
  • 2050 Lines of C Code

45
Obtaining Serial C Version of Barnes-Hut
  • Started with Explicitly Parallel Version
    (SPLASH-2)
  • Removed Parallel Constructs to get Serial C
  • Converted to Clean Object-Based C
  • Major Structural Changes
  • Eliminated Scheduling Code and Data Structures
  • Split a Loop in Force Computation Phase
  • Introduced New Field into Particle Data Structure

46
Obtaining Serial C Version of Water
  • Started with Serial C Translated from Fortran
  • Converted to Clean Object-Based C
  • Major Structural Change
  • Auxiliary Objects for O(N2) phases

47
Obtaining Serial C Version of String
  • Started With Serial C Translated From Fortran
  • Converted to Clean C
  • No Major Structural Changes

48
Performance Results for Barnes-Hut and Water
16
16
12
12
Speedup
Speedup
8
8
4
4
0
0
0
4
8
12
16
0
4
8
12
16
Number of Processors
Number of Processors
Water on DASH 512 Molecules
Barnes-Hut on DASH 16K Particles
49
Performance Results for String
Speedup
Number of Processors
String on DASH Big Well Model
50
Synchronization Optimizations
  • Generated A Version of Each Application for Each
    Lock Coarsening Policy
  • Original
  • Bounded
  • Aggressive
  • Dynamic Feedback
  • Ran Applications on Stanford DASH Machine

51
Lock Overhead
  • Percentage of Time that the Single Processor
    Execution Spends Acquiring and Releasing Mutual
    Exculsion Locks

Percentage Lock Overhead
Barnes-Hut on DASH 16K Particles
Water on DASH 512 Molecules
String on DASH Big Well Model
52
Contention Overhead for Barnes-Hut
  • Percentage of Time that Processors Spend Waiting
    to Acquire Locks Held by Other Processors

Contention Percentage
Barnes-Hut on DASH 16K Particles
53
Contention Overhead for Water
  • Percentage of Time that Processors Spend Waiting
    to Acquire Locks Held by Other Processors

Aggressive
Bounded
Original
Contention Percentage
Water on DASH 512 Molecules
54
Contention Overhead for String
  • Percentage of Time that Processors Spend Waiting
    to Acquire Locks Held by Other Processors

Aggressive
Original
Contention Percentage
String on DASH Big Well Model
55
Performance Results for Barnes-Hut and Water
Ideal
Ideal
Aggressive
Bounded
Dynamic Feedback
Dynamic Feedback
16
Bounded
Original
Original
Aggressive
12
Speedup
Speedup
8
4
0
Number of Processors
Number of Processors
Barnes-Hut on DASH 16K Particles
Water on DASH 512 Molecules
56
Performance Results for String
Ideal
Original
16
Dynamic Feedback


Aggressive
12
Speedup
8
4
0
0
4
8
12
16
Number of Processors
String on DASH Big Well Model
57
New Directions For Parallelizing Compilers
  • Presented Results
  • Complete Computations
  • Shared Memory Machines
  • Larger Class of Hardware Platforms
  • Clusters of Shared Memory Machines
  • Fine-Grain Parallel Machines
  • Larger Class of Computations
  • Migratory Computations
  • Persistent Data
  • Multithreaded Servers

58
Migratory Computations
  • Goal Parallelize Computation
  • Key Issues Incomplete Programs, Dynamic
    Parallelization, Persistent Data

59
Multithreaded Servers
Accepts Request
Accepts Request
Sends Response
Sends Response
Sends Response
  • Goals Better Throughput, Better Response Time
  • Key Issues Semantics of Communication and
    Resource Allocation Primitives, Consistency of
    Shared State

60
Future
  • Key Trends
  • Increasing Need For Parallel Computing
  • Increasing Availability of Cheap Parallel
    Hardware
  • Key Challenge
  • Effectively Support Development of Robust,
    Efficient Parallel Software for Wide Range of
    Applications
  • Successful Approach Will
  • Leverage Structure in Modern Programming
    Paradigms
  • Use Advanced Compiler Technology
  • Provide A Simpler, More Effective Programming
    Model

61
Conclusion
  • New Analysis Framework for Parallelizing
    Compilers
  • Commutativity Analysis
  • New Class of Applications for Parallelizing
    Compilers
  • Irregular Computations
  • Dynamic, Linked Data Structures
  • Current Status
  • Implemented Prototype
  • Good Results
  • Future
  • New Hardware Platforms
  • New Application Domains
Write a Comment
User Comments (0)
About PowerShow.com