PPoPP - PowerPoint PPT Presentation

About This Presentation

Title:

PPoPP

Description:

... Links', E.Chan, R.van de Geijn (UTexas), W. Gropp, R.Thakur (Argonne National Lab. ... Guy L. Steele: 'Parallel Programming and. Code Selection in Fortress' ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 32

Provided by: ivanmat

Category:

more less

Transcript and Presenter's Notes

Title: PPoPP

1
PPoPP06ACM SIGPLAN Symposium onPrinciples and
Practice of Parallel ProgrammingNew York,
March 29-31, 2006

Conference Review
Presented by Utku Aydonat

2
Outline

Conference overview
Brief summaries of sessions
Keynote speeches Panel
Best paper

3
Conference Overview

History 90, 91, 93, 95, 97, 99, 01, 03, 05, 06
Primary focus anything related to parallel
programming
Algorithms
Communication
Languages
8 sessions, 26 papers
Dominating topics multicores, parallelization
techniques

4
Conference Overview

PPoPP Paper Acceptance Statistics
Year Submitted Accepted Rate
2006 91 25 27
2005 87 27 31
2003 45 20 44
1999 79 17 22
1997 86 26 30

5
Overview of Session

Communication
Languages
Performance Characterization
Shared Memory Parallelism
Atomicity Issues
Multicore Software
Transactional Memory
Potpourri

6
Session 1 Communication

Collective Communication of Architectures that
Support Simultaneous Communication over Multiple
Links, E.Chan, R.van de Geijn (UTexas), W.
Gropp, R.Thakur (Argonne National Lab.)
Adopt MPI collective communication algorithms to
supercomputer architectures that support
simultaneous communication with multiple nodes.
Theoritically latency can be reduced practically
it is not achievable due to the algorithms and
overheads.
Performance Evaluation of Adaptive MPI, Chao
Huang, Gengbin Zheng (UIUC), Sameer Kumar (IBM T.
J. Watson), Laxmikant Kale (UIUC)
Design and evaluate AMPI that supports processor
virtualization
Benefits load balancing, adaptive overlapping,
independence from the available number of
processors, etc.

7
Session 1 Communication

Mobile MPI Programs on Computational Grids,
Rohit Fernandes, Keshav Pingali, Paul Stodghill
(Cornell)
Checkpointing system for C programs using MPI
Able to take checkpoint on Alpha cluster and
restart them on Windows
RDMA Read Based Rendezvous Protocol for MPI over
InfiniBand Design Alternatives and Benefits,
Sayantan Sur, Hyun-Wook Jin, Lei Chai,
Dhabaleswar K Panda (Ohio State)
A rendezvous protocol in MPI using RDMA read.
Increases communication / computation overlap.

8
Session 2 Languages

Global-View Abstractions for User-Defined
Reductions and Scans, Steve J. Deitz, David
Callahan, Bradford L. Chamberlain (Cray),
Lawrence Snyder (U. of Washington)
Chapel programming language developed by Cray
Inc. as a part of DARPA High-Productivity
Computing Systems program
Global view abstractions for user-defined
reductions and scans
Programming for Parallelism and Locality with
Hierarchically Tiled Arrays, Ganesh Bikshandi,
Jia Guo, Daniel Hoeflinger (UIUC), Gheorghe
Almasi (IBM T. J. Watson), Basilio B Fraguela
(Universidade da Coruña), Maria Jesus Garzaran,
David Padua (UIUC), Christoph von Praun (IBM T.
J. Watson)
Hierarchically Tiled Arrays (HTAs) that define
tiling structure for arrays
Reductions, mapping, scans, transpose, shift
operations are defined.

9
MinK in Chapel
Called for each element of A
var minimums 1..10 integer minimums
mink(integer, 10) reduce A
10
HTA
11
Session 3 Performance Characterization

Performance characterization of bio-molecular
simulations using molecular dynamics, Sadaf
Alam, Pratul Agarwal, Al Geist, Jeffrey Vetter
(ORNL)
Investigated performance bottlenecks in MD
applications on supercomputers
Found out that the implementations of algorithms
are not scalable
On-line Automated Performance Diagnosis on
Thousands of Processors, Philip C. Roth (ORNL),
Barton P. Miller (U. of Wisconsin, Madison)
Distributed and scalable performance analysis
tool
Can analyze large application with 1024 processes
and present the results in a folded graph.

12
Session 3 Performance Characterization

A Case Study in Top-Down Performance Estimation
for a Large-Scale Parallel Application, Ilya
Sharapov, Robert Kroeger, Guy Delamarter (Sun
Microsystems)
Performance estimation of HPC workloads on future
architectures
Based on low-level analysis and scalability
predictions.
Predicts the performance of Gyrokinetic Toroidal
Code executed on Suns future architectures

13
Session 4 Shared Memory Parallelism

Hardware Profile-guided Automatic Page Placement
for ccNUMA Systems, Jaydeep Marathe, Frank
Mueller (North Carolina State U.)
Profiles memory accesses and places pages
accordingly.
20 performance improvement and 2.7 overhead.
Adaptive Scheduling with Parallelism Feedback,
Kunal Agrawal, Yuxiong He, Wen Jing Hsu, Charles
Leiserson (Mass. Inst. of Tech.)
Allocates processors to jobs based on the past
parallelism of the job.
Uses R-trimmed mean for the feed-back.

14
Session 4 Shared Memory Parallelism

Predicting Bounds on Queuing Delay for
Batch-scheduled Parallel Machines, John Brevik,
Daniel Nurmi, Rich Wolski (UCSB)
Binomial Method Batch Predictor (BMBP) that bases
its predictions on the past wait times.
Uses 95th percentile and its predictions are
close to real wait times experienced.
Optimizing Irregular Shared-Memory Applications
for Distributed-Memory Systems, Ayon Basumallik,
Rudolf Eigenmann (Purdue)
Converts OpenMP applications to MPI based
applications
Uses inspection loop to find non-local access and
reorder loops.

15
OpenMP-to-MPI
16
Session 5 Atomicity Issues

Proving Correctness of Highly-Concurrent
Linearizable Objects Viktor Vefeiadis (U. of
Cambridge), Maurice Herlihy (Brown U.), Tony
Hoare (Microsoft Research Cambridge), Marc
Shapiro (INRIA Rocquencourt LIP6)
Proves the safety of concurrent objects using
Rely-Guarantee method
Each threads rely condition should be satisfied
and each threads guarantee condition implies
others rely condition for every operation.
Accurate and Efficient Runtime Detection of
Atomicity Errors in Concurrent Programs, Liqiang
Wang, Scott D. Stoller (SUNY at Stony Brook)
Instruments the program and obtain profiling of
memory accesses
Builds a tree of the conflicting accesses and
applies some algorithms to prove conflict and
view equivalency.

17
Session 5 Atomicity Issues

Scalable Synchronous Queues, William N. Scherer
III (U. of Rochester), Doug Lea (SUNY Oswego),
Michael L. Scott (U. of Rochester)
Best Paper
Details are coming up.

18
Session 6 Multicore Software

POSH A TLS Compiler that Exploits Program
Structure, Wei Liu, James Tuck, Luis Ceze,
Wonsun Ahn (UIUC), Karin Strauss, Jose Renau
(UCSC), Josep Torrellas (UIUC)
TLS compiler that divides the program to tasks,
prune the inefficient ones
Uses profiling to detect tasks that may violate
frequently.
High-performance IPv6 Forwarding Algorithm for
Multi-core and Multithreaded Network Processors,
Hu Xianghui (U. of Sci. and Tech. of China),
Xinan Tang (Intel), Bei Hua (U. of Sci. and Tech.
of China)
New IPv6 forwarding algorithm optimized for Intel
NPU features
Achieves 10Gbps speed for large routing tables
with up to 400K entries.

19
Session 6 Multicore Software

MAMA! A Memory Allocator for Multithreaded
Architectures, Simon Kahan, Petr Konecny (Cray
Inc.)
A memory allocator that aggregate requests to
reduce the fragmentation
Transforms contention to collaboration
Experiments with micro-benchmarks proves that it
works

20
Session 7 Transactional Memory

A High Performance Software Transactional Memory
System For A Multi-Core Runtime, Bratin Saha,
Ali-Reza Adl-Tabatabai, Richard L. Hudson
(Intel), Chi Cao Minh, Ben Hertzberg (Stanford)
Maps each memory location to a unique lock and
acquires all the relevant locks before committing
a transaction
Undo-logging, write-locking/read versioning,
cache-line conflict detection
Exploiting Distributed Version Concurrency in a
Transactional Memory Cluster, Kaloian Manassiev,
Madalin Mihailescu, Cristiana Amza (UofT)
Transactional Memory system on commodity clusters
for generic C and SQL applications
Diffs are applied by readers on demand and may
violate writers.

21
Session 7 Transactional Memory

Hybrid Transactional Memory, Sanjeev Kumar
(Intel), Michael Chu (U. of Mich.), Christopher
Hughes, Partha Kundu, Anthony Nguyen (Intel)
Hardware and Software TM together
Extends DSTM
Conflict detection is based on loading and
storing the state field of the object wrapper and
the locator field.

22
Session 8 Potpourri

Fast and Transparent Recovery for Continuous
Availability of Cluster-based Servers, Rosalia
Christodoulopoulou, Kaloian Manassiev (UofT),
Angelos Bilas (U. of Crete), Cristiana Amza
(UofT)
Recovery from failure on virtual shared memory
systems
Based on page replication on backup nodes
Fail-free overhead of 38 and recovery cost is
below 600ms.
Mimimizing Execution Time in MPI Programs on an
Energy-Constrained, Power-Scalable Cluster, Rob
Springer1, David K. Lowenthal1, Barry Rountree
(The U. of Georgia), Vincent W. Freeh (North
Carolina State U.)
Finds the best of processors gear
combination that minimizes power and execution
time.
Found the optimum schedule in 50 of the programs
by iterating 7 of search space.

23
Session 8 Potpourri

Teaching parallel computing to science faculty
best practices and common pitfalls, David Joiner
(Kean U.), Paul Gray (U. of Northern Iowa),
Thomas Murphy (Contra Costa College), Charles
Peck (Earlham College)
Experience in teaching parallel programming in a
community college

24
Keynote Speeches Panel

Parallel Programming and Code Selection in
Fortress, Guy L. Steele Jr., Sun Fellow, Sun
Microsystems Laboratories
Parallel Programming in Modern Web Search
Engines, Raymie Stata, Chief Architect for
Search Marketplace, Yahoo!, Inc.
Software Issues for Multicore Systems,
Moderator James Larus, (Microsoft Research),
Panelists Saman Amarasinghe (MIT), Richard
Brunner (AMD), Luddy Harrison (UIUC), David Kuck
(Intel), Michael Scott (U. Rochester), Burton
Smith (Microsoft), Kevin Stoodley (IBM)

25
Guy L. Steele Parallel Programming and Code
Selection in Fortress

To do for Fortran what Java did for C
Dynamic compilation
Platform independence
Security model including type checking
Research funded in part by the DARPA through
their High Productivity Computing Systems program
Don't build the languagegrow it
Make programming notation closer to math
Ease use of parallelism
Can a feature be provided by a library rather
than in compiler?
Programmers (especially library writers) need not
fear subroutines, functions, methods, and
interfaces for performance reasons

26
Guy L. Steele Parallel Programming and Code
Selection in Fortress

Type System Objects and Traits
Traits like interfaces, but may contain code
Primitive types are first-class
Booleans, integers, floats, characters are all
objects
Transactional access to shared variables
Fortress loops are parallel by default
Programming language notation can become closer
to mathematical notation

27
Guy L. Steele Parallel Programming and Code
Selection in Fortress
28
Panel Software Issues for Multicore Systems

Performance Conscious Languages
Languages that increase programmer productivity
while making it easier to optimize
New Compiler Opportunities
New languages that take performance seriously
Possible compiler support for using multicores
for other than parallelism
Security Enforcement
Program Introspection
Meanwhile, vast majority of applications
programmers have no idea about parallelism
More Dual-core mid-2006, Quad core in 2007 (AMD)
Software Architecture Challenges (debugging,
profiling, making multi-threading easier, etc.

29
Panel Software Issues for Multicore Systems

Some Successes in Using Multi-Core (OS support,
transactional memory, virtualization, efficient
JVMs)
Parallel software systems must be much simpler,
architecturally, than sequential ones if they
have a chance of holding together
We will struggle before finally accepting that
the cache abstraction does not scale
Efficient point-to-point communication is
required
Most success will be achieved on nonstandard
multicore platforms like graphics processors,
network processors, signal processors, where
there is less investment in caches.
We need new apps to drive the interest towards
multicores
Where will the parallelism come from? (dataflow,
reduce/map/scan, speculative parallelization,
etc.)

30
Panel Software Issues for Multicore Systems

The explicit sacrifice of single-thread
performance in favor of parallel performance
Most vulnerable communities
Those who have not previously been exposed to or
had a need for parallel systems, for example ..
Typical client software, mobile devices
Server transactions with significant internal
complexity
Those who chronically need to drive the maximum
performance from their computer systems, for
example ..
High performance computing
Gamers
Above 8 cores, we do not know if multi-cores will
be useful or not

31
Readings For Future CARG

Optimizing Irregular Shared-Memory Applications
for Distributed-Memory Systems, Ayon Basumallik,
Rudolf Eigenmann (Purdue)
POSH A TLS Compiler that Exploits Program
Structure, Wei Liu, James Tuck, Luis Ceze,
Wonsun Ahn (UIUC), Karin Strauss, Jose Renau
(UCSC), Josep Torrellas (UIUC)
MAMA! A Memory Allocator for Multithreaded
Architectures, Simon Kahan, Petr Konecny (Cray
Inc.)
Hybrid Transactional Memory, Sanjeev Kumar
(Intel), Michael Chu (U. of Mich.), Christopher
Hughes, Partha Kundu, Anthony Nguyen (Intel)