Title: PPoPP
1PPoPP06ACM SIGPLAN Symposium onPrinciples and
Practice of Parallel ProgrammingNew York,
March 29-31, 2006
- Conference Review
- Presented by Utku Aydonat
2Outline
- Conference overview
- Brief summaries of sessions
- Keynote speeches Panel
- Best paper
3Conference Overview
- History 90, 91, 93, 95, 97, 99, 01, 03, 05, 06
- Primary focus anything related to parallel
programming - Algorithms
- Communication
- Languages
- 8 sessions, 26 papers
- Dominating topics multicores, parallelization
techniques
4Conference Overview
- PPoPP Paper Acceptance Statistics
- Year Submitted Accepted Rate
- 2006 91 25 27
- 2005 87 27 31
- 2003 45 20 44
- 1999 79 17 22
- 1997 86 26 30
5Overview of Session
- Communication
- Languages
- Performance Characterization
- Shared Memory Parallelism
- Atomicity Issues
- Multicore Software
- Transactional Memory
- Potpourri
6Session 1 Communication
- Collective Communication of Architectures that
Support Simultaneous Communication over Multiple
Links, E.Chan, R.van de Geijn (UTexas), W.
Gropp, R.Thakur (Argonne National Lab.) - Adopt MPI collective communication algorithms to
supercomputer architectures that support
simultaneous communication with multiple nodes. - Theoritically latency can be reduced practically
it is not achievable due to the algorithms and
overheads. - Performance Evaluation of Adaptive MPI, Chao
Huang, Gengbin Zheng (UIUC), Sameer Kumar (IBM T.
J. Watson), Laxmikant Kale (UIUC) - Design and evaluate AMPI that supports processor
virtualization - Benefits load balancing, adaptive overlapping,
independence from the available number of
processors, etc.
7Session 1 Communication
- Mobile MPI Programs on Computational Grids,
Rohit Fernandes, Keshav Pingali, Paul Stodghill
(Cornell) - Checkpointing system for C programs using MPI
- Able to take checkpoint on Alpha cluster and
restart them on Windows - RDMA Read Based Rendezvous Protocol for MPI over
InfiniBand Design Alternatives and Benefits,
Sayantan Sur, Hyun-Wook Jin, Lei Chai,
Dhabaleswar K Panda (Ohio State) - A rendezvous protocol in MPI using RDMA read.
- Increases communication / computation overlap.
8Session 2 Languages
- Global-View Abstractions for User-Defined
Reductions and Scans, Steve J. Deitz, David
Callahan, Bradford L. Chamberlain (Cray),
Lawrence Snyder (U. of Washington) - Chapel programming language developed by Cray
Inc. as a part of DARPA High-Productivity
Computing Systems program - Global view abstractions for user-defined
reductions and scans - Programming for Parallelism and Locality with
Hierarchically Tiled Arrays, Ganesh Bikshandi,
Jia Guo, Daniel Hoeflinger (UIUC), Gheorghe
Almasi (IBM T. J. Watson), Basilio B Fraguela
(Universidade da Coruña), Maria Jesus Garzaran,
David Padua (UIUC), Christoph von Praun (IBM T.
J. Watson) - Hierarchically Tiled Arrays (HTAs) that define
tiling structure for arrays - Reductions, mapping, scans, transpose, shift
operations are defined.
9MinK in Chapel
Called for each element of A
var minimums 1..10 integer minimums
mink(integer, 10) reduce A
10HTA
11Session 3 Performance Characterization
- Performance characterization of bio-molecular
simulations using molecular dynamics, Sadaf
Alam, Pratul Agarwal, Al Geist, Jeffrey Vetter
(ORNL) - Investigated performance bottlenecks in MD
applications on supercomputers - Found out that the implementations of algorithms
are not scalable - On-line Automated Performance Diagnosis on
Thousands of Processors, Philip C. Roth (ORNL),
Barton P. Miller (U. of Wisconsin, Madison) - Distributed and scalable performance analysis
tool - Can analyze large application with 1024 processes
and present the results in a folded graph.
12Session 3 Performance Characterization
- A Case Study in Top-Down Performance Estimation
for a Large-Scale Parallel Application, Ilya
Sharapov, Robert Kroeger, Guy Delamarter (Sun
Microsystems) - Performance estimation of HPC workloads on future
architectures - Based on low-level analysis and scalability
predictions. - Predicts the performance of Gyrokinetic Toroidal
Code executed on Suns future architectures
13Session 4 Shared Memory Parallelism
- Hardware Profile-guided Automatic Page Placement
for ccNUMA Systems, Jaydeep Marathe, Frank
Mueller (North Carolina State U.) - Profiles memory accesses and places pages
accordingly. - 20 performance improvement and 2.7 overhead.
- Adaptive Scheduling with Parallelism Feedback,
Kunal Agrawal, Yuxiong He, Wen Jing Hsu, Charles
Leiserson (Mass. Inst. of Tech.) - Allocates processors to jobs based on the past
parallelism of the job. - Uses R-trimmed mean for the feed-back.
14Session 4 Shared Memory Parallelism
- Predicting Bounds on Queuing Delay for
Batch-scheduled Parallel Machines, John Brevik,
Daniel Nurmi, Rich Wolski (UCSB) - Binomial Method Batch Predictor (BMBP) that bases
its predictions on the past wait times. - Uses 95th percentile and its predictions are
close to real wait times experienced. - Optimizing Irregular Shared-Memory Applications
for Distributed-Memory Systems, Ayon Basumallik,
Rudolf Eigenmann (Purdue) - Converts OpenMP applications to MPI based
applications - Uses inspection loop to find non-local access and
reorder loops.
15OpenMP-to-MPI
16Session 5 Atomicity Issues
- Proving Correctness of Highly-Concurrent
Linearizable Objects Viktor Vefeiadis (U. of
Cambridge), Maurice Herlihy (Brown U.), Tony
Hoare (Microsoft Research Cambridge), Marc
Shapiro (INRIA Rocquencourt LIP6) - Proves the safety of concurrent objects using
Rely-Guarantee method - Each threads rely condition should be satisfied
and each threads guarantee condition implies
others rely condition for every operation. - Accurate and Efficient Runtime Detection of
Atomicity Errors in Concurrent Programs, Liqiang
Wang, Scott D. Stoller (SUNY at Stony Brook) - Instruments the program and obtain profiling of
memory accesses - Builds a tree of the conflicting accesses and
applies some algorithms to prove conflict and
view equivalency.
17Session 5 Atomicity Issues
- Scalable Synchronous Queues, William N. Scherer
III (U. of Rochester), Doug Lea (SUNY Oswego),
Michael L. Scott (U. of Rochester) - Best Paper
- Details are coming up.
18Session 6 Multicore Software
- POSH A TLS Compiler that Exploits Program
Structure, Wei Liu, James Tuck, Luis Ceze,
Wonsun Ahn (UIUC), Karin Strauss, Jose Renau
(UCSC), Josep Torrellas (UIUC) - TLS compiler that divides the program to tasks,
prune the inefficient ones - Uses profiling to detect tasks that may violate
frequently. - High-performance IPv6 Forwarding Algorithm for
Multi-core and Multithreaded Network Processors,
Hu Xianghui (U. of Sci. and Tech. of China),
Xinan Tang (Intel), Bei Hua (U. of Sci. and Tech.
of China) - New IPv6 forwarding algorithm optimized for Intel
NPU features - Achieves 10Gbps speed for large routing tables
with up to 400K entries.
19Session 6 Multicore Software
- MAMA! A Memory Allocator for Multithreaded
Architectures, Simon Kahan, Petr Konecny (Cray
Inc.) - A memory allocator that aggregate requests to
reduce the fragmentation - Transforms contention to collaboration
- Experiments with micro-benchmarks proves that it
works
20Session 7 Transactional Memory
- A High Performance Software Transactional Memory
System For A Multi-Core Runtime, Bratin Saha,
Ali-Reza Adl-Tabatabai, Richard L. Hudson
(Intel), Chi Cao Minh, Ben Hertzberg (Stanford) - Maps each memory location to a unique lock and
acquires all the relevant locks before committing
a transaction - Undo-logging, write-locking/read versioning,
cache-line conflict detection - Exploiting Distributed Version Concurrency in a
Transactional Memory Cluster, Kaloian Manassiev,
Madalin Mihailescu, Cristiana Amza (UofT) - Transactional Memory system on commodity clusters
for generic C and SQL applications - Diffs are applied by readers on demand and may
violate writers.
21Session 7 Transactional Memory
- Hybrid Transactional Memory, Sanjeev Kumar
(Intel), Michael Chu (U. of Mich.), Christopher
Hughes, Partha Kundu, Anthony Nguyen (Intel) - Hardware and Software TM together
- Extends DSTM
- Conflict detection is based on loading and
storing the state field of the object wrapper and
the locator field.
22Session 8 Potpourri
- Fast and Transparent Recovery for Continuous
Availability of Cluster-based Servers, Rosalia
Christodoulopoulou, Kaloian Manassiev (UofT),
Angelos Bilas (U. of Crete), Cristiana Amza
(UofT) - Recovery from failure on virtual shared memory
systems - Based on page replication on backup nodes
- Fail-free overhead of 38 and recovery cost is
below 600ms. - Mimimizing Execution Time in MPI Programs on an
Energy-Constrained, Power-Scalable Cluster, Rob
Springer1, David K. Lowenthal1, Barry Rountree
(The U. of Georgia), Vincent W. Freeh (North
Carolina State U.) - Finds the best of processors gear
combination that minimizes power and execution
time. - Found the optimum schedule in 50 of the programs
by iterating 7 of search space.
23Session 8 Potpourri
- Teaching parallel computing to science faculty
best practices and common pitfalls, David Joiner
(Kean U.), Paul Gray (U. of Northern Iowa),
Thomas Murphy (Contra Costa College), Charles
Peck (Earlham College) - Experience in teaching parallel programming in a
community college
24Keynote Speeches Panel
- Parallel Programming and Code Selection in
Fortress, Guy L. Steele Jr., Sun Fellow, Sun
Microsystems Laboratories - Parallel Programming in Modern Web Search
Engines, Raymie Stata, Chief Architect for
Search Marketplace, Yahoo!, Inc. - Software Issues for Multicore Systems,
Moderator James Larus, (Microsoft Research),
Panelists Saman Amarasinghe (MIT), Richard
Brunner (AMD), Luddy Harrison (UIUC), David Kuck
(Intel), Michael Scott (U. Rochester), Burton
Smith (Microsoft), Kevin Stoodley (IBM)
25Guy L. Steele Parallel Programming and Code
Selection in Fortress
- To do for Fortran what Java did for C
- Dynamic compilation
- Platform independence
- Security model including type checking
- Research funded in part by the DARPA through
their High Productivity Computing Systems program - Don't build the languagegrow it
- Make programming notation closer to math
- Ease use of parallelism
- Can a feature be provided by a library rather
than in compiler? - Programmers (especially library writers) need not
fear subroutines, functions, methods, and
interfaces for performance reasons
26Guy L. Steele Parallel Programming and Code
Selection in Fortress
- Type System Objects and Traits
- Traits like interfaces, but may contain code
- Primitive types are first-class
- Booleans, integers, floats, characters are all
objects - Transactional access to shared variables
- Fortress loops are parallel by default
- Programming language notation can become closer
to mathematical notation
27Guy L. Steele Parallel Programming and Code
Selection in Fortress
28Panel Software Issues for Multicore Systems
- Performance Conscious Languages
- Languages that increase programmer productivity
while making it easier to optimize - New Compiler Opportunities
- New languages that take performance seriously
- Possible compiler support for using multicores
for other than parallelism - Security Enforcement
- Program Introspection
- Meanwhile, vast majority of applications
programmers have no idea about parallelism - More Dual-core mid-2006, Quad core in 2007 (AMD)
- Software Architecture Challenges (debugging,
profiling, making multi-threading easier, etc.
29Panel Software Issues for Multicore Systems
- Some Successes in Using Multi-Core (OS support,
transactional memory, virtualization, efficient
JVMs) - Parallel software systems must be much simpler,
architecturally, than sequential ones if they
have a chance of holding together - We will struggle before finally accepting that
the cache abstraction does not scale - Efficient point-to-point communication is
required - Most success will be achieved on nonstandard
multicore platforms like graphics processors,
network processors, signal processors, where
there is less investment in caches. - We need new apps to drive the interest towards
multicores - Where will the parallelism come from? (dataflow,
reduce/map/scan, speculative parallelization,
etc.)
30Panel Software Issues for Multicore Systems
- The explicit sacrifice of single-thread
performance in favor of parallel performance - Most vulnerable communities
- Those who have not previously been exposed to or
had a need for parallel systems, for example .. - Typical client software, mobile devices
- Server transactions with significant internal
complexity - Those who chronically need to drive the maximum
performance from their computer systems, for
example .. - High performance computing
- Gamers
- Above 8 cores, we do not know if multi-cores will
be useful or not
31Readings For Future CARG
- Optimizing Irregular Shared-Memory Applications
for Distributed-Memory Systems, Ayon Basumallik,
Rudolf Eigenmann (Purdue) - POSH A TLS Compiler that Exploits Program
Structure, Wei Liu, James Tuck, Luis Ceze,
Wonsun Ahn (UIUC), Karin Strauss, Jose Renau
(UCSC), Josep Torrellas (UIUC) - MAMA! A Memory Allocator for Multithreaded
Architectures, Simon Kahan, Petr Konecny (Cray
Inc.) - Hybrid Transactional Memory, Sanjeev Kumar
(Intel), Michael Chu (U. of Mich.), Christopher
Hughes, Partha Kundu, Anthony Nguyen (Intel)