PPoPP - PowerPoint PPT Presentation

About This Presentation
Title:

PPoPP

Description:

... Links', E.Chan, R.van de Geijn (UTexas), W. Gropp, R.Thakur (Argonne National Lab. ... Guy L. Steele: 'Parallel Programming and. Code Selection in Fortress' ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 32
Provided by: ivanmat
Category:
Tags: ppopp | don | steele

less

Transcript and Presenter's Notes

Title: PPoPP


1
PPoPP06ACM SIGPLAN Symposium onPrinciples and
Practice of Parallel ProgrammingNew York,
March 29-31, 2006
  • Conference Review
  • Presented by Utku Aydonat

2
Outline
  • Conference overview
  • Brief summaries of sessions
  • Keynote speeches Panel
  • Best paper

3
Conference Overview
  • History 90, 91, 93, 95, 97, 99, 01, 03, 05, 06
  • Primary focus anything related to parallel
    programming
  • Algorithms
  • Communication
  • Languages
  • 8 sessions, 26 papers
  • Dominating topics multicores, parallelization
    techniques

4
Conference Overview
  • PPoPP Paper Acceptance Statistics
  • Year Submitted Accepted Rate
  • 2006 91 25 27
  • 2005 87 27 31
  • 2003 45 20 44
  • 1999 79 17 22
  • 1997 86 26 30

5
Overview of Session
  • Communication
  • Languages
  • Performance Characterization
  • Shared Memory Parallelism
  • Atomicity Issues
  • Multicore Software
  • Transactional Memory
  • Potpourri

6
Session 1 Communication
  • Collective Communication of Architectures that
    Support Simultaneous Communication over Multiple
    Links, E.Chan, R.van de Geijn (UTexas), W.
    Gropp, R.Thakur (Argonne National Lab.)
  • Adopt MPI collective communication algorithms to
    supercomputer architectures that support
    simultaneous communication with multiple nodes.
  • Theoritically latency can be reduced practically
    it is not achievable due to the algorithms and
    overheads.
  • Performance Evaluation of Adaptive MPI, Chao
    Huang, Gengbin Zheng (UIUC), Sameer Kumar (IBM T.
    J. Watson), Laxmikant Kale (UIUC)
  • Design and evaluate AMPI that supports processor
    virtualization
  • Benefits load balancing, adaptive overlapping,
    independence from the available number of
    processors, etc.

7
Session 1 Communication
  • Mobile MPI Programs on Computational Grids,
    Rohit Fernandes, Keshav Pingali, Paul Stodghill
    (Cornell)
  • Checkpointing system for C programs using MPI
  • Able to take checkpoint on Alpha cluster and
    restart them on Windows
  • RDMA Read Based Rendezvous Protocol for MPI over
    InfiniBand Design Alternatives and Benefits,
    Sayantan Sur, Hyun-Wook Jin, Lei Chai,
    Dhabaleswar K Panda (Ohio State)
  • A rendezvous protocol in MPI using RDMA read.
  • Increases communication / computation overlap.

8
Session 2 Languages
  • Global-View Abstractions for User-Defined
    Reductions and Scans, Steve J. Deitz, David
    Callahan, Bradford L. Chamberlain (Cray),
    Lawrence Snyder (U. of Washington)
  • Chapel programming language developed by Cray
    Inc. as a part of DARPA High-Productivity
    Computing Systems program
  • Global view abstractions for user-defined
    reductions and scans
  • Programming for Parallelism and Locality with
    Hierarchically Tiled Arrays, Ganesh Bikshandi,
    Jia Guo, Daniel Hoeflinger (UIUC), Gheorghe
    Almasi (IBM T. J. Watson), Basilio B Fraguela
    (Universidade da Coruña), Maria Jesus Garzaran,
    David Padua (UIUC), Christoph von Praun (IBM T.
    J. Watson)
  • Hierarchically Tiled Arrays (HTAs) that define
    tiling structure for arrays
  • Reductions, mapping, scans, transpose, shift
    operations are defined.

9
MinK in Chapel
Called for each element of A
var minimums 1..10 integer minimums
mink(integer, 10) reduce A
10
HTA
11
Session 3 Performance Characterization
  • Performance characterization of bio-molecular
    simulations using molecular dynamics, Sadaf
    Alam, Pratul Agarwal, Al Geist, Jeffrey Vetter
    (ORNL)
  • Investigated performance bottlenecks in MD
    applications on supercomputers
  • Found out that the implementations of algorithms
    are not scalable
  • On-line Automated Performance Diagnosis on
    Thousands of Processors, Philip C. Roth (ORNL),
    Barton P. Miller (U. of Wisconsin, Madison)
  • Distributed and scalable performance analysis
    tool
  • Can analyze large application with 1024 processes
    and present the results in a folded graph.

12
Session 3 Performance Characterization
  • A Case Study in Top-Down Performance Estimation
    for a Large-Scale Parallel Application, Ilya
    Sharapov, Robert Kroeger, Guy Delamarter (Sun
    Microsystems)
  • Performance estimation of HPC workloads on future
    architectures
  • Based on low-level analysis and scalability
    predictions.
  • Predicts the performance of Gyrokinetic Toroidal
    Code executed on Suns future architectures

13
Session 4 Shared Memory Parallelism
  • Hardware Profile-guided Automatic Page Placement
    for ccNUMA Systems, Jaydeep Marathe, Frank
    Mueller (North Carolina State U.)
  • Profiles memory accesses and places pages
    accordingly.
  • 20 performance improvement and 2.7 overhead.
  • Adaptive Scheduling with Parallelism Feedback,
    Kunal Agrawal, Yuxiong He, Wen Jing Hsu, Charles
    Leiserson (Mass. Inst. of Tech.)
  • Allocates processors to jobs based on the past
    parallelism of the job.
  • Uses R-trimmed mean for the feed-back.

14
Session 4 Shared Memory Parallelism
  • Predicting Bounds on Queuing Delay for
    Batch-scheduled Parallel Machines, John Brevik,
    Daniel Nurmi, Rich Wolski (UCSB)
  • Binomial Method Batch Predictor (BMBP) that bases
    its predictions on the past wait times.
  • Uses 95th percentile and its predictions are
    close to real wait times experienced.
  • Optimizing Irregular Shared-Memory Applications
    for Distributed-Memory Systems, Ayon Basumallik,
    Rudolf Eigenmann (Purdue)
  • Converts OpenMP applications to MPI based
    applications
  • Uses inspection loop to find non-local access and
    reorder loops.

15
OpenMP-to-MPI
16
Session 5 Atomicity Issues
  • Proving Correctness of Highly-Concurrent
    Linearizable Objects Viktor Vefeiadis (U. of
    Cambridge), Maurice Herlihy (Brown U.), Tony
    Hoare (Microsoft Research Cambridge), Marc
    Shapiro (INRIA Rocquencourt LIP6)
  • Proves the safety of concurrent objects using
    Rely-Guarantee method
  • Each threads rely condition should be satisfied
    and each threads guarantee condition implies
    others rely condition for every operation.
  • Accurate and Efficient Runtime Detection of
    Atomicity Errors in Concurrent Programs, Liqiang
    Wang, Scott D. Stoller (SUNY at Stony Brook)
  • Instruments the program and obtain profiling of
    memory accesses
  • Builds a tree of the conflicting accesses and
    applies some algorithms to prove conflict and
    view equivalency.

17
Session 5 Atomicity Issues
  • Scalable Synchronous Queues, William N. Scherer
    III (U. of Rochester), Doug Lea (SUNY Oswego),
    Michael L. Scott (U. of Rochester)
  • Best Paper
  • Details are coming up.

18
Session 6 Multicore Software
  • POSH A TLS Compiler that Exploits Program
    Structure, Wei Liu, James Tuck, Luis Ceze,
    Wonsun Ahn (UIUC), Karin Strauss, Jose Renau
    (UCSC), Josep Torrellas (UIUC)
  • TLS compiler that divides the program to tasks,
    prune the inefficient ones
  • Uses profiling to detect tasks that may violate
    frequently.
  • High-performance IPv6 Forwarding Algorithm for
    Multi-core and Multithreaded Network Processors,
    Hu Xianghui (U. of Sci. and Tech. of China),
    Xinan Tang (Intel), Bei Hua (U. of Sci. and Tech.
    of China)
  • New IPv6 forwarding algorithm optimized for Intel
    NPU features
  • Achieves 10Gbps speed for large routing tables
    with up to 400K entries.

19
Session 6 Multicore Software
  • MAMA! A Memory Allocator for Multithreaded
    Architectures, Simon Kahan, Petr Konecny (Cray
    Inc.)
  • A memory allocator that aggregate requests to
    reduce the fragmentation
  • Transforms contention to collaboration
  • Experiments with micro-benchmarks proves that it
    works

20
Session 7 Transactional Memory
  • A High Performance Software Transactional Memory
    System For A Multi-Core Runtime, Bratin Saha,
    Ali-Reza Adl-Tabatabai, Richard L. Hudson
    (Intel), Chi Cao Minh, Ben Hertzberg (Stanford)
  • Maps each memory location to a unique lock and
    acquires all the relevant locks before committing
    a transaction
  • Undo-logging, write-locking/read versioning,
    cache-line conflict detection
  • Exploiting Distributed Version Concurrency in a
    Transactional Memory Cluster, Kaloian Manassiev,
    Madalin Mihailescu, Cristiana Amza (UofT)
  • Transactional Memory system on commodity clusters
    for generic C and SQL applications
  • Diffs are applied by readers on demand and may
    violate writers.

21
Session 7 Transactional Memory
  • Hybrid Transactional Memory, Sanjeev Kumar
    (Intel), Michael Chu (U. of Mich.), Christopher
    Hughes, Partha Kundu, Anthony Nguyen (Intel)
  • Hardware and Software TM together
  • Extends DSTM
  • Conflict detection is based on loading and
    storing the state field of the object wrapper and
    the locator field.

22
Session 8 Potpourri
  • Fast and Transparent Recovery for Continuous
    Availability of Cluster-based Servers, Rosalia
    Christodoulopoulou, Kaloian Manassiev (UofT),
    Angelos Bilas (U. of Crete), Cristiana Amza
    (UofT)
  • Recovery from failure on virtual shared memory
    systems
  • Based on page replication on backup nodes
  • Fail-free overhead of 38 and recovery cost is
    below 600ms.
  • Mimimizing Execution Time in MPI Programs on an
    Energy-Constrained, Power-Scalable Cluster, Rob
    Springer1, David K. Lowenthal1, Barry Rountree
    (The U. of Georgia), Vincent W. Freeh (North
    Carolina State U.)
  • Finds the best of processors gear
    combination that minimizes power and execution
    time.
  • Found the optimum schedule in 50 of the programs
    by iterating 7 of search space.

23
Session 8 Potpourri
  • Teaching parallel computing to science faculty
    best practices and common pitfalls, David Joiner
    (Kean U.), Paul Gray (U. of Northern Iowa),
    Thomas Murphy (Contra Costa College), Charles
    Peck (Earlham College)
  • Experience in teaching parallel programming in a
    community college

24
Keynote Speeches Panel
  • Parallel Programming and Code Selection in
    Fortress, Guy L. Steele Jr., Sun Fellow, Sun
    Microsystems Laboratories
  • Parallel Programming in Modern Web Search
    Engines, Raymie Stata, Chief Architect for
    Search Marketplace, Yahoo!, Inc.
  • Software Issues for Multicore Systems,
    Moderator James Larus, (Microsoft Research),
    Panelists Saman Amarasinghe (MIT), Richard
    Brunner (AMD), Luddy Harrison (UIUC), David Kuck
    (Intel), Michael Scott (U. Rochester), Burton
    Smith (Microsoft), Kevin Stoodley (IBM)

25
Guy L. Steele Parallel Programming and Code
Selection in Fortress
  • To do for Fortran what Java did for C
  • Dynamic compilation
  • Platform independence
  • Security model including type checking
  • Research funded in part by the DARPA through
    their High Productivity Computing Systems program
  • Don't build the languagegrow it
  • Make programming notation closer to math
  • Ease use of parallelism
  • Can a feature be provided by a library rather
    than in compiler?
  • Programmers (especially library writers) need not
    fear subroutines, functions, methods, and
    interfaces for performance reasons

26
Guy L. Steele Parallel Programming and Code
Selection in Fortress
  • Type System Objects and Traits
  • Traits like interfaces, but may contain code
  • Primitive types are first-class
  • Booleans, integers, floats, characters are all
    objects
  • Transactional access to shared variables
  • Fortress loops are parallel by default
  • Programming language notation can become closer
    to mathematical notation

27
Guy L. Steele Parallel Programming and Code
Selection in Fortress
28
Panel Software Issues for Multicore Systems
  • Performance Conscious Languages
  • Languages that increase programmer productivity
    while making it easier to optimize
  • New Compiler Opportunities
  • New languages that take performance seriously
  • Possible compiler support for using multicores
    for other than parallelism
  • Security Enforcement
  • Program Introspection
  • Meanwhile, vast majority of applications
    programmers have no idea about parallelism
  • More Dual-core mid-2006, Quad core in 2007 (AMD)
  • Software Architecture Challenges (debugging,
    profiling, making multi-threading easier, etc.

29
Panel Software Issues for Multicore Systems
  • Some Successes in Using Multi-Core (OS support,
    transactional memory, virtualization, efficient
    JVMs)
  • Parallel software systems must be much simpler,
    architecturally, than sequential ones if they
    have a chance of holding together
  • We will struggle before finally accepting that
    the cache abstraction does not scale
  • Efficient point-to-point communication is
    required
  • Most success will be achieved on nonstandard
    multicore platforms like graphics processors,
    network processors, signal processors, where
    there is less investment in caches.
  • We need new apps to drive the interest towards
    multicores
  • Where will the parallelism come from? (dataflow,
    reduce/map/scan, speculative parallelization,
    etc.)

30
Panel Software Issues for Multicore Systems
  • The explicit sacrifice of single-thread
    performance in favor of parallel performance
  • Most vulnerable communities
  • Those who have not previously been exposed to or
    had a need for parallel systems, for example ..
  • Typical client software, mobile devices
  • Server transactions with significant internal
    complexity
  • Those who chronically need to drive the maximum
    performance from their computer systems, for
    example ..
  • High performance computing
  • Gamers
  • Above 8 cores, we do not know if multi-cores will
    be useful or not

31
Readings For Future CARG
  • Optimizing Irregular Shared-Memory Applications
    for Distributed-Memory Systems, Ayon Basumallik,
    Rudolf Eigenmann (Purdue)
  • POSH A TLS Compiler that Exploits Program
    Structure, Wei Liu, James Tuck, Luis Ceze,
    Wonsun Ahn (UIUC), Karin Strauss, Jose Renau
    (UCSC), Josep Torrellas (UIUC)
  • MAMA! A Memory Allocator for Multithreaded
    Architectures, Simon Kahan, Petr Konecny (Cray
    Inc.)
  • Hybrid Transactional Memory, Sanjeev Kumar
    (Intel), Michael Chu (U. of Mich.), Christopher
    Hughes, Partha Kundu, Anthony Nguyen (Intel)
Write a Comment
User Comments (0)
About PowerShow.com