Using PLAPACK and MPICHG2 to GridEnable Bayesian Geostatistical Models - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Using PLAPACK and MPICHG2 to GridEnable Bayesian Geostatistical Models

Description:

MCMC algorithms for geostatistical models are computationally intensive ... We have developed parallel algorithms to leverage the power of the TeraGrid in ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 22
Provided by: shao153
Category:

less

Transcript and Presenter's Notes

Title: Using PLAPACK and MPICHG2 to GridEnable Bayesian Geostatistical Models


1
Using PLAPACK and MPICH-G2 to Grid-Enable
Bayesian Geostatistical Models
Wenli He, Shaowen Wang, Jun Yan, Mary Kathryn
Cowles, Marc P. Armstrong
  • Grid Research educatiOn group _at_ IoWa (GROW)
  • The University of Iowa
  • June 13, 2006

2
Outline
  • Motivation
  • Introduction
  • Background
  • Parallel computing methods
  • Strategies
  • Technologies
  • Experiments
  • Conclusion
  • Future work

3
A Motivating Problem A Lung Cancer Risk Study
Based on Residential Radon Concentrations
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
Iowa Radon Measurements
Predicted Radon Concentrations
From Dr. Brian J. Smith
4
Introduction
Motivation
  • Application areas
  • GIScience
  • Large scale spatial-temporal data mining and
    inference
  • Methods
  • Using geostatistical models to characterize the
    spatial distributions of environmental processes
    or disease-related outcomes
  • Using Bayesian Inference Markov Chain Monte
    Carlo methods to fit geostatistical models
  • Challenges
  • Computationally intensive (revisited later)

Introduction
Background
Methods
Experiments
Conclusions
Significance
5
Bayesian Inference
Motivation
  • Bayesian methods
  • Fit statistical models
  • Markov chain Monte Carlo (MCMC) methods
  • Help fit complex models numerically
  • Iteratively sample from the joint posterior
    distribution of the model parameters

Introduction
Background
Methods
Experiments
Conclusions
Significance
6
Markov chain Monte Carlo
  • MCMC algorithms for geostatistical models are
    computationally intensive
  • Each iteration of a sampler requires matrix
    calculations such as inversion of a correlation
    matrix whose order is equal to the number of
    geographic sites
  • Memory intensive
  • Some samplers must be run for 10s or 100s of
    thousands of iterations
  • Cholesky factorization is used for the matrix
    calculations in our MCMC algorithms
  • O(n3)

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
7
Background Summary
Motivation
  • Analyses require days or weeks to complete
  • We have developed parallel algorithms to leverage
    the power of the TeraGrid in order to
    substantially reduce the amount of time needed to
    fit Bayesian geostatistical models to large
    datasets

Introduction
Background
Methods
Experiments
Conclusions
Significance
8
Research Novelty
Motivation
  • Fully exploit MCMC parallelisms using Grid
    technologies
  • Single-chain
  • Multiple-chain

Introduction
Background
Methods
Experiments
Conclusions
Significance
9
Parallel Strategies
  • Parallelize matrix calculations in each single
    chain
  • Distribute data to multiple computing nodes
  • Reduce memory requirements
  • Divide matrix calculations by multiple computing
    nodes
  • Reduce computing time
  • Run independent parallel chains
  • Divide computing nodes into groups
  • Each group runs an independent chain

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
10
Technologies Used PLAPACK, SPRNG
  • PLAPACK (Parallel Linear Algebra Package)
  • Based on MPI (Message Passing Interface)
  • Provide C-interface and Fortran-interface
  • Object-oriented programming style
  • http//www.cs.utexas.edu/users/plapack/
  • SPRNG (Scalable Parallel Random Number Generator)
  • Generates independent pseudo-random number
    streams on each processor
  • Guarantees the samples from different chains are
    independent
  • http//sprng.cs.fsu.edu/

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
11
Technologies Used MPICH-G2
  • Support for the execution of MPI programs on
    Grids
  • Communication topology management
  • Help split processors into groups
  • The processors of each group belong to the same
    cluster
  • Each group runs a single chain
  • Cross-cluster communication cost is minimal
  • http//www3.niu.edu/mpi/

Motivation
Introduction
Background
Methods
Experiments

Conclusions

Node 1
Node 2
Node 5
Node 6
Node 9
Node 10
Significance
Node 3
Node 4
Node 7
Node 8
Node 11
Node 12
Chain 1
Chain 2
Chain 3
Cluster B
Cluster A
12
Computing Environments
  • A local Beowulf cluster
  • Limited number of computing nodes, lt100
  • Intra-network connection
  • 1 Gigabit Ethernet
  • The NSF TeraGrid
  • 8 supercomputing sites interconnected by three to
    four 10Gbps optical fiber "lambdas" (light
    pipelines)
  • Thousands of computing nodes at each site
  • Intra-site network
  • Myrinet 2000 or Gigabit Ethernet
  • An excellent environment for parallel
    applications like ours

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
13
Experiment (1) Computing Time (Seconds) for
Parallelizing a Single Chain
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
  • Computing nodes from a single TeraGrid cluster
  • Simulated datasets
  • Up to 12,000 data points
  • Length of chains ONLY 10

14
Experiment (1) Speedup
15
Experiment (2) Two Test Cases of Running
Parallel Chains
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
  • Load balancing issues
  • One chain slower by 23 than another chain

16
Speedup Analysis
  • Strictly speaking, speedup can only be obtained
    after statisticians output analysis ?
  • Depends on the length of burn-in period
  • A simple formula to use
  • burn-in b iterations, monitoring run n
    iterations
  • Serial context a chain of length bn
  • N Parallel chains each chain of length bn/N
  • Speed-up

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
17
Conclusions
  • Parallelize MCMC-based Bayesian inference for
    geostatistical models
  • Methods
  • Parallelizing matrix calculations within single
    chains
  • Running parallel chains
  • TeraGrid provides an ideal computing environment
    for solving this class of problems
  • Initial results demonstrate significant
    performance gains

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
18
Research Significance
Motivation
  • Impact
  • Significantly reduce the time span for conducting
    the large-scale analyses
  • Novelty
  • We are the first
  • Exploiting both single-chain and multiple-chain
    parallelism in MCMC-based Bayesian geostatistical
    models
  • Supporting large-scale geographic data analysis
  • Using Grid computing as enabling technologies

Introduction
Background
Methods
Experiments
Conclusions
Significance
19
Future Work
  • Improve the parallelization of matrix
    calculations
  • Address load balancing for parallel chains
  • Further conduct speedup analysis for parallel
    chains
  • Apply our methods to large-scale geographic
    analysis problems

20
Acknowledgement
  • This research is partially supported by the
    HawkGrid project funded by the Office of Vice
    President for Research at The University of Iowa
  • The NSF TeraGrid resource allocations
  • TG-DMS040004T
  • TG-SES060003T

21
Thanks!
  • Comments or questions?
Write a Comment
User Comments (0)
About PowerShow.com