Using PLAPACK and MPICHG2 to GridEnable Bayesian Geostatistical Models - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Using PLAPACK and MPICHG2 to GridEnable Bayesian Geostatistical Models

Description:

MCMC algorithms for geostatistical models are computationally intensive ... We have developed parallel algorithms to leverage the power of the TeraGrid in ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 22

Provided by: shao153

Category:

more less

Transcript and Presenter's Notes

Title: Using PLAPACK and MPICHG2 to GridEnable Bayesian Geostatistical Models

1
Using PLAPACK and MPICH-G2 to Grid-Enable
Bayesian Geostatistical Models
Wenli He, Shaowen Wang, Jun Yan, Mary Kathryn
Cowles, Marc P. Armstrong

Grid Research educatiOn group _at_ IoWa (GROW)
The University of Iowa
June 13, 2006

2
Outline

Motivation
Introduction
Background
Parallel computing methods
Strategies
Technologies
Experiments
Conclusion
Future work

3
A Motivating Problem A Lung Cancer Risk Study
Based on Residential Radon Concentrations
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
Iowa Radon Measurements
Predicted Radon Concentrations
From Dr. Brian J. Smith
4
Introduction
Motivation

Application areas
GIScience
Large scale spatial-temporal data mining and
inference
Methods
Using geostatistical models to characterize the
spatial distributions of environmental processes
or disease-related outcomes
Using Bayesian Inference Markov Chain Monte
Carlo methods to fit geostatistical models
Challenges
Computationally intensive (revisited later)

Introduction
Background
Methods
Experiments
Conclusions
Significance
5
Bayesian Inference
Motivation

Bayesian methods
Fit statistical models
Markov chain Monte Carlo (MCMC) methods
Help fit complex models numerically
Iteratively sample from the joint posterior
distribution of the model parameters

Introduction
Background
Methods
Experiments
Conclusions
Significance
6
Markov chain Monte Carlo

MCMC algorithms for geostatistical models are
computationally intensive
Each iteration of a sampler requires matrix
calculations such as inversion of a correlation
matrix whose order is equal to the number of
geographic sites
Memory intensive
Some samplers must be run for 10s or 100s of
thousands of iterations
Cholesky factorization is used for the matrix
calculations in our MCMC algorithms
O(n3)

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
7
Background Summary
Motivation

Analyses require days or weeks to complete
We have developed parallel algorithms to leverage
the power of the TeraGrid in order to
substantially reduce the amount of time needed to
fit Bayesian geostatistical models to large
datasets

Introduction
Background
Methods
Experiments
Conclusions
Significance
8
Research Novelty
Motivation

Fully exploit MCMC parallelisms using Grid
technologies
Single-chain
Multiple-chain

Introduction
Background
Methods
Experiments
Conclusions
Significance
9
Parallel Strategies

Parallelize matrix calculations in each single
chain
Distribute data to multiple computing nodes
Reduce memory requirements
Divide matrix calculations by multiple computing
nodes
Reduce computing time
Run independent parallel chains
Divide computing nodes into groups
Each group runs an independent chain

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
10
Technologies Used PLAPACK, SPRNG

PLAPACK (Parallel Linear Algebra Package)
Based on MPI (Message Passing Interface)
Provide C-interface and Fortran-interface
Object-oriented programming style
http//www.cs.utexas.edu/users/plapack/
SPRNG (Scalable Parallel Random Number Generator)
Generates independent pseudo-random number
streams on each processor
Guarantees the samples from different chains are
independent
http//sprng.cs.fsu.edu/

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
11
Technologies Used MPICH-G2

Support for the execution of MPI programs on
Grids
Communication topology management
Help split processors into groups
The processors of each group belong to the same
cluster
Each group runs a single chain
Cross-cluster communication cost is minimal
http//www3.niu.edu/mpi/

Motivation
Introduction
Background
Methods
Experiments

Conclusions

Node 1
Node 2
Node 5
Node 6
Node 9
Node 10
Significance
Node 3
Node 4
Node 7
Node 8
Node 11
Node 12
Chain 1
Chain 2
Chain 3
Cluster B
Cluster A
12
Computing Environments

A local Beowulf cluster
Limited number of computing nodes, lt100
Intra-network connection
1 Gigabit Ethernet
The NSF TeraGrid
8 supercomputing sites interconnected by three to
four 10Gbps optical fiber "lambdas" (light
pipelines)
Thousands of computing nodes at each site
Intra-site network
Myrinet 2000 or Gigabit Ethernet
An excellent environment for parallel
applications like ours

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
13
Experiment (1) Computing Time (Seconds) for
Parallelizing a Single Chain
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance

Computing nodes from a single TeraGrid cluster
Simulated datasets
Up to 12,000 data points
Length of chains ONLY 10

14
Experiment (1) Speedup
15
Experiment (2) Two Test Cases of Running
Parallel Chains
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance

Load balancing issues
One chain slower by 23 than another chain

16
Speedup Analysis

Strictly speaking, speedup can only be obtained
after statisticians output analysis ?
Depends on the length of burn-in period
A simple formula to use
burn-in b iterations, monitoring run n
iterations
Serial context a chain of length bn
N Parallel chains each chain of length bn/N
Speed-up

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
17
Conclusions

Parallelize MCMC-based Bayesian inference for
geostatistical models
Methods
Parallelizing matrix calculations within single
chains
Running parallel chains
TeraGrid provides an ideal computing environment
for solving this class of problems
Initial results demonstrate significant
performance gains

Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
18
Research Significance
Motivation

Impact
Significantly reduce the time span for conducting
the large-scale analyses
Novelty
We are the first
Exploiting both single-chain and multiple-chain
parallelism in MCMC-based Bayesian geostatistical
models
Supporting large-scale geographic data analysis
Using Grid computing as enabling technologies

Introduction
Background
Methods
Experiments
Conclusions
Significance
19
Future Work

Improve the parallelization of matrix
calculations
Address load balancing for parallel chains
Further conduct speedup analysis for parallel
chains
Apply our methods to large-scale geographic
analysis problems

20
Acknowledgement

This research is partially supported by the
HawkGrid project funded by the Office of Vice
President for Research at The University of Iowa
The NSF TeraGrid resource allocations
TG-DMS040004T
TG-SES060003T

21
Thanks!