Title: Using PLAPACK and MPICHG2 to GridEnable Bayesian Geostatistical Models
1Using PLAPACK and MPICH-G2 to Grid-Enable
Bayesian Geostatistical Models
Wenli He, Shaowen Wang, Jun Yan, Mary Kathryn
Cowles, Marc P. Armstrong
- Grid Research educatiOn group _at_ IoWa (GROW)
- The University of Iowa
- June 13, 2006
2Outline
- Motivation
- Introduction
- Background
- Parallel computing methods
- Strategies
- Technologies
- Experiments
- Conclusion
- Future work
3A Motivating Problem A Lung Cancer Risk Study
Based on Residential Radon Concentrations
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
Iowa Radon Measurements
Predicted Radon Concentrations
From Dr. Brian J. Smith
4Introduction
Motivation
- Application areas
- GIScience
- Large scale spatial-temporal data mining and
inference - Methods
- Using geostatistical models to characterize the
spatial distributions of environmental processes
or disease-related outcomes - Using Bayesian Inference Markov Chain Monte
Carlo methods to fit geostatistical models - Challenges
- Computationally intensive (revisited later)
Introduction
Background
Methods
Experiments
Conclusions
Significance
5Bayesian Inference
Motivation
- Bayesian methods
- Fit statistical models
- Markov chain Monte Carlo (MCMC) methods
- Help fit complex models numerically
- Iteratively sample from the joint posterior
distribution of the model parameters
Introduction
Background
Methods
Experiments
Conclusions
Significance
6Markov chain Monte Carlo
- MCMC algorithms for geostatistical models are
computationally intensive - Each iteration of a sampler requires matrix
calculations such as inversion of a correlation
matrix whose order is equal to the number of
geographic sites - Memory intensive
- Some samplers must be run for 10s or 100s of
thousands of iterations - Cholesky factorization is used for the matrix
calculations in our MCMC algorithms - O(n3)
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
7Background Summary
Motivation
- Analyses require days or weeks to complete
- We have developed parallel algorithms to leverage
the power of the TeraGrid in order to
substantially reduce the amount of time needed to
fit Bayesian geostatistical models to large
datasets
Introduction
Background
Methods
Experiments
Conclusions
Significance
8Research Novelty
Motivation
- Fully exploit MCMC parallelisms using Grid
technologies - Single-chain
- Multiple-chain
Introduction
Background
Methods
Experiments
Conclusions
Significance
9Parallel Strategies
- Parallelize matrix calculations in each single
chain - Distribute data to multiple computing nodes
- Reduce memory requirements
- Divide matrix calculations by multiple computing
nodes - Reduce computing time
- Run independent parallel chains
- Divide computing nodes into groups
- Each group runs an independent chain
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
10Technologies Used PLAPACK, SPRNG
- PLAPACK (Parallel Linear Algebra Package)
- Based on MPI (Message Passing Interface)
- Provide C-interface and Fortran-interface
- Object-oriented programming style
- http//www.cs.utexas.edu/users/plapack/
- SPRNG (Scalable Parallel Random Number Generator)
- Generates independent pseudo-random number
streams on each processor - Guarantees the samples from different chains are
independent - http//sprng.cs.fsu.edu/
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
11Technologies Used MPICH-G2
- Support for the execution of MPI programs on
Grids - Communication topology management
- Help split processors into groups
- The processors of each group belong to the same
cluster - Each group runs a single chain
- Cross-cluster communication cost is minimal
- http//www3.niu.edu/mpi/
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Node 1
Node 2
Node 5
Node 6
Node 9
Node 10
Significance
Node 3
Node 4
Node 7
Node 8
Node 11
Node 12
Chain 1
Chain 2
Chain 3
Cluster B
Cluster A
12Computing Environments
- A local Beowulf cluster
- Limited number of computing nodes, lt100
- Intra-network connection
- 1 Gigabit Ethernet
- The NSF TeraGrid
- 8 supercomputing sites interconnected by three to
four 10Gbps optical fiber "lambdas" (light
pipelines) - Thousands of computing nodes at each site
- Intra-site network
- Myrinet 2000 or Gigabit Ethernet
- An excellent environment for parallel
applications like ours
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
13Experiment (1) Computing Time (Seconds) for
Parallelizing a Single Chain
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
- Computing nodes from a single TeraGrid cluster
- Simulated datasets
- Up to 12,000 data points
- Length of chains ONLY 10
14Experiment (1) Speedup
15Experiment (2) Two Test Cases of Running
Parallel Chains
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
- Load balancing issues
- One chain slower by 23 than another chain
16Speedup Analysis
- Strictly speaking, speedup can only be obtained
after statisticians output analysis ? - Depends on the length of burn-in period
- A simple formula to use
- burn-in b iterations, monitoring run n
iterations - Serial context a chain of length bn
- N Parallel chains each chain of length bn/N
- Speed-up
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
17Conclusions
- Parallelize MCMC-based Bayesian inference for
geostatistical models - Methods
- Parallelizing matrix calculations within single
chains - Running parallel chains
- TeraGrid provides an ideal computing environment
for solving this class of problems - Initial results demonstrate significant
performance gains
Motivation
Introduction
Background
Methods
Experiments
Conclusions
Significance
18Research Significance
Motivation
- Impact
- Significantly reduce the time span for conducting
the large-scale analyses - Novelty
- We are the first
- Exploiting both single-chain and multiple-chain
parallelism in MCMC-based Bayesian geostatistical
models - Supporting large-scale geographic data analysis
- Using Grid computing as enabling technologies
Introduction
Background
Methods
Experiments
Conclusions
Significance
19Future Work
- Improve the parallelization of matrix
calculations - Address load balancing for parallel chains
- Further conduct speedup analysis for parallel
chains - Apply our methods to large-scale geographic
analysis problems
20Acknowledgement
- This research is partially supported by the
HawkGrid project funded by the Office of Vice
President for Research at The University of Iowa - The NSF TeraGrid resource allocations
- TG-DMS040004T
- TG-SES060003T
21Thanks!