Title: High Performance Scientific Data Analytics
1 High Performance Scientific Data Analytics
- Nagiza F. Samatova, PhD
- Department of Computer Science, NCSU
- Computer Science and Mathematics Division, ORNL
2Core Team
Paul Breimyer
Guru
Heshan
Collaborators Co-authors on papers, Scott
Klasky, Roselyne, Mladen, Arie and Alex, Marcia,
Bill Nevins, Bob Hettich, John Drake, Tony
Mezzacappa, etc.
Chandra
3Publications
- CRAN Samatova NF, Yoginath S, Kora G, Bauer D,
http//cran.r-project.org/mirrors.html. - SciDAC-06 Samatova NF, Branstetter M, Ganguly
AR, Hettich R, Khan S, Kora G, Li J, Ma X, Pan C,
Shoshani A, Yoginath S, Journal of Physics
Conference Series 46 (2006) 505509. - PDCS-05 Yoginath S, Samatova NF, Bauer D, Kora
G, Fann G, Geist A, In Proceedings of the 18th
International Conference on Parallel and
Distributed Computing Systems (PDCS-2005),
September 12 - 14, 2005, Las Vegas, Nevada. - AnalChem-06.a Pan C, Kora G, McDonald WH, Tabb
DL, VerBerkmoes NC, Hurst GB, Pelletier DA,
Samatova NF, Hettich RL, Anal Chem. 2006 Oct
1578(20)7121-31. - AnalChem-06.b Pan C, Kora G, Tabb DL, Pelletier
DA, McDonald WH, Hurst GB, Hettich RL, Samatova
NF, Anal Chem. 2006 Oct 1578(20)7110-20. - TPAMI-05 Ostrouchov G, Samatova NF, IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 271340-1343, 2005. - JCGS-07 Qu YM, Ostrouchov G, Yoginath S,
Samatova NF, Journal of Computational and
Graphical Statistics, 2007 - MCP-08 Pan, C., Oda, Y., Lankford, P.K., Zhang,
B., Samatova, N.F., Pelletier, D.A.,Harwood,
C.S., Hettich, R.L.,Characterization of anaerobic
catabolism of p-coumarate in Rhodopseudomonas
palustris by integrating transcriptomics and
quantitative proteomics." Mol Cell Proteomics,
vol. 7, no. 5, pp. 938-48, 2008. - CSDA-07 Park BH, Ostrouchov G, Samatova NF.,
Sampling streaming data with replacement. Comput.
Stat. Data Anal., vol. 52, no. 2, pp. 750-762,
2007 - TVCG-07 Sisneros, R., Jones, C., Huang, J.,
Gao, J., Park, B.H., Samatova, N.F., A
multi-level cache model for run-time optimization
of remote visualization." IEEE Trans Vis Computer
Graph, vol. 13, no. 5, pp. 991-1003, Sep-Oct 2007 - DPD-02 Samatova NF, Ostrouchov G, Geist A,
Melechko AV., RACHET An efficient cover-based
merging of clustering hierarchies from
distributed datasets." Distrib. Parallel
Databases,vol. 11, no. 2, pp. 157-180, Mar 2002 - BIBM-08 Breimyer, P., Green, N., Kumar, V.,
Samatova, N.F., \BioDEAL Biological
data-evidence-annotation linkage system."
Proceedings of the IEEE International Conference
on Bioinformatics and Biomedicine (BIBM 2008),,
Philadelphia, PA, USA, Nov. 7-9, 2008 - Ma, X. Li, J. Samatova, N.F., \Automatic
Parallelization of Scripting Languages Toward
Transparent Desktop Parallel Computing."
Proceedings of IEEE/ACS International Conference
on Parallel and Distributed Processing Symposium
(IPDPS 2007), pp. 1-6, 26-30 March, 2007
4Publications (cont.)
- Lin H, Ma X, Chandramohan P, Geist A, Samatova
NF, Efficient Data Access for Parallel BLAST."
Proceedings of 19th IEEE International Parallel
and Distributed Processing Symposium (IPDPS
2005), pp. 72, 04-08 April 2005 - Yoginath S, Samatova NF, Bauer D, Kora G, Fann G,
Geist A, RScaLAPACK High-performance parallel
statistical computing with R and ScaLAPACK.
Proceedings of the 18th International Conference
on Parallel and Distributed Computing Systems
(PDCS-2005), Sep 12-14, 2005, Las Vegas, Nevada. - Park BH, Ostrouchov G, Samatova NF,
\Reservoir-based random sampling from data
stream." Proceedings of the Fourth SIAM
International Conference on Data Mining, Orlando,
FL, April, 2004 - Ostrouchov G, Samatova NF, Embedding methods and
robust statistics for dimension reduction."
COMPSTAT 2004 Proceedings in Computational
Statistics, Physica-Verlag, A Springer Company,
2004 - Park, B.-H. Samatova, N.F., Ostrouchov, G.
Geist, A., Xmap Fast dimension reduction
algorithms for multivariate streamline data."
Proceedings of the 6th International Workshop on
High Performance Data Mining Pervasive and Data
Stream Mining (in conjunction with Third
International SIAM Conference on Data Mining),
San Francisco, CA May 1-3, 2003. - Abu-Khzam FN, Samatova NF, Ostrouchov G, Langston
MA, Geist GA, Distributed dimension reduction
algorithms for widely dispersed dataa."
Proceedings of the Fourteenth IASTED
International Conference on Parallel and
Distributed Computing and Systems (IASTED PDCS
2002), p. 167-174, 2002, ACTA Press. - Qu Y, Ostrouchov G, Samatova NF, Geist A,
Principal component analysis for dimension
reduction in massive distributed data sets."
Proceedings of the Second SIAM International
Conference on Data Mining, p 4-9, April 2002 - Samatova NF, Ostrouchov G, Geist A, Melechko AV,
RACHET A new algorithm for mining
multi-dimensional distributed datasets."
Proceedings of the SIAM Third Workshop on Mining
Scientific Datasets, Chicago, IL, April 2001 - Samatova NF, Breimyer P, Kora G, Pan P, Yoginath
S, \Parallel R for High Performance Analytics
Applications to Biology." in Scientic Data
Management, A. Shoshani and D. Rotem (editors),
C. Kamath (co-editor), CRC Press/Taylor and
Francis, 2008 (Coming soon) - Samatova, N.F., Branstetter, M., Ganguly, A.R.,
Hettich, R., Khan, S., Kora, G., Li, J., Ma, X.,
Pan, C.,Shoshani, A., S. Yoginath, \High
performance statistical computing with parallel
R Applications to biology and climate." Journal
of Physics Conference Series, SciDAC 2006, v.
46, p. 505-509, 2006. - Bethel W, Abram G, Sharf J, Frank R, Ahrens J,
Samatova NF, Miller M, Interoperability of
visualization software and data models is not an
achievable goal." In Proceedingsof the IEEE
Visualization, Seattle, Washington, October
19-24, 2003, p. 607-610
5Tonys Frustrations
Scientific Computing is not only
COMPUTE-INTENSIVE but also DATA-INTENSIVE.
- Visualization
- TSB, ParaView, EnSight, VisBench... Which one
to choose? What if I want the best part of each
one of them? Will they ever interoperate? - Will they support HDF directly? What about
parallel I/O? - Will I have viz pipelines/features customized for
TSI? - Multi-resolution, remote, collaborative,
interactive, parallel, scalable - Data analysis
- Will I have data analysis pipelines customized
for TSI? - What features to extract?
- Move from qualitative to quantitative validation
and verification of models - Can I have a compact representation of entire
simulation? How to compare simulations? Will
data analysis be coupled w/ data archives? - Will data analysis be ever coupled with
visualization?
6More Frustrations
Tony wants to remain a Domain Expert NOT to
become a Jack of All Trades
- Data Management Networking
- Hydro-run 10243 produces terabytes per run
- How to efficiently stream directly to-from HPSS?
- PVFS, SRM, HRM How to utilize them?
- Simultaneous transfer of data from simulation
computer to data analysis/Viz. cluster - File I/O and data transfer take as much time and
effort as simulation if not more, while limiting
data size often results in rerun due to overly
coarse sampling - What about data reduction/compression techniques?
How aggressive can I be? Will it be enough? What
about viz and data analysis running on reduced
data? Will I still preserve the desired features? - How to efficiently utilize network resources
including data staging, cataloging, scheduling of
preprocessing data analysis viz tasks?
7How to Make Tony Happy? Internet Plug-ins
for Ultrascale Computing?
Paraview
IEEE Viz-2003
8End-to-End Data Analytics
9Programmers Dilemma
Domain-specific (?)
Productivity
high-level languages
Scripting (R, Matlab, IDL)
Object Oriented (C, Java)
Functional languages (C, Fortran)
Assembly
low-level language
10Towards High-Performance High-Level Languages
How do we get there? ? Parallelization
Domain-specific (?)
Productivity
high-level languages
Performance
Scripting (R, Matlab, IDL)
Object Oriented (C, Java)
Functional languages (C, Fortran)
Assembly
low-level language
11One Hat Does NOT Fit AllParallel R for Data
Intensive Statistical Computing
- Technical computing
- Matrix and vector formulations
Statistical computing and graphics
http//www.r-project.org
- Developed by R. Gentleman R. Ihaka
- Expanded by community as open source
- Extensible via dynamically loadable libs
- Data Visualization and analysis platform
- Image processing, vector computing
12Statistical Computing with R
- About R (http//www.r-project.org/)
- Open source, most widely used for statistical
analysis and graphics similar to S. - Extensible via dynamically loadable add-on
packages. - Originally developed by R. Gentleman and R.
Ihaka.
Towards Enabling Parallel Computing in R
- snow (Luke Tierney) general API on top of
message passing routines to provide high-level
(parallel apply) commands mostly demonstrated
for embarrassingly parallel applications.
- Rmpi (Hao Yu) R interface to LAM-MPI.
- rpvm (Na Li and Tony Rossini) R interface to
PVM requires knowledge of parallel programming.
gt library (rpvm) gt .PVM.start.pvmd () gt
.PVM.addhosts (...) gt .PVM.config ()
13Lessons Learned from R/Matlab ParallelizationInte
ractivity and High-Level Curse Blessing
pR
Back-end approach - data parallelism -
C/C/Fortran with MPI - RScaLAPACK (Samatova
et al, 2005)
high
Automatic parallelization - task parallelism
- task-pR (Samatova et al, 2004)
Abstraction Interactivity Productivity
Embarrassing parallelism - data parallelism -
snow (Tierney, Rossini, Li, Sevcikova, 2006)
Manual parallelization - message passing -
Rmpi (Hao Yu, 2006) -rpvm (Na Li Tony
Rossini, 2006)
Compiled approach - Matlab?C?automatic
parallelization
low
Packages http//cran.r-project.org/
14Task and Data Parallelism in pR
15pR Multi-tiered Architecture
Interactive R Client
16pR in Use
- Key Features of pR Users Perspective
- Be able to use existing high level R code
- Require minimal extra efforts for parallelizing
- Have identical/similar (presumably easy-to-use)
interface to Rs - Be able to test codes in sequential settings
- Provide efficient and scalable (in terms of
problem size and number of processors)
performance - Integrate with Kepler as front-end interface
17Scalability of pR RScaLAPACK
Rgt solve (A,B) pRgt sla.solve (A, B, NPROWS,
NPCOLS, MB) A, B are input matrices NPROWS and
NPCOLS are process grid specs MB is block size
116
111
106
99
83
59
Architecture SGI Altix at CCS of ORNL with 256
Intel Itanium2 processors at 1.5 GHz 8 GB of
memory per processor (2 TB system memory) 64-bit
Linux OS 1.5 TeraFLOPs/s theoretical total peak
performance.
18Overhead due to R pR
19C/C/Fortran Plug-in to pR
20Serial pR Performance over Python and R
pR Improv. over Python
pR Improv. over R
pR
Comparing Method Performance in Seconds
21RedHat and CRAN Distribution
22End-to-End Data Analytics
23Outreach Applications Publications
Across Science Applications
- Biology Quantitative Proteomics (B. Hettich, G.
Hurst, C. Harwood, C. Pan) - Climate Analysis of Extreme Events (M.
Branstetter, A. Ganguly, S. Khan) - GIS GRASSpR (G. Fann, B. Budhend)
- Fusion Scott Klasky, Bill Nevins
24- Subtract background noise from data
- Generate Covariance Chromatogram
- Apply Savitzky-Golay Smoother
- Calculate cut-off for search
- Find Window with Max. SN ratio
- ..
ProRata http//www.MSProRata.org
25ProRata Bringing pR to Biologists
DOE OBER Projects Using ProRata
- J. Banfield, Bob Hettich AMD Nature-09
- M. Buchanan CMCS Center Bioinformatics08
- J. Mielenz BESC BioEnergy In-submission
- C. Harwood, Bob Hettich R. palustris MCP-08
gt1,000 downloads
ProRata http//www.MSProRata.org
AnalChem-06.a, 06.b
26- About GRASS (grass.itc.it)
- GRASS (Geographic Resources Analysis Support
System) is a raster/vector GIS, image processing
system, and graphics production. - GRASS contains over 350 programs and tools to
render maps and images on monitor and paper
manipulate raster, vector, and sites data
process multi spectral image data create,
manage, and store spatial data. - It is Free (Libre) Software/Open Source released
under GNU GPL.
27(No Transcript)
28End-to-End Data Analytics
29Programmatic Backend Access Via Web Services
Integration to Kepler
Kepler Workflow
30Dashboard Interface to pR
Scott Klasky Roselyne Nobert
31End-to-End Data Analytics
32Parallel, Distributed and Streamline Algorithms
- Clustering
- RACHET REF, REF
- Faisals
- Dimension Reduction and Data Compression
- Distributed PCA REF
- Streamline XMap
- RobustMap REF
- Outlier/Extreme Event Detection
- RobustMap REF
- Modeling the Usual to Find the Unusual REF
- Climate Extreme Events SciDAC-06
- Streamline Sampling
- With replacement REF, REF
- Parallel Graph Mining
33RACHET Distributed Hierarchical Clustering
1. Generate Local Dendogram
2. Transmit
Send the code NOT the data
RACHET
3. Merge
4. Visualize
Centroid Descriptive Statistics
Merging Theorem for updating DS
Global Dendogram
Recursive Agglomeration of Clustering Hierarchies
by Encircling Tactic (RACHET)
34Distributed Streaming Dimension
ReductionMerging Information Rather Than Raw
Data
Stream of simulation data
tt2
new
Incremental update via fusion
- Merge pivotal points only
- Linear time for each chunk
- 5 deviation from monolithic
- Merge few PCs and local means
- One time communication
- Controlled variability preserved
35Model the Usual to Find the Unusual
To reduce the data to detect extreme/specific
events in global context.
3. Reduce data to model parameters
4. Select extremes for global analysis
5. Cluster the extremes (4)
6. Map back to series
36End-to-End Data Analytics
37Climate Data Movement ESGSDM
38mpiBLAST-pio Exploiting Parallel I/O
- Publications IPDPS-05, SSDBM-08
- Download http//mpiblast.lanl.gov or
http//www.mpiblast.org - Collaborators Xiasong Ma, Heshan Lin, Wu Feng
39End-to-End Data Analytics Summary
40How to Make Sense of Data?Know Your Limits Be
Smart
Not humanly possible to browse a petabyte of
data. Analysis must reduce data to quantities of
interest.
Ultrascale Computations Must be smart about
which probe combinations to see! Physical
Experiments Must be smart about probe placement!
To see 1 percent of a petabyte at 10 megabytes
per second takes
35 8-hour days!
41Looking into the FutureNSF Expedition
Nagiza Samatova Mladen Vouk Scott Klasky Alok
Choudhary Bertram Ludaescher
42Concept-Driven Analytics
43Generating Knowledge Hierarchies via In-X
Analytics
Climate Use Case
In-X devices/applications (white spheres)
produce Knowledge Layers (pyramid) for annotation
and further discussion by scientific social
sub-nets (smileys). L1 A supercomputer runs a
simulation and produces raw data (bottom pyramid
layer). L2 As the simulation proceeds, in-X
cloud is informed of the pending analytics. While
streaming time series to their destination,
cyberinfrastructure cloud on-the-fly segments
them (into 100 time points), fits polynomials
into each segment, reduces segments to a few
polynomial coefficients. In-networks reduced data
reaches remote destination, active disks. L3
Disks, while storing the data, perform in-disks
clustering to find similar points in
low-dimensional coefficient space (the usual) and
detect outliers to find local extremes (the
unusual). L4 Disks fit statistical models into
clusters of similar points (e.g., cluster
centroids, density). L5 Local/global extremes
for different variables are analyzed in memory
for cause-effect linkages. L6 Humanly and/or
automatically generated hypotheses are recorded
in community knowledgebases. L8 Databases,
while recording the predicted relationships and
hypotheses, compare, contrast, and link them to
prior knowledge. In-database comparative analysis
results are recorded.
44Semantic Knowledge Annotation with BioDEAL