High Performance Scientific Data Analytics

About This Presentation

Title:

High Performance Scientific Data Analytics

Description:

High Performance Scientific Data Analytics –

Number of Views:130

Avg rating:3.0/5.0

Slides: 44

Provided by: your182

Learn more at: https://sdm.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Scientific Data Analytics

1
High Performance Scientific Data Analytics

Nagiza F. Samatova, PhD
Department of Computer Science, NCSU
Computer Science and Mathematics Division, ORNL

2
Core Team
Paul Breimyer
Guru
Heshan
Collaborators Co-authors on papers, Scott
Klasky, Roselyne, Mladen, Arie and Alex, Marcia,
Bill Nevins, Bob Hettich, John Drake, Tony
Mezzacappa, etc.
Chandra
3
Publications

CRAN Samatova NF, Yoginath S, Kora G, Bauer D,
http//cran.r-project.org/mirrors.html.
SciDAC-06 Samatova NF, Branstetter M, Ganguly
AR, Hettich R, Khan S, Kora G, Li J, Ma X, Pan C,
Shoshani A, Yoginath S, Journal of Physics
Conference Series 46 (2006) 505509.
PDCS-05 Yoginath S, Samatova NF, Bauer D, Kora
G, Fann G, Geist A, In Proceedings of the 18th
International Conference on Parallel and
Distributed Computing Systems (PDCS-2005),
September 12 - 14, 2005, Las Vegas, Nevada.
AnalChem-06.a Pan C, Kora G, McDonald WH, Tabb
DL, VerBerkmoes NC, Hurst GB, Pelletier DA,
Samatova NF, Hettich RL, Anal Chem. 2006 Oct
1578(20)7121-31.
AnalChem-06.b Pan C, Kora G, Tabb DL, Pelletier
DA, McDonald WH, Hurst GB, Hettich RL, Samatova
NF, Anal Chem. 2006 Oct 1578(20)7110-20.
TPAMI-05 Ostrouchov G, Samatova NF, IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 271340-1343, 2005.
JCGS-07 Qu YM, Ostrouchov G, Yoginath S,
Samatova NF, Journal of Computational and
Graphical Statistics, 2007
MCP-08 Pan, C., Oda, Y., Lankford, P.K., Zhang,
B., Samatova, N.F., Pelletier, D.A.,Harwood,
C.S., Hettich, R.L.,Characterization of anaerobic
catabolism of p-coumarate in Rhodopseudomonas
palustris by integrating transcriptomics and
quantitative proteomics." Mol Cell Proteomics,
vol. 7, no. 5, pp. 938-48, 2008.
CSDA-07 Park BH, Ostrouchov G, Samatova NF.,
Sampling streaming data with replacement. Comput.
Stat. Data Anal., vol. 52, no. 2, pp. 750-762,
2007
TVCG-07 Sisneros, R., Jones, C., Huang, J.,
Gao, J., Park, B.H., Samatova, N.F., A
multi-level cache model for run-time optimization
of remote visualization." IEEE Trans Vis Computer
Graph, vol. 13, no. 5, pp. 991-1003, Sep-Oct 2007
DPD-02 Samatova NF, Ostrouchov G, Geist A,
Melechko AV., RACHET An efficient cover-based
merging of clustering hierarchies from
distributed datasets." Distrib. Parallel
Databases,vol. 11, no. 2, pp. 157-180, Mar 2002
BIBM-08 Breimyer, P., Green, N., Kumar, V.,
Samatova, N.F., \BioDEAL Biological
data-evidence-annotation linkage system."
Proceedings of the IEEE International Conference
on Bioinformatics and Biomedicine (BIBM 2008),,
Philadelphia, PA, USA, Nov. 7-9, 2008
Ma, X. Li, J. Samatova, N.F., \Automatic
Parallelization of Scripting Languages Toward
Transparent Desktop Parallel Computing."
Proceedings of IEEE/ACS International Conference
on Parallel and Distributed Processing Symposium
(IPDPS 2007), pp. 1-6, 26-30 March, 2007

4
Publications (cont.)

Lin H, Ma X, Chandramohan P, Geist A, Samatova
NF, Efficient Data Access for Parallel BLAST."
Proceedings of 19th IEEE International Parallel
and Distributed Processing Symposium (IPDPS
2005), pp. 72, 04-08 April 2005
Yoginath S, Samatova NF, Bauer D, Kora G, Fann G,
Geist A, RScaLAPACK High-performance parallel
statistical computing with R and ScaLAPACK.
Proceedings of the 18th International Conference
on Parallel and Distributed Computing Systems
(PDCS-2005), Sep 12-14, 2005, Las Vegas, Nevada.
Park BH, Ostrouchov G, Samatova NF,
\Reservoir-based random sampling from data
stream." Proceedings of the Fourth SIAM
International Conference on Data Mining, Orlando,
FL, April, 2004
Ostrouchov G, Samatova NF, Embedding methods and
robust statistics for dimension reduction."
COMPSTAT 2004 Proceedings in Computational
Statistics, Physica-Verlag, A Springer Company,
2004
Park, B.-H. Samatova, N.F., Ostrouchov, G.
Geist, A., Xmap Fast dimension reduction
algorithms for multivariate streamline data."
Proceedings of the 6th International Workshop on
High Performance Data Mining Pervasive and Data
Stream Mining (in conjunction with Third
International SIAM Conference on Data Mining),
San Francisco, CA May 1-3, 2003.
Abu-Khzam FN, Samatova NF, Ostrouchov G, Langston
MA, Geist GA, Distributed dimension reduction
algorithms for widely dispersed dataa."
Proceedings of the Fourteenth IASTED
International Conference on Parallel and
Distributed Computing and Systems (IASTED PDCS
2002), p. 167-174, 2002, ACTA Press.
Qu Y, Ostrouchov G, Samatova NF, Geist A,
Principal component analysis for dimension
reduction in massive distributed data sets."
Proceedings of the Second SIAM International
Conference on Data Mining, p 4-9, April 2002
Samatova NF, Ostrouchov G, Geist A, Melechko AV,
RACHET A new algorithm for mining
multi-dimensional distributed datasets."
Proceedings of the SIAM Third Workshop on Mining
Scientific Datasets, Chicago, IL, April 2001
Samatova NF, Breimyer P, Kora G, Pan P, Yoginath
S, \Parallel R for High Performance Analytics
Applications to Biology." in Scientic Data
Management, A. Shoshani and D. Rotem (editors),
C. Kamath (co-editor), CRC Press/Taylor and
Francis, 2008 (Coming soon)
Samatova, N.F., Branstetter, M., Ganguly, A.R.,
Hettich, R., Khan, S., Kora, G., Li, J., Ma, X.,
Pan, C.,Shoshani, A., S. Yoginath, \High
performance statistical computing with parallel
R Applications to biology and climate." Journal
of Physics Conference Series, SciDAC 2006, v.
46, p. 505-509, 2006.
Bethel W, Abram G, Sharf J, Frank R, Ahrens J,
Samatova NF, Miller M, Interoperability of
visualization software and data models is not an
achievable goal." In Proceedingsof the IEEE
Visualization, Seattle, Washington, October
19-24, 2003, p. 607-610

5
Tonys Frustrations
Scientific Computing is not only
COMPUTE-INTENSIVE but also DATA-INTENSIVE.

Visualization
TSB, ParaView, EnSight, VisBench... Which one
to choose? What if I want the best part of each
one of them? Will they ever interoperate?
Will they support HDF directly? What about
parallel I/O?
Will I have viz pipelines/features customized for
TSI?
Multi-resolution, remote, collaborative,
interactive, parallel, scalable
Data analysis
Will I have data analysis pipelines customized
for TSI?
What features to extract?
Move from qualitative to quantitative validation
and verification of models
Can I have a compact representation of entire
simulation? How to compare simulations? Will
data analysis be coupled w/ data archives?
Will data analysis be ever coupled with
visualization?

6
More Frustrations
Tony wants to remain a Domain Expert NOT to
become a Jack of All Trades

Data Management Networking
Hydro-run 10243 produces terabytes per run
How to efficiently stream directly to-from HPSS?
PVFS, SRM, HRM How to utilize them?
Simultaneous transfer of data from simulation
computer to data analysis/Viz. cluster
File I/O and data transfer take as much time and
effort as simulation if not more, while limiting
data size often results in rerun due to overly
coarse sampling
What about data reduction/compression techniques?
How aggressive can I be? Will it be enough? What
about viz and data analysis running on reduced
data? Will I still preserve the desired features?
How to efficiently utilize network resources
including data staging, cataloging, scheduling of
preprocessing data analysis viz tasks?

7
How to Make Tony Happy? Internet Plug-ins
for Ultrascale Computing?
Paraview
IEEE Viz-2003
8
End-to-End Data Analytics
9
Programmers Dilemma
Domain-specific (?)
Productivity
high-level languages
Scripting (R, Matlab, IDL)
Object Oriented (C, Java)
Functional languages (C, Fortran)
Assembly
low-level language
10
Towards High-Performance High-Level Languages
How do we get there? ? Parallelization
Domain-specific (?)
Productivity
high-level languages
Performance
Scripting (R, Matlab, IDL)
Object Oriented (C, Java)
Functional languages (C, Fortran)
Assembly
low-level language
11
One Hat Does NOT Fit AllParallel R for Data
Intensive Statistical Computing

Technical computing
Matrix and vector formulations

Statistical computing and graphics
http//www.r-project.org

Developed by R. Gentleman R. Ihaka
Expanded by community as open source
Extensible via dynamically loadable libs

Data Visualization and analysis platform
Image processing, vector computing

12
Statistical Computing with R

About R (http//www.r-project.org/)
Open source, most widely used for statistical
analysis and graphics similar to S.
Extensible via dynamically loadable add-on
packages.
Originally developed by R. Gentleman and R.
Ihaka.

Towards Enabling Parallel Computing in R

snow (Luke Tierney) general API on top of
message passing routines to provide high-level
(parallel apply) commands mostly demonstrated
for embarrassingly parallel applications.

Rmpi (Hao Yu) R interface to LAM-MPI.

rpvm (Na Li and Tony Rossini) R interface to
PVM requires knowledge of parallel programming.

gt library (rpvm) gt .PVM.start.pvmd () gt
.PVM.addhosts (...) gt .PVM.config ()
13
Lessons Learned from R/Matlab ParallelizationInte
ractivity and High-Level Curse Blessing
pR
Back-end approach - data parallelism -
C/C/Fortran with MPI - RScaLAPACK (Samatova
et al, 2005)
high
Automatic parallelization - task parallelism
- task-pR (Samatova et al, 2004)
Abstraction Interactivity Productivity
Embarrassing parallelism - data parallelism -
snow (Tierney, Rossini, Li, Sevcikova, 2006)
Manual parallelization - message passing -
Rmpi (Hao Yu, 2006) -rpvm (Na Li Tony
Rossini, 2006)
Compiled approach - Matlab?C?automatic
parallelization
low
Packages http//cran.r-project.org/
14
Task and Data Parallelism in pR
15
pR Multi-tiered Architecture
Interactive R Client
16
pR in Use

Key Features of pR Users Perspective
Be able to use existing high level R code
Require minimal extra efforts for parallelizing
Have identical/similar (presumably easy-to-use)
interface to Rs
Be able to test codes in sequential settings
Provide efficient and scalable (in terms of
problem size and number of processors)
performance
Integrate with Kepler as front-end interface

17
Scalability of pR RScaLAPACK
Rgt solve (A,B) pRgt sla.solve (A, B, NPROWS,
NPCOLS, MB) A, B are input matrices NPROWS and
NPCOLS are process grid specs MB is block size
116
111
106
99
83
59
Architecture SGI Altix at CCS of ORNL with 256
Intel Itanium2 processors at 1.5 GHz 8 GB of
memory per processor (2 TB system memory) 64-bit
Linux OS 1.5 TeraFLOPs/s theoretical total peak
performance.
18
Overhead due to R pR
19
C/C/Fortran Plug-in to pR
20
Serial pR Performance over Python and R
pR Improv. over Python
pR Improv. over R
pR
Comparing Method Performance in Seconds
21
RedHat and CRAN Distribution
22
End-to-End Data Analytics
23
Outreach Applications Publications
Across Science Applications

Biology Quantitative Proteomics (B. Hettich, G.
Hurst, C. Harwood, C. Pan)
Climate Analysis of Extreme Events (M.
Branstetter, A. Ganguly, S. Khan)
GIS GRASSpR (G. Fann, B. Budhend)
Fusion Scott Klasky, Bill Nevins

Subtract background noise from data
Generate Covariance Chromatogram
Apply Savitzky-Golay Smoother
Calculate cut-off for search
Find Window with Max. SN ratio
..

ProRata http//www.MSProRata.org
25
ProRata Bringing pR to Biologists
DOE OBER Projects Using ProRata

J. Banfield, Bob Hettich AMD Nature-09
M. Buchanan CMCS Center Bioinformatics08
J. Mielenz BESC BioEnergy In-submission
C. Harwood, Bob Hettich R. palustris MCP-08

gt1,000 downloads
ProRata http//www.MSProRata.org
AnalChem-06.a, 06.b
26

About GRASS (grass.itc.it)
GRASS (Geographic Resources Analysis Support
System) is a raster/vector GIS, image processing
system, and graphics production.
GRASS contains over 350 programs and tools to
render maps and images on monitor and paper
manipulate raster, vector, and sites data
process multi spectral image data create,
manage, and store spatial data.
It is Free (Libre) Software/Open Source released
under GNU GPL.

27
(No Transcript)
28
End-to-End Data Analytics
29
Programmatic Backend Access Via Web Services
Integration to Kepler
Kepler Workflow
30
Dashboard Interface to pR
Scott Klasky Roselyne Nobert
31
End-to-End Data Analytics
32
Parallel, Distributed and Streamline Algorithms

Clustering
RACHET REF, REF
Faisals
Dimension Reduction and Data Compression
Distributed PCA REF
Streamline XMap
RobustMap REF
Outlier/Extreme Event Detection
RobustMap REF
Modeling the Usual to Find the Unusual REF
Climate Extreme Events SciDAC-06
Streamline Sampling
With replacement REF, REF
Parallel Graph Mining

33
RACHET Distributed Hierarchical Clustering
1. Generate Local Dendogram
2. Transmit
Send the code NOT the data
RACHET
3. Merge
4. Visualize
Centroid Descriptive Statistics
Merging Theorem for updating DS
Global Dendogram
Recursive Agglomeration of Clustering Hierarchies
by Encircling Tactic (RACHET)
34
Distributed Streaming Dimension
ReductionMerging Information Rather Than Raw
Data
Stream of simulation data
tt2
new
Incremental update via fusion

Merge pivotal points only
Linear time for each chunk
5 deviation from monolithic

Merge few PCs and local means
One time communication
Controlled variability preserved

35
Model the Usual to Find the Unusual
To reduce the data to detect extreme/specific
events in global context.
3. Reduce data to model parameters
4. Select extremes for global analysis
5. Cluster the extremes (4)
6. Map back to series
36
End-to-End Data Analytics
37
Climate Data Movement ESGSDM
38
mpiBLAST-pio Exploiting Parallel I/O

Publications IPDPS-05, SSDBM-08
Download http//mpiblast.lanl.gov or
http//www.mpiblast.org
Collaborators Xiasong Ma, Heshan Lin, Wu Feng

39
End-to-End Data Analytics Summary
40
How to Make Sense of Data?Know Your Limits Be
Smart
Not humanly possible to browse a petabyte of
data. Analysis must reduce data to quantities of
interest.
Ultrascale Computations Must be smart about
which probe combinations to see! Physical
Experiments Must be smart about probe placement!
To see 1 percent of a petabyte at 10 megabytes
per second takes
35 8-hour days!
41
Looking into the FutureNSF Expedition
Nagiza Samatova Mladen Vouk Scott Klasky Alok
Choudhary Bertram Ludaescher
42
Concept-Driven Analytics
43
Generating Knowledge Hierarchies via In-X
Analytics
Climate Use Case
In-X devices/applications (white spheres)
produce Knowledge Layers (pyramid) for annotation
and further discussion by scientific social
sub-nets (smileys). L1 A supercomputer runs a
simulation and produces raw data (bottom pyramid
layer). L2 As the simulation proceeds, in-X
cloud is informed of the pending analytics. While
streaming time series to their destination,
cyberinfrastructure cloud on-the-fly segments
them (into 100 time points), fits polynomials
into each segment, reduces segments to a few
polynomial coefficients. In-networks reduced data
reaches remote destination, active disks. L3
Disks, while storing the data, perform in-disks
clustering to find similar points in
low-dimensional coefficient space (the usual) and
detect outliers to find local extremes (the
unusual). L4 Disks fit statistical models into
clusters of similar points (e.g., cluster
centroids, density). L5 Local/global extremes
for different variables are analyzed in memory
for cause-effect linkages. L6 Humanly and/or
automatically generated hypotheses are recorded
in community knowledgebases. L8 Databases,
while recording the predicted relationships and
hypotheses, compare, contrast, and link them to
prior knowledge. In-database comparative analysis
results are recorded.
44
Semantic Knowledge Annotation with BioDEAL

Write a Comment

User Comments (0)