Title: Challenges for Scalable Scientific Knowledge Discovery
1Challenges for Scalable Scientific Knowledge
Discovery
- Alok Choudhary
- EECS Department, Northwestern University
- Wei-keng Liao, Kui Gao, Arifa Nisar
- Rob Ross , Rajeev Thakur, Rob Latham (ANL)
- Many people from SDM center
2Outline
- Achievements
- Success stories
- Vision for the future (and of the past!)
3Achievements
- Parallel NetCDF
- New parallel I/O APIs
- Scalable data file (64bit) implementation
- Application communities DOE climate,
astrophysics, ocean modeling - MPI-IO
- A coherent cache layer in ROMIO
- Locking protocol aware file domain partitioning
methods - Many optimizations
- Use in production applications
- PVFS
- Datatype I/O
- Distributed file locking
- I/O benchmark
- S3aSim a sequence similarity search framework
4Success stories
- Parallel NetCDF
- Application communities DOE climate,
astrophysics, ocean modeling - FLASH-IO benchmark with pnetCDF method
- Application
- S3D combustion simulation from Jacqueline Chen at
SNL - MPI collective I/O method
- PnetCDF method
- HDF5 method
- ADIOS method
- I/O benchmark
- S3aSim a sequence similarity search framework
- Lots of downloads of software in public domain
Techniques directly and indirectly used by many
applications
5Illustrative pnetCDF users
- FLASH astrophysical thermonuclear application
from ASCI/Alliances center at university of
Chicago - ACTM atmospheric chemical transport model, LLNL
- WRF-ROMS regional ocean model system I/O module
from scientific data technologies group, NCSA - ASPECT data understanding infrastructure, ORNL
- pVTK parallel visualization toolkit, ORNL
- PETSc portable, extensible toolkit for
scientific computation, ANL - PRISM PRogram for Integrated Earth System
Modeling, users from CC Research Laboratories,
NEC Europe Ltd. - ESMF earth system modeling framework, national
center for atmospheric research
J. Li, W. Liao, A. Choudhary, R. Ross, R. Thakur,
W. Gropp, R. Latham, A. Siegel, B. Gallagher, and
M. Zingale. Parallel netCDF A Scienti?c
High-Performance I/O Interface. SC 2003.
6PnetCDF large array support
- The limitations of current pnetCDF
- CDF-1 lt 2GB file size and lt 2GB array size
- CDF-2 gt 2GB file size but still lt 2GB array size
- File format uses only 32-bit signed integers
- Implementations MPI Datatype constructor uses
only 32-bit integers - Large array support
- CDF-5 gt 2GB file size and gt 2GB array size
- Changes in file format and APIs
- Replace all 32-bit integers with 64-bit integers
- New 64-bit integer attributes
- Changes in implementation
- Replace MPI functions and maintain or enhance
optimizations
(Current/future work)
7PnetCDF subfiling
- As the number of processes increases in todays
HPC, problem domain size increases, so are array
sizes - Storing global arrays of size gt 100GB in a single
netCDF file may not be effective/efficient for
post data analysis - Subfiling divides a netCDF dataset into multiple
files, but still maintaining the canonical data
structure - Automatically reconstruct arrays, subarrays,
based on the subfiling metadata
(Current/future work)
8Analytical functions for pnetCDF
(Future work)
- A new set of APIs
- Reduction functions, statistical functions,
histograms, and multidimensional transformations,
data mining - Enable on-line processing while data is generated
- Built on top of the existing pnetCDF data access
infrastructure
9MPI-IO persistent file domain
(Past work)
- Aim to reduce the cost of cache coherence control
across multiple MPI-IO calls - Keep file access domains unchanged from one to
another IO call - Cached data can safely stay at client-side memory
without being evicted - Implementations
- User provided domain size
- Automatically determined by the aggregate access
region
K. Coloma, A. Choudhary, W. Liao, L. Ward, E.
Russell, and N. Pundit. Scalable High-level
Caching for Parallel I/O. IPDPS 2004.
10MPI-IO file caching
- A coherent client-side file caching system
- Aim to improve performance across multiple I/O
calls - Implementations
- I/O threads one POSIX thread in each I/O
aggregator - MPI remote memory access functions
- I/O delegate using MPI dynamic process
management functions
(Current/future work)
FLASH-IO
- W. Liao, A. Ching, K. Coloma, A. Choudhary, and
L. Ward. An Implementation and Eval- uation of
Client-side File Caching for MPI-IO. IPDPS 2007. - K. Coloma, A. Choudhary, W. Liao, L. Ward, and
S. Tideman. DAChe Direct Access Cache System for
Parallel I/O. International Supercomputer
Conference, 2005.
11Caching with I/O delegate
- Allocate a dedicate group of processes to perform
I/O - Uses a small percentage (lt 10 ) of additional
resource - Entire memory space at delegates can be used for
caching - Collective I/O off-load
I/O delegate size is 3
(Current/future work)
A. Nisar, W. Liao, and A. Choudhary. Scaling
Parallel I/O Performance through I/O Delegate and
Caching System. SC 2008.
12Operations off-load
(Future work)
- I/O delegates are additional compute resource
- Idle while parallel program is in the computation
stage - Powerful enough to run complete parallel programs
- Potential operations
- On-line data analytical processing
- Operations for active disk with caching support
- Parallel programs since delegates can communicate
with each other - Data redundancy and reliability support
parity, mirroring across all delegates
13MPI file domain partitioning methods
(Current/future work)
- Partitioning methods are based on underlying file
system locking protocol - GPFS token-based protocol
- Align the partitioning with the lock boundaries
- Lustre server-based protocol
- Static-cyclic based
- Group-cyclic based
W. Liao and A. Choudhary. Dynamically Adapting
File Domain Partitioning Methods for Collective
I/O Based on Underlying Parallel File System
Locking Protocols. SC 2008.
14S3D-IO on Cray XTPerformance/Productivity
(Current work)
- Problem
- Number of files created are often generated per
processor - Causes problems with archiving and future access
- Approach
- Parallel I/O (MPI-IO) optimization
- One file per variable during I/O
- Requires multi-processor coordination during I/O
- Achievement
- Shown to scale to 10s of thousands of processors
on production systems - better performance but eliminating the need to
create 100K files
15Optimizations for PVFS
(past work)
- Datatype I/O
- Packing non-contiguous I/O requests into a single
request - Data layout is presented in a concise
description, which is passed over the network
instead of (offset, length) - Distributed locking component
- Datatype lock consisting of many non-contiguous
regions - Try-lock protocol
- When failed, fall back to ordered two-phase lock
- A. Ching, A. Choudhary, W. Liao, R. Ross, and W.
Gropp. Efficient Structured Data Access in
Parallel File Systems. Cluster Computing 2003 - A. Ching, R. Ross, W. Liao, L. Ward, and A.
Choudhary. Noncontiguous Locking Techniques for
Parallel File Systems. SC 2007.
16I/O benchmark
(Past work)
- S3aSim
- A sequence similarity search algorithm framework
for MPI-IO evaluation. It uses a master-slave
parallel programming model with database
segmentation, which mimics the mpiBLAST access
pattern
A. Ching, W. Feng, H. Lin, X. Ma, and A.
Choudhary. Exploring I/O strategies for parallel
sequence database search tools with S3aSim. HPDC
2006
17Data analytic run-time library at active storage
nodes
(Future work)
- Enhance the MPI-IO interfaces and functionality
- Pre-define functions
- Plug-in user-defined functions
- Embedded functions in MPI data representation
- Active storage infrastructure
- General-purpose CPU with GPUs and/or FPGA
- FPGAs for reconfiguration and acceleration of
analysis functions - Software programming model
- Traditional application codes
- Acceleration codes for GPUs and FPGAs
18The VISION THING!
19Discovery of Patterns from Global Earth Science
Data Sets(Instruments, Sensors and/or
Simulations)
- Science Goal Understand global scale patterns in
biosphere processes - Earth Science Questions
- When and where do ecosystem disturbances occur?
- What is the scale and location of land cover
change and its impact? - How are ocean, atmosphere and land processes
coupled? - Data sources
- Weather observation stations
- High-resolution EOS satellites
- 1982-2000 AVHRR at 1 x 1 resolution
(115kmx115km) 2000-present MODIS at 250m x 250m
resolution - Model-based data from forecast and other
models Sea level pressure 1979-present at 2.5 x
2.5 Sea surface temperature 1979-present 1 x
1 - Data sets created by data fusion
Monthly Average Temperature
Earth Observing System
20Analytics/Knowledge Discovery Challenges
- Spatio-temporal nature of data
- Traditional data mining techniques do not take
advantage of spatial and temporal
autocorrelation. - Scalability
- Size of Earth Science data sets can be very
large, especially for data such as
high-resolution vegetation - Grid cells can range from a resolution of 2.5 x
2.5 (10K locations for the globe) to 250m x 250m
(15M locations for just California about 10
billion for the globe) - High-dimensionality
- Long time series are common in Earth Science
21Some Climate problems and Knowledge Discovery
Challenges
- Challenges
- Spatio-temporal nature of data
- Traditional data mining techniques do not take
advantage of spatial and temporal
autocorrelation. - Scalability
- Size of Earth Science data sets has increased 6
orders of magnitude in 20 years, and continues to
grow with higher resolution data. - Grid cells have gone from a resolution of 2.5 x
2.5 (10K points for the globe) to 250m x 250m
(15M points for just California about 10 billion
for the globe) - High-dimensionality
- Long time series are common in Earth Science
- Climate Problems
- Extend the range, accuracy, and utility of
weather prediction - Improve our understanding and timely prediction
of severe weather, pollution, and climate events. - Improve understanding and prediction of seasonal,
decadal, and century-scale climate variation on
global, regional, and local scales - Create the ability to make accurate predictions
of global climate and carbon-cycle response to
various forcing scenarios over the next 100
years.
22Astrophysics
- Cosmological Simulations
- Simulate formation and evolution of galaxies
- What is dark matter?
- What is the nature of dark energy?
- How did galaxies, quasars, and supermassive black
holes form from the initial conditions in the
early universe.
Snapshot from a pure N-body simulation showing
the distribution of dark matter at the present
time (light colors represent greater density of
dark matter). 1B particles
Postprocessed to demonstrate the impact of
ionizing radiation from galaxies.
23SDM Future Vision
- Build Science Intelligence and Knowledge
Discoverer - Think of this as Oracle, SAS, NetAPP and
Amazon combined into one - Build tools for customization to application
domain (potential verticals) - Provide Toolbox for common applications
- Develop Scientific Warehouse infrastructure
- Build intelligence into the I/O Stack
- Develop an analytics appliance
- Develop a language and support for specifying
management and analytics - Focus on Needs as more important consideration
than feature
24Large-Scale Scientific Data Managementand
Analysis
- Prof. Alok Choudhary
- ECE Department, Northwestern University
- Evanston, IL
- Email choudhar_at_ece.northwestern.edu
- ACKNOLEDGEMENTS Wei-Keng Liao, M. Kandemir, X.
Shen, S. More, R. Thakur, G. Memik, J No, R.
Stevens - Project Web Page - http//www.ece.northwestern.
edu/wkliao/MDMS
Salishan Conference on High-Speed Computing,
April 2001
25Cosmology Application
Variables
Time
26Virtuous Cycle
Simulation (Execute app, Generate data)
Problem setup (Mesh, domain Decomposition)
Manage, Visualize, Analyze
Measure Results, Learn, Archive
27Problems and Challenges
- Large-scale data (TB, PB ranges)
- Large-scale parallelism (unmanageable)
- Complex data formats and hierarchies
- Sharing, analysis in a distributed environment
- Non-standard systems and interoperability
problems (e.g., file systems) - Technology driven by commercial applications
- Storage
- File systems
- Data management
- What about analysis? Feature extraction, mining,
pattern recognition etc.
28MDMS - Goals and Objectives
- High-performance data access
- Determine optimal parallel I/O techniques for
applications - Data access prediction
- Transparent data pre-fetching, pre-staging,
caching, subfiling on storage system - Automatic data analysis for data mining
- Data management for large-scale scientific
computations - Use a database to store all metadata for
performance (and other information) future
(XML?) - Static metadata data location, access, storage
pattern, underlying storage device, etc - Dynamic metadata data usage, historical
performance and access patterns, associations and
relationships among datasets - Support for on-line and off-line data analysis
and mining
29Architecture
Simulation Data Analysis Visualization
User Applications
I/O func (best_I/O (for these param)) Hint
Query Input Metadata Hints, Directives Association
s
Data
OIDs parameters for I/O
Schedule, Prefetch, cache Hints (coll I/O)
Storage Systems (I/O Interface)
MDMS
Performance Input System metadata
Metadata access pattern, history
MPI-IO (Other interfaces..)
30Metadata
- Application Level
- Date, run-time parameters, execution environment,
comments, result summary, etc. - Program Level
- Data types, structures
- Association of multiple datasets and files
- File location, file structures (single/multiple
datasets multiple/single file) - Performance Level
- I/O functions (eg. Collective/non-collective I/O
parameters) - Access hints, access pattern, storage pattern,
dataset associations - Striping, pooled striping, storage association
- Prefetching, staging, migration, caching hints
- Historical performance
31Interface
32Run Application
33Dataset and Access Pattern Table
34Data Analysis
35Visualize
36Incorporating Data Analysis, Mining and Feature
Detection
- Can these tasks be performed on-line?
- It is expensive to write and read back data for
future analysis - Why not embed analysis functions within the
storage (I/O) runtime systems? - Utilize resources by partitioning system into
data generator and analyzer
37Integrating Analysis
Simulation (Execute app, Generate data)
On-line analysis And mining
Problem setup (Mesh, domain Decomposition)
Manage, Visualize, Analyze
Measure Results, Learn, Archive
38Some Publications
- A. Choudhary, M. Kandemir, J. No, G. Memik, X.
Shen, W. Liao, H. Nagesh, S. More, V. Taylor, R.
Thakur, and R. Stevens. Data Management for
Large-Scale Scientific Computations in High
Performance Distributed Systems'' in Cluster
Computing the Journal of Networks, Software
Tools and Applications, 2000 - A. Choudhary, M. Kandemir, H. Nagesh, J. No, X.
Shen, V. Taylor, S. More, and R. Thakur. Data
Management for Large-Scale Scientific
Computations in High Performance Distributed
Systems'' in High-Performance Distributed
Computing Conference'99, San Diego, CA, August,
1999. - A. Choudhary and M. Kandemir. System-Level
Metadata for High-Performance Data management''
in IEEE Metadata Conference, April, 1999. - X. Shen, W. Liao, A. Choudhary, G. Memik, M.
Kandemir, S. More, G. Thiruvathukal, and A.
Singh. A Novel Application Development
Environment for Large-Scale Scientific
Computations', International Conference on
Supercomputing, 2000 - These and more Available at http//www.ece.northwe
stern.edu/wkliao/MDMS
39Internal Architecture and Data Flow
40In-Place On-Line Analytics Software Architecture
41Statistical and Data Mining Functions on Active
Storage Cluster
(Future work)
- Develop computational kernels common in
analytics, data mining and statistical operations
for acceleration on FPGAs - NU-minebench data mining package
- Develop parallel version of the data mining
kernels that can be accelerated using GPUs and
FPGAs
MineBench Project Homepage http//cucis.ece.north
western.edu/projects/DMS
42Accelerating and Computing in the Storage
43Illustration of Acceleration (1) Classification
(2) PCA
44GPU Coprocessing
- Compared to CPUs, GPUs offer 10x higher
computational capability and 10x greater memory
bandwidth. - Lower operating speed, but higher transistor
count. - More transistors devoted to computation.
- In the past, general purpose computation on GPUs
was difficult. - Hardware was specialized.
- Programming required knowledge of the rendering
pipeline. - Now, however, GPUs look much more like SIMD
machines. - More of the GPUs resources can be applied toward
general-purpose computation. - Coding for the GPU no longer requires background
knowledge in graphics rendering. - Performance gains of 1-2 orders of magnitude are
possible for data-parallel applications.
45k-Means Performance (compared with host processor)
46Results
47Challenges in Scientific Knowledge Discovery
Scientific Data Management
- Data management
- Query of Scientific DB
- Performance optimizations
Knowledge Discovery
Knowledge Discovery
- In-place analytics
- Customized acceleration
- Scalable Mining
- High-level interface
- proactive
- What not How?
High-Performance I/O
Analytics and Mining
48SDM Future Vision
- Build Science Intelligence and Knowledge
Discoverer - Think of this as Oracle, SAS, NetAPP and
Amazon combined into one - Build tools for customization to application
domain (potential verticals) - Provide Toolbox for common applications
- Develop Scientific Warehouse infrastructure
- Build intelligence into the I/O Stack
- Develop an analytics appliance
- Develop a language and support for specifying
management and analytics - Focus on Needs as more important consideration
than feature