Challenges for Scalable Scientific Knowledge Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Challenges for Scalable Scientific Knowledge Discovery

Description:

Challenges for Scalable Scientific Knowledge Discovery. Alok Choudhary ... K. Coloma, A. Choudhary, W. Liao, L. Ward, E. Russell, and N. Pundit. ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 49
Provided by: your182
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Challenges for Scalable Scientific Knowledge Discovery


1
Challenges for Scalable Scientific Knowledge
Discovery
  • Alok Choudhary
  • EECS Department, Northwestern University
  • Wei-keng Liao, Kui Gao, Arifa Nisar
  • Rob Ross , Rajeev Thakur, Rob Latham (ANL)
  • Many people from SDM center

2
Outline
  • Achievements
  • Success stories
  • Vision for the future (and of the past!)

3
Achievements
  • Parallel NetCDF
  • New parallel I/O APIs
  • Scalable data file (64bit) implementation
  • Application communities DOE climate,
    astrophysics, ocean modeling
  • MPI-IO
  • A coherent cache layer in ROMIO
  • Locking protocol aware file domain partitioning
    methods
  • Many optimizations
  • Use in production applications
  • PVFS
  • Datatype I/O
  • Distributed file locking
  • I/O benchmark
  • S3aSim a sequence similarity search framework

4
Success stories
  • Parallel NetCDF
  • Application communities DOE climate,
    astrophysics, ocean modeling
  • FLASH-IO benchmark with pnetCDF method
  • Application
  • S3D combustion simulation from Jacqueline Chen at
    SNL
  • MPI collective I/O method
  • PnetCDF method
  • HDF5 method
  • ADIOS method
  • I/O benchmark
  • S3aSim a sequence similarity search framework
  • Lots of downloads of software in public domain
    Techniques directly and indirectly used by many
    applications

5
Illustrative pnetCDF users
  • FLASH astrophysical thermonuclear application
    from ASCI/Alliances center at university of
    Chicago
  • ACTM atmospheric chemical transport model, LLNL
  • WRF-ROMS regional ocean model system I/O module
    from scientific data technologies group, NCSA
  • ASPECT data understanding infrastructure, ORNL
  • pVTK parallel visualization toolkit, ORNL
  • PETSc portable, extensible toolkit for
    scientific computation, ANL
  • PRISM PRogram for Integrated Earth System
    Modeling, users from CC Research Laboratories,
    NEC Europe Ltd.
  • ESMF earth system modeling framework, national
    center for atmospheric research

J. Li, W. Liao, A. Choudhary, R. Ross, R. Thakur,
W. Gropp, R. Latham, A. Siegel, B. Gallagher, and
M. Zingale. Parallel netCDF A Scienti?c
High-Performance I/O Interface. SC 2003.
6
PnetCDF large array support
  • The limitations of current pnetCDF
  • CDF-1 lt 2GB file size and lt 2GB array size
  • CDF-2 gt 2GB file size but still lt 2GB array size
  • File format uses only 32-bit signed integers
  • Implementations MPI Datatype constructor uses
    only 32-bit integers
  • Large array support
  • CDF-5 gt 2GB file size and gt 2GB array size
  • Changes in file format and APIs
  • Replace all 32-bit integers with 64-bit integers
  • New 64-bit integer attributes
  • Changes in implementation
  • Replace MPI functions and maintain or enhance
    optimizations

(Current/future work)
7
PnetCDF subfiling
  • As the number of processes increases in todays
    HPC, problem domain size increases, so are array
    sizes
  • Storing global arrays of size gt 100GB in a single
    netCDF file may not be effective/efficient for
    post data analysis
  • Subfiling divides a netCDF dataset into multiple
    files, but still maintaining the canonical data
    structure
  • Automatically reconstruct arrays, subarrays,
    based on the subfiling metadata

(Current/future work)
8
Analytical functions for pnetCDF
(Future work)
  • A new set of APIs
  • Reduction functions, statistical functions,
    histograms, and multidimensional transformations,
    data mining
  • Enable on-line processing while data is generated
  • Built on top of the existing pnetCDF data access
    infrastructure

9
MPI-IO persistent file domain
(Past work)
  • Aim to reduce the cost of cache coherence control
    across multiple MPI-IO calls
  • Keep file access domains unchanged from one to
    another IO call
  • Cached data can safely stay at client-side memory
    without being evicted
  • Implementations
  • User provided domain size
  • Automatically determined by the aggregate access
    region

K. Coloma, A. Choudhary, W. Liao, L. Ward, E.
Russell, and N. Pundit. Scalable High-level
Caching for Parallel I/O. IPDPS 2004.
10
MPI-IO file caching
  • A coherent client-side file caching system
  • Aim to improve performance across multiple I/O
    calls
  • Implementations
  • I/O threads one POSIX thread in each I/O
    aggregator
  • MPI remote memory access functions
  • I/O delegate using MPI dynamic process
    management functions

(Current/future work)
FLASH-IO
  • W. Liao, A. Ching, K. Coloma, A. Choudhary, and
    L. Ward. An Implementation and Eval- uation of
    Client-side File Caching for MPI-IO. IPDPS 2007.
  • K. Coloma, A. Choudhary, W. Liao, L. Ward, and
    S. Tideman. DAChe Direct Access Cache System for
    Parallel I/O. International Supercomputer
    Conference, 2005.

11
Caching with I/O delegate
  • Allocate a dedicate group of processes to perform
    I/O
  • Uses a small percentage (lt 10 ) of additional
    resource
  • Entire memory space at delegates can be used for
    caching
  • Collective I/O off-load

I/O delegate size is 3
(Current/future work)
A. Nisar, W. Liao, and A. Choudhary. Scaling
Parallel I/O Performance through I/O Delegate and
Caching System. SC 2008.
12
Operations off-load
(Future work)
  • I/O delegates are additional compute resource
  • Idle while parallel program is in the computation
    stage
  • Powerful enough to run complete parallel programs
  • Potential operations
  • On-line data analytical processing
  • Operations for active disk with caching support
  • Parallel programs since delegates can communicate
    with each other
  • Data redundancy and reliability support
    parity, mirroring across all delegates

13
MPI file domain partitioning methods
(Current/future work)
  • Partitioning methods are based on underlying file
    system locking protocol
  • GPFS token-based protocol
  • Align the partitioning with the lock boundaries
  • Lustre server-based protocol
  • Static-cyclic based
  • Group-cyclic based

W. Liao and A. Choudhary. Dynamically Adapting
File Domain Partitioning Methods for Collective
I/O Based on Underlying Parallel File System
Locking Protocols. SC 2008.
14
S3D-IO on Cray XTPerformance/Productivity
(Current work)
  • Problem
  • Number of files created are often generated per
    processor
  • Causes problems with archiving and future access
  • Approach
  • Parallel I/O (MPI-IO) optimization
  • One file per variable during I/O
  • Requires multi-processor coordination during I/O
  • Achievement
  • Shown to scale to 10s of thousands of processors
    on production systems
  • better performance but eliminating the need to
    create 100K files

15
Optimizations for PVFS
(past work)
  • Datatype I/O
  • Packing non-contiguous I/O requests into a single
    request
  • Data layout is presented in a concise
    description, which is passed over the network
    instead of (offset, length)
  • Distributed locking component
  • Datatype lock consisting of many non-contiguous
    regions
  • Try-lock protocol
  • When failed, fall back to ordered two-phase lock
  • A. Ching, A. Choudhary, W. Liao, R. Ross, and W.
    Gropp. Efficient Structured Data Access in
    Parallel File Systems. Cluster Computing 2003
  • A. Ching, R. Ross, W. Liao, L. Ward, and A.
    Choudhary. Noncontiguous Locking Techniques for
    Parallel File Systems. SC 2007.

16
I/O benchmark
(Past work)
  • S3aSim
  • A sequence similarity search algorithm framework
    for MPI-IO evaluation. It uses a master-slave
    parallel programming model with database
    segmentation, which mimics the mpiBLAST access
    pattern

A. Ching, W. Feng, H. Lin, X. Ma, and A.
Choudhary. Exploring I/O strategies for parallel
sequence database search tools with S3aSim. HPDC
2006
17
Data analytic run-time library at active storage
nodes
(Future work)
  • Enhance the MPI-IO interfaces and functionality
  • Pre-define functions
  • Plug-in user-defined functions
  • Embedded functions in MPI data representation
  • Active storage infrastructure
  • General-purpose CPU with GPUs and/or FPGA
  • FPGAs for reconfiguration and acceleration of
    analysis functions
  • Software programming model
  • Traditional application codes
  • Acceleration codes for GPUs and FPGAs

18
The VISION THING!
19
Discovery of Patterns from Global Earth Science
Data Sets(Instruments, Sensors and/or
Simulations)
  • Science Goal Understand global scale patterns in
    biosphere processes
  • Earth Science Questions
  • When and where do ecosystem disturbances occur?
  • What is the scale and location of land cover
    change and its impact?
  • How are ocean, atmosphere and land processes
    coupled?
  • Data sources
  • Weather observation stations
  • High-resolution EOS satellites
  • 1982-2000 AVHRR at 1 x 1 resolution
    (115kmx115km) 2000-present MODIS at 250m x 250m
    resolution
  • Model-based data from forecast and other
    models Sea level pressure 1979-present at 2.5 x
    2.5 Sea surface temperature 1979-present 1 x
    1
  • Data sets created by data fusion

Monthly Average Temperature
Earth Observing System
20
Analytics/Knowledge Discovery Challenges
  • Spatio-temporal nature of data
  • Traditional data mining techniques do not take
    advantage of spatial and temporal
    autocorrelation.
  • Scalability
  • Size of Earth Science data sets can be very
    large, especially for data such as
    high-resolution vegetation
  • Grid cells can range from a resolution of 2.5 x
    2.5 (10K locations for the globe) to 250m x 250m
    (15M locations for just California about 10
    billion for the globe)
  • High-dimensionality
  • Long time series are common in Earth Science

21
Some Climate problems and Knowledge Discovery
Challenges
  • Challenges
  • Spatio-temporal nature of data
  • Traditional data mining techniques do not take
    advantage of spatial and temporal
    autocorrelation.
  • Scalability
  • Size of Earth Science data sets has increased 6
    orders of magnitude in 20 years, and continues to
    grow with higher resolution data.
  • Grid cells have gone from a resolution of 2.5 x
    2.5 (10K points for the globe) to 250m x 250m
    (15M points for just California about 10 billion
    for the globe)
  • High-dimensionality
  • Long time series are common in Earth Science
  • Climate Problems
  • Extend the range, accuracy, and utility of
    weather prediction
  • Improve our understanding and timely prediction
    of severe weather, pollution, and climate events.
  • Improve understanding and prediction of seasonal,
    decadal, and century-scale climate variation on
    global, regional, and local scales
  • Create the ability to make accurate predictions
    of global climate and carbon-cycle response to
    various forcing scenarios over the next 100
    years.

22
Astrophysics
  • Cosmological Simulations
  • Simulate formation and evolution of galaxies
  • What is dark matter?
  • What is the nature of dark energy?
  • How did galaxies, quasars, and supermassive black
    holes form from the initial conditions in the
    early universe.

Snapshot from a pure N-body simulation showing
the distribution of dark matter at the present
time (light colors represent greater density of
dark matter). 1B particles
Postprocessed to demonstrate the impact of
ionizing radiation from galaxies.
23
SDM Future Vision
  • Build Science Intelligence and Knowledge
    Discoverer
  • Think of this as Oracle, SAS, NetAPP and
    Amazon combined into one
  • Build tools for customization to application
    domain (potential verticals)
  • Provide Toolbox for common applications
  • Develop Scientific Warehouse infrastructure
  • Build intelligence into the I/O Stack
  • Develop an analytics appliance
  • Develop a language and support for specifying
    management and analytics
  • Focus on Needs as more important consideration
    than feature

24
Large-Scale Scientific Data Managementand
Analysis
  • Prof. Alok Choudhary
  • ECE Department, Northwestern University
  • Evanston, IL
  • Email choudhar_at_ece.northwestern.edu
  • ACKNOLEDGEMENTS Wei-Keng Liao, M. Kandemir, X.
    Shen, S. More, R. Thakur, G. Memik, J No, R.
    Stevens
  • Project Web Page - http//www.ece.northwestern.
    edu/wkliao/MDMS

Salishan Conference on High-Speed Computing,
April 2001
25
Cosmology Application
Variables
Time
26
Virtuous Cycle
Simulation (Execute app, Generate data)
Problem setup (Mesh, domain Decomposition)
Manage, Visualize, Analyze
Measure Results, Learn, Archive
27
Problems and Challenges
  • Large-scale data (TB, PB ranges)
  • Large-scale parallelism (unmanageable)
  • Complex data formats and hierarchies
  • Sharing, analysis in a distributed environment
  • Non-standard systems and interoperability
    problems (e.g., file systems)
  • Technology driven by commercial applications
  • Storage
  • File systems
  • Data management
  • What about analysis? Feature extraction, mining,
    pattern recognition etc.

28
MDMS - Goals and Objectives
  • High-performance data access
  • Determine optimal parallel I/O techniques for
    applications
  • Data access prediction
  • Transparent data pre-fetching, pre-staging,
    caching, subfiling on storage system
  • Automatic data analysis for data mining
  • Data management for large-scale scientific
    computations
  • Use a database to store all metadata for
    performance (and other information) future
    (XML?)
  • Static metadata data location, access, storage
    pattern, underlying storage device, etc
  • Dynamic metadata data usage, historical
    performance and access patterns, associations and
    relationships among datasets
  • Support for on-line and off-line data analysis
    and mining

29
Architecture
Simulation Data Analysis Visualization
User Applications
I/O func (best_I/O (for these param)) Hint
Query Input Metadata Hints, Directives Association
s
Data
OIDs parameters for I/O
Schedule, Prefetch, cache Hints (coll I/O)
Storage Systems (I/O Interface)
MDMS
Performance Input System metadata
Metadata access pattern, history
MPI-IO (Other interfaces..)
30
Metadata
  • Application Level
  • Date, run-time parameters, execution environment,
    comments, result summary, etc.
  • Program Level
  • Data types, structures
  • Association of multiple datasets and files
  • File location, file structures (single/multiple
    datasets multiple/single file)
  • Performance Level
  • I/O functions (eg. Collective/non-collective I/O
    parameters)
  • Access hints, access pattern, storage pattern,
    dataset associations
  • Striping, pooled striping, storage association
  • Prefetching, staging, migration, caching hints
  • Historical performance

31
Interface
32
Run Application
33
Dataset and Access Pattern Table
34
Data Analysis
35
Visualize
36
Incorporating Data Analysis, Mining and Feature
Detection
  • Can these tasks be performed on-line?
  • It is expensive to write and read back data for
    future analysis
  • Why not embed analysis functions within the
    storage (I/O) runtime systems?
  • Utilize resources by partitioning system into
    data generator and analyzer

37
Integrating Analysis
Simulation (Execute app, Generate data)
On-line analysis And mining
Problem setup (Mesh, domain Decomposition)
Manage, Visualize, Analyze
Measure Results, Learn, Archive
38
Some Publications
  • A. Choudhary, M. Kandemir, J. No, G. Memik, X.
    Shen, W. Liao, H. Nagesh, S. More, V. Taylor, R.
    Thakur, and R. Stevens. Data Management for
    Large-Scale Scientific Computations in High
    Performance Distributed Systems'' in Cluster
    Computing the Journal of Networks, Software
    Tools and Applications, 2000
  • A. Choudhary, M. Kandemir, H. Nagesh, J. No, X.
    Shen, V. Taylor, S. More, and R. Thakur. Data
    Management for Large-Scale Scientific
    Computations in High Performance Distributed
    Systems'' in High-Performance Distributed
    Computing Conference'99, San Diego, CA, August,
    1999.
  • A. Choudhary and M. Kandemir. System-Level
    Metadata for High-Performance Data management''
    in IEEE Metadata Conference, April, 1999.
  • X. Shen, W. Liao, A. Choudhary, G. Memik, M.
    Kandemir, S. More, G. Thiruvathukal, and A.
    Singh. A Novel Application Development
    Environment for Large-Scale Scientific
    Computations', International Conference on
    Supercomputing, 2000
  • These and more Available at http//www.ece.northwe
    stern.edu/wkliao/MDMS

39
Internal Architecture and Data Flow
40
In-Place On-Line Analytics Software Architecture
41
Statistical and Data Mining Functions on Active
Storage Cluster
(Future work)
  • Develop computational kernels common in
    analytics, data mining and statistical operations
    for acceleration on FPGAs
  • NU-minebench data mining package
  • Develop parallel version of the data mining
    kernels that can be accelerated using GPUs and
    FPGAs

MineBench Project Homepage http//cucis.ece.north
western.edu/projects/DMS
42
Accelerating and Computing in the Storage
43
Illustration of Acceleration (1) Classification
(2) PCA
44
GPU Coprocessing
  • Compared to CPUs, GPUs offer 10x higher
    computational capability and 10x greater memory
    bandwidth.
  • Lower operating speed, but higher transistor
    count.
  • More transistors devoted to computation.
  • In the past, general purpose computation on GPUs
    was difficult.
  • Hardware was specialized.
  • Programming required knowledge of the rendering
    pipeline.
  • Now, however, GPUs look much more like SIMD
    machines.
  • More of the GPUs resources can be applied toward
    general-purpose computation.
  • Coding for the GPU no longer requires background
    knowledge in graphics rendering.
  • Performance gains of 1-2 orders of magnitude are
    possible for data-parallel applications.

45
k-Means Performance (compared with host processor)
46
Results
  • Matrix size 2048

47
Challenges in Scientific Knowledge Discovery
Scientific Data Management
  • Data management
  • Query of Scientific DB
  • Performance optimizations

Knowledge Discovery
Knowledge Discovery
  • In-place analytics
  • Customized acceleration
  • Scalable Mining
  • High-level interface
  • proactive
  • What not How?

High-Performance I/O
Analytics and Mining
48
SDM Future Vision
  • Build Science Intelligence and Knowledge
    Discoverer
  • Think of this as Oracle, SAS, NetAPP and
    Amazon combined into one
  • Build tools for customization to application
    domain (potential verticals)
  • Provide Toolbox for common applications
  • Develop Scientific Warehouse infrastructure
  • Build intelligence into the I/O Stack
  • Develop an analytics appliance
  • Develop a language and support for specifying
    management and analytics
  • Focus on Needs as more important consideration
    than feature
Write a Comment
User Comments (0)
About PowerShow.com