Scientific Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Scientific Data Mining

Description:

Techniques borrowed from image and video processing, machine ... Scientific data mining - from a Terabyte to a Megabyte. Raw. Data. Target. Data. Preprocessed ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 30
Provided by: your182
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Scientific Data Mining


1
Scientific Data Mining
  • Chandrika Kamath
  • October 7, 2008
  • Lawrence Livermore National Laboratory

2
Goal solving the problem of data overload
  • Use scientific data mining techniques to analyze
    data from various SciDAC applications
  • Techniques borrowed from image and video
    processing, machine learning, statistics, pattern
    recognition,
  • Leveraging the Sapphire scientific data mining
    software, with functions added as required
  • Contributors to the SciDAC part Erick Cantú-Paz,
    Imola K. Fodor, Siddharth Manay, Nicole S. Love

3
  • Overview of Sapphire

4
Sapphire scientific data mining(1998-2008)
  • We analyze science data from experiments,
    observations, and simulations massive and
    complex
  • Sapphire has a three-fold focus
  • research in robust, accurate, scalable algorithms
  • modular, extensible software
  • analysis of data from practical problems
  • Funded through DOE NNSA, LLNL LDRD, SDM SciDAC
    Center, GSEP SciDAC project

https//computation.llnl.gov/casc/sapphire
5
Scientific data mining - from a Terabyte to a
Megabyte
Raw Data
Target Data
Preprocessed Data
Transformed Data
Patterns
Knowledge
Data Preprocessing
Pattern Recognition
Interpreting Results
De-noising Object - identification Feature-
extraction Normalization
Dimension- reduction
Data Fusion Sampling Multi-resolution analysis
Classification Clustering Regression
Visualization Validation
An iterative and interactive process
6
The Sapphire system architecture flexible,
portable, scalable
RDB Data Store
Decision trees Neural Networks SVMs k-nearest
neighbors Clustering Evolutionary
algorithms Tracking .
De-noise data Background- subtraction Identify
objects Extract features
Features
Data items
FITS BSQ PNM View . . .
Display Patterns
Sample data Fuse data Multi-resolution- analysis
Normalization Dimension- reduction
Sapphire Software
Public Domain Software
Sapphire Domain Software
Components linked by Python
User Input feedback
US Patents 6675164 (1/04), 6859804 (2/05),
6879729 (4/05), 6938049 (8/05), 7007035 (2/06),
7062504 (6/06)
7
The modular software is used to meet the needs of
different applications
Command-line Interface
Graphical Interface
Remote Sensing

Fragmentation of materials
Plasma Physics
Sapphire Software
Drivers, support functions
Drivers, support functions

Astronomy
Video surveillance
Climate Simulations
Sapphire libraries Scientific data processing,
dimension reduction, pattern recognition
Sim/Expt comparison
Fluid mix, turbulence
In this talk, I focus only on SciDAC applications
8
  • SciDAC achievements

9
Application 1 Separating signals in climate data
  • We used independent component analysis to
    separate El Niño and volcano signals in climate
    simulations
  • Showed that the technique can be used to enable
    better comparisons of simulations

Collaboration with Ben Santer (LLNL)
10
Application 2 Identifying key features for EHOs
in DIII-D
  • We used dimension reduction techniques from
    statistics and machine learning to identify key
    features associated with edge harmonic
    oscillations in the DIII-D tokamak
  • H-mode is the preferred mode of operation, but
    associated with ELMs which can damage
    components of the tokamak
  • A quiescent H-mode has been observed associated
    with EHOs need to understand EHOs better
  • The key variables identified are being used to
    understand the cause of EHOs the software has
    been licensed to GAT

Collaboration with Keith Burrell and Mike Walker
(GAT)
11
The data is from sensors in DIII-D
  • 700 experiments, each lasting 6 seconds
  • Each 50ms window of an experiment is assigned a
    low or high EHO-ness label
  • Each window is described by 37 sensor
    measurements
  • Data cleanup
  • discard windows with at least one missing sensor
    value
  • use median value of variable in window
  • discard windows with at least one variable in the
    top or bottom percentile of its range
  • resulted in 41818 instances

12
Challenge no preconceived notion of which sensor
values are important
  • Data cleanup prevents outliers from influencing
    results
  • Use different feature selection methods to gain
    confidence
  • PCA filter use magnitude of coefficients
  • Distance filter Kullback-Liebler distance
    between histograms
  • Stump filter
  • Chi-square filter
  • Boosting approach
  • Introduce a noise feature

13
We evaluated the features using a naïve Bayes
classifier
14
We also considered the top ten features selected
by the methods
15
Several features are common across different
methods
Multiple methods provide confidence in results
16
Application 3 Classifying and characterizing
orbits in Poincaré plots
  • I am using techniques from scientific data mining
    to assign one of four labels to an orbit and
    extract characteristics of separatrix and island
    chain orbits.

Collaboration with J. Breslau, N. Pomphrey, D.
Monticello(PPPL), S. Klasky(ORNL)
17
There are four classes of orbits based on the
location of the initial point
Island chain
Quasi-periodic
Stochastic
Separatrix
18
Challenge There is a large variation in the
orbits of any one class
quasiperiodic orbits
19
Variation in island-chain orbits
20
Variation in separatrix orbits
1000 points
5000 points
21
How do we extract representative features for an
orbit?
  • Variation in the data makes it difficult to
    identify good features and extract them in a
    robust way
  • Issues with labels assigned to orbits
  • Next steps characterizing island chains and
    separatrix orbits

Identifying missing orbits
22
Application 4 Tracking blobs in fusion plasma
  • We are using image and video processing
    techniques to identify and track blobs in
    experimental data from NSTX to validate and
    refine theories of edge turbulence

t t1 t2
Denoised original
After removal of background
Detection of blobs
Collaboration with S. Zweben, R. Maqueda, and D.
Stotler (PPPL)
23
Goal understand the turbulence which causes
leakage of the plasma
  • Requirements for fusion high temperature and
    confined plasma
  • Fine-scale turbulence at the edge causes leakage
    of plasma from the center to the edge
  • Loss of confinement
  • Heat loss of plasma
  • Erosion or vaporization of the containment wall

24
The Gas-Puff Imaging diagnostic is used to view
the coherent structures
  • Turbulence in the form of density filaments
    highly elongated in the direction of the magnetic
    field
  • Inject a gas cloud in the torus, and capture the
    intersection of the cloud with the filament using
    a camera which views the filament along the
    magnetic field

GPI view 16x32 cm
25
Data from GPI in NSTX
  • PSI-5 camera capture GPI images
  • 300 frame sequences taken at 250,000 frames/sec
  • 16-bit images with 64x64 pixels

26
Why is this difficult?
  • coherent structures are poorly understood
    empirically and not understood theoretically
  • no known ground-truth
  • noisy images
  • variation within a sequence

27
Example frames to segment (sequence 113734
frames 1-50)
28
We are investigating several image segmentation
methods
  • Immersion-Based basic immersion, constrained
    watershed, watershed merging
  • Region Growing seeded region growing, seed
    competition
  • Model-Based 2-D Gaussian fit
  • Challenges how do we select the parameters in an
    algorithm, how do we handle the variability in
    the data especially for longer sequences, how do
    the choices of algorithms and parameters
    influence the science,

Ongoing work see AHM 2007 slides
29
Vision for the future
  • Meeting algorithm requirements of current
    applications
  • Robust extraction of feature vectors (orbit
    characterization)
  • Improved algorithms for image analysis (blob
    characterization)
  • Uncertainty quantification (how much can we trust
    the result?)
  • Meeting the science goals
  • Classification and characterization of Poincaré
    plots
  • Tracking the blobs in NSTX
  • Extraction of coherent structures in fluid and
    particle data and their non-linear interactions
    (GSEP)
  • Addressing requests from new applications SNS,
    materials science, combustion, power grid,
  • Deploy as requested
Write a Comment
User Comments (0)
About PowerShow.com