Title: Scientific Data Mining
1Scientific Data Mining
- Chandrika Kamath
- October 7, 2008
- Lawrence Livermore National Laboratory
2Goal solving the problem of data overload
- Use scientific data mining techniques to analyze
data from various SciDAC applications - Techniques borrowed from image and video
processing, machine learning, statistics, pattern
recognition, - Leveraging the Sapphire scientific data mining
software, with functions added as required - Contributors to the SciDAC part Erick Cantú-Paz,
Imola K. Fodor, Siddharth Manay, Nicole S. Love
3 4Sapphire scientific data mining(1998-2008)
- We analyze science data from experiments,
observations, and simulations massive and
complex - Sapphire has a three-fold focus
- research in robust, accurate, scalable algorithms
- modular, extensible software
- analysis of data from practical problems
- Funded through DOE NNSA, LLNL LDRD, SDM SciDAC
Center, GSEP SciDAC project -
https//computation.llnl.gov/casc/sapphire
5Scientific data mining - from a Terabyte to a
Megabyte
Raw Data
Target Data
Preprocessed Data
Transformed Data
Patterns
Knowledge
Data Preprocessing
Pattern Recognition
Interpreting Results
De-noising Object - identification Feature-
extraction Normalization
Dimension- reduction
Data Fusion Sampling Multi-resolution analysis
Classification Clustering Regression
Visualization Validation
An iterative and interactive process
6The Sapphire system architecture flexible,
portable, scalable
RDB Data Store
Decision trees Neural Networks SVMs k-nearest
neighbors Clustering Evolutionary
algorithms Tracking .
De-noise data Background- subtraction Identify
objects Extract features
Features
Data items
FITS BSQ PNM View . . .
Display Patterns
Sample data Fuse data Multi-resolution- analysis
Normalization Dimension- reduction
Sapphire Software
Public Domain Software
Sapphire Domain Software
Components linked by Python
User Input feedback
US Patents 6675164 (1/04), 6859804 (2/05),
6879729 (4/05), 6938049 (8/05), 7007035 (2/06),
7062504 (6/06)
7The modular software is used to meet the needs of
different applications
Command-line Interface
Graphical Interface
Remote Sensing
Fragmentation of materials
Plasma Physics
Sapphire Software
Drivers, support functions
Drivers, support functions
Astronomy
Video surveillance
Climate Simulations
Sapphire libraries Scientific data processing,
dimension reduction, pattern recognition
Sim/Expt comparison
Fluid mix, turbulence
In this talk, I focus only on SciDAC applications
8 9Application 1 Separating signals in climate data
- We used independent component analysis to
separate El Niño and volcano signals in climate
simulations - Showed that the technique can be used to enable
better comparisons of simulations
Collaboration with Ben Santer (LLNL)
10Application 2 Identifying key features for EHOs
in DIII-D
- We used dimension reduction techniques from
statistics and machine learning to identify key
features associated with edge harmonic
oscillations in the DIII-D tokamak - H-mode is the preferred mode of operation, but
associated with ELMs which can damage
components of the tokamak - A quiescent H-mode has been observed associated
with EHOs need to understand EHOs better - The key variables identified are being used to
understand the cause of EHOs the software has
been licensed to GAT
Collaboration with Keith Burrell and Mike Walker
(GAT)
11The data is from sensors in DIII-D
- 700 experiments, each lasting 6 seconds
- Each 50ms window of an experiment is assigned a
low or high EHO-ness label - Each window is described by 37 sensor
measurements - Data cleanup
- discard windows with at least one missing sensor
value - use median value of variable in window
- discard windows with at least one variable in the
top or bottom percentile of its range - resulted in 41818 instances
12Challenge no preconceived notion of which sensor
values are important
- Data cleanup prevents outliers from influencing
results - Use different feature selection methods to gain
confidence - PCA filter use magnitude of coefficients
- Distance filter Kullback-Liebler distance
between histograms - Stump filter
- Chi-square filter
- Boosting approach
- Introduce a noise feature
13We evaluated the features using a naïve Bayes
classifier
14We also considered the top ten features selected
by the methods
15Several features are common across different
methods
Multiple methods provide confidence in results
16Application 3 Classifying and characterizing
orbits in Poincaré plots
- I am using techniques from scientific data mining
to assign one of four labels to an orbit and
extract characteristics of separatrix and island
chain orbits.
Collaboration with J. Breslau, N. Pomphrey, D.
Monticello(PPPL), S. Klasky(ORNL)
17There are four classes of orbits based on the
location of the initial point
Island chain
Quasi-periodic
Stochastic
Separatrix
18Challenge There is a large variation in the
orbits of any one class
quasiperiodic orbits
19Variation in island-chain orbits
20Variation in separatrix orbits
1000 points
5000 points
21How do we extract representative features for an
orbit?
- Variation in the data makes it difficult to
identify good features and extract them in a
robust way - Issues with labels assigned to orbits
- Next steps characterizing island chains and
separatrix orbits
Identifying missing orbits
22Application 4 Tracking blobs in fusion plasma
- We are using image and video processing
techniques to identify and track blobs in
experimental data from NSTX to validate and
refine theories of edge turbulence
t t1 t2
Denoised original
After removal of background
Detection of blobs
Collaboration with S. Zweben, R. Maqueda, and D.
Stotler (PPPL)
23Goal understand the turbulence which causes
leakage of the plasma
- Requirements for fusion high temperature and
confined plasma - Fine-scale turbulence at the edge causes leakage
of plasma from the center to the edge - Loss of confinement
- Heat loss of plasma
- Erosion or vaporization of the containment wall
24The Gas-Puff Imaging diagnostic is used to view
the coherent structures
- Turbulence in the form of density filaments
highly elongated in the direction of the magnetic
field - Inject a gas cloud in the torus, and capture the
intersection of the cloud with the filament using
a camera which views the filament along the
magnetic field
GPI view 16x32 cm
25Data from GPI in NSTX
- PSI-5 camera capture GPI images
- 300 frame sequences taken at 250,000 frames/sec
- 16-bit images with 64x64 pixels
26Why is this difficult?
- coherent structures are poorly understood
empirically and not understood theoretically - no known ground-truth
- noisy images
- variation within a sequence
27Example frames to segment (sequence 113734
frames 1-50)
28We are investigating several image segmentation
methods
- Immersion-Based basic immersion, constrained
watershed, watershed merging - Region Growing seeded region growing, seed
competition - Model-Based 2-D Gaussian fit
- Challenges how do we select the parameters in an
algorithm, how do we handle the variability in
the data especially for longer sequences, how do
the choices of algorithms and parameters
influence the science,
Ongoing work see AHM 2007 slides
29Vision for the future
- Meeting algorithm requirements of current
applications - Robust extraction of feature vectors (orbit
characterization) - Improved algorithms for image analysis (blob
characterization) - Uncertainty quantification (how much can we trust
the result?) - Meeting the science goals
- Classification and characterization of Poincaré
plots - Tracking the blobs in NSTX
- Extraction of coherent structures in fluid and
particle data and their non-linear interactions
(GSEP) - Addressing requests from new applications SNS,
materials science, combustion, power grid, - Deploy as requested