Title: Information Technology and Systems Center
1Data Mining in Earth Sciences
Rahul Ramachandran, Sara Graves and Ken
Keiser Mathematical Challenges in Scientific
Data Mining IPAM January 14-18, 2002
- Information Technology and Systems Center
- University of Alabama in Huntsville
- http//datamining.itsc.uah.edu
2Outline
- Introduction
- ADaM System
- Data Mining Taxonomy for Earth Science
- Event/Relationship based
- Application Examples
- Dimensionality Reduction
- References
3Reasons for Data Mining of Earth Science Data
- Greatly increased data volume due to improvements
in data collection/access/availability/storage
technology (instruments, computational resources,
internet) - Terra are about 1 terabyte per day - more than
can be analyzed by conventional means - High variability in data formats and content
- Need for high returns on expensive data
investments - Need for improved access/availability of data,
information and knowledge - Need for higher level products for the
non-specialist and interdisciplinary/cross-domain
researchers - Questions/queries are getting more complex due,
in part, to heterogeneous nature of the data
4Characteristics of Earth Science Data
- High variability
- Type
- Geostationary
- Polar Orbiting
- Structure
- Raster
- Vector
- Resolution
- Fine AVHRR 1km
- Coarse SSM/I 20km
- Multi/Hyper Spectral
- Processing stage
- Level 0 Raw data instrument counts
- Level 1 Annotated with Geo-reference information
- Level 2 Transformed by algorithm into
geophysical parameter - Level 3 Spatial/Temporal resampling
- Level 4 Includes additional model data
5Characteristics of Earth Science Data
- Need to know physical basis (domain knowledge)
before applying statistical techniques - Multiple time scales
- Wide variety of data formats
- Includes spatial/temporal information
- Typically needs domain-specific algorithms
6ADaM History
- Algorithm Development and Mining (ADaM) System
- The system provides knowledge discovery, feature
detection and content-based searching for data
values, as well as for metadata. - It contains over 120 different operations to be
performed on the input data stream. - Operations vary from specialized atmospheric
science data-set specific algorithms to different
digital image processing techniques, processing
modules for automatic pattern recognition,
machine perception, neural networks and genetic
algorithms. - Developed a Event/Relationship Search System for
the environment
7ADaM Engine Architecture
Preprocessed Data
Patterns/ Models
Results
Data
Translated Data
Processing
Preprocessing
Analysis
Selection and Sampling Subsetting
Subsampling Select by Value Coincidence
Search Grid Manipulation Grid Creation
Bin Aggregate Bin Select Grid Aggregate
Grid Select Find Holes Image Processing
Cropping Inversion Thresholding Others...
Clustering K Means Isodata
Maximum Pattern Recognition Bayes Classifier
Min. Dist. Classifier Image Analysis
Boundary Detection Cooccurrence Matrix
Dilation and Erosion Histogram Operations
Polygon Circumscript Spatial Filtering
Texture Operations Genetic Algorithms Neural
Networks Others...
8 ADaM Mining Environment
Data Mining Server
Mining Results
Event/ Relationship Search System
9Data Mining Taxonomy
10Event-based Mining
- Known events/Known algorithms
- Tropical Cyclones from AMSU-A data
- Known events/Learned algorithms
- Rainfall estimation from SSM/I data
- Lightning Detection from OLS data
- Unknown event/Unknown algorithm
- Target Independent Mining
11Known Event/Known Algorithm
I know what phenomena to detect and I have
the algorithm to do so!
Results
Add algorithm to Mining Environment
- Relationship analysis
- Coincidence searches
- Input for other algorithms
Earth Science Data Sets
12Tropical Cyclone DetectionEstimating Maximum
Wind Speed
- Scientist Dr. Roy Spencer (GHCC/MSFC NASA)
- Data used Advanced Microwave Sounding Unit-A
- Radiometer can detect temperatures at different
levels of the atmosphere - Surface winds in tropical cyclones are directly
related to the warm middle- and upper-atmosphere
temperatures which exist around the cyclone
center - AMSU-A measures this warmth at several
frequencies near 55 gigahertz (GHz) - Calibrated using aircraft reconnaissance
measurements in tropical depressions, tropical
storms, and hurricanes from the 1998 Atlantic
hurricane season - Tropical cyclone detection based on ice
scattering, water vapor and wind speed
13Tropical Cyclone DetectionEstimating Maximum
Wind Speed
Advanced Microwave Sounding Unit (AMSU-A)Data
- Water cover mask to eliminate land
- Laplacian filter to compute temperature
gradients - Science Algorithm to estimate wind speed
- Contiguous regions with wind speeds above a
desired - threshold identified
- Additional test to eliminate false positives
- Maximum wind speed and location produced
Calibration/ Limb Correction/ Converted to Tb
Hurricane Floyd
Data Archive
Mining Environment
Result
Results are placed on the web and made available
to National Hurricane Center Joint Typhoon
Warning Center
14Known Event/Learned Algorithm
Data Mining System
I know what phenomena I want to detect but I
do not know the characteristics of the phenomena
Results
Refine your algorithm using iteration
- Relationship analysis
- Coincidence searches
- Input for other algorithms
Earth Science Data Sets
15Rainfall Estimation and Identification Study
using SSM/I data
- Scientist Dr. Steve Goodman (GHCC/MSFC NASA)
- To determine whether generic pattern recognition
techniques could be applied to SSM/I data to
detect rain - Minimum Distance Classifier, Back-propagation
Neural Network and Discrete Bayes Classifier were
compared against a Science Algorithm ( WetNet PIP
Algorithm) - US Composite rainfall product was used as ground
truth
Subsetted SSM/I data
NEXRAD Composite data
16Rainfall Estimation and Identification Study
using SSM/I data
- SSM/I and US rain data over southeastern United
States for the period January and July 1995 were
compared in the study - SSM/I and Radar data were gridded and registered
to establish spatial and temporal coincidence - BPNN performance was comparable to that of the
WetNet PIP SSM/I rain rate algorithm - Performance of Bayes classifier was not as good
as that of the WetNet PIP SSM/I rain rate
algorithm. This is perhaps due to the small
sample size used for estimating density functions
of the two classes (rain and non-rain)
17Lightning Detection in Operational Linescan
System (OLS) Images
- Scientist Dr. Steve Goodman (GHCC/MSFC NASA)
- To identify lightning streaks in night time
portions of OLS images - OLS is carried by DMSP satellites and produces a
visible and thermal image - Lightning shows up as bright horizontal streaks
as do city lights and moonlight reflected off the
clouds - Approach based on morphological filtering and
gradient detection was selected - Both visible and thermal band used
18Lightning Detection in Operational Linescan
System (OLS) Images
- Erosion and dilation was used to find areas
in/near clouds, other areas were removed - Gradient detection in the direction of satellite
propagation was applied to the visible image to
extract horizontal streaks - Texture measures were used to identify areas of
small patchy cloud cover which exhibited small
bright streaks - Genetic algorithm was used to tune parameters of
the classification during training
Results ( Accuracy)
Correctly Detected
False Positives
False Negatives
Training Results
80
0.7
19.2
Test Results
78.2
4.3
17.3
19Unknown Event/Unknown Algorithm
Data Mining System
I want to find anomalies in the data sets !
Results
Let the miner discover it
- Relationship analysis
- Coincidence searches
- Input for other algorithms
Earth Science Data Sets
Example Target Independent Mining
20Target Independent Mining of SSM/I Data
- Mine for data in a target independent manner (no
specific phenomena under consideration) - Interested in transient phenomena that move
through an area - Transient phenomena characterized as deviation
from normal - Objective Data Reduction with minimum loss of
information - Size of remotely sensed data prevents it from
being maintained on-line - Data is archived in much slower tertiary storage
- Need to develop techniques to minimize the need
for data access from the tertiary storage - Procedure Overlay the earths surface with a
constant grid consisting of cells - For each cell a maximum and minimum trend line is
computed - Maximum trend line is computed by forming a set
of maximum values for a day over some period
(month) - Median for a series of months is used to form the
maximum trend line - Same procedure used to calculate minimum trend
line
21Target Independent Mining of SSM/I Data
Trend Lines Represent What Is Normal
22Target Independent Mining of SSM/I Data
- Extracted metadata not oriented toward any
particular transient phenomena - Laboratory tests show 98 data compression while
preserving 92 of MCSs detectable in raw data - MCS events represented only 6.7 of extracted
metadata
23Relationship-based Mining
- Coincident Association
- VARGA Algorithm for multispectral data
- Localized Spatial Association
- Cumulus Cloud Classification in GOES Imagery
- Temporal Association
24Coincident Association Mining
- Use Market Basket analysis to mine for
association rules in vector data - Rule has form X ?Y
- Rule characterized by
- Support
- of vector instances that have X ? Y
- How likely the rule is applicable?
- Confidence
- What of vector instances that contain X also
contain Y? - Estimate of conditional probability
25Coincident Association Applied to Multi-spectral
Data Mining
- Developed and implemented Vector Association Rule
Generation Algorithm (VARGA) as a modification to
market-basket association rule mining. - Modified to minimize memory usage for large
multi-spectral satellite data such as SSM/I (90
megabytes per day uncompressed) - Example SSM/I Rule
- 19V, 180.0 37H, 140.0 -gt 37V, 200.0
0.117037 0.945986
26Localized Spatial Association Mining
- Extract association rules to characterize texture
(Dissertation of Dr. John Rushing) - Each pixel on an nxn neighborhood is
characterized by the triple (X,Y,I) - The X and Y offsets from the pixel at the
neighborhood center - Its intensity I
- Association rules can then be characterized by
relationships between the triples
27Association Rule Example
- The rule specified in figure can be applied to
this image in 9 of the 16 pixel locations due to
the pixel offsets in the rule. - Of these 9 locations, the antecedent matches at 5
locations, and both the antecedent and consequent
match at 3 locations. - This yields a support of 3/9 33.33 and a
confidence of 3/5 60.
28Association Rule Example
29GOES Cumulus Cloud Classification Why Texture
Features?
- Cumulus cloud fields have a very characteristic
texture signature in the GOES visible imagery
30GOES Cumulus Cloud Classification The Need
- Cloud systems are important modulators of earths
radiation budget - Large uncertainties are associated with cloud
radiative forcing - Radiative energy budget is impacted by change in
distribution of clouds - Cumulus clouds are a cloud field type that could
respond strongly to climate change - Knowledge of cloud geometry, size and spatial
distribution is needed for the representation of
cumulus clouds in radiative transfer models - To derive models of cloud field characteristics,
automated cumulus cloud detection schemes are
required to analyze large amounts of data
31GOES Cumulus Cloud Classification Purpose of
this study
- Compare different techniques for detecting
Cumulus cloud fields in Geostationary Operation
Environmental Satellite (GOES) - Comparison based on
- Accuracy of detection
- Amount of time required to classify
- Feature measures used along with the Maximum
Likelihood Classifier - Texture features
- Gray Level Co-Occurrences Matrix
- Gray Level Run Length Features
- Association Rules
- Edge Detection Features
- Sobel Filter
- Laplacian Filter
- Combination of Sobel and Laplacian Filter
32GOES Cumulus Cloud Classification Texture
Features (1)
- Gray Level Co-Occurrence Matrix
- First texture feature vector to be developed
- GLCM is used as a benchmark
- It is based on positional operator
- Positional operator defines relationship between
pixels in terms of x,y offset or as a distance,
angle offset - Co-occurrence matrix is an NxN matrix where N is
the number of gray levels and functions are
computed on the matrix - Gray Level Run Length Features
- Gray level statistical features based on
homogeneous gray level runs - Run is a series of consecutive pixels of the same
intensity - Run length are at orientations in increments of
45 degrees starting at 0 degrees
33GOES Cumulus Cloud Classification Texture
Features (2)
- Association Rules
- Often used in business applications to identify
relationships in databases - Adapted to discriminate textures in images
- Based on frequently occurring local image
structures
Triples ( Pos X, Pos Y, Pixel Intensity) Rule
(0,0,2) (1,1,2) gt (1,0,0) Then calculate
Support and Confidence of this Rule
34GOES Cumulus Cloud Classification Edge Detection
Features
- These techniques are used for detecting
discontinuities in an image - These techniques apply a local derivative
operator on the image - Sobel Filters
- It calculates the magnitude of rate of change of
gray level and the direction of this change
vector - Magnitude Gx Gy
- Direction tan-1(Gx/Gy)
- Gx (z7 2z8 z9) (z1 2z2 z3)
- Gy (z3 2z6 z9) (z1 2z4 z7)
- Laplacian Filters
- It is a second order derivative
- F(z) 4z5 (z2 z4 z6 z8)
z1 z2 z3
z4 z5 z6
z7 z8 z9
35GOES Cumulus Cloud Classification Experiment
Process
- Training
- Samples selected from 1000x1000 GOES scene
- Only two classes are used Cumulus and Others (
includes background) - For validation, samples were labeled by at least
two experts and only pixels where experts agreed
were used for training - Maximum likelihood classifier was trained using
GLCM, GLRL, Association Rules and Edge detection
features - Window size was varied 5x5 11x11
- Testing
- 12 different GOES images (512x512) where used for
testing - Classification results were compared against
expert labeled images - Confusion matrix, classification accuracy and
experiment run times were calculated
36GOES Cumulus Cloud Classification Sample Result
Original
GLRL
Association Rules
GLCM
Expert Labeled
Sobel
Sobel Laplacian
Laplacian
37GOES Cumulus Cloud Classification Conclusions
- Accuracy
- Best results using texture features
- GLRL (78) with a filter size of 11x11
- Association Rules (75) with a filter size of 5x5
- GLCM gave the worst results (51-55)
- Best results using edge detection filters
- Sobel Filter (78) with a filter size of 11x11
- Laplacian (73) with a filter size of 9x9
- Laplacian and Sobel (75) with a filter size of
9x9 - Timing Results
- Times were calculated on an 933MHz Pentium III
processor PC with 512 MB memory - Texture feature techniques in general required an
order of magnitude more time than edge detection
filters
38Dimensionality Reduction Mesoscale Convective
System (MCS) Detection
Scientists
Populating Knowledge Base (reducing data volume )
Scientists
- Define the Experiment
- Select algorithm (Devlin)
- Automatic extraction of MCSs from SSM/I
- data
Mining Results MCSs
Knowledge Base Event/ Relationship
Search System
SSM/I Data
39Dimensionality Reduction Research Analysis
- Reduced amount of data
- Allow scientists to pose questions
- and get results
- Allow easy visualization
- Maximize knowledge discovery/
- minimize data handling
- Scientists can refine their
- knowledge repository
- Answer the science questions
- Analysis
- Find MCSs over river basins in Middle East?
- Data Sets
- MCSs
- River basin data set
- Political boundaries
Scientists
Mining Results MCSs
Event/ Relationship Search System
Knowledge Base Event/ Relationship
Search System
SSM/I Data
40Dimensionality Reduction Knowledge Reuse
- Climatological Study of MCSs
- What is the latitudinal distribution of
- MCSs?
- Which continent has more MCSs?
- What is the size distribution of the
- MCSs for JUN-JUL-AUG?
- What is the relationship between the
- number of MCSs and their intensities?
- Do results vary for El-Nino years?
Scientists
Mining Results MCSs
Event/ Relationship Search System
Knowledge Base Event/ Relationship
Search System
SSM/I Data
41Event/Relationship Search System
- Allows users to conduct coincidence searches and
relationship tests between mined phenomena and a
variety of parameters - Parameters include geographic regions,
political boundaries, or other named phenomena
for a specific time period
42References
- Graves, Sara J., Thomas Hinke, Shanlini Kansal,
"Metadata The Golden Nuggets of Data Mining",
First IEEE Metadata Conference, Bethesda,
Maryland, April 16- 18, 1996 - Hinke, Thomas, John Rushing, Shanlini Kansal,
Sara J. Graves, Heggere S. Ranganath, "For
Scientific Data Discovery Why Can't the Archive
be More Like the Web", Proceedings Ninth
International Conference on Scientific Database
Management, Evergreen State College, Olympia,
Washington, August 11-13, 1997 - Hinke, Thomas, John Rushing, Heggere S.
Ranganath, Sara J. Graves, "Techniques and
Experience in Mining Remotely Sensed Satellite
Data", Artificial Intelligence Review 14 (6)
Issues on the Application of Data Mining, pp
503-531, December 2000 - Hinke, Thomas, John Rushing, Shanlini Kansal,
Sara J. Graves, Heggere S. Ranganath, Evans
Criswell, "Eureka Phenomena Discovery and
Phenomena Mining System", AMS 13th Intl
Conference on Interactive Information and
Processing Systems (IIPS) for Meteorology,
Oceanography and Hydrology, 1997
43References
- Hinke, Thomas, John Rushing, Heggere S.
Ranganath, Sara J. Graves, "Target-Independent
Mining for Scientific Data Capturing Transients
and Trends for Phenomena Mining", Proceedings
Third International Conference on Data Mining
(KDD-97), Newport Beach, California, August
14-17, 1997 - Keiser, Ken, John Rushing, Helen Conover, Sara J.
Graves, "Data Mining System Toolkit for Earth
Science Data", Earth Observation (EO)
Geo-Spatial (GEO) Web and Internet Workshop,
Washington, D.C., February 1999 - Rushing, John, Heggere S. Ranganath, Thomas
Hinke, Sara J. Graves, "Using Association Rules
as Texture Features", IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol
23, No. 8, 845-858, 2001 - Nair, Udaysankar J., John Rushing, Rahul
Ramachandran, Kwo-Sen Kuo, Sara J. Graves, Ron
Welch, "Detection of Cumulus Cloud Fields in
Satellite Imagery", The International Symposium
on Optical Science, Engineering and
Instrumentation, Denver, 1999 - Nair, U., J. Rushing, R. Ramachandran, R. Welch,
and S. J. Graves, Detection of boundary layer
cumulus cloud fields in GOES satellite imagery,
submitted to Journal of Applied Meteorology,
September, 2001