Title: Environmental Data Warehousing and Mining
1Environmental Data Warehousing and Mining
- Nabil R. Adam
- Vijay Atluri, Dihua Guo, Songmei Yu
- Rutgers University CIMIC
- NSF Workshop on Next Generation Data Mining
NGDM02 - November 1-3, 2002
2Outline
- Setting
- A Real-world Lab The NJ Meadowlands Area
- Motivating Examples
- Environmental Data Warehousing
- Environmental Data Mining
3MERI (Meadowlands Environmental Research
Institute)
- Established in 1998 as a Collaboration between
The New Jersey Meadowlands Commission and Rutgers
CIMIC. - Provides a world class environmental research
institute for urban and coastal wetlands focused
on the district - Administered by Rutgers-CIMIC
-
- Mission
- Conduct and sponsor research in ecology,
environmental science and information technology
to monitor, preserve and improve ecological and
human health and welfare in the Meadowlands
District, NJ.
4MERI
- Budget 1.6 Million/year (2002-2007)
- Staff
- Faculty, Students, FT NJMC/Rutgers
Scientists/Staff - Disciplines
- Biology, Ecology, Geology, Environmental Sc.,
Hydrological Modeling, Remote Sensing/Geographic
Information Systems, and Information Technology - Work closely with the NJ Meadowlands Commission
to disseminate research results - To the scientific community and the various
government agencies - Information and technology transfer to local
municipalities - Develop scientific content for education and
exhibits - Provide high school and college students with
science internships
5Digital Meadowlands
Processed Satellite Images
3D visualization
NASA archives
EnvironmentalParameters
Reports
Fly-by/ Drill-down
Radar
Digital Meadowlands
Interactive Maps
Monitoring Stations
Sensors
Satellite Imagery AVHRR
Aerial Photos
documents
Maps
6Visualization Drill-down
7- ASTER (Advanced Spaceborne Thermal Emission and
Reflection Radiometer) - Ground Resolution 15 m (bands 1-2), 30 m (bands
4-7), 90 m (bands 10-14) - Spectral Bands 14
- Swath width 60 Km
- Application
- Daily monitoring of flood prone areas. Flood
prone areas are shown in red. Under flood
conditions sensor would detect water (blue)
covering flood prone areas (red).
8Satellite Images
- Various data sources and data types
- various types of satellite images with different
resolutions captured by different sensors - AVHRR direct downloads from polar orbiting
satellites(NOAA 12, NOAA 14 and NOAA 15), 1km
resolution, 5 bands - LANDSAT and RADAR obtained from NASA archives,
30m resolution, 7 bands - Hyper-spectral images 34 bands images from AISA
(Airborne Imaging Spectrometer for Applications)
sensor, 250-1000m resolution - Aerial ortho-photographs high resolution (1m)
images (IKONOS, QUICKBIRD), - MODIS images for global dynamics and processes
occurring on the land, in the oceans, and in the
lower atmosphere from NASAS satellites Terra and
Aqua , 90m resolution, 36 bands - ASTER detailed maps of land surface temperature,
emissivity, reflectance and elevation from NASAs
satellite Terra, 14 bands, 15-90m resolution
9Users access database through the Worldwide Web
Automated, near real-time monitoring system
Weather station with data logger
Water monitor wired to data logger
WWW interface
modem link
Central computer ingests data and stores it in a
database
10Real Time Data from Monitoring Stations
11Use of Water Quality Data Tracking the
effectiveness of pollution control measures
Regulatory minimum level
12One Example of Satellite Imagery AVHRR
13(No Transcript)
14Information Sources -- Traditional
- Include
- A Library of a large variety of Documents
- Scientific Publication
- Guidelines and Regulations
- Measurements and Impact Studies
- Documents contain Text, Tables, Pictures,
Drawings and Maps - Census information that describes the
socio-economic and health characteristics of the
population
15The User Community
- Researchers, faculty, graduate students in a
variety of disciplines including biology,
ecology, geology, environmental science, and IT - make scientific observations such as the changes
in vegetation pattern and its effect on
temperature over the years - Policy Makers
- query various critical parameters such as ambient
air and water quality and visualize the results
in a graphical form - gain help in the evaluation and formulation of
environmental policies - The Public
- learn information about their county, community,
home on such issues as environment, health, and
infrastructure - K-12 Educators and students
16The Data Volume
- Satellite images
- AVHRR 50MB each image, 2-4 images per satellite
per day, CIMIC is downloading images from 3
satellites and generate 15GB data per month - MODIS images and ASTER images are available
everyday or every other day. - IKONOS, QUICKBIRD 1m resolution images (7.5
quad), each image would roughly be 15000120008,
which means 1.44GB. - Our Mass Storage
- EMC CLARiiON FC4500
- Capacity up to 18TB
- Good cost per MB, excellent performance,
scalability, and flexibility - It satisfies the needs of the Online Querying
Information System - One GB cache optimized for r/w at different
times by a script - Backplane provides a data transfer rate of 200
MB/Second from the disks to the fiber channel
port which transfers the data over the fiber
channel cables to the host at 100 MB/Second. - Additional fiber channel
- Very flexible configuration capabilities
17Environmental Knowledge Discovery
- Examples
- Data Warehousing
- Data Mining
- How and Why
- Hypothesis testing
18Motivating Examples (1)
- Identify a natural disturbance affecting wetland
vegetation such as fire, pathogen infestation or
wilting by drought in the New Jersey Meadowlands? - What should we have?
- A time series of satellite images (a few years)
- Calculated soil and vegetation indices for images
- Digital elevation models (DEM) of Meadowlands
- Precipitation record for time series
- Zoning designation for area being observed
- What do we need to do?
- Identify the sudden drop in the vegetation index
(NDVI) in areas where NDVI has been consistently
high through time (outlier detection) - Determine areas where suddenly the soil index is
high due to the exposure of bare mineral soil.
(classification) - Combine high soil index with low NDVI and
precipitation record to determine the occurrence
of vegetation disturbance (characterization)
19Motivating Examples (2)
- Find bird resting patterns along the eastern
seaboard migration corridor - Data needs
- Extent of ecosystems that support invertebrates
along the migration corridor - Availability of invertebrates in water and
sediments through the migrating period. - What we need to know
- The number of birds and bird types as related to
the availability of food at each rest stop.
(trends detection) - Detect abnormal bird populations (low or high)
which are not explained by availability of food
at specific resting stops. (outlier detection)
20Motivating Examples (3)
- Investigate the associations between change in
forest cover and illegal exploitation of
protected tropical forests - Data we need
- Satellite image/maps
- Calculated deforestation rates using NDVI indices
- Data on truck movement
- Records on ship movements from local ports
- Data on migrant worker camps
- What we can get?
- Relate deforestation rate to new road
construction and truck traffic in areas where the
topography and local ecosystems support exotic
tropical trees.. (association detection)
21Other Motivate Examples
- hydroclimatological study (Praveen Kumar and
Amanda BT. White, 2002) - How can we link the changes in NDVI to changes in
the hydrologic condition? - Can we distinguish between the changes due to
various factors, such as inter-annual climate
variability and human action impact? - Is it possible to distinguish between
variabilities related to inter-annual and
long-term trends? - Is there correlation between NDVI variations and
ecoregion, or between NDVI with other parameters,
such as climate, physiography, topography, or
hydrology? - Are the trends confined to certain regions? Is
the nature of the variability and trend different
in different regions? - Are there any systematic changes over last 10
years? - Are there regions where changes are attributable
to human impact, such as logging?
22Environmental Data Warehousing (EDW)
- Poses a number of challenging requirements with
respect to - The design of the data model due to the nature of
analytical operations to be performed -
- The nature of the views to be maintained by the
environmental warehouse.
23EDW ChallengesNature of the Environmental Data
- Each dimension in itself is multi-dimensional in
nature, e.g., - raster images such as satellite downloads
- used to generate various images of different
types including land-use, water, temperature,
NDVI - each of them have multiple dimensions
- the geographic extent and coordinates
- the time and date of its capture
- resolution, ...
- regional maps represented as vector data
- temporal and spatial
- streaming data collected from various sensors
- Temperature, air quality, atmospheric pressure,
water quality dissolved oxygen, mineral
contents, salinity - geographic location (spatial dimension)
- temporal dimension
24Nature of the Environmental data
25Nature of the Environmental data
- Each dimensional table is itself
multi-dimensional by nature - Traditional data warehouse models are not
suitable for an environmental data warehouse - Our Proposal cascaded star schema
26EDW ChallengesComplex Nature of Queries (1)
- Retrieve changes in the vegetation pattern over
a certain region during last 10 years, and their
effect on the regional maps over that time period - requires
- layering of the images representing the
vegetation patterns with those of the maps whose
time intervals of validity overlap - traverse along this temporal dimension with the
overlaid image - In the traditional data warehouse sense,
- first construct two data cubes along the time
dimensions for each of the vegetation images and
maps - then fuse these two cubes into one
27Demo
- http//cimic.rutgers.edu/songmei/dw.html
28(No Transcript)
29(No Transcript)
30(No Transcript)
31EDW ChallengesComplex Nature of Queries (2)
- Observe the changes in the surface water and
population due to the changes in the vegetation
pattern - fusion of multiple cubes is required
- Simulate a fly-by over a region starting with a
specific point and elevation, and traverse the
region on a specific path with reducing elevation
levels at a certain speed, and reaching a
destination (a 3-dimensional trajectory) - Requires
- retrieving images that span adjacent regions
that overlap the spatial trajectory, but with
increasing resolution levels to simulate the
effect of reduced elevation level - display them at a speed that matches the desired
velocity of the fly-by.
32EDW ChallengesEfficient Software and Mature
Technology
- We need software applications to efficiently
manage and manipulate images either by
pre-setting or by ad -hoc - Example of calculating NDVI
- select (char) ( 255.0 (band2 - band1)/(band2
band1)) - 10001500, 10001500
- from landsat_band1 as band1, landsat_band2 as
band2 - Example in the area of DB -- RasDaMan
- A basic research project sponsored by the
European Community to develop comprehensive MDD
database technology - Multi-dimensional data models (MDD) to store
images - Interacts with Oracle for meta data and blob
management
33RasDaMan (1)
34RasDaMan (2)
- Distinguished Features
- A clear distinction is made between the logical
(query) level and the physical (storage
organization and data transmission) level of
array management. - On the conceptual level, arrays are treated as a
general data abstraction, they can be of any
dimensionality, they can have an arbitrary (fixed
or variable) number of elements per dimension,
and both primitive and derived types are
admissible as array base types. - The model has formal set-algebraic semantics
based on AFATL Image Algebra, a rigid
mathematical framework able to express any image
transformation. - On the physical level, a novel combination of
tiling and spatial indexing allows for the
efficient execution of queries on MDD while
offering the benefits of conventional database
technology, such as query performance depending
on the result set (and not on the overall data
set size), concurrency control, support for crash
recovery, and transaction management. - A data definition language for multidimensional
arrays, together with a SQL-based and optimized
query language called RasQL allows for powerful
associative retrieval and data manipulation
35Ongoing Work
- Formulating the necessary primitives for the
specification and execution of queries - Extending the OLAP operations for the cascaded
star - roll-up aggregating on a specific dimension,
i.e., summarize data - drill-down from higher level summary to lower
level detailed - slicing projecting data along a subset of
dimensions with an equality selection of other
dimensions - dicing similar to slicing except that instead of
equality selection of other dimensions, a range
selection is used - pivoting reorient the multidimensional cube
- zoom-in, zoom-out, aggregation of views using the
above OLAP operations
36Environmental Data Mining Challenges (1)
- How can we mine spatial data and non-spatial data
from multispectral satellite images and thematic
maps. (Krzysztof Koerski, Junas Adhikary, and
Jiawei Han, 1996) - Currently research uses only single type of map
or image - Mine them at the same time
- Resolutions are different
- The representation of the thematic maps are
different - How to deal with the complex relationships among
objects (Krzysztof Koerski, Junas Adhikary, and
Jiawei Han, 1996) - Relationships
- Spatial relationship distance
- Topological relationship disjoint, overlap, far
away, etc - Direction
- Current clustering represent the big object using
centroid, e.g., objects of similar size and
regular shape, only one of them is very narrow,
long band shape
37Environmental Data Mining Challenges (2)
- How to utilize the various data seamlessly
- The diverse data types
- Structured data vector, raster, relational
database - Unstructured data text, multimedia, and
geo-referenced stream data. - Needs supporting data
- Some can be found in the Data Warehouse summary,
average - Some need to be created on the fly variation,
etc. - How to utilize the geographic visualization tool
- Can it replace the statistical visualizations
tools at some area?
38Data Mining Techniques
- From the motivating examples we notice that
several data mining techniques are involved - Segmentation
- Clustering
- Classification
- Rule detection
- Trend detection
- Outlier detection
39EDM Techniques Rule detection
- Examples
- Can we distinguish between the changes due to
various factors, such as inter-annual climate
variability and human action impact? - Can we link the changes in NDVI to changes in the
hydrologic conditions, or changes in population? - Various rules
- Characteristic rules one characteristic of data
- Discriminant rules the feature discriminating or
contrasting a class of data from other classes - Association rules one set of feature is
correlated with another set of data
40EDM Techniques Association Rule detection
- Algorithms
- Classic algorithms
- Apriori for Boolean association rules to find
frequent itemset (Jiawei Han and Micheline
Kamber, 2000) - Statistic techniques regression model
- Spatial data mining algorithm (Krzysztof Koperski
and Jiawei Han, 1996) - a top-down search technique
- Use spatial approximation
- Pre-process is require for object recognition
- Needs comprehensive algorithm for mining a
combination of spatial and non-spatial data at
the same time
41EDM Techniques - Segmentation
- Example
- Are the trends confined to certain regions? Is
the nature of the variability and trend different
in different regions? - Is it possible to distinguish between variability
related to inter-annual and long term trends. - Clustering
- groups spatial objects such that objects in the
same groups are similar and objects in different
groups are unlike each other . - Classification
- Selects a relevant set of attributes and
attribute values that determine an effective
mapping of spatial objects into pre-defined
target classes. (H. J. Miller and J.Han, 2001) - Name a set of pre-determined classes
(inter-annual changes, long term changes)
42EDM Techniques Segmentation (Contd)
- Algorithms
- Classification the classes are pre-defined
- Decision tree induction
- Bayesian classification
- Cluster
- Partitioning algorithms k-means method,
k-medoids method - The problem here is that the result is strongly
depends on the initial guess of the centroid - Hierarchy algorithms AGNES, DIANA, BIRCH, CURE
- The hierarchy algorithms are not optimal for
large datasets - Density based DBSCAN, OPTICS, DENCLUE
- Only dot, without meaningful interpretation
- Grid based STRING, WaveCluster, CLIQUE
- How to partition high-dimensional data
43EDM Techniques Outlier Detection
- To find inconsistency and abnormal
- Example
- Can we identify the abnormal changes in NDVI or
particular species? - Has is it been usually hot for this October
- Algorithms (Raymond T.Ng, 2001)
- Distribution-based approach the one not follow
the standard distribution. - Hard to know the distribution
- Not suitable for high-dimensional datasets
- Depth-based method represent the data at
k-dimensional space, assign depth to each object.
- Does not scale up for more than 3-D
- Distance-based outlier detection
- Require the existence of an appropriate distance
function
44References
- 1 Praveen Kumar and Amanda BT. White, Scalable
Knowledge discovery for hydroclimatological
studies , University of Illinois, 2002 - 2 H. J. Miller and J.Han, Geographic Data
Mining and Knowledge Discovery, Taylor
Francis, 2001 - 3 Nabil Adam, Vijay Atluri, Songmei Yu and
Yelena Yesha, Efficient Storage and Management
of Environmental Information, presented in 11th
Mass Storage Conference hold by IEEE and NASA,
Maryland, April 2002. - 4 Wendolin Bosques, Ricardo Rodriguez, Angelica
Rondon and Ramon Vasquez, "A Spatial Data
Retrieval and Image Processing Expert System for
the World Wide Web," 21st International
Conference on Computers and Industrial
Engineering, 1997, pages 433-436. - 5. Krzysztof Koperski and Jiawei Han,
Discovery of spatial association rules in
Geographic Information Database, Proceedings of
4th International Symp. Advances, in Spatial
Database, (SSD). Vol 951, Springer-Verlag, 47-66. - 6 Kirk Barrett, The Meadowlands Environmental
Research Institute, Science on the Semantic Web
(SWS) Workshop, Oct 2002. - 7 Jiawei Han, Russ B. Altman, Vipin Kumar,
Heikki Mannila, and Daryl Pregibon, Emerging
Scientific Applications in Data Mining,
Communications of ACM, August, 2002, Vol. 45, No.
8, Page 54-58 - 8 Krzysztof Koerski, Junas Adhikary, and Jiawei
Han, Spatial data mining progress and
Challenges Survey paper, SIGMOD 96 workshop on
Research Issures in Data Mining and Knowledge
discover. - 9 Jiawei Han and Micheline Kamber, Data Mining
Concepts and Techniques, Morgan Kaufmann
Publishers, 2000 - 10 Raymond T.Ng, Detecting outliers from large
datasets, Geographic Data Mining and Knowledge
Discovery, Taylor Francis, 2001
45Focus Areas
- Environmental monitoring
- Remote sensing/GIS for land use planning
- Plant and animal inventory and assessment
- Salt-marsh and Landfill Characterization and
Restoration - Assessment and Remediation of Contaminated
Sediments - Land use information management for planning and
engineering (predict land use trends for planner,
code enforcement for engineers) - Scientific data warehousing for efficient
management of environmental and remote sensing
data - Scientific data mining for discovering trends,
patterns and relationships among land use and
environmental data - Automating land use permit processing workflows
through transparent inter-agency interaction
46Introduction to Environmental Data (contd.)
- Value-added products
- water
- vegetation
- temperature
- true colors (composites)
- models of the topography and spatial attributes
of the landscape - roads, rivers, parcels, schools, zip code areas,
city streets and administrative boundaries - Maps, reports, data sets from government agencies
- census information that describes the
socio-economic and health characteristics of the
population - real-time data from ground monitoring stations