Environmental Data Warehousing and Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Environmental Data Warehousing and Mining

Description:

Environmental Data Warehousing and Mining Nabil R. Adam Vijay Atluri, Dihua Guo, Songmei Yu Rutgers University CIMIC NSF Workshop on Next Generation Data Mining NGDM02 – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 47
Provided by: csUmbcEdu
Category:

less

Transcript and Presenter's Notes

Title: Environmental Data Warehousing and Mining


1
Environmental Data Warehousing and Mining
  • Nabil R. Adam
  • Vijay Atluri, Dihua Guo, Songmei Yu
  • Rutgers University CIMIC
  • NSF Workshop on Next Generation Data Mining
    NGDM02
  • November 1-3, 2002

2
Outline
  • Setting
  • A Real-world Lab The NJ Meadowlands Area
  • Motivating Examples
  • Environmental Data Warehousing
  • Environmental Data Mining

3
MERI (Meadowlands Environmental Research
Institute)
  • Established in 1998 as a Collaboration between
    The New Jersey Meadowlands Commission and Rutgers
    CIMIC.
  • Provides a world class environmental research
    institute for urban and coastal wetlands focused
    on the district
  • Administered by Rutgers-CIMIC
  • Mission
  • Conduct and sponsor research in ecology,
    environmental science and information technology
    to monitor, preserve and improve ecological and
    human health and welfare in the Meadowlands
    District, NJ.

4
MERI
  • Budget 1.6 Million/year (2002-2007)
  • Staff
  • Faculty, Students, FT NJMC/Rutgers
    Scientists/Staff
  • Disciplines
  • Biology, Ecology, Geology, Environmental Sc.,
    Hydrological Modeling, Remote Sensing/Geographic
    Information Systems, and Information Technology
  • Work closely with the NJ Meadowlands Commission
    to disseminate research results
  • To the scientific community and the various
    government agencies
  • Information and technology transfer to local
    municipalities
  • Develop scientific content for education and
    exhibits
  • Provide high school and college students with
    science internships

5
Digital Meadowlands
Processed Satellite Images
3D visualization
NASA archives
EnvironmentalParameters
Reports
Fly-by/ Drill-down
Radar
Digital Meadowlands
  • Users


Interactive Maps
Monitoring Stations
Sensors
Satellite Imagery AVHRR
Aerial Photos
documents
Maps
6
Visualization Drill-down
7
  • ASTER (Advanced Spaceborne Thermal Emission and
    Reflection Radiometer)
  • Ground Resolution 15 m (bands 1-2), 30 m (bands
    4-7), 90 m (bands 10-14)
  • Spectral Bands 14
  • Swath width 60 Km
  • Application
  • Daily monitoring of flood prone areas. Flood
    prone areas are shown in red. Under flood
    conditions sensor would detect water (blue)
    covering flood prone areas (red).

8
Satellite Images
  • Various data sources and data types
  • various types of satellite images with different
    resolutions captured by different sensors
  • AVHRR direct downloads from polar orbiting
    satellites(NOAA 12, NOAA 14 and NOAA 15), 1km
    resolution, 5 bands
  • LANDSAT and RADAR obtained from NASA archives,
    30m resolution, 7 bands
  • Hyper-spectral images 34 bands images from AISA
    (Airborne Imaging Spectrometer for Applications)
    sensor, 250-1000m resolution
  • Aerial ortho-photographs high resolution (1m)
    images (IKONOS, QUICKBIRD),
  • MODIS images for global dynamics and processes
    occurring on the land, in the oceans, and in the
    lower atmosphere from NASAS satellites Terra and
    Aqua , 90m resolution, 36 bands
  • ASTER detailed maps of land surface temperature,
    emissivity, reflectance and elevation from NASAs
    satellite Terra, 14 bands, 15-90m resolution

9
Users access database through the Worldwide Web
Automated, near real-time monitoring system
Weather station with data logger
Water monitor wired to data logger
WWW interface
modem link
Central computer ingests data and stores it in a
database
10
Real Time Data from Monitoring Stations
11
Use of Water Quality Data Tracking the
effectiveness of pollution control measures
Regulatory minimum level
12
One Example of Satellite Imagery AVHRR
13
(No Transcript)
14
Information Sources -- Traditional
  • Include
  • A Library of a large variety of Documents
  • Scientific Publication
  • Guidelines and Regulations
  • Measurements and Impact Studies
  • Documents contain Text, Tables, Pictures,
    Drawings and Maps
  • Census information that describes the
    socio-economic and health characteristics of the
    population

15
The User Community
  • Researchers, faculty, graduate students in a
    variety of disciplines including biology,
    ecology, geology, environmental science, and IT
  • make scientific observations such as the changes
    in vegetation pattern and its effect on
    temperature over the years
  • Policy Makers
  • query various critical parameters such as ambient
    air and water quality and visualize the results
    in a graphical form
  • gain help in the evaluation and formulation of
    environmental policies
  • The Public
  • learn information about their county, community,
    home on such issues as environment, health, and
    infrastructure
  • K-12 Educators and students

16
The Data Volume
  • Satellite images
  • AVHRR 50MB each image, 2-4 images per satellite
    per day, CIMIC is downloading images from 3
    satellites and generate 15GB data per month
  • MODIS images and ASTER images are available
    everyday or every other day.
  • IKONOS, QUICKBIRD 1m resolution images (7.5
    quad), each image would roughly be 15000120008,
    which means 1.44GB.
  • Our Mass Storage
  • EMC CLARiiON FC4500
  • Capacity up to 18TB
  • Good cost per MB, excellent performance,
    scalability, and flexibility
  • It satisfies the needs of the Online Querying
    Information System
  • One GB cache optimized for r/w at different
    times by a script
  • Backplane provides a data transfer rate of 200
    MB/Second from the disks to the fiber channel
    port which transfers the data over the fiber
    channel cables to the host at 100 MB/Second.
  • Additional fiber channel
  • Very flexible configuration capabilities

17
Environmental Knowledge Discovery
  • Examples
  • Data Warehousing
  • Data Mining
  • How and Why
  • Hypothesis testing

18
Motivating Examples (1)
  • Identify a natural disturbance affecting wetland
    vegetation such as fire, pathogen infestation or
    wilting by drought in the New Jersey Meadowlands?
  • What should we have?
  • A time series of satellite images (a few years)
  • Calculated soil and vegetation indices for images
  • Digital elevation models (DEM) of Meadowlands
  • Precipitation record for time series
  • Zoning designation for area being observed
  • What do we need to do?
  • Identify the sudden drop in the vegetation index
    (NDVI) in areas where NDVI has been consistently
    high through time (outlier detection)
  • Determine areas where suddenly the soil index is
    high due to the exposure of bare mineral soil.
    (classification)
  • Combine high soil index with low NDVI and
    precipitation record to determine the occurrence
    of vegetation disturbance (characterization)

19
Motivating Examples (2)
  • Find bird resting patterns along the eastern
    seaboard migration corridor
  • Data needs
  • Extent of ecosystems that support invertebrates
    along the migration corridor
  • Availability of invertebrates in water and
    sediments through the migrating period.
  • What we need to know
  • The number of birds and bird types as related to
    the availability of food at each rest stop.
    (trends detection)
  • Detect abnormal bird populations (low or high)
    which are not explained by availability of food
    at specific resting stops. (outlier detection)

20
Motivating Examples (3)
  • Investigate the associations between change in
    forest cover and illegal exploitation of
    protected tropical forests
  • Data we need
  • Satellite image/maps
  • Calculated deforestation rates using NDVI indices
  • Data on truck movement
  • Records on ship movements from local ports
  • Data on migrant worker camps
  • What we can get?
  • Relate deforestation rate to new road
    construction and truck traffic in areas where the
    topography and local ecosystems support exotic
    tropical trees.. (association detection)

21
Other Motivate Examples
  • hydroclimatological study (Praveen Kumar and
    Amanda BT. White, 2002)
  • How can we link the changes in NDVI to changes in
    the hydrologic condition?
  • Can we distinguish between the changes due to
    various factors, such as inter-annual climate
    variability and human action impact?
  • Is it possible to distinguish between
    variabilities related to inter-annual and
    long-term trends?
  • Is there correlation between NDVI variations and
    ecoregion, or between NDVI with other parameters,
    such as climate, physiography, topography, or
    hydrology?
  • Are the trends confined to certain regions? Is
    the nature of the variability and trend different
    in different regions?
  • Are there any systematic changes over last 10
    years?
  • Are there regions where changes are attributable
    to human impact, such as logging?

22
Environmental Data Warehousing (EDW)
  • Poses a number of challenging requirements with
    respect to
  • The design of the data model due to the nature of
    analytical operations to be performed
  • The nature of the views to be maintained by the
    environmental warehouse.

23
EDW ChallengesNature of the Environmental Data
  • Each dimension in itself is multi-dimensional in
    nature, e.g.,
  • raster images such as satellite downloads
  • used to generate various images of different
    types including land-use, water, temperature,
    NDVI
  • each of them have multiple dimensions
  • the geographic extent and coordinates
  • the time and date of its capture
  • resolution, ...
  • regional maps represented as vector data
  • temporal and spatial
  • streaming data collected from various sensors
  • Temperature, air quality, atmospheric pressure,
    water quality dissolved oxygen, mineral
    contents, salinity
  • geographic location (spatial dimension)
  • temporal dimension

24
Nature of the Environmental data
25
Nature of the Environmental data
  • Each dimensional table is itself
    multi-dimensional by nature
  • Traditional data warehouse models are not
    suitable for an environmental data warehouse
  • Our Proposal cascaded star schema

26
EDW ChallengesComplex Nature of Queries (1)
  • Retrieve changes in the vegetation pattern over
    a certain region during last 10 years, and their
    effect on the regional maps over that time period
  • requires
  • layering of the images representing the
    vegetation patterns with those of the maps whose
    time intervals of validity overlap
  • traverse along this temporal dimension with the
    overlaid image
  • In the traditional data warehouse sense,
  • first construct two data cubes along the time
    dimensions for each of the vegetation images and
    maps
  • then fuse these two cubes into one

27
Demo
  • http//cimic.rutgers.edu/songmei/dw.html

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
EDW ChallengesComplex Nature of Queries (2)
  • Observe the changes in the surface water and
    population due to the changes in the vegetation
    pattern
  • fusion of multiple cubes is required
  • Simulate a fly-by over a region starting with a
    specific point and elevation, and traverse the
    region on a specific path with reducing elevation
    levels at a certain speed, and reaching a
    destination (a 3-dimensional trajectory)
  • Requires
  • retrieving images that span adjacent regions
    that overlap the spatial trajectory, but with
    increasing resolution levels to simulate the
    effect of reduced elevation level
  • display them at a speed that matches the desired
    velocity of the fly-by.

32
EDW ChallengesEfficient Software and Mature
Technology
  • We need software applications to efficiently
    manage and manipulate images either by
    pre-setting or by ad -hoc
  • Example of calculating NDVI
  • select (char) ( 255.0 (band2 - band1)/(band2
    band1))
  • 10001500, 10001500
  • from landsat_band1 as band1, landsat_band2 as
    band2
  • Example in the area of DB -- RasDaMan
  • A basic research project sponsored by the
    European Community to develop comprehensive MDD
    database technology
  • Multi-dimensional data models (MDD) to store
    images
  • Interacts with Oracle for meta data and blob
    management

33
RasDaMan (1)
34
RasDaMan (2)
  • Distinguished Features
  • A clear distinction is made between the logical
    (query) level and the physical (storage
    organization and data transmission) level of
    array management.
  • On the conceptual level, arrays are treated as a
    general data abstraction, they can be of any
    dimensionality, they can have an arbitrary (fixed
    or variable) number of elements per dimension,
    and both primitive and derived types are
    admissible as array base types.
  • The model has formal set-algebraic semantics
    based on AFATL Image Algebra, a rigid
    mathematical framework able to express any image
    transformation.
  • On the physical level, a novel combination of
    tiling and spatial indexing allows for the
    efficient execution of queries on MDD while
    offering the benefits of conventional database
    technology, such as query performance depending
    on the result set (and not on the overall data
    set size), concurrency control, support for crash
    recovery, and transaction management.
  • A data definition language for multidimensional
    arrays, together with a SQL-based and optimized
    query language called RasQL allows for powerful
    associative retrieval and data manipulation

35
Ongoing Work
  • Formulating the necessary primitives for the
    specification and execution of queries
  • Extending the OLAP operations for the cascaded
    star
  • roll-up aggregating on a specific dimension,
    i.e., summarize data
  • drill-down from higher level summary to lower
    level detailed
  • slicing projecting data along a subset of
    dimensions with an equality selection of other
    dimensions
  • dicing similar to slicing except that instead of
    equality selection of other dimensions, a range
    selection is used
  • pivoting reorient the multidimensional cube
  • zoom-in, zoom-out, aggregation of views using the
    above OLAP operations

36
Environmental Data Mining Challenges (1)
  • How can we mine spatial data and non-spatial data
    from multispectral satellite images and thematic
    maps. (Krzysztof Koerski, Junas Adhikary, and
    Jiawei Han, 1996)
  • Currently research uses only single type of map
    or image
  • Mine them at the same time
  • Resolutions are different
  • The representation of the thematic maps are
    different
  • How to deal with the complex relationships among
    objects (Krzysztof Koerski, Junas Adhikary, and
    Jiawei Han, 1996)
  • Relationships
  • Spatial relationship distance
  • Topological relationship disjoint, overlap, far
    away, etc
  • Direction
  • Current clustering represent the big object using
    centroid, e.g., objects of similar size and
    regular shape, only one of them is very narrow,
    long band shape

37
Environmental Data Mining Challenges (2)
  • How to utilize the various data seamlessly
  • The diverse data types
  • Structured data vector, raster, relational
    database
  • Unstructured data text, multimedia, and
    geo-referenced stream data.
  • Needs supporting data
  • Some can be found in the Data Warehouse summary,
    average
  • Some need to be created on the fly variation,
    etc.
  • How to utilize the geographic visualization tool
  • Can it replace the statistical visualizations
    tools at some area?

38
Data Mining Techniques
  • From the motivating examples we notice that
    several data mining techniques are involved
  • Segmentation
  • Clustering
  • Classification
  • Rule detection
  • Trend detection
  • Outlier detection

39
EDM Techniques Rule detection
  • Examples
  • Can we distinguish between the changes due to
    various factors, such as inter-annual climate
    variability and human action impact?
  • Can we link the changes in NDVI to changes in the
    hydrologic conditions, or changes in population?
  • Various rules
  • Characteristic rules one characteristic of data
  • Discriminant rules the feature discriminating or
    contrasting a class of data from other classes
  • Association rules one set of feature is
    correlated with another set of data

40
EDM Techniques Association Rule detection
  • Algorithms
  • Classic algorithms
  • Apriori for Boolean association rules to find
    frequent itemset (Jiawei Han and Micheline
    Kamber, 2000)
  • Statistic techniques regression model
  • Spatial data mining algorithm (Krzysztof Koperski
    and Jiawei Han, 1996)
  • a top-down search technique
  • Use spatial approximation
  • Pre-process is require for object recognition
  • Needs comprehensive algorithm for mining a
    combination of spatial and non-spatial data at
    the same time

41
EDM Techniques - Segmentation
  • Example
  • Are the trends confined to certain regions? Is
    the nature of the variability and trend different
    in different regions?
  • Is it possible to distinguish between variability
    related to inter-annual and long term trends.
  • Clustering
  • groups spatial objects such that objects in the
    same groups are similar and objects in different
    groups are unlike each other .
  • Classification
  • Selects a relevant set of attributes and
    attribute values that determine an effective
    mapping of spatial objects into pre-defined
    target classes. (H. J. Miller and J.Han, 2001)
  • Name a set of pre-determined classes
    (inter-annual changes, long term changes)

42
EDM Techniques Segmentation (Contd)
  • Algorithms
  • Classification the classes are pre-defined
  • Decision tree induction
  • Bayesian classification
  • Cluster
  • Partitioning algorithms k-means method,
    k-medoids method
  • The problem here is that the result is strongly
    depends on the initial guess of the centroid
  • Hierarchy algorithms AGNES, DIANA, BIRCH, CURE
  • The hierarchy algorithms are not optimal for
    large datasets
  • Density based DBSCAN, OPTICS, DENCLUE
  • Only dot, without meaningful interpretation
  • Grid based STRING, WaveCluster, CLIQUE
  • How to partition high-dimensional data

43
EDM Techniques Outlier Detection
  • To find inconsistency and abnormal
  • Example
  • Can we identify the abnormal changes in NDVI or
    particular species?
  • Has is it been usually hot for this October
  • Algorithms (Raymond T.Ng, 2001)
  • Distribution-based approach the one not follow
    the standard distribution.
  • Hard to know the distribution
  • Not suitable for high-dimensional datasets
  • Depth-based method represent the data at
    k-dimensional space, assign depth to each object.
  • Does not scale up for more than 3-D
  • Distance-based outlier detection
  • Require the existence of an appropriate distance
    function

44
References
  • 1 Praveen Kumar and Amanda BT. White, Scalable
    Knowledge discovery for hydroclimatological
    studies , University of Illinois, 2002
  • 2 H. J. Miller and J.Han, Geographic Data
    Mining and Knowledge Discovery, Taylor
    Francis, 2001
  • 3 Nabil Adam, Vijay Atluri, Songmei Yu and
    Yelena Yesha, Efficient Storage and Management
    of Environmental Information, presented in 11th
    Mass Storage Conference hold by IEEE and NASA,
    Maryland, April 2002.
  • 4 Wendolin Bosques, Ricardo Rodriguez, Angelica
    Rondon and Ramon Vasquez, "A Spatial Data
    Retrieval and Image Processing Expert System for
    the World Wide Web," 21st International
    Conference on Computers and Industrial
    Engineering, 1997, pages 433-436.
  • 5. Krzysztof Koperski and Jiawei Han,
    Discovery of spatial association rules in
    Geographic Information Database, Proceedings of
    4th International Symp. Advances, in Spatial
    Database, (SSD). Vol 951, Springer-Verlag, 47-66.
  • 6 Kirk Barrett, The Meadowlands Environmental
    Research Institute, Science on the Semantic Web
    (SWS) Workshop, Oct 2002.
  • 7 Jiawei Han, Russ B. Altman, Vipin Kumar,
    Heikki Mannila, and Daryl Pregibon, Emerging
    Scientific Applications in Data Mining,
    Communications of ACM, August, 2002, Vol. 45, No.
    8, Page 54-58
  • 8 Krzysztof Koerski, Junas Adhikary, and Jiawei
    Han, Spatial data mining progress and
    Challenges Survey paper, SIGMOD 96 workshop on
    Research Issures in Data Mining and Knowledge
    discover.
  • 9 Jiawei Han and Micheline Kamber, Data Mining
    Concepts and Techniques, Morgan Kaufmann
    Publishers, 2000
  • 10 Raymond T.Ng, Detecting outliers from large
    datasets, Geographic Data Mining and Knowledge
    Discovery, Taylor Francis, 2001

45
Focus Areas
  • Environmental monitoring
  • Remote sensing/GIS for land use planning
  • Plant and animal inventory and assessment
  • Salt-marsh and Landfill Characterization and
    Restoration
  • Assessment and Remediation of Contaminated
    Sediments
  • Land use information management for planning and
    engineering (predict land use trends for planner,
    code enforcement for engineers)
  • Scientific data warehousing for efficient
    management of environmental and remote sensing
    data
  • Scientific data mining for discovering trends,
    patterns and relationships among land use and
    environmental data
  • Automating land use permit processing workflows
    through transparent inter-agency interaction

46
Introduction to Environmental Data (contd.)
  • Value-added products
  • water
  • vegetation
  • temperature
  • true colors (composites)
  • models of the topography and spatial attributes
    of the landscape
  • roads, rivers, parcels, schools, zip code areas,
    city streets and administrative boundaries
  • Maps, reports, data sets from government agencies
  • census information that describes the
    socio-economic and health characteristics of the
    population
  • real-time data from ground monitoring stations
Write a Comment
User Comments (0)
About PowerShow.com