Title: University of Alabama in Huntsville NMI Testing and Experiences
1University of Alabama in Huntsville NMI Testing
and Experiences
Sandra Redman Information Technology and Systems
Center and Information Technology Research
Center National Space Science and Technology
Center 256-961-7806 sredman_at_itsc.uah.edu Sandra.R
edman_at_msfc.nasa.gov www.itsc.uah.edu
2Improving Data Usability
- Advanced Applications Development
- Data organization and management for archival and
analysis - Data Mining in real-time and for post run
analysis - Interchange Technologies for improved data
exploitation - Semantics to transform data exploitation via
intelligent automated processing - Exploiting Technology
- Grid technologies for seamless access to multiple
computational and data resources into a virtual
computing environment - Cluster technologies for high speed parallel
computation, for multiple agent computations, and
other applications - High-performance networking for advanced
applications development and high performance
connectivity - Next generation technologies in videoconferencing
and electronic collaboration
3Exploiting Technology to Improve Data Usability
Distributed Immersive Collaborative Environments
Real-time Data Fusion Information Delivery
Customized knowledge delivery
GRID Processing
Knowledge Discovery
On-Board Mining
Adaptive/learning
Increasing Capability
Custom Order Processing
3D and 4D distributed dynamic data fusion
Visual navigation aides
Earth Science Markup Language (ESML)
Data Mining
Time
Now
Future
4Data Usability Success Builds on the Integration
of User Domains and Information Technology
- Information Technology Scientists
- Information Science Research
- Knowledge Management
- Data Exploitation
- Domain Scientists and Engineers
- Research and Analysis
- Data Set Development
- Collaborations
- Accelerate research process
- Maximize knowledge discovery
- Minimize data handling
- Contribute to both fields
Domain Scientists and Engineers
Information Scientists
5Data Mining
- Automated discovery of patterns, anomalies from
vast observational data sets - Derived knowledge for decision making,
predictions and disaster response - http//datamining.itsc.uah.edu
6Mining Environment When, Where, Who and Why?
- WHERE
- User Workstation
- Data Mining Center
- Cluster
- Grid
- On-board
- WHEN
- Real Time
- On-Ingest
- On-Demand
- Repeatedly
- WHO
- End Users
- Domain Experts
- Mining Experts
- WHY
- Event
- Relationship
- Association
- Corroboration
- Collaboration
Data Mining
7Creating a Successful Environment for Data Mining
- Provide scientists with the capabilities to allow
the flexibility of creative scientific analysis - Provide data mining benefits of
- Automation of the analysis process
- Reducing data volume
- Provide a framework to allow a well defined
structure to the entire process - Provide a suite of mining algorithms for creative
analysis that can adapt to new hypotheses - Provide capabilities to add science algorithms
to the environment - Exploit emerging technologies in computational
and data grids, high-performance networks, and
collaborative environments
8Algorithm Development and Mining System (ADaM) -
System Overview
- Consists of over 100 interoperable mining and
image processing components - Each component is provided with a C application
programming interface (API), an executable in
support of scripting tools (e.g. Perl, Python,
Tcl, Shell) - ADaM components are lightweight and autonomous,
and have been used successfully in a grid
environment - ADaM has several translation components that
provide data level interoperability with other
mining systems (such as WEKA and Orange), and
point tools (such as libSVM and svmLight) - Components include Python wrappers and web
service interfaces - Visualization of results easily accomplished with
various visualization packages
9ADaM Components
10Current Mining Environments
- Multiple Configurations
- Complete System (Client and Engine)
- Mining Engine (User provides its own client)
- Application Specific Mining Systems
- Operations Tool Kit
- Stand Alone Mining Algorithms
- Distributed/Federated/Grid Mining
- Distributed services
- Distributed data
- Chaining using Interchange Technologies
- On-board Mining
- Real time and distributed mining
- Processing environment constraints
- Space-based/ground-based/unmanned
11ADaM Feature Subset Selection application chosen
for testing
- Supervised pattern classification is a technique
important in many domains - Used to improve both the runtime and accuracy of
a supervised pattern classifier by eliminating
noisy, irrelevant or redundant attributes or
features from the data set. - Feature subset selection is the process of
choosing a subset of the features from the
original data set in order to maximize classifier
accuracy - Both processor and data-intensive
12Parallel Version of Cloud Extraction
- GOES images used to recognize cumulus cloud
fields - Cumulus clouds are small and do not show up well
in 4km resolution IR channels - Detection of cumulus cloud fields in GOES can be
accomplished by using texture features or edge
detectors
Master
Slave 1
Slave 2
Slave 3
GOES Image
Laplacian Filter
Sobel Horizontal Filter
Sobel Vertical Filter
Energy Computation
Energy Computation
Energy Computation
Energy Computation
Classifier
Cloud Image
GOES Image
Cumulus Cloud Mask
Three edge detection filters are used
together to detect cumulus clouds which lends
itself to implementation on a parallel cluster
13Feature Subset Selection Testing
- Application ported to linux
- Support Vector Machine downloaded and tested
- Developed application scripts
- Modified for Globus environment by writing simple
Globus RSL file - Ran each combination of tools on a different node
on the grid - Globus used to execute jobs on different machines
- Experimented with both real and synthetic data
14Early Findings (NMI R2)
- Globus documentation improved, installation
trouble-free, application port straight-forward - No problems encountered during Condor-G
installation, but found problem with Condor-G
under Redhat linux 7.3 when using nss_ldap.
Developer provided workaround - start name
service caching daemon (nscd) - GSI-OpenSSH installed, but Kerberos
authentication did not work since linux was not
compiled with PAM option (undocumented) - Network Weather Service installed, but learned we
are more interested in MDS
15MEADModeling Environment for Atmospheric
Discovery
- One of the NSF PACI Alliance research Expeditions
- Expeditions ensure intense collaboration among
technology developers and application scientists
and focus on the deployment of infrastructure
that supports computational science and
engineering and science in a variety of
disciplines. - MEADs focus is on retrospective analysis of
hurricanes and severe storms using the TeraGrid,
integrating computation, grid workflow
management, data management, model coupling, data
analysis/mining, and visualization.
16MEAD
- Science Objective
- To investigate different thunderstorm cell
interactions favorable for subsequent tornado
(mesocyclone) formation - Approach
- Use idealized WRF model simulations with
different initial conditions - Create a large parameter space of thunderstorm
cell interaction and storm behavior - Mine this search space for patterns and trends
17WRF Initializations
- 230 WRF runs were made, two control
(single-cell) - Each corresponded to a particular
arrangement of a pair of initial storm cells - In figure at left
- Each square 1 simulation
- 1st storm in the middle
- 2nd at one of blue squares
- Center cell stronger
Matrix of WRF simulations
Slide Source Brian Jewett
18Goals of this Mining Study
- Develop a mesocyclone detection algorithm (in
both 2D and 3D) - Develop an algorithm to track the temporal
evolution of the mesocyclone features - Investigate the use of clustering techniques to
- Summarize differences in simulation runs
- Provide an overview of all the simulations
19Example Tracking Results
20Mesocyclone Detection and Tracking Results
Features with time durations of a single time
step are filtered out
21Summary Mesocyclone Detection
- Number of mesocyclones with higher duration tend
to be associated with initializations where the
second cell is closer to the first - Mesocyclones found in the storm simulations are
sensitive to the particular arrangement of a pair
of initial storm cells (secondary storm placement
at 45 degrees to the primary storm) - Clustering techniques are useful to summarize
differences in simulation runs - Clustering techniques provide an overview of all
the simulations
22Some Lessons Learned
- NMI Testbed Process working well
- Answers found through NMI discussion lists from
developers and other users - Have to sell the grid concept to developers,
administrators, users - NMI Work proven helpful in other grid work
- TeraGrid
- LEAD Linked Environments for Atmospheric
Discovery - SpaceDoG Space Development and Operations Grid
- CEOS Committee for Earth Observing Satellites
- More Components needed
23Linked Environment for Atmospheric Discovery
(LEAD)
- NSF Information Technology Research Program
- Creating a cyberinfrastructure for mesoscale
meteorology - real-time, on-demand, and dynamically adaptive
needs for mesoscale weather research - High volume data sets and streams
- Computationally demanding numerical models and
data assimilation systems
24The LEAD Goal
- To create an integrated, scalable framework in
which analysis tools, forecast models, and data
repositories can be used as dynamically adaptive,
on-demand systems that can - operate independent of data formats and the
physical location of data or computing resources - change configuration rapidly and automatically in
response to weather - continually be steered by new data (i.e., the
weather) - respond to decision-driven inputs from users
- initiate other processes automatically and
- steer remote observing technologies to optimize
data collection for the problem at hand
25The LEAD Vision Dynamic, Adaptive, Multi-Scale
NWS National Static Observations Grids
Virtual/Digital Resources and Services
ADAS
ADaM
Mesoscale Weather
MyLEADPortal
Experimental Dynamic Observations
Tools
Remote Physical (Grid) Resources
Local Physical Resources
Local Observations
26LEAD An integrated framework for identifying,
accessing, preparing, assimilating, predicting,
managing, analyzing, mining, and visualizing
meteorological data, independent of format and
physical location
27Challenges for Next-generation Mining
- Develop and document common/standard interfaces
for interoperability of data and services - Design new data models for handling
- real-time/streaming input
- data fusion/integration
- Design and develop distributed standardized
catalog capabilities - Develop advanced resource allocation and load
balancing techniques - Exploit the grid concept for enhanced data mining
functionality - Develop more intelligent and intuitive user
interfaces - Integrate with collaborative environments
- Develop ontologies of scientific data, processes
and data mining techniques for multiple domains - Support language and system independent
components - Incorporate data mining into science and
engineering curricula
28LEAD GWSTBsGrid and Web Services Testbeds
- Local User Environment customized portal,
control of information flows, collaboration
tools, managing processes - Productivity Environment models, tools, and
algorithms - Data Services Environment data transport, data
formatting, and interoperability - Distributed Technologies Environment workflow
infrastructure to autonomously acquire resources
and adapt to changing plans - Data Archive recent and historical data,
products, and tools
29LEAD Education Testbeds
- Provide hands-on access to assess the
effectiveness of LEAD technologies for education - Provide input and feedback to LEAD developers
- Facilitate knowledge transfer
- Collaborative technologies
30LEAD policy development and implementation
- Define Virtual Organizations
- LEAD designed for use principally by the
meteorological higher education and operations
research communities - Develop LEAD policies
- Developing LEAD global policies
- Adhere to local policies of each site (security,
resource utilization, etc.) - Policy management services
- PKI cryptography, X.509 certificates
- Authorization service
- Monitor resource utilization and accounting
services
31Other considerations
- Emerging standards and middleware
- Applications development to concentrate on the
application using NMI middleware (Globus,
MyProxy, OGCE, etc) for grid infrastructure also
using additional middleware (MCS, RSL,
performance monitoring tools) - Current software has dependencies on middleware
versions - Configuration management
- Distributed team developing and delivering
software to multiple testbeds - Goal is to allow heterogeneous host environments
- Collaborative technologies
- Access Grid, H.323 videoconferencing facilitate
LEAD team project planning and work sessions - Collaborative technologies will be integrated
into testbeds for user education and research
32Data Integration and Mining From Global
Information to Local Knowledge
Emergency Response
Precision Agriculture
Bioinformatics
Urban Environments
Weather Prediction