Title: Focus Study: Mining on the Grid with ADaM
1Focus StudyMining on the Grid with ADaM
Sara Graves Sandra Redman Information Technology
and Systems Center and Information Technology
Research Center University of Alabama in
Huntsville National Space Science and Technology
Center 256-961-7806 sgraves_at_itsc.uah.edu sredman_at_
itsc.uah.edu www.itsc.uah.edu
2Data Mining
- Automated discovery of patterns, anomalies from
vast observational data sets - Derived knowledge for decision making,
predictions and disaster response - http//datamining.itsc.uah.edu
3Creating a Successful Environment for Data Mining
- Provide scientists with the capabilities to allow
the flexibility of creative scientific analysis - Provide data mining benefits of
- Automation of the analysis process
- Reducing data volume
- Provide a framework to allow a well defined
structure to the entire process - Provide a suite of mining algorithms for creative
analysis that can adapt to new hypotheses - Provide capabilities to add science algorithms to
the environment - Exploit emerging technologies in computational
and data grids, high-performance networks, and
collaborative environments
4Challenges for Next-generation Mining
- Develop and document common/standard interfaces
for interoperability of data and services - Design new data models for handling
- real-time/streaming input
- data fusion/integration
- Design and develop distributed standardized
catalog capabilities - Develop advanced resource allocation and load
balancing techniques - Exploit the grid concept for enhanced data mining
functionality - Develop more intelligent and intuitive user
interfaces - Integrate with collaborative environments
- Develop ontologies of scientific data, processes
and data mining techniques for multiple domains - Support language and system independent
components - Incorporate data mining into science and
engineering curricula
5Algorithm Development and Mining System (ADaM) -
System Overview
- Consists of over 100 interoperable mining and
image processing components - Each component is provided with a C application
programming interface (API), an executable in
support of scripting tools (e.g. Perl, Python,
Tcl, Shell) - ADaM components are lightweight and autonomous,
and have been used successfully in a grid
environment (NASA IPG, TeraGrid, lab) - ADaM has several translation components that
provide data level interoperability with other
mining systems (such as WEKA and Orange), and
point tools (such as libSVM and svmLight) - Web service interfaces in development
- Executes in multiple environments (e.g.
workstation, cluster, grid, on-board, etc.) - NMI Integration Testbed test cases
6MEADModeling Environment for Atmospheric
Discovery
- One of the NSF PACI Alliance research Expeditions
- Expeditions ensure intense collaboration among
technology developers and application scientists
and focus on the deployment of infrastructure
that supports computational science and
engineering and science in a variety of
disciplines - MEADs focus is on retrospective analysis of
hurricanes and severe storms using the TeraGrid,
integrating computation, grid workflow
management, data management, model coupling, data
analysis/mining, and visualization
7MEAD Mining ExampleMesocyclone Detection
Algorithm
- Science Objective
- To investigate different thunderstorm cell
interactions favorable for subsequent tornado
(mesocyclone) formation - Goals
- Develop a mesocyclone detection algorithm (in
both 2D and 3D) - Develop an algorithm to track the temporal
evolution of the mesocyclone features - Investigate the use of clustering techniques to
- Summarize differences in simulation runs
- Provide an overview of all the simulations
8Approach
- Mining Approach
- Use idealized WRF model simulations with
different initial conditions - Create a large parameter space of thunderstorm
cell interaction and storm behavior - Mine this search space for patterns and trends
- Grid Approach
- Application scripts developed in Python and
tested on linux modified for Globus environment
by writing a simple Globus RSL file - Application scripts constructed to run each
combination of tools in parallel on a different
node on the grid
9Example MEAD Workflow
Initial Setup
Model Execution
Post Run Analysis
Initial Data and Parameters
Data Mining (ADaM)
Multiple WRF Models (Weather)
Model Results
Inter-model communications
Model Results
Multiple ROMS Models (Ocean)
Visualization
Initial Data and Parameters
Grid environment supports the demanding
computational, data storage and post analysis
requirements
10Using the TeraGrid
- Excellent user documentation at
http//www.teragrid.org/userinfo/ - Account Management - Procedures vary per site
- Get account at each site
- Obtain certificate (from one of several sites,
X.509 or KX.509) - Establish Distinguished Name in grid-mapfile at
each site - Create certificate proxy (grid-proxy-int,
MyProxy, kinit) - Programming Environment Know your systems
- Compilers (you have a number of choices)
- Environment Variables (SoftEnv)
- Message Passing (several flavors available)
- Executing Jobs
- Condor-G
- Globus
11WRF Initializations
- 230 WRF runs were made, two control
(single-cell) - Each corresponded to a particular
arrangement of a pair of initial storm cells - In figure at left
- Each square 1 simulation
- 1st storm in the middle
- 2nd at one of blue squares
- Center cell stronger
Matrix of WRF simulations
Slide Source Brian Jewett
12Example Tracking Results
13Mesocyclone Detection and Tracking Results
Features with time durations of a single time
step are filtered out
14Summary Mesocyclone Detection
- Number of mesocyclones with higher duration tend
to be associated with initializations where the
second cell is closer to the first - Mesocyclones found in the storm simulations are
sensitive to the particular arrangement of a pair
of initial storm cells (secondary storm placement
at 45 degrees to the primary storm) - Clustering techniques are useful
- Summarize differences in simulation runs
- Provide an overview of all the simulations
- Limitations of Clustering algorithms
- Investigated K-Means, Dbscan, Maximin and
Hiearchical Clustering Algorithms - K-Means clustering quality is inferior but
provides useful cluster centers or profiles
15LEAD Linked Environments for Atmospheric
Discovery
- A cyberinfrastructure for mesoscale
meteorology - real-time, on-demand, and dynamically adaptive
needs for mesoscale weather research - High volume data sets and streams
- Computationally demanding numerical models and
data assimilation systems
16LEAD NSF Information Technology Research (ITR)
program Multi-Disciplinary team contributing
expertise in meteorological applications,
analysis tools, forecast tools, data distribution
and management, portal development, workflow
orchestration, education and outreach
17LEAD An integrated framework for identifying,
accessing, preparing, assimilating, predicting,
managing, analyzing, mining, and visualizing
meteorological data, independent of format and
physical location Dynamic workflow
orchestration and data management are key
elements
18LEAD GWSTBsGrid and Web Services Testbeds
- Local User Environment customized portal,
control of information flows, collaboration
tools, managing processes - Productivity Environment models, tools, and
algorithms - Data Services Environment data transport, data
formatting, and interoperability - Distributed Technologies Environment workflow
infrastructure to autonomously acquire resources
and adapt to changing plans - Data Archive recent and historical data,
products, and tools
19The Portal as a Grid Access Point
- The Portal Server provides the users Grid Context.
OGCE or GridSphere Grid Portal Server
https
SOAP WS-Security
Web Services Resource Framework Web Services
Notification
Physical Resource Layer
20Services Oriented Architecture
- User interfaces with portal via browser
- Portal provides tools for users to build and
launch workflows - Portlets (JSR-168) provide interface between user
and grid services - Applications can be wrapped as services via a
Portal Factory Service Generator - Requires application, script to run it, input
parameters, output parameters - Write an AppService document and upload to Portal
Factory Service Generator (in portal) - Service is created as well as the portal client
interface - Security model integral to design
21Data Integration and Mining From Global
Information to Local Knowledge
Emergency Response
Precision Agriculture
Bioinformatics
Urban Environments
Weather Prediction