Metadata, Provenance, and Search in e-Science - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Metadata, Provenance, and Search in e-Science

Description:

Volume of data used in computational science too large: manage on behalf of user ... e-Science Gateway Architecture. Grid. Portal Server. Execution. Management ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 41
Provided by: BethP99
Learn more at: http://vw.indiana.edu
Category:

less

Transcript and Presenter's Notes

Title: Metadata, Provenance, and Search in e-Science


1
Metadata, Provenance, and Search in e-Science
  • Beth Plale
  • Director, Center for Data and Search Informatics
  • School of Informatics
  • Indiana University

2
CreditsPhD students Yogesh Simmhan, Nithya
Vijayakumar, and Scott Jensen. Dennis Gannon,
IU, key collaborator on discovery
cyberinfrastructure
3
Nature of Computational Science Discovery
  • Extract data from heterogeneous databases,
  • Execute task sequences (workflows) on your
    behalf,
  • Mine data from sensors and instruments and
    responding,
  • Try out new algorithms,
  • Explore data through visualization, and
  • Go back and repeat steps again with new data,
    answering new questions, or with new algorithms.
  • How is this discovery process supported today?
  • Through cyberinfrastructure.
  • CyberInfrastructure that supports
  • On demand knowledge discovery
  • Automated experiment management (data and
    workflow)
  • Data protection, and automated data product
    provenance tracking.

4
CyberInfrastructure framework for discovery
  • Plug and play data sources and analysis tools.
    Complex what-if scenarios. Through
  • User portal
  • Personal metadata catalog of data exploration
    results
  • Data product index/catalog
  • Data provenance service
  • Workflow engine and composition tools
  • Tied together with Internet-scale event bus.
  • Results publishable to digital library.

5
Cyberinfrastructure for computing DSI DataCenter
Supports analysis, use, visualization and
search research. Supports multiple datasets.
6
Distributed services provide functionali
capability
7
Vision for Data Handling
  • Capturing metadata about data sets as generated
    is key
  • Syntatic file size, date of creation and
  • Semantic or domain specific spatial region,
    logical time
  • Context of file is key search parameter
  • Provenance, or history of data product, needed to
    assess quality
  • Volume of data used in computational science too
    large manage on behalf of user
  • Indexes help efficiency

8
The Realization in Software
Workflow graph
Application services
Compute Engine
Users Browser
Workflow Engine
App factory
Event Notification Bus
Portal server
MyLEAD Agent service
Data Management Service
Data Catalog service
Provenance Collection service
MyLEAD User Metadata catalog
Data Storage
9
Infrastructure is portal based - that is, all
services are available through a web server
10
e-Science Gateway Architecture
Users Grid Desktop
Grid Portal Server
1 Service Oriented Architectures for Science
Gateways on Grid Systems, Gannon, D., et al.
ICSOC, 2005
11
LEAD-CI Cyberinfrastructure
  • Workflows run on the LEADgrid and on Teragrid.
  • Portal and persistent back-end web services run
    on LEADgrid.
  • Data storage resources for storing user-generated
    data products are provided by Indiana University.

12
Typical weather forecast runs as workflow
Pre-Processing
Assimilation
Forecast
Visualization
Terrain data files
ETA, RUC, GFS data
IDV viz
arpstrn
Ext2arps-ibc
Ext2arps-lbc
Surface data files
WRF
Radar data (level II)
arpssfc
88d2arps
arps2wrf
wrf2arps
ADAS assimilation
Radar data (level III)
arpsplot
Surface, upper air mesonet wind profiler data
nids2arps
Satellite data
400 Data Products Consumed Produced
transformed during Workflow Lifecycle
mci2arps
13
To set up workflow experiment, we select a
workflow (not shown) then set model parameters
here
14
Supported community data collections
15
Data Integration
Local view crosswalk point of presence supports
crawling, publishes difference list as LEAD
Metadata Schema (LMS) documents
CASA radar Collection, Months (ftp)
Globally integrated view Data Catalog Service
Oklahoma
Boolean search query
Latest 3 days Unidata IDD Distribution (XML
web server)
  • Crawler crawls catalogs
  • Builds index of results
  • Web service API
  • Boolean search query with spatial/temporal
    support

Indiana
List of results as LEAD Metadata Schema documents
Web service API
Level II and III radar, latest 3 days (XML web
server)
Colorado
ETA, NCEP, NAM, METAR, etc. (XML web server)
Index XMLDB native XML database and Lucene for
index
Colorado
crosswalks
16
LEAD Personal Workspace
  • CyberInfrastructure extends users desktop to
    incorporate vast data analysis space.
  • As users go about doing scientific experiments,
    the CI manages back-end storage and compute
    resources.
  • Portal provides ways to explore this data and
    search and discover it.
  • Metadata about experiments is largely
    automatically generated, and highly searchable.
  • Describes data object (the file) in
    application-rich terms, and provides URI to data
    service that can resolve an abstract unique
    identifier to real, on-line data file.

17
Searching for experiments using model
configuration parameters 2 attributes selected
18
Searching for experiments based on model
parameters 4 returned experiments one displayed
19
How forecast model configuration parameters
stored in personal catalog
Forecast model configuration file handed off to
plugin that shreds XML document into queriable
attributes associated with experiment
20
What Why of Provenance
  • Derivation history of a data product
  • What (when, where) application created the data
  • Its parameters configuration
  • Other input data used by application
  • Workflow is composed from building blocks like
    these. So provenance for data used in workflow
    gives workflow trace

Data ProvenanceData.Out.1 Process
Application_A Timestamp 2006-06-23T124523
Host tyr20.cs.indiana.edu Input Data.In.1,
Data.In.2 Config Config.A
21
The What Why of Provenance
  • Trace Workflow Execution
  • What services were used during workflow
    execution?
  • Validate if all steps of execution successful?
  • Audit Trail
  • What resources were used during workflow
    execution?
  • Data Quality Reuse
  • What applications were used to derived data
    products?
  • Which workflows use a certain data product?
  • Attribution
  • Who performed the experiment?
  • Who owns the workflow data products?
  • Discovery
  • Locate data generated by a workflow
  • Locate workflows containing App-X that succeeded

22
Collection Framework
A Framework for Collecting Provenance in
Data-Centric Scientific Workflows, Simmhan, Y.,
et al., ICWS Conference, 2006
23
Generating Karma Provenance Activities
  • Instrument applications to publish provenance
  • Simple Java Library available to
  • Create provenance activities
  • Publish activities as messages
  • Jython wrapper scripts use library to publish
    provenance invoke application
  • Generic Factory toolkit easily converts
    applications to web service
  • Built-in provenance instrumentation

24
Sample Sequence of Activities
  • appStarted(App1)
  • info(App1 starting)
  • fileReceiveStarted(File1)
  • -- do gridftp get to stage input file File1 --
  • fileReceiveFinished(File1)
  • fileConsumed(File1)
  • computationStarted(Code1)
  • -- call Fortran code Code1 to process input
    files --
  • computationFinished(Code1)
  • fileProduced(File2)
  • fileSendStarted(File2)
  • -- do gridftp put to save output file File2 --
  • fileSendFinished(File2)
  • publishURL(File2)
  • appFinishedSuccess(App1, File2)
    appFinishedFailed(App1, ERR)
  • flush()

25
Performance perturbation
26
Standalone tool for provenance collection and
experience reuse future direction
27
Forecast start time can also be set to occur
on severe weather conditions (not shown here)
28
Weather triggered workflows
  • Goal is cyberinfrastructure that allows
    scientists and students to run weather models
    dynamically and adaptively in response to weather
    events.
  • Accomplished by coupling events processing and
    triggered forecast workflows
  • Vijayakumar et al (2006) presented framework for
    this purpose
  • Events-processing system does temporal and
    spatial filtering.
  • Storm detection algorithm (SDA) detects storm
    events in remaining streams
  • SDA returns detected storm events
  • Events processing system generates trigger to
    workflow engine

29
Continuous stream mining
  • In stream mining of weather, events of interest
    are anomalies
  • Event processing queries can be deployed to sites
    in the LEAD grid (rectangles)
  • Data streams delivered to each site through
    Unidata Internet Data Dissemination system
  • CEP enables real-time response to the weather

query
computation node
data generation source
30
Example CEP query
  • Scientists can set up a 6-hour weather forecast
    over a region of say a 700 sq. mile bounding box,
    and submit a workflow that will run sometime in
    the future
  • CEP query detects severe storm conditions
    developing in the region
  • The forecast workflow is started at a future
    point in time as determined by the CEP query

31
Stream Provenance Tracking
  • Data stream provenance - derivation history of
    data product where data product is derived
    time-bounded stream
  • Stream provenance can establish correlations
    between significant events (e.g., storm
    occurrences)
  • Anticipate resource needs by examining provenance
    data and discover trends in weather forecast
    model output
  • Determine when next wave of users will arrive,
    and where their resources might need to be
    allocated

32
Stream processing as part of cyberinfrastructure
  • SQL-based queries responding to input streams
    event-by-event within stream and concurrent
    across streams
  • Each query generates time-bounded output stream

33
Provenance Service in Calder
Process flow / invocation
Calder internal messaging
WS-Messenger notifications
34
Provenance Update Handling Scalability
  • Update processing time - time taken from instant
    user sends a notification to instant provenance
    service completes corresponding update
  • Experiment
  • Bombard provenance service at different update
    rates by simulating many clients sending
    provenance updates simultaneously
  • Measure incoming rate at provenance service and
    overall time taken for handling each update.
  • Overhead includes time to create message, send
    and receive through WS-Messenger, process message
    and store it in DB

35
  • Problem
  • Severe weather can bring many storms over a local
    region of interest
  • It is infeasible and unnecessary to run weather
    model in response to each of them
  • Solution
  • Group storm events into spatial clusters
  • Trigger model runs in response to clusters of
    storms

36
Spatial Clustering DBSCAN algorithm
  • DBSCAN is a density-based clustering algorithm
    and it can do spatial clustering location
    parameters are treated as features.
  • DBSCAN algorithm has two parameters
  • e radius within which a point is considered to
    be a neighbor of another point
  • minPt minimum number of neighboring points that
    a point has to have to be considered as a core
    point.
  • The two parameters determine the clustering result

Mining work done by Xiang Li, University of
Alabama Huntsville
37
Data
  • WSR88D radar data on 3/27/2007
  • Total of 134 radar sites covering CONUS
  • The time period examined is between 100 pm to
    600pm EST.
  • The 5 hrs time period is divided into 20 time
    interval with each interval of 15 min. Storm
    events within the same time interval is clustered

Storm events detected at 100 pm 115 pm
Mining work done by Xiang Li, University of
Alabama Huntsville
38
Algorithm comparison DBSCAN and K-means
Time period 100 pm 115 pm
Number of clusters 3
Conclusion DBSCAN algorithm performs better than
k-means algorithm
39
Future Work
  • Publication of provenance to digital library
  • Generalized support for metadata systems
  • Enhanced support for mining triggers
  • Personal weather predictor
  • LEAD framework packaged into single 8-16 core
    multicore machine
  • Expands educational opportunities suitable for
    small schools
  • Engage communities beyond meteorologists

40
Thank you for the interest. Thanks to my many
domain science and CS collaborators, to my
students, and to the funding agents.Please feel
free to contact me at plale_at_indiana.edu
Write a Comment
User Comments (0)
About PowerShow.com