Title: The Grid Bringing data producers and consumers closer
1The GridBringing data producers and consumers
closer?
- Mark Gahegan
- GeoVISTA Center
- Penn State University, USA
2Status quo
- Producers and consumers of data have become far
removed from each other - Data producers cannot anticipate what consumers
might do with the data - Example Conference Signs
- The burden of metadata production is significant
- Documenting standards ignored to various degrees
- Are metadata standards failing?
- Many realitiesthe way we describe the world
keeps changing
According to Heraclitus, panta rheieverything
is in flux. But what gives that flux its form is
the logosthe words or signs that enable us to
perceive patterns in the flux, remember them,
talk about them, and take action upon them even
while we ourselves are part of the flux we are
acting in and on. --(John Sowa)
3What is The Grid (e-Science)? e.g. The
geosciences network (GEON) in the USA
FollowingDavid Ribes, GEON meta researcher
4How does the Grid change anything?
- The Grid brings data producers and data consumers
closer than ever before - Grid and web service architectures broker the
access to data, so provide opportunities for
gathering and deploying new kinds of metadata
5Ola Ahlqvist, Penn State
6Example the metadata Big Five--uncertainty
- Sinton (1978) and many others since defined the
important aspects of metadata to be positional
error, temporal error, measurement or thematic
error, consistency and completeness - Sometimes lineage is included
7Are they practical to gather?
- In some cases, the answer is yes, e.g. primary
surveying data producers - Means of recording accuracy in space, time and
value/theme do exist - Means of recording consistency, completeness,
lineage are for the most part experimental, there
are no accepted standards
8Are they practical to use?
- Clearly there are examples of metadata fields
that are very useful and easy to deploy, e.g. map
projection datum - But often the answer is no, at least not in a
quantitative way - the propagation of error through data combination
is extremely difficult to achieve in practice. - Communicating uncertainty, either statistically
or visually, is still a research topic - Communicating lineage is likewise experimental
- metadata would help more if there was more direct
interaction with it as part of analysis in
existing GIS - Uncertainty metadata may help in a qualitative
way, current onus is on the userhow should they
make sense of it?
9If data could talk?
Is the data suitable / optimal for my current
task?
Do I trust the people who produced this data?
If I dig here, might I hit a gas main?
- Has it been used in this way before, and if so
was it a success? By whom?
Where are the problems / missing values?
Will this work? Will I use it right?
10Approaches to producing metadata
- Use existing and emerging metadata
standards(perspective of data producer, onus on
data producer) - User ranking and feedback
- What works? What is missing? What is known?
What is unknown? - Use-case logging monitor use via a web portal /
library, warehouse - Use counts by web domains to differentiate
different user communities to measure impact,
value to intended users communities - Use-case mining and analysis
- Discover significant usage patterns, use these to
infer relevance, e.g. recommender systems, - Genesis, derivation, workflows
- By exposing, analyzing and documenting the means
by which the dataset was produced - Ontology mining
- Ontology creation from either schema (metadata)
or content (data)
11User ranking and feedback
- Virtual Reality in Geography (Geographic
Information Systems Workshop) by Peter Fisher
(Editor), et al (Hardcover - January 15, 2002)
(Rate this item) Usually ships in 24 hoursList
Price 99.95 Buy new 99.95
Hypothesis users have valuable things to say
about the products they use
12(No Transcript)
13Use case logging
- Log each interaction with the cyberinfrastructure
that provides access to datakeep tallies
Hypothesis Knowing what is popular is helpful
when making choices
14Use-case mining and analysis
- Learn from the use-cases (Recommender systems)
- Who created this resource?
- When was it created?
- How often has it been used?
- Has it been modified recently?
- Who has used it?
- What has it been used with?
Hypothesis I can learn from the actions of others
15Define situations of use
16Mining association rules from use-case logs
- Association rules are mined from user action logs
- uses the WEKA (Waikato Environment for Knowledge
Analysis) API that implemented the Apriori
algorithm (Agrawal, R. and Srikant, R., 1994). - Tools added for data preprocessing and
classifying - attribute selector allows user to select a
subset of data attributes. - data filters allows user to define filters to
convert String, Time, Numeric data in any
attribute column to nominal data for association
mining.
17Data mining tools (association rules)
Results sensitivity settings
Data Filter - String
Attribute Selector
Design
Data Filter - Numeric
Data Filter - Time
18Applying results of mining, e.g. musicplasma.com
19Genesis, derivation, workflows
- Define how a dataset was created
Hypothesis We may not be able to quantify
uncertainty for your use-case, but we can show
you exactly what we did!
20Ontology mining
- Recent work has shown that ontologies can be
built by mining database schema (e.g. GEON
portal)
Ontologies can also be built by analyzing the
data itself Having a domain ontology is useful
for anchoring
Hypothesis metadata descriptions that are
inferred are useful, consistent and not
burdensome since they can be mined
21Codex ways of understanding a data resource
22implemented as a web portal
Bill Pike, PSU
http//hero.geog.psu.edu/codex
23Concept maps (gravitational anomaly)
24 extend to data and methods
25and to people
Also to articles / papers using Citeseer metadata
26A learning activity integrated with semantic
search
Conceptual Space
List of concepts
Embedded learning activity (resource)
Searching Digital Libraries for content relating
to selected concepts (DLESE, ADL)
27Conclusions
- Assertion 1 Current attempts to gather and
utilize metadata are failing - Assertion 2 The burden of tagging existing and
future data with user-relevant metadata is
overwhelming. We cannot realistically expect data
producers to carry this burden alone. - Many different approaches to metadata creation
are open to us. - Some are new, facilitated by grid and web
service brokered access to e-resources. - We need to try some of these on a large scale.
- Incentives rewards