Title: SC 32/WG 2 Tutorial
1JTC1 SC32 N1649
- SC 32/WG 2 Tutorial
- Metadata Registry Standards
- July 16, 2007
Bruce Bargmeyer University of California,
Berkeley and Lawrence Berkley National
Laboratory Tel 1 510-495-2905 bebargmeyer_at_lbl.go
v
2Topics
- Standards development OMG, ISO (TC 37 JTC 1/SC
32), W3C, OASIS - Align, Coordinate, Integrate
Standards, Recommendations, Specifications - Semantics Challenges and Future Directions
3Align, Coordinate, IntegrateStandards
WG 2 doing OK internally
24707
11179 E3
19763
20944
4Align, Coordinate, IntegrateStandards
SC 32?
WG 1
WG 2
WG 3
WG 4
Clearwater meeting a step forward
5Align, Coordinate, Integrate Standards/Recommendat
ions/Specificationsfor Semantic Computing
Semantic Web
Terminology
Object Management
ISO/IEC 11179 Metadata Registries
Graph
RDF
MOF ODM CWM IMM
Subject
Node
Predicate
Edge
Node
Object
W3C
OMG
ISO/IEC JTC 1/SC 32
ISO TC 37
6Standards DevelopmentSemantics Management and
Semantics Services Semantic Computing
Align, Co-develop, Fast Track, PAS Submission
OMG
W3C
ISO/IEC JTC 1 SC 32
ISO TC 37
7Standards DevelopmentSemantics Management and
Semantics Services Semantic Computing
Align, integrate, co-develop, Fast Track, PAS
Submission Can we coordinate content?
OMG
ISO/IEC JTC 1 SC 32
W3C
W3C
8A Success
Some text and figures are identical in the two
standards.
OMG
ISO/IEC 24707 OMG ODM
ISO/IEC JTC 1 SC 32
ISO/IEC 20944 Common Logic OMG Ontology
Definition Metamodel
9Standards DevelopmentSemantics Management and
Semantics Services Semantic Computing
Ongoing effort
ISO/IEC JTC 1 SC 32
ISO/IEC 11179 (Edition 3)
10Standards DevelopmentSemantics Management and
Semantics Services Semantic Computing
Possible effort
OMG
RFP - MOF? IMM
11179 E3 proposals
11Standards DevelopmentSemantics Management and
Semantics Services Semantic Computing
Hopeful?
OMG
IMM
ISO/IEC JTC 1 SC 32
ISO/IEC 11179 (Edition 3)
12Other Possibilities
- OASIS ebXML Registry
- W3C Semantic Web Deployment WG
- TC 37
13The Ageless Information Problemcf Data,
Information, Knowledge, Wisdom
- Getting the information that we need, when we
need it, without afflicting the excellent minds
of humans with toil and drudgery - The litany
- Too much or too little, irrelevant, not
authoritative, out of date - Unknown quality, not trustable, lacks provenance,
no certainty measures - Difficult to find, difficult to access, difficult
to use - Meaning not clear, relationship to other
information not clear - Data creators do not have the same understanding
of the data as end users - Recorded data loses much real world meaning,
context, relationships - Much of the meaning of data is buried in the
processes used to manipulate the data (e.g., in
computer code) - Need improvements in efficiency and effectiveness
- Every time we solve it, we re-create it.
14New Semantics Capabilities Proposed for ISO/IEC
11179 MDR (Edition 3)
- Improve traditional data management/data
administration - Use stronger semantics management and semantics
services capabilities - Enable something new
- Semantic computing
15Semantic Computing The Nub of It
- Processing that takes meaning into account
- Makes use of concept systems, e.g., thesauri
and/or ontologies - Moves some of the meaning of data from computer
code to managed semantics - Processing that uses (e.g., reasons across) the
relations between things not just computing about
the things themselves. - Processing that helps to take people out of the
computation, reducing the human toil - Semantics grounding for data, data discovery,
extraction, mapping, translation, formatting,
validation, inferencing, - Delivering higher-level results that are more
helpful for the users thought and action
16In The Epic Information StruggleWe Have Made
Heroic Progress
Files
Computer Processing Cards Tape Disk
Machine Processing
17In The Epic Information StruggleWe Have Made
Heroic Progress
- In structuring data and text --
- Structured Data
- Columns on cards tape (possibly comma
separated) - Hierarchical (DBMS)
- Network
- Table (relational DBMS)
- Hierarchy (XML)
- Graph (RDF)
- Semi-structured text
- Nrof, trof, LaTeX
- SGML
- HTML
- XML
18In The Epic Information StruggleWe Have Made
Heroic Progress
- In documenting data and text (e.g., semantics
management) - Data Standards
- Code sets
- (Meta)Data Standards
- Data element definitions, valid values, value
meanings - Metadata registries (MDR, ISO/IEC 11179)
- Other standards as presented at this conference
- Concept systems (or KOS)
- Glossaries
- Dictionaries
- Thesauri
- Taxonomies
- Ontologies
- Graphs
19Semantic ManagementProposals for 11179 Edition 3
- Improve data management through use of stronger
semantics management - Databases
- XML data
- Other traditional data
- Enable new wave of semantic computing
- Take meaning of data into account
- Process across relations as well as properties
- May use reasoning engines, e.g., to draw
inferences
20Semantics Improve Data Management/Data
Administration
Conceptual Domain Agent
Object Class Chemopreventive Agent
Valid Values Cyclooxygenase Inhibitor Doxercalcife
rol Eflornithine Ursodiol
Data Element Concept Chemopreventive Agent NSC
Number
Value Domain NSC Code
Classification Schemes caDSRTraining
Property NSCNumber
Representation Code
Data Element Chemopreventive Agent Name
Context caCORE
Enterprise Vocabulary Services (EVS) Concepts
Unite NCI MDR
Source Denise Warzel, National Cancer Institute
21Semantic Computing Application Find and process
non-explicit data
Analgesic Agent
For example Patient data on drugs contains
brand names (e.g. Tylenol, Anacin-3,
Datril,) However, want to study patients
taking analgesic agents
Non-Narcotic Analgesic
Analgesic and Antipyretic
Acetominophen
Nonsteroidal Antiinflammatory Drug
Datril
Anacin-3
Tylenol
22A Semantics Application Specify and compute
across Relations, e.g., within a food web in an
Arctic ecosystem
An organism is connected to another organism for
which it is a source of food energy and material
by an arrow representing the direction of
biomass transfer.
Source http//en.wikipedia.org/wiki/Food_webFood
_web (from SPIRE)
23Semantics Application Combine Data, Metadata
Concept Systems
Inference Search Query find water bodies
downstream from Fletcher Creek where chemical
contamination was over 10 micrograms per liter
between December 2001 and March 2003
Concept system
Data
ID Date Temp Hg
A 06-09-13 4.4 4
B 06-09-13 9.3 2
X 06-09-13 6.7 78
Metadata
Name Datatype Definition Units
ID text Monitoring Station Identifier not applicable
Date date Date yy-mm-dd
Temp number Temperature (to 0.1 degree C) degrees Celcius
Hg number Mercury contamination micrograms per liter
24Semantics Application Use data from systems that
record the same facts with different terms
- Reduce the human toil of drawing information
together and performing analysis.
25Challenge Use data from systems that record the
same facts with different terms
Database Catalogs
Common Content
ISO 11179Registries
UDDIRegistries
Table Column
Data Element
Common Content
Common Content
Business Specification
Country Identifier
OASIS/ebXMLRegistries
CASE Tool Repositories
XML Tag
Attribute
Common Content
Common Content
Business Object
Coverage
TermHierarchy
OntologicalRegistries
Common Content
26Same Fact, Different Terms
Data Elements
DZ BE CN DK EG FR . . . ZW
012 056 156 208 818 250 . . . 716
Algeria Belgium China Denmark Egypt France . .
. Zimbabwe
LAlgérie Belgique Chine Danemark Egypte La
France . . . Zimbabwe
DZA BEL CHN DNK EGY FRA . . . ZWE
Name Context Definition Unique ID 4572 Value
Domain Maintenance Org. Steward Classification
Registration Authority Others
ISO 3166 English Name
ISO 3166 3-Numeric Code
ISO 3166 2-Alpha Code
ISO 3166 French Name
ISO 3166 3-Alpha Code
27Challenge Draw information together from a broad
range of studies, databases, reports, etc.
28 A semantics application Information Extraction
and Use
Extraction Engine
Segment Classify Associate Normalize Deduplicate
Discover patterns Select models Fit
parameters Inference Report results
11179-3 (E3) XMDR
Actionable Information
Decision Support
29Extraction Engines
- Find concepts and relations between concepts in
text, tables, data, audio, video, - Produce databases (relational tables, graph
structures), and other output - Functions
- Segment find text snippets (boundaries
important) - Classify determines database field for text
segment - Association which text segments belong together
- Normalization put information into standard
form - Deduplication collapse redundant information
30Metadata Registries are Useful
- Registered semantics
- For training extraction engines
- The Normalize function can make use of standard
code sets that have mapping between
representation forms. - The Classify function can interact with
pre-established concept systems. - Provenance
- High precision for proper nouns, less precision
(e.g., 70) for other concepts -gt impacts
downstream processing, Need to track precision
31Challenge Gain Common Understanding of meaning
between Data Creators and Data Users
A common interpretation of what the data
represents
EEA
USGS
text
data
environ agriculture climate human
health industry tourism soil water air
DoD
123 345 445 670 248 591 308
123 345 445 670 248 591 308
3268 0825 1348 5038 2708 0000 2178
3268 0825 1348 5038 2708 0000 2178
Users
text
data
environ agriculture climate human
health industry tourism soil water air
EPA
123 345 445 670 248 591 308
123 345 445 670 248 591 308
3268 0825 1348 5038 2708 0000 2178
3268 0825 1348 5038 2708 0000 2178
text
data
3268 0825 1348 5038 2708 0000 2178
123 345 445 670 248 591 308
ambiente agricultura tiempo salud
huno industria turismo tierra agua aero
123 345 445 670 248 591 308
3268 0825 1348 5038
Others . . .
Users
Information systems
Data Creation
32Practical Vocabulary Management
- Vocabulary Management is essential for use of
semantic technologies - Define concepts and relationships
- Harmonize terminology, resolve conflicts
- Collaborate with stakeholders
- An approach
- Select a domain of interest
- Enter core concepts and relationships
- Engage community in vocabulary review
- Harmonize, validate and vet the vocabulary
- Enter metadata describing enterprise data
- Link concept system to metadata
33Use eXtended MDR Capabilities
- For vocabulary repository
- Register, harmonize, validate, and vet
definitions and relations - To register mappings between multiple
vocabularies - To register mappings of concepts to data
- To provide semantics services
- To register and manage the provenance of data
- 11179-3 (E3) is part of the infrastructure for
semantics and data management. - These capabilities are proposed for ISO/IEC 11179
Edition 3
3411179 (E3) Use
- Upside
- Collaborative
- Supports interaction with community of interest
- Shared evolution and dissemination
- Enables Review Cycle
- Standards-based dont lock semantics into
proprietary technology - Foundation for strategic data centric
applications - Lays the foundation for Ontology-based
Information Management - Content is reusable for many purposes
- Downside
- Managing semantics is HARD WORK- No matter how
friendly the tools - Needs integration with other components
35Some Challenges
- Data management and metadata management must
evolve to address more complex data structures
(relational, object, hierarchies, graphs) - Query capabilities
- More than SQL, XQuery, SPARQL
- Discovery mechanisms
- More than Google
- Access, mining, extraction
- We need stronger semantics management
36Metadata Registry Support for
- Registering and mapping ontologies
- Ontology Evolution
- Registering Process Ontologies
37Thank You
- Acknowledgements
- Karlo Berket, LBNL
- Kevin Keck, LBNL
- John McCarthy, LBNL
- Harold Solbrig, Apelon
-
- This material is based upon work supported by the
National Science Foundation under Grant No.
0637122, USEPA and USDOD. Any opinions, findings,
and conclusions or recommendations expressed in
this material are those of the author(s) and do
not necessarily reflect the views of the National
Science Foundation, USEPA or USDOD.
- Bruce Bargmeyer
- Lawrence Berkeley National Laboratory
- Berkeley Water Center
- University of California, Berkeley
- Tel 1 510-495-2905
- bebargmeyer_at_lbl.gov