Title: Scalable Knowledge Extraction from Legacy Sources with SEEK
1Scalable Knowledge Extraction from Legacy Sources
with SEEK
- Joachim Hammer
- Dept. of CISE
- University of Florida
- jhammer_at_cise.ufl.edu
- 3-June-2003
2Outline of Talk
- Motivation
- SEEK Information Architecture
- Knowledge Extraction
- Schema Extraction
- Code Analysis
- Status and Future Work
3SEEK Project
- Faculty
- Joachim Hammer
- Mark Schmalz
- William OBrien
- Ray Issa
- Joe Geunes
- Sherman Bai
Current Students Oguzhan Topsakal Mingxi
Wu Ivan Mutis Haiyan Xie Bibo Yang Bo Lu
Computer Science Building Construction Industria
l Engineering
- Sponsored by NSF
- Year 2 of 4
4Motivation
- Need for integrated access to intelligence
sources in support of national security related
applications - Ability to rapidly sift through massive volumes
of data for situational analysis and planning - Hard sources have unique and often incompatible
information systems, varying levels of
sophistication - E.g., Web, public records, sensor networks, state
and federal databases, etc. - Current integration approaches rely on manual
coding of connection software - Not scalable - Development of a toolkit to facilitate
integration of heterogeneous legacy data and
knowledge
5Information Environment
reporting authority/ collaborative analysis
Many computers, many users, many information needs
6Application Areas
- Homeland Defense
- Threat prediction and detection
- Emergency Management
- Emergency response planning, damage assessment
- Extended Enterprise/Supply Network
- Decision/negotiation support to improve
performance and customization
7Information Integration Problem
- Globally dispersed data
- Within individual organizations
- Among different organizations
- Little will or ability to share the data
- Privacy/security concerns
- Mismatch in schema/data representation
- Data distributed over a wide area
- Very hard to deal with!
8SEEK Environment Context
coordinator/ lead
Agency
Agency
Agency
SEEK
SEEK
SEEK
Decision Support/ Analysis
Secure Hosting Infrastructure
9SEEK Information Architecture
Legacy Source
Connection Toolkit
SEEK Components
Domain Expert
Application
Knowledge Extraction Module
Analysis Module
Legacy Data and Systems
Secure, value-added extraction of source data
Wrapper
AM query analysis, knowledge composition
(mediation) W source connection and
translation KEM configuration of W and AM at
build-time
10Run-Time Querying and Analysis
- Different information contexts application,
analysis module, source - Translator needed to convert between information
contexts - Assume existence of translator between AM and
application contexts - Analysis module provides robust (value-added)
mediation - Solution strategy based on information available
in source - Capable of composing final answer out of multiple
source results - SEEK wrapper responsible for syntactic and
semantic conversions - Formulates source queries based on capabilities
of source - Restructures source results to conform to
information context of AM
11Build-Time Knowledge Extraction
- Extract information about legacy source to
facilitate development of wrapper and
configuration of AM - Produces description of accessible knowledge in
source - Schema extraction from data source
- Analysis of application code to augment schema
with semantics and extract business rules - Schema Matching to infer mappings between
information context of AM/application with that
of legacy source - Quality and accuracy of extracted knowledge (and
hence the wrapper and AM) improves over time and
with human input
12Architectural Overview
Domain Model
Domain Ontology
Data Reverse Engineering (DRE)
revise, validate
Schema Information
Schema Extractor (SE)
Semantic Analyzer (SA)
Embedded Queries
train, validate
Schema Matching (SM)
Schema, semantics business rules
Legacy Application Code
Legacy DB
to wrapper generator
Mapping rules
Legacy Source
13Schema Extraction
- Based on data reverse engineering algorithms,
e.g., Chiang 94/95, Petit et al. 96 - Reduced dependency on human input
- Eliminated limitations (e.g., consistent naming,
legacy schema in 3-NF) - Use database catalog to directly extract concepts
and simple constraints - Use database instances to infer relationships and
constraints - Interact with code analysis to augment schema
with semantics - Produces E/R-like representation of the entities,
relationships, and constraints
14Semantic Analysis
- Identify semantic descriptions for schema items
in database in application code - E.g., trace database schema names back to output
statements - Using code slicing to reduce application code to
only those statements that are of interest to the
analyzer (Horwitz, Reps 92) - Apply pattern matcher
- discover associations among variables
- identify patterns that encode business
information - E.g., business rules encoded in IF-THEN-ELSE
statements - Versions for C, C, and Java
15DRE Implementation
Legacy Source
Application Code
DB Interface Module
Data
configuration
AST Generation
1
Dictionary Extraction
2
Queries
AST
Code Analysis
3
Metadata Repository
Inclusion Dependency Mining
4
Business Knowledge
Relation Classification
Schema
5
Attribute Classification
6
Knowledge Encoder
XML DTD
Entity Identification
7
XML DOC
Relationship Classification
8
To Schema Matcher
16Legacy Database Schema
- Proj P_ID,P_NAME,DES_S,DES_F,A_S,A_F,
- AvailPROJ_ID,AVAIL_UID,RES_ID,AVAIL_FROM,AVAIL_TO
,UNITS - Res PROJ_ID,RES_UID,RES_NAME,R_ACWP,R_BCWP,R_BCWS
, - T PROJ_ID,T_UID,T_ID,T_NAME,T_DUR,T_ST_D,T_FIN_D,
- Assn PROJ_ID,ASSN_UID,T_UID,R_ID,ASSN_BASE_C,ASSN
_ACT_W, - .
- .
17Scheduling Application
- / program for task scheduling /
- char aValue
- char cValue
- int bValue 0
- / more code /
- EXEC SQL SELECT T_ST_D,T_FIN_D INTO aValue,
cValue FROM T WHERE T_PRITY bValue - / more code /
- int flag 0
- IF (cValue lt aValue)
-
- flag 1 / exception handling /
-
- / more code /
- printf (Task Start Date d, aValue)
18Extracted Conceptual Schema
Proj_ID
P_Name
P_ID
Des_S
Res_UID
N
has
Proj
Res
1
1
1
Res_Name
N
has
Assn
has
Res_ID
N
M
N
T
Avail
Avail_UID
Proj_ID
T_ID
Proj_ID
T_UID
19Result of Code Analysis
20Extracted Business Rules
- Variables have been replaced by their extracted
meaning (to the extent that they are known)
21Current Status Future Research
- Current
- Implemented interactive knowledge extraction
prototype consisting of SE and SA (supply chain
construction domains) - Developing schema matching module
- Application of SEEK toolkit to emergency response
system - Data collection in cooperation with City of
Gainesville Fire Rescue - Application to management of EOC planned
- Future
- Development of analysis module
- Enhance DRE with ability to improve with time and
usage cases
22Summary and Conclusion
- SEEK is a structured approach to integrating
domain-specific legacy sources - Modular architecture provides several important
capabilities - (Semi)automatic knowledge extraction
- DRE, semantic analysis, schema matching
- Important contributions to theory of knowledge
capture and integration - Requirement for building scalable sharing
architectures - Enabling technology for (semi)automatic ontology
creation - Enabler for Semantic Web?
23More Info
- M. S. Schmalz, J. Hammer, M. Wu, and O. Topsakal,
"EITH - A unifying representation for database
schema and application code in enterprise
knowledge extraction." To be presented at 22nd
International Conference on Conceptual Modeling
(ER 2003), Chicago, IL, 2003. - Scalable Extraction of Enterprise Knowledge.
Conditionally accepted for publication in
Research Frontiers in Supply Chain Management and
E-Commerce, E. Akcaly, J. Geunes, P.M. Pardalos,
H.E.Romeijn, and Z.J. Shen, (eds). Kluwer Science
Series in Applied Optimization. (accepted for
publication in 2004.) - SEEKing knowledge in legacy information systems
to support interoperability. ECAI-02 Workshop on
Ontologies and Semantic Interoperability, Lyon,
France, July 21-26, 2002. - SEEK accomplishing enterprise information
integration across heterogeneous sources, ITCON
Journal of Information Technology in
Construction Special Edition on Knowledge
Management, 7, pp. 101-124, 2002. - Robust mediation of supply chain information.
ASCE Specialty Conference on Fully Integrated and
Automated Project Processes (FIAPP) in Civil
Engineering, Blacksburg, VA, September 26-28,
2001, 415-425. - Web Site http//www.dbcenter.cise.ufl.edu/seek/