Scalable Knowledge Extraction from Legacy Sources with SEEK - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Scalable Knowledge Extraction from Legacy Sources with SEEK

Description:

Ray Issa. Joe Geunes. Sherman Bai. Sponsored by NSF. Year 2 of 4. Computer Science ... Need for integrated access to intelligence sources in support of national ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 23
Provided by: joac92
Category:

less

Transcript and Presenter's Notes

Title: Scalable Knowledge Extraction from Legacy Sources with SEEK


1
Scalable Knowledge Extraction from Legacy Sources
with SEEK
  • Joachim Hammer
  • Dept. of CISE
  • University of Florida
  • jhammer_at_cise.ufl.edu
  • 3-June-2003

2
Outline of Talk
  • Motivation
  • SEEK Information Architecture
  • Knowledge Extraction
  • Schema Extraction
  • Code Analysis
  • Status and Future Work

3
SEEK Project
  • Faculty
  • Joachim Hammer
  • Mark Schmalz
  • William OBrien
  • Ray Issa
  • Joe Geunes
  • Sherman Bai

Current Students Oguzhan Topsakal Mingxi
Wu Ivan Mutis Haiyan Xie Bibo Yang Bo Lu
Computer Science Building Construction Industria
l Engineering
  • Sponsored by NSF
  • Year 2 of 4

4
Motivation
  • Need for integrated access to intelligence
    sources in support of national security related
    applications
  • Ability to rapidly sift through massive volumes
    of data for situational analysis and planning
  • Hard sources have unique and often incompatible
    information systems, varying levels of
    sophistication
  • E.g., Web, public records, sensor networks, state
    and federal databases, etc.
  • Current integration approaches rely on manual
    coding of connection software - Not scalable
  • Development of a toolkit to facilitate
    integration of heterogeneous legacy data and
    knowledge

5
Information Environment
reporting authority/ collaborative analysis
Many computers, many users, many information needs
6
Application Areas
  • Homeland Defense
  • Threat prediction and detection
  • Emergency Management
  • Emergency response planning, damage assessment
  • Extended Enterprise/Supply Network
  • Decision/negotiation support to improve
    performance and customization

7
Information Integration Problem
  • Globally dispersed data
  • Within individual organizations
  • Among different organizations
  • Little will or ability to share the data
  • Privacy/security concerns
  • Mismatch in schema/data representation
  • Data distributed over a wide area
  • Very hard to deal with!

8
SEEK Environment Context
coordinator/ lead
Agency
Agency
Agency

SEEK
SEEK
SEEK
Decision Support/ Analysis
Secure Hosting Infrastructure
9
SEEK Information Architecture
Legacy Source
Connection Toolkit
SEEK Components
Domain Expert
Application
Knowledge Extraction Module
Analysis Module
Legacy Data and Systems
Secure, value-added extraction of source data
Wrapper
AM query analysis, knowledge composition
(mediation) W source connection and
translation KEM configuration of W and AM at
build-time
10
Run-Time Querying and Analysis
  • Different information contexts application,
    analysis module, source
  • Translator needed to convert between information
    contexts
  • Assume existence of translator between AM and
    application contexts
  • Analysis module provides robust (value-added)
    mediation
  • Solution strategy based on information available
    in source
  • Capable of composing final answer out of multiple
    source results
  • SEEK wrapper responsible for syntactic and
    semantic conversions
  • Formulates source queries based on capabilities
    of source
  • Restructures source results to conform to
    information context of AM

11
Build-Time Knowledge Extraction
  • Extract information about legacy source to
    facilitate development of wrapper and
    configuration of AM
  • Produces description of accessible knowledge in
    source
  • Schema extraction from data source
  • Analysis of application code to augment schema
    with semantics and extract business rules
  • Schema Matching to infer mappings between
    information context of AM/application with that
    of legacy source
  • Quality and accuracy of extracted knowledge (and
    hence the wrapper and AM) improves over time and
    with human input

12
Architectural Overview
Domain Model
Domain Ontology
Data Reverse Engineering (DRE)
revise, validate
Schema Information
Schema Extractor (SE)
Semantic Analyzer (SA)
Embedded Queries
train, validate
Schema Matching (SM)
Schema, semantics business rules


Legacy Application Code
Legacy DB
to wrapper generator
Mapping rules
Legacy Source
13
Schema Extraction
  • Based on data reverse engineering algorithms,
    e.g., Chiang 94/95, Petit et al. 96
  • Reduced dependency on human input
  • Eliminated limitations (e.g., consistent naming,
    legacy schema in 3-NF)
  • Use database catalog to directly extract concepts
    and simple constraints
  • Use database instances to infer relationships and
    constraints
  • Interact with code analysis to augment schema
    with semantics
  • Produces E/R-like representation of the entities,
    relationships, and constraints

14
Semantic Analysis
  • Identify semantic descriptions for schema items
    in database in application code
  • E.g., trace database schema names back to output
    statements
  • Using code slicing to reduce application code to
    only those statements that are of interest to the
    analyzer (Horwitz, Reps 92)
  • Apply pattern matcher
  • discover associations among variables
  • identify patterns that encode business
    information
  • E.g., business rules encoded in IF-THEN-ELSE
    statements
  • Versions for C, C, and Java

15
DRE Implementation
Legacy Source
Application Code
DB Interface Module
Data
configuration
AST Generation
1
Dictionary Extraction
2
Queries
AST
Code Analysis
3
Metadata Repository
Inclusion Dependency Mining
4
Business Knowledge
Relation Classification
Schema
5
Attribute Classification
6
Knowledge Encoder
XML DTD
Entity Identification
7
XML DOC
Relationship Classification
8
To Schema Matcher
16
Legacy Database Schema
  • Proj P_ID,P_NAME,DES_S,DES_F,A_S,A_F,
  • AvailPROJ_ID,AVAIL_UID,RES_ID,AVAIL_FROM,AVAIL_TO
    ,UNITS
  • Res PROJ_ID,RES_UID,RES_NAME,R_ACWP,R_BCWP,R_BCWS
    ,
  • T PROJ_ID,T_UID,T_ID,T_NAME,T_DUR,T_ST_D,T_FIN_D,
  • Assn PROJ_ID,ASSN_UID,T_UID,R_ID,ASSN_BASE_C,ASSN
    _ACT_W,
  • .
  • .

17
Scheduling Application
  • / program for task scheduling /
  • char aValue
  • char cValue
  • int bValue 0
  • / more code /
  • EXEC SQL SELECT T_ST_D,T_FIN_D INTO aValue,
    cValue FROM T WHERE T_PRITY bValue
  • / more code /
  • int flag 0
  • IF (cValue lt aValue)
  • flag 1 / exception handling /
  • / more code /
  • printf (Task Start Date d, aValue)

18
Extracted Conceptual Schema
Proj_ID
P_Name
P_ID
Des_S
Res_UID
N
has
Proj
Res
1
1
1
Res_Name
N
has
Assn
has
Res_ID
N
M
N
T
Avail
Avail_UID
Proj_ID
T_ID
Proj_ID
T_UID
19
Result of Code Analysis
20
Extracted Business Rules
  • Variables have been replaced by their extracted
    meaning (to the extent that they are known)

21
Current Status Future Research
  • Current
  • Implemented interactive knowledge extraction
    prototype consisting of SE and SA (supply chain
    construction domains)
  • Developing schema matching module
  • Application of SEEK toolkit to emergency response
    system
  • Data collection in cooperation with City of
    Gainesville Fire Rescue
  • Application to management of EOC planned
  • Future
  • Development of analysis module
  • Enhance DRE with ability to improve with time and
    usage cases

22
Summary and Conclusion
  • SEEK is a structured approach to integrating
    domain-specific legacy sources
  • Modular architecture provides several important
    capabilities
  • (Semi)automatic knowledge extraction
  • DRE, semantic analysis, schema matching
  • Important contributions to theory of knowledge
    capture and integration
  • Requirement for building scalable sharing
    architectures
  • Enabling technology for (semi)automatic ontology
    creation
  • Enabler for Semantic Web?

23
More Info
  • M. S. Schmalz, J. Hammer, M. Wu, and O. Topsakal,
    "EITH - A unifying representation for database
    schema and application code in enterprise
    knowledge extraction." To be presented at 22nd
    International Conference on Conceptual Modeling
    (ER 2003), Chicago, IL, 2003.
  • Scalable Extraction of Enterprise Knowledge.
    Conditionally accepted for publication in
    Research Frontiers in Supply Chain Management and
    E-Commerce, E. Akcaly, J. Geunes, P.M. Pardalos,
    H.E.Romeijn, and Z.J. Shen, (eds). Kluwer Science
    Series in Applied Optimization. (accepted for
    publication in 2004.)
  • SEEKing knowledge in legacy information systems
    to support interoperability. ECAI-02 Workshop on
    Ontologies and Semantic Interoperability, Lyon,
    France, July 21-26, 2002.
  • SEEK accomplishing enterprise information
    integration across heterogeneous sources, ITCON
    Journal of Information Technology in
    Construction Special Edition on Knowledge
    Management, 7, pp. 101-124, 2002.
  • Robust mediation of supply chain information.
    ASCE Specialty Conference on Fully Integrated and
    Automated Project Processes (FIAPP) in Civil
    Engineering, Blacksburg, VA, September 26-28,
    2001, 415-425.
  • Web Site http//www.dbcenter.cise.ufl.edu/seek/
Write a Comment
User Comments (0)
About PowerShow.com