Digital Government Research Center DGRC Energy Data Collection EDC - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Digital Government Research Center DGRC Energy Data Collection EDC

Description:

... need multi-database access engine. 3. need terminology standardization ... These components include reformulated gasoline blendstock for oxygenate blending ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 23
Provided by: Eduar81
Category:

less

Transcript and Presenter's Notes

Title: Digital Government Research Center DGRC Energy Data Collection EDC


1
Digital Government Research Center (DGRC)Energy
Data Collection (EDC)
  • Eduard Hovy
  • USC Information Sciences Institute
  • Judith L. Klavans
  • Columbia University
  • http//www.isi.edu/cardgis/about-cardgis.html

2
The Vision Ask the Government...
Were thinking of moving to Denver...What are the
schools like there?
How many people had breast cancer in the area
over the past 30 years?
Is there an orchestra? An art gallery? How far
are the nightclubs?
How have property values in the area changed over
the past decade?
3
The Problem and the Solution
  • Problem FedStats has thousands of databases in
    over seventy Government Agencies
  • data is duplicated and near-duplicated,
  • even Government officials cannot find it!
  • Solution Create a system to provide easy
    standardized access
  • 1. need powerful user interface
  • 2. need multi-database access engine
  • 3. need terminology standardization mechanism

4
Digital Government Research Center (DGRC)
  • Funded by NSF Digital Government program
  • Integration of research between Columbia
    University and USC/ISI
  • Enables synergy between near-term development and
    longer-term research
  • Cross-fertilization of complementary approaches
  • Frequent meetings with government sponsors

5
Project Goals
Enable access to multiple, heterogeneous Federal
agency data sources through a single interface
using standardized terminology via structured
metadata, while accounting for cross-agency and
cross database semantic and syntactic variability.
6
Interface 2000 Focus
Enable access to multiple, heterogeneous Federal
agency data sources through a single interface
using standardized terminology built into an
ontology or knowledge structure via structured
metadata, while accounting for cross-agency and
cross database semantic and syntactic variability.
7
System Architecture
8
Initial Scope
  • Domain gasoline prices and sales.
  • Initial sources
  • Energy Info. Administration (quarterly CD ROM)
  • Bureau of Labor Statistics (http//stats.bls.gov)
  • Census Bureau (CD ROM for 1992 data)
  • California Energy Commission (weekly data at
    http//energy.ca.gov)

9
Three Core Technologies
  • 1. User Interface
  • 2. Multiple Data Source Access and Integration
  • 3. Standardized terminology and ontology

10
Terminology Standardization Group name Ontology
Group
  • Purposes
  • 1. support automated linking of database metadata
    terms into standard nomenclature, with aliasing
  • 2. provide ontological framework for metadata
  • 3. analyze and contrast definitions of related
    concepts, to assist experts in various agencies.
  • Approach
  • ISIs SENSUS ontology (100,000 nodes)
  • Columbias NL analysis technology
  • Columbias Lexical Knowledge Base structure
  • ISIs concept-alignment techniques

11
Ontology Extraction Progress
  • Built automatic analysis system to
  • convert definitions into structured
    attribute-value hierarchy
  • extract footnotes and attachment sites
  • Serves as the basis for cross-agency ontology
    linking
  • Export data to SENSUS for testing, identify and
    process additional data

12
Ontologies
  • Representing structure in knowledge from the real
    world
  • Many relations
  • a car is a vehicle with 4 wheels
  • a motorcycle is a vehicle with 2 wheels
  • a van is a vehicle for transporting goods
  • a bond is a vehicle for investing

Judith L. Klavans
3
13
Building a Complex Ontology
Definitions
Glossaries
Text
Ontology
Metadata
Footnotes
6
Judith L. Klavans
14
Building a Complex Ontology
Definitions
Ontology
Glossaries
Text
Metadata
Footnotes
7
Judith L. Klavans
15
EDC Challenges
  • Extracting accurate and useful information from
    all sources
  • Relating information between sources
  • Multiple sources from within the same agency
  • Information comes from different agencies
  • Challenges
  • Merging when possible
  • Creating new entries when needed

8
Judith L. Klavans
16
DGRC Research
  • Gather glossaries, thesauri, definitions from
    participating agencies.
  • Extract ontological information
  • Structure and deliver to ISI for access and
    display
  • Link glossaries and ontology

15
Judith L. Klavans
17
Sample Definition
  • Input
  • Motor Gasoline Blending Components Naphthas
    (e.g., straight-run gasoline, alkylate,
    reformate, benzene, toluene, xylene) used for
    blending or compounding into finished motor
    gasoline. These components include reformulated
    gasoline blendstock for oxygenate blending (RBOB)
    but exclude oxygenates (alcohols, ethers),
    butane, and pentanes plus. Note Oxygenates are
    reported as individual components and are
    included in the total for other hydrocarbons,
    hydrogens, and oxygenates.

Judith L. Klavans
16
18
Methods
  • Symbolic analysis techniques
  • based on lexical knowledge base results
  • Statistical analysis techniques
  • provides fully automatic results
  • can be filtered and structured

17
Judith L. Klavans
19
Columbia Output
Judith L. Klavans
18
20
ISIs SENSUS Ontology
  • Taxonomy, multiple superclass links.
  • Approx. 90,000 items.
  • Top level Penman Upper Model (ISI).
  • Body WordNet (Princeton), rearranged.
  • Used at ISI for machine translation, text
    summarization, database access.

http//vigor.isi.edu8002/sensus2/
21
Status and Plans
  • Over 4000 new terms and definitions analyzed
  • Being loaded and structured in DGRC ontology
  • www.cs.columbia.edu/digigov

20
Judith L. Klavans
22
Progress
  • 1. Interface
  • initial prototype built and runnin
  • 2. Information Integration
  • wrappers around databases
  • new aggregation operators
  • 3. Ontology contruction and integration
  • NSF Digital Government Program PI meeting in May
    2000
Write a Comment
User Comments (0)
About PowerShow.com