Title: Digital Government Research Center DGRC Energy Data Collection EDC
1Digital Government Research Center (DGRC)Energy
Data Collection (EDC)
- Eduard Hovy
- USC Information Sciences Institute
- Judith L. Klavans
- Columbia University
- http//www.isi.edu/cardgis/about-cardgis.html
2The Vision Ask the Government...
Were thinking of moving to Denver...What are the
schools like there?
How many people had breast cancer in the area
over the past 30 years?
Is there an orchestra? An art gallery? How far
are the nightclubs?
How have property values in the area changed over
the past decade?
3The Problem and the Solution
- Problem FedStats has thousands of databases in
over seventy Government Agencies - data is duplicated and near-duplicated,
- even Government officials cannot find it!
- Solution Create a system to provide easy
standardized access - 1. need powerful user interface
- 2. need multi-database access engine
- 3. need terminology standardization mechanism
4Digital Government Research Center (DGRC)
- Funded by NSF Digital Government program
- Integration of research between Columbia
University and USC/ISI - Enables synergy between near-term development and
longer-term research - Cross-fertilization of complementary approaches
- Frequent meetings with government sponsors
5Project Goals
Enable access to multiple, heterogeneous Federal
agency data sources through a single interface
using standardized terminology via structured
metadata, while accounting for cross-agency and
cross database semantic and syntactic variability.
6Interface 2000 Focus
Enable access to multiple, heterogeneous Federal
agency data sources through a single interface
using standardized terminology built into an
ontology or knowledge structure via structured
metadata, while accounting for cross-agency and
cross database semantic and syntactic variability.
7System Architecture
8Initial Scope
- Domain gasoline prices and sales.
- Initial sources
- Energy Info. Administration (quarterly CD ROM)
- Bureau of Labor Statistics (http//stats.bls.gov)
- Census Bureau (CD ROM for 1992 data)
- California Energy Commission (weekly data at
http//energy.ca.gov)
9Three Core Technologies
- 1. User Interface
- 2. Multiple Data Source Access and Integration
- 3. Standardized terminology and ontology
10Terminology Standardization Group name Ontology
Group
- Purposes
- 1. support automated linking of database metadata
terms into standard nomenclature, with aliasing - 2. provide ontological framework for metadata
- 3. analyze and contrast definitions of related
concepts, to assist experts in various agencies.
- Approach
- ISIs SENSUS ontology (100,000 nodes)
- Columbias NL analysis technology
- Columbias Lexical Knowledge Base structure
- ISIs concept-alignment techniques
11Ontology Extraction Progress
- Built automatic analysis system to
- convert definitions into structured
attribute-value hierarchy - extract footnotes and attachment sites
- Serves as the basis for cross-agency ontology
linking - Export data to SENSUS for testing, identify and
process additional data
12Ontologies
- Representing structure in knowledge from the real
world - Many relations
- a car is a vehicle with 4 wheels
- a motorcycle is a vehicle with 2 wheels
- a van is a vehicle for transporting goods
- a bond is a vehicle for investing
Judith L. Klavans
3
13Building a Complex Ontology
Definitions
Glossaries
Text
Ontology
Metadata
Footnotes
6
Judith L. Klavans
14Building a Complex Ontology
Definitions
Ontology
Glossaries
Text
Metadata
Footnotes
7
Judith L. Klavans
15EDC Challenges
- Extracting accurate and useful information from
all sources - Relating information between sources
- Multiple sources from within the same agency
- Information comes from different agencies
- Challenges
- Merging when possible
- Creating new entries when needed
8
Judith L. Klavans
16DGRC Research
- Gather glossaries, thesauri, definitions from
participating agencies. - Extract ontological information
- Structure and deliver to ISI for access and
display - Link glossaries and ontology
15
Judith L. Klavans
17Sample Definition
- Input
- Motor Gasoline Blending Components Naphthas
(e.g., straight-run gasoline, alkylate,
reformate, benzene, toluene, xylene) used for
blending or compounding into finished motor
gasoline. These components include reformulated
gasoline blendstock for oxygenate blending (RBOB)
but exclude oxygenates (alcohols, ethers),
butane, and pentanes plus. Note Oxygenates are
reported as individual components and are
included in the total for other hydrocarbons,
hydrogens, and oxygenates.
Judith L. Klavans
16
18Methods
- Symbolic analysis techniques
- based on lexical knowledge base results
- Statistical analysis techniques
- provides fully automatic results
- can be filtered and structured
17
Judith L. Klavans
19Columbia Output
Judith L. Klavans
18
20ISIs SENSUS Ontology
- Taxonomy, multiple superclass links.
- Approx. 90,000 items.
- Top level Penman Upper Model (ISI).
- Body WordNet (Princeton), rearranged.
- Used at ISI for machine translation, text
summarization, database access.
http//vigor.isi.edu8002/sensus2/
21Status and Plans
- Over 4000 new terms and definitions analyzed
- Being loaded and structured in DGRC ontology
- www.cs.columbia.edu/digigov
20
Judith L. Klavans
22Progress
- 1. Interface
- initial prototype built and runnin
- 2. Information Integration
- wrappers around databases
- new aggregation operators
- 3. Ontology contruction and integration
- NSF Digital Government Program PI meeting in May
2000