Carol Jean Godby - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Carol Jean Godby

Description:

high: dublin core, metadata, element, electronic resources ... among men, working hours, inflation, mass belief systems, aerial photography ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 24
Provided by: ocl793
Category:
Tags: carol | godby | jean

less

Transcript and Presenter's Notes

Title: Carol Jean Godby


1
The automatic encoding of lexical knowledge in
RDF topicmaps
  • Carol Jean Godby
  • OCLC
  • Online Computer Library Center
  • March 6, 2001

2
Topicmaps of Web resources
  • For navigating complex Web sites
  • For managing bookmark files
  • For creating views of the Web that are organized
    by subject

3
Terminology identification
  • ...is an essential first step in the analysis of
    a document's content.
  • ...is one of the most mature research subjects in
    natural language processing.

4
Lexical phrases
  • Are the names of persistent concepts.
  • Act like words.
  • Are commonly used to name new concepts in rapidly
    evolving technical subject domains.

5
Not a lexical phraseRecurrent problem
6
A lexical phraseRecurrent erosion
7
Identifying lexical phrases
  • Tokenized text ...Planetary scientists think
    the convex shape came about as lava welled up
    beneath the crater's solid floor.
  • Ngrams planetary scientists think, convex
    shape, welled up, coincided with, five times
    greater than, easiest way, Milky Way, absolute
    magnitudes brighter than, added material,
    advanced study, African American
  • Index filter planetary scientists, convex
    shape, easiest way, Milky Way, absolute
    magnitudes, added material, advanced study,
    African American
  • Topic filter planetary scientists, Milky Way

8
Terminology identification process flow
Tokenized text
9.8m
Ngrams
1.9 M
Index filter
59k, 2331 phrases
35k, 1632 phrases
Topic filter
9
Strategies in the topic filter
  • Word/phrase frequency and strength of association
  • Knowledge-poor text analysis
  • More sophisticated but computable text analysis

10
Word and phrase frequencies
  • Word/phrase frequency
  • high dublin core, metadata, element, electronic
    resources
  • low availability period, background, applicable
    terminologies
  • Weighted frequency
  • 1. core element, date element, metadata
    element
  • 2. author name, entity name, corporate name
  • 3. HTML tag, end tag, meta tag

11
Knowledge-poor techniques 1
  • Some noun phrase heads usually appear in text
    only with adjective or noun modifiers.
  • Example holes--black holes, grey holes,
    central holes
  • Others usually appear without modifiers.
  • Example galaxy--cartwheel galaxy, spiral
    galaxy
  • a galaxy, our galaxy, this
    galaxy

12
Consequences
  • We can identify topical single terms
  • galaxy, star, sun, moon
  • government, abortion, communism
  • metadata, html, Internet, information
  • We can create subject taxonomies
  • galaxy (-ies) hole(s)
  • cartwheel galaxy black
    holes
  • elliptical galaxy drill
    holes
  • spiral galaxy grey
    holes

13
Knowledge-poor techniques 2 subject probes
  • Goal to get high-quality subject terms
  • Look for markers of a subject that is talked
    about, written about or studied topics in, study
    of, analysis of, (on the) subject of, major in
  • Probes differ in specificity.
  • topics in sciences, arts, humanities, library
    science, astronomy, physics, business, data
    visualization, computer science, mathematics,
    computer and network security, mathematics,
    number theory, medicine
  • analysis of metabolic regulation, numerical
    analysis, saline water phenomena, coals, iron
    ore, cereal grains, income dynamics among men,
    working hours, inflation, mass belief systems,
    aerial photography

14
Some results
15
The identification ofterm relationships
  • Singular/Plural Library, libraries
  • Acronyms
  • Standard Generalized Markup Language--SGML
  • Library of Congress Subject Headings--LCSH
  • Coordination
  • library and information science--library science,
    information science
  • information storage and retrieval--information
    storage, information retrieval
  • cataloging and interlibrary loan--cataloging,
    interlibrary loan
  • Ellipsis
  • abbreviated key title--abbreviated title
  • authority file records--authority records

16
A more abstract relationship hypernym/hyponym
  • electronic formats, such as text/HTML, ASCII,
    or PostScript .
  • Other examples from our data
  • Controlled Vocabularies Medical Subject
    Headings, Art and Architecture Thesaurus
  • metadata element set Dublin Core
  • protocol server applications NFS server,
    FTP server, Web server
  • moving images films, videos, simulations

17
A graph representation of relationships
Dewey
Subject Headings
Dewey call numbers
B/N
Broad/Narrow
Library of Congress Subject Headings
Ellipsis
Dewey Decimal

Dewey numbers
B/N
Dewey decimal classification
Acronym
B/N
DDC and LCSH
Acronym
numbers
cutter numbers
B/N
DDC
Coordination
18
RDF Topic Representation
numbers
name
Numbers
isDefinedIn
http//r2
http//r1
http//r3
19
System flow 1 processing steps
  • 1. Harvest Web text.
  • 2. Extract terminology and relationships.
  • 3. Organize terminology into an RDF graph.
  • 4. Import the RDF graph into the Extended Open
    RDF Toolkit.

20
System flow 2 User interaction
The Web
RDF Concept graph
User
RDF search engine
21
A screen shot
22
Future plans
  • Develop a user interface that fully exploits the
    richness of the RDF graph structure.
  • Merge terminology extracted from source documents
    with other sources of infermation.
  • Improve processes for automatically extracting
    terminology.

23
References
  • The Extended Open RDF Toolkit
  • Accessible at
  • http//eor.dublincore.org/
  • Automatically generated topic maps of World Wide
    Web resources.
  • Accessible at
  • http//www.oclc.org/oclc/research/publications/rev
    iew99/godby/topicmaps.htm
  • The WordSmith indexing system
  • Accessible at
  • http//www.oclc.org/oclc/research/publications/rev
    iew98/godby_reighart/wordsmith.htm
Write a Comment
User Comments (0)
About PowerShow.com