Title: Carol Jean Godby
1The automatic encoding of lexical knowledge in
RDF topicmaps
- Carol Jean Godby
- OCLC
- Online Computer Library Center
- March 6, 2001
2Topicmaps of Web resources
- For navigating complex Web sites
- For managing bookmark files
- For creating views of the Web that are organized
by subject
3Terminology identification
- ...is an essential first step in the analysis of
a document's content. - ...is one of the most mature research subjects in
natural language processing.
4Lexical phrases
- Are the names of persistent concepts.
- Act like words.
- Are commonly used to name new concepts in rapidly
evolving technical subject domains.
5Not a lexical phraseRecurrent problem
6A lexical phraseRecurrent erosion
7Identifying lexical phrases
- Tokenized text ...Planetary scientists think
the convex shape came about as lava welled up
beneath the crater's solid floor. - Ngrams planetary scientists think, convex
shape, welled up, coincided with, five times
greater than, easiest way, Milky Way, absolute
magnitudes brighter than, added material,
advanced study, African American - Index filter planetary scientists, convex
shape, easiest way, Milky Way, absolute
magnitudes, added material, advanced study,
African American - Topic filter planetary scientists, Milky Way
8Terminology identification process flow
Tokenized text
9.8m
Ngrams
1.9 M
Index filter
59k, 2331 phrases
35k, 1632 phrases
Topic filter
9Strategies in the topic filter
- Word/phrase frequency and strength of association
- Knowledge-poor text analysis
- More sophisticated but computable text analysis
10Word and phrase frequencies
- Word/phrase frequency
- high dublin core, metadata, element, electronic
resources - low availability period, background, applicable
terminologies - Weighted frequency
- 1. core element, date element, metadata
element - 2. author name, entity name, corporate name
- 3. HTML tag, end tag, meta tag
11Knowledge-poor techniques 1
- Some noun phrase heads usually appear in text
only with adjective or noun modifiers. - Example holes--black holes, grey holes,
central holes - Others usually appear without modifiers.
- Example galaxy--cartwheel galaxy, spiral
galaxy - a galaxy, our galaxy, this
galaxy
12Consequences
- We can identify topical single terms
- galaxy, star, sun, moon
- government, abortion, communism
- metadata, html, Internet, information
- We can create subject taxonomies
- galaxy (-ies) hole(s)
- cartwheel galaxy black
holes - elliptical galaxy drill
holes - spiral galaxy grey
holes -
13Knowledge-poor techniques 2 subject probes
- Goal to get high-quality subject terms
- Look for markers of a subject that is talked
about, written about or studied topics in, study
of, analysis of, (on the) subject of, major in - Probes differ in specificity.
- topics in sciences, arts, humanities, library
science, astronomy, physics, business, data
visualization, computer science, mathematics,
computer and network security, mathematics,
number theory, medicine - analysis of metabolic regulation, numerical
analysis, saline water phenomena, coals, iron
ore, cereal grains, income dynamics among men,
working hours, inflation, mass belief systems,
aerial photography
14Some results
15The identification ofterm relationships
- Singular/Plural Library, libraries
- Acronyms
- Standard Generalized Markup Language--SGML
- Library of Congress Subject Headings--LCSH
- Coordination
- library and information science--library science,
information science - information storage and retrieval--information
storage, information retrieval - cataloging and interlibrary loan--cataloging,
interlibrary loan - Ellipsis
- abbreviated key title--abbreviated title
- authority file records--authority records
16A more abstract relationship hypernym/hyponym
- electronic formats, such as text/HTML, ASCII,
or PostScript . - Other examples from our data
- Controlled Vocabularies Medical Subject
Headings, Art and Architecture Thesaurus - metadata element set Dublin Core
- protocol server applications NFS server,
FTP server, Web server - moving images films, videos, simulations
17A graph representation of relationships
Dewey
Subject Headings
Dewey call numbers
B/N
Broad/Narrow
Library of Congress Subject Headings
Ellipsis
Dewey Decimal
Dewey numbers
B/N
Dewey decimal classification
Acronym
B/N
DDC and LCSH
Acronym
numbers
cutter numbers
B/N
DDC
Coordination
18RDF Topic Representation
numbers
name
Numbers
isDefinedIn
http//r2
http//r1
http//r3
19System flow 1 processing steps
- 1. Harvest Web text.
- 2. Extract terminology and relationships.
- 3. Organize terminology into an RDF graph.
- 4. Import the RDF graph into the Extended Open
RDF Toolkit.
20System flow 2 User interaction
The Web
RDF Concept graph
User
RDF search engine
21A screen shot
22Future plans
- Develop a user interface that fully exploits the
richness of the RDF graph structure. - Merge terminology extracted from source documents
with other sources of infermation. - Improve processes for automatically extracting
terminology.
23References
- The Extended Open RDF Toolkit
- Accessible at
- http//eor.dublincore.org/
- Automatically generated topic maps of World Wide
Web resources. - Accessible at
- http//www.oclc.org/oclc/research/publications/rev
iew99/godby/topicmaps.htm - The WordSmith indexing system
- Accessible at
- http//www.oclc.org/oclc/research/publications/rev
iew98/godby_reighart/wordsmith.htm