Information Retrieval and the Semantic Web - PowerPoint PPT Presentation

About This Presentation

Title:

Information Retrieval and the Semantic Web

Description:

Anonymous, 2001. U M B C. AN HONORS UNIVERSITY IN MARYLAND. tell. register. U M B C ... A term is a non-anonymous RDF resource which is the URI reference of either a ... – PowerPoint PPT presentation

Number of Views:582

Avg rating:3.0/5.0

Slides: 38

Provided by: timfi

Learn more at: https://ebiquity.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and the Semantic Web

1

Information Retrieval and the Semantic Web

Tim Finin, James Mayfield, Anupam Joshi,R. Scott
Cost and Clay Fink University of Maryland,
Baltimore County Johns Hopkins University,
Applied Physics Lab 04 January 2004
DARPA contract F30602-00-0591and NSF awards
ITR-IIS-0326460 and ITR-IIS-0325464 provided
partial research support for this work
2
Introduction and motivation
3

XML is Lisp's bastard nephew, with uglier syntax
and no semantics. Yet XML is poised to enable the
creation of a Web of data that dwarfs anything
since the Library at Alexandria.
-- Philip Wadler, Et tu XML? The fall of
the relational empire, VLDB, Rome, September
2001.

The web has made people smarter. We need to
understand how to use it to make machines
smarter, too.
-- Michael I. Jordan (UC Berkeley),
paraphrased from a talk at AAAI, July
2002

The Semantic Web will globalize KR, just as the
WWW globalize hypertext
-- Tim Berners-Lee

The multi-agent systems paradigm and the web
both emerged around 1990. One has succeeded
beyond imagination and the other has not yet made
it out of the lab.
-- Anonymous, 2001

7
(No Transcript)
8
Vision and Model
9
Vision

Semantic markup (e.g., OWL) as markup
Web documents are traditional HTML documents,
augmented with machine-readable semantic markup
that describes their content
Inference and retrieval are tightly bound
Inference over semantic markup improves retrieval
and text retrieval facilitates inference
Agents should use the web like humans do
Think of a query, encode to retrieve possibly
relevant documents, read some and extract
knowledge, repeat until objectives met

10
Why use IR techniques?

We will want to retrieve over structured and
unstructured knowledge
We should prepare for the appearance of text
documents with embedded SW markup
We may want to get our SWDs into conventional
search engines, such as Google.
Mature, scalable, low cost, deployed
infrastructure
IR techniques also have some unique
characteristics that may be very useful
e.g., ranking matches, document similarity,
clustering, relevance feedback, etc.

11
FrameworkSemantic Markup
Local KB
agent
Semantic Web Query
Inference Engine
Encoded Markup
Semantic Markup
Statement to be proved
Web Search Engine
Ranked Pages
Filters
Semantic Markup
Semantic Markup
12
FrameworkIncorporating Text
Local KB
Semantic Web Query
Inference Engine
Encoded Markup
Semantic Markup
Statement to be proved
Web Search Engine
Text Query
Filters
Text
Text
Ranked Pages
Filters
Semantic Markup
Semantic Markup
13
Harnessing Google

Google started indexing RDF documents some time
in late 2003
Can we take advantage of this?
Weve developed techniques to get some structured
data to be indexed by Google
And then later retrieved
Technique give Google enhanced documents with
additional annotations containing Swangle Terms

14
Swangle definition

swangle
Pronunciation swang-glFunction transitive
verbInflected Forms swangled swangling
/-g(-)ling/Etymology Postmodern English,
from C mangle, Date 20th century
1 to convert an RDF triple into one or more IR
indexing terms
2 to process a document or query so that its
content bearing markup will be indexed by an
IR system
Synonym see tblify
- swangler /-g(-)lr/ noun

15
Swangling

Swangling turns a SW triple into 7 word like
terms
One for each non-empty subset of the three
components with the missing elements replaced by
the special dont care URI
Terms generated by a hashing function (e.g.,
SHA1)
Swangling an RDF document means adding in triples
with swangle terms.
This can be indexed and retrieved via
conventional search engines like Google
Allows one to search for a SWD with a triple that
claims Ossama bin Laden is located at X

16
A Swangled Triple

ltrdfRDF
xmlnss"http//swoogle.umbc.edu/ontologies/swan
gle.owl"
lt/rdfgt
ltsSwangledTriplegt ltsswangledTextgtN656WNTZ36KQ5
PX6RFUGVKQ63Alt/sswangledTextgt
ltrdfscommentgtSwangled text for
http//www.xfront.com/owl/ontologies/camera/Came
ra, http//www.w3.org/2000/01/rdf-schema
subClassOf, http//www.xfront.com/owl/ontol
ogies/camera/PurchaseableItem
lt/rdfscommentgt ltsswangledTextgtM6IMWPWIH4YQI4IM
GZYBGPYKEIlt/sswangledTextgt ltsswangledTextgtHO2H
3FOPAEM53AQIZ6YVPFQ2XIlt/sswangledTextgt
ltsswangledTextgt2AQEUJOYPMXWKHZTENIJS6PQ6Mlt/sswan
gledTextgt ltsswangledTextgtIIVQRXOAYRH6GGRZDFXKEE
B4PYlt/sswangledTextgt ltsswangledTextgt75Q5Z3BYAK
RPLZDLFNS5KKMTOYlt/sswangledTextgt
ltsswangledTextgt2FQ2YI7SNJ7OMXOXIDEEE2WOZUlt/sswan
gledTextgtlt/sSwangledTriplegt

17
Whats the point?

Wed like to get our documents into Google
Swangle terms look like words to Google and other
search engines.
Cloaking obviates modifying document
Add rules to the web server so that, when a
search spider asks for document X the document
swangled(X) is returned. Caching makes this
efficient
A swangle term length of 7 may be an acceptable
length for a Semantic Web of 1010 triples --
collision prob for a triple 210-6.
We could also use Swanglish hashing each triple
into N of the 50K most common English words

18
OWLIR
19
Student Event Scenario

UMBC sends out descriptions of 50 events a week
to students.
Each student has a standing query used to route
event messages.
A student only receives announcements of events
matching his/her interests and schedule.
Use LMCOs AeroText system to automatically add
DAMLOIL markup to event descriptions.
Categorize text announcements into event types
Identify key elements and add DAML markup
Use JESS to reason over the markup, drawing
ontology-supported inferences

20
Event Ontology

A simple ontology for University events
Includes classes, subclasses, properties, etc.
Can include instance data, e.g., UMBC, NEC,
Fairleigh Dickenson, etc.

21
OWLIR Architecture
Expand EventDescription
Agents
Classification
Extract triples reason
InfoExtraction
LMCO AeroText Java
Jess
Jess
EventDescriptions
Text
Text triples
TextDAML
TextDAML
Text triples
Converttriples toindex terms
Extract triples reason
Converttriples toindex terms
Text
Must
Index
Query User Interface
Text
Jess
OK
SIRE
Retrieve
Must not
Text triples
Final Results
Inference on results
Results User Interface
22
Swoogle
23
Swoogle Search
SWD SWO SWI
SWOOGLE 2
The web, like Gaul, is divided into three parts
the regular web (e.g. HTML), Semantic Web
Ontologies (SWOs), and Semantic Web Instance
files (SWIs)
Web Server
Human users
Ontology Dictionary
OntologyDictionary
SwoogleStatistics
SwoogleSearch
Web Service
Intelligent Agents
service
IR analyzer
SWD analyzer
analysis
SWD Metadata
SWD Cache
digest
SWD Reader
The Web
Candidate URLs
SWD Rank
Web Crawler
Swoogle Statistics
discovery
A SWDs rank is a function of its type (SWO/SWI)
and the rank and types of the documents to which
its related.
Swoogle uses four kinds of crawlers to discover
semantic web documents and several analysis
agents to compute metadata and relations among
documents and ontologies. Metadata is stored in
a relational DBMS. Services are provided to
people and agents.
http//swoogle.umbc.edu/
Statistics as of November 2004
SWDs 336,000 Classes 95,000
Triples 47,000,000 Properties 53,000
Ontologies 4,200 Individuals 7,200,000
SWD IR Engine
Swoogle provides services to people via a web
interface and to agents as web services.
Swoogle puts documents into a character n-gram
based IR engine to compute document similarity
and do retrieval from queries
Contributors include Tim Finin, Anupam Joshi, Yun
Peng, R. Scott Cost, Jim Mayfield, Joel Sachs,
Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding,
and Drew Ogle. Partial research support was
provided by DARPA contract F30602-00-0591 and by
NSF by awards NSF-ITR-IIS-0326460 and
NSF-ITR-IDM-0219649. November 2004.
24
Concepts

Document
A Semantic Web Document (SWD) is an online
document written in semantic web languages (i.e.
RDF and OWL).
An ontology document (SWO) is a SWD that contains
mostly term definition (i.e. classes and
properties). It corresponds to T-Box in
Description Logic.
An instance document (SWI or SWDB) is a SWD that
contains mostly class individuals. It corresponds
to A-Box in Description Logic.
Term
A term is a non-anonymous RDF resource which is
the URI reference of either a class or a
property.
Individual
An individual refers to a non-anonymous RDF
resource which is the URI reference of a class
member.

In swoogle, a document D is a valid SWD iff.
JENA correctly parses D and produces at least
one triple.
JENA is a Java framework for writing Semantic
Web applications. http//www.hpl.hp.com/semweb/jen
a2.htm
rdftype
foafPerson
rdfsClass
rdftype
http//.../foaf.rdffinin
foafPerson
25
Demo
Find Time Ontology (Swoogle Search)
1

Digest Time Ontology
Document view
Term view

2
3
Find Term Person (Ontology Dictionary)

Digest Term Person
Class properties
(Instance) properties

4
Swoogle Statistics
5
26
Find Time Ontology
Demo1
We can use a set of keywords to search ontology.
For example, time, before, after are basic
concepts for a Time ontology.
27
Usage of Terms in SWD
http//www.cs.umbc.edu/finin/foaf.rdf
http//foo.com/foaf.rdf
rdftype
rdftype
foafPerson
foafPerson

foafmbox
http//foo.com/foaf.rdffinin
finin_at_umbc.edu
finin_at_umbc.edu
foafmbox
http//xmlns.com/foaf/1.0/
populated Class
rdfssubClassOf
wordNetAgent
populated Property
foafPerson
rdftype
rdfsClass
rdfsdomain
defined Class
foafmbox
rdftype
defined Property
rdfProperty
defined Individual
28
Digest Time Ontology (term view)
Demo2(a)
TimeZone
before
.
intAfter
29
Digest Time Ontology (document view)
Demo2(b)
30
Find Term Person
Demo3
Not capitalized! URIref is case sensitive!
31
Digest Term Person
Demo4
167 different properties
562 different properties
32
Demo5
Swoogle Statistics
33
Swoogle IR Search

This is work in progress, not yet fully
integrated into Swoogle
Documents are put into an ngram IR engine (after
processing by Jena) in canonical XML form
Each contiguous sequence of N characters is used
as an index term (e.g., N5)
Queries processed the same way
Character ngrams work almost as well as words but
have some advantages
No tokenization, so works well with artificial
languages and agglutinative languages
gt good for RDF!

34
Why character n-grams?

Suppose we want to find ontologies for time
We might use the following query
time temporal interval point before after during
day month year eventually calendar clock duration
end begin zone
And have matches for documents with URIs like
http//foo.com/timeont.owltimeInterval
http//foo.com/timeont.owlCalendarClockInterval
http//purl.org/upper/temporal/t13.owltimeThing

35
Another approach URIs as words

Remember ontologies define vocabularies
In OWL, URIs of classes and properties are the
words
So, take a SWD, reduce to triples, extract the
URIs (with duplicates), discard URIs for blank
nodes, hash each URI to a token (use MD5Hash),
and index the document.
Process queries in the same way
Variation include literal data (e.g., strings)
too.

36
Conclusion
37
What we have done