Title: Information Retrieval and the Semantic Web
1- Information Retrieval and the Semantic Web
Tim Finin, James Mayfield, Anupam Joshi,R. Scott
Cost and Clay Fink University of Maryland,
Baltimore County Johns Hopkins University,
Applied Physics Lab 04 January 2004
DARPA contract F30602-00-0591and NSF awards
ITR-IIS-0326460 and ITR-IIS-0325464 provided
partial research support for this work
2Introduction and motivation
3- XML is Lisp's bastard nephew, with uglier syntax
and no semantics. Yet XML is poised to enable the
creation of a Web of data that dwarfs anything
since the Library at Alexandria. - -- Philip Wadler, Et tu XML? The fall of
the relational empire, VLDB, Rome, September
2001.
4- The web has made people smarter. We need to
understand how to use it to make machines
smarter, too. - -- Michael I. Jordan (UC Berkeley),
paraphrased from a talk at AAAI, July - 2002
5- The Semantic Web will globalize KR, just as the
WWW globalize hypertext - -- Tim Berners-Lee
6- The multi-agent systems paradigm and the web
both emerged around 1990. One has succeeded
beyond imagination and the other has not yet made
it out of the lab. - -- Anonymous, 2001
7(No Transcript)
8Vision and Model
9Vision
- Semantic markup (e.g., OWL) as markup
- Web documents are traditional HTML documents,
augmented with machine-readable semantic markup
that describes their content - Inference and retrieval are tightly bound
- Inference over semantic markup improves retrieval
and text retrieval facilitates inference - Agents should use the web like humans do
- Think of a query, encode to retrieve possibly
relevant documents, read some and extract
knowledge, repeat until objectives met
10Why use IR techniques?
- We will want to retrieve over structured and
unstructured knowledge - We should prepare for the appearance of text
documents with embedded SW markup - We may want to get our SWDs into conventional
search engines, such as Google. - Mature, scalable, low cost, deployed
infrastructure - IR techniques also have some unique
characteristics that may be very useful - e.g., ranking matches, document similarity,
clustering, relevance feedback, etc.
11FrameworkSemantic Markup
Local KB
agent
Semantic Web Query
Inference Engine
Encoded Markup
Semantic Markup
Statement to be proved
Web Search Engine
Ranked Pages
Filters
Semantic Markup
Semantic Markup
12FrameworkIncorporating Text
Local KB
Semantic Web Query
Inference Engine
Encoded Markup
Semantic Markup
Statement to be proved
Web Search Engine
Text Query
Filters
Text
Text
Ranked Pages
Filters
Semantic Markup
Semantic Markup
13Harnessing Google
- Google started indexing RDF documents some time
in late 2003 - Can we take advantage of this?
- Weve developed techniques to get some structured
data to be indexed by Google - And then later retrieved
- Technique give Google enhanced documents with
additional annotations containing Swangle Terms
14Swangle definition
- swangle
- Pronunciation swang-glFunction transitive
verbInflected Forms swangled swangling
/-g(-)ling/Etymology Postmodern English,
from C mangle, Date 20th century - 1 to convert an RDF triple into one or more IR
indexing terms - 2 to process a document or query so that its
content bearing markup will be indexed by an
IR system - Synonym see tblify
- - swangler /-g(-)lr/ noun
15Swangling
- Swangling turns a SW triple into 7 word like
terms - One for each non-empty subset of the three
components with the missing elements replaced by
the special dont care URI - Terms generated by a hashing function (e.g.,
SHA1) - Swangling an RDF document means adding in triples
with swangle terms. - This can be indexed and retrieved via
conventional search engines like Google - Allows one to search for a SWD with a triple that
claims Ossama bin Laden is located at X
16A Swangled Triple
- ltrdfRDF
- xmlnss"http//swoogle.umbc.edu/ontologies/swan
gle.owl" - lt/rdfgt
- ltsSwangledTriplegt ltsswangledTextgtN656WNTZ36KQ5
PX6RFUGVKQ63Alt/sswangledTextgt
ltrdfscommentgtSwangled text for
http//www.xfront.com/owl/ontologies/camera/Came
ra, http//www.w3.org/2000/01/rdf-schema
subClassOf, http//www.xfront.com/owl/ontol
ogies/camera/PurchaseableItem
lt/rdfscommentgt ltsswangledTextgtM6IMWPWIH4YQI4IM
GZYBGPYKEIlt/sswangledTextgt ltsswangledTextgtHO2H
3FOPAEM53AQIZ6YVPFQ2XIlt/sswangledTextgt
ltsswangledTextgt2AQEUJOYPMXWKHZTENIJS6PQ6Mlt/sswan
gledTextgt ltsswangledTextgtIIVQRXOAYRH6GGRZDFXKEE
B4PYlt/sswangledTextgt ltsswangledTextgt75Q5Z3BYAK
RPLZDLFNS5KKMTOYlt/sswangledTextgt
ltsswangledTextgt2FQ2YI7SNJ7OMXOXIDEEE2WOZUlt/sswan
gledTextgtlt/sSwangledTriplegt
17Whats the point?
- Wed like to get our documents into Google
- Swangle terms look like words to Google and other
search engines. - Cloaking obviates modifying document
- Add rules to the web server so that, when a
search spider asks for document X the document
swangled(X) is returned. Caching makes this
efficient - A swangle term length of 7 may be an acceptable
length for a Semantic Web of 1010 triples --
collision prob for a triple 210-6. - We could also use Swanglish hashing each triple
into N of the 50K most common English words
18OWLIR
19Student Event Scenario
- UMBC sends out descriptions of 50 events a week
to students. - Each student has a standing query used to route
event messages. - A student only receives announcements of events
matching his/her interests and schedule. - Use LMCOs AeroText system to automatically add
DAMLOIL markup to event descriptions. - Categorize text announcements into event types
- Identify key elements and add DAML markup
- Use JESS to reason over the markup, drawing
ontology-supported inferences
20Event Ontology
- A simple ontology for University events
- Includes classes, subclasses, properties, etc.
- Can include instance data, e.g., UMBC, NEC,
Fairleigh Dickenson, etc.
21OWLIR Architecture
Expand EventDescription
Agents
Classification
Extract triples reason
InfoExtraction
LMCO AeroText Java
Jess
Jess
EventDescriptions
Text
Text triples
TextDAML
TextDAML
Text triples
Converttriples toindex terms
Extract triples reason
Converttriples toindex terms
Text
Must
Index
Query User Interface
Text
Jess
OK
SIRE
Retrieve
Must not
Text triples
Final Results
Inference on results
Results User Interface
22Swoogle
23Swoogle Search
SWD SWO SWI
SWOOGLE 2
The web, like Gaul, is divided into three parts
the regular web (e.g. HTML), Semantic Web
Ontologies (SWOs), and Semantic Web Instance
files (SWIs)
Web Server
Human users
Ontology Dictionary
OntologyDictionary
SwoogleStatistics
SwoogleSearch
Web Service
Intelligent Agents
service
IR analyzer
SWD analyzer
analysis
SWD Metadata
SWD Cache
digest
SWD Reader
The Web
Candidate URLs
SWD Rank
Web Crawler
Swoogle Statistics
discovery
A SWDs rank is a function of its type (SWO/SWI)
and the rank and types of the documents to which
its related.
Swoogle uses four kinds of crawlers to discover
semantic web documents and several analysis
agents to compute metadata and relations among
documents and ontologies. Metadata is stored in
a relational DBMS. Services are provided to
people and agents.
http//swoogle.umbc.edu/
Statistics as of November 2004
SWDs 336,000 Classes 95,000
Triples 47,000,000 Properties 53,000
Ontologies 4,200 Individuals 7,200,000
SWD IR Engine
Swoogle provides services to people via a web
interface and to agents as web services.
Swoogle puts documents into a character n-gram
based IR engine to compute document similarity
and do retrieval from queries
Contributors include Tim Finin, Anupam Joshi, Yun
Peng, R. Scott Cost, Jim Mayfield, Joel Sachs,
Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding,
and Drew Ogle. Partial research support was
provided by DARPA contract F30602-00-0591 and by
NSF by awards NSF-ITR-IIS-0326460 and
NSF-ITR-IDM-0219649. November 2004.
24Concepts
- Document
- A Semantic Web Document (SWD) is an online
document written in semantic web languages (i.e.
RDF and OWL). - An ontology document (SWO) is a SWD that contains
mostly term definition (i.e. classes and
properties). It corresponds to T-Box in
Description Logic. - An instance document (SWI or SWDB) is a SWD that
contains mostly class individuals. It corresponds
to A-Box in Description Logic. - Term
- A term is a non-anonymous RDF resource which is
the URI reference of either a class or a
property. - Individual
- An individual refers to a non-anonymous RDF
resource which is the URI reference of a class
member.
In swoogle, a document D is a valid SWD iff.
JENA correctly parses D and produces at least
one triple.
JENA is a Java framework for writing Semantic
Web applications. http//www.hpl.hp.com/semweb/jen
a2.htm
rdftype
foafPerson
rdfsClass
rdftype
http//.../foaf.rdffinin
foafPerson
25Demo
Find Time Ontology (Swoogle Search)
1
- Digest Time Ontology
- Document view
- Term view
2
3
Find Term Person (Ontology Dictionary)
- Digest Term Person
- Class properties
- (Instance) properties
4
Swoogle Statistics
5
26Find Time Ontology
Demo1
We can use a set of keywords to search ontology.
For example, time, before, after are basic
concepts for a Time ontology.
27Usage of Terms in SWD
http//www.cs.umbc.edu/finin/foaf.rdf
http//foo.com/foaf.rdf
rdftype
rdftype
foafPerson
foafPerson
foafmbox
http//foo.com/foaf.rdffinin
finin_at_umbc.edu
finin_at_umbc.edu
foafmbox
http//xmlns.com/foaf/1.0/
populated Class
rdfssubClassOf
wordNetAgent
populated Property
foafPerson
rdftype
rdfsClass
rdfsdomain
defined Class
foafmbox
rdftype
defined Property
rdfProperty
defined Individual
28Digest Time Ontology (term view)
Demo2(a)
TimeZone
before
.
intAfter
29Digest Time Ontology (document view)
Demo2(b)
30Find Term Person
Demo3
Not capitalized! URIref is case sensitive!
31Digest Term Person
Demo4
167 different properties
562 different properties
32Demo5
Swoogle Statistics
33Swoogle IR Search
- This is work in progress, not yet fully
integrated into Swoogle - Documents are put into an ngram IR engine (after
processing by Jena) in canonical XML form - Each contiguous sequence of N characters is used
as an index term (e.g., N5) - Queries processed the same way
- Character ngrams work almost as well as words but
have some advantages - No tokenization, so works well with artificial
languages and agglutinative languages - gt good for RDF!
34Why character n-grams?
- Suppose we want to find ontologies for time
- We might use the following query
- time temporal interval point before after during
day month year eventually calendar clock duration
end begin zone - And have matches for documents with URIs like
- http//foo.com/timeont.owltimeInterval
- http//foo.com/timeont.owlCalendarClockInterval
- http//purl.org/upper/temporal/t13.owltimeThing
35Another approach URIs as words
- Remember ontologies define vocabularies
- In OWL, URIs of classes and properties are the
words - So, take a SWD, reduce to triples, extract the
URIs (with duplicates), discard URIs for blank
nodes, hash each URI to a token (use MD5Hash),
and index the document. - Process queries in the same way
- Variation include literal data (e.g., strings)
too.
36Conclusion
37What we have done
- Developed Swoogle a crawler based retrieval
system for SWDs - Developed and implemented a technique to get
Google to index and retrieve SWDs - Prototyped (twice) an ngram based IR engine for
SWDs - Explored the integration of inference and
retrieval - Used these in several demonstration systems