Information Retrieval and the Semantic Web - PowerPoint PPT Presentation

About This Presentation
Title:

Information Retrieval and the Semantic Web

Description:

Anonymous, 2001. U M B C. AN HONORS UNIVERSITY IN MARYLAND. tell. register. U M B C ... A term is a non-anonymous RDF resource which is the URI reference of either a ... – PowerPoint PPT presentation

Number of Views:582
Avg rating:3.0/5.0
Slides: 38
Provided by: timfi
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and the Semantic Web


1
  • Information Retrieval and the Semantic Web

Tim Finin, James Mayfield, Anupam Joshi,R. Scott
Cost and Clay Fink University of Maryland,
Baltimore County Johns Hopkins University,
Applied Physics Lab 04 January 2004
DARPA contract F30602-00-0591and NSF awards
ITR-IIS-0326460 and ITR-IIS-0325464 provided
partial research support for this work
2
Introduction and motivation
3
  • XML is Lisp's bastard nephew, with uglier syntax
    and no semantics. Yet XML is poised to enable the
    creation of a Web of data that dwarfs anything
    since the Library at Alexandria.
  • -- Philip Wadler, Et tu XML? The fall of
    the relational empire, VLDB, Rome, September
    2001.

4
  • The web has made people smarter. We need to
    understand how to use it to make machines
    smarter, too.
  • -- Michael I. Jordan (UC Berkeley),
    paraphrased from a talk at AAAI, July
  • 2002

5
  • The Semantic Web will globalize KR, just as the
    WWW globalize hypertext
  • -- Tim Berners-Lee

6
  • The multi-agent systems paradigm and the web
    both emerged around 1990. One has succeeded
    beyond imagination and the other has not yet made
    it out of the lab.
  • -- Anonymous, 2001

7
(No Transcript)
8
Vision and Model
9
Vision
  • Semantic markup (e.g., OWL) as markup
  • Web documents are traditional HTML documents,
    augmented with machine-readable semantic markup
    that describes their content
  • Inference and retrieval are tightly bound
  • Inference over semantic markup improves retrieval
    and text retrieval facilitates inference
  • Agents should use the web like humans do
  • Think of a query, encode to retrieve possibly
    relevant documents, read some and extract
    knowledge, repeat until objectives met

10
Why use IR techniques?
  • We will want to retrieve over structured and
    unstructured knowledge
  • We should prepare for the appearance of text
    documents with embedded SW markup
  • We may want to get our SWDs into conventional
    search engines, such as Google.
  • Mature, scalable, low cost, deployed
    infrastructure
  • IR techniques also have some unique
    characteristics that may be very useful
  • e.g., ranking matches, document similarity,
    clustering, relevance feedback, etc.

11
FrameworkSemantic Markup
Local KB
agent
Semantic Web Query
Inference Engine
Encoded Markup
Semantic Markup
Statement to be proved
Web Search Engine
Ranked Pages
Filters
Semantic Markup
Semantic Markup
12
FrameworkIncorporating Text
Local KB
Semantic Web Query
Inference Engine
Encoded Markup
Semantic Markup
Statement to be proved
Web Search Engine
Text Query
Filters
Text
Text
Ranked Pages
Filters
Semantic Markup
Semantic Markup
13
Harnessing Google
  • Google started indexing RDF documents some time
    in late 2003
  • Can we take advantage of this?
  • Weve developed techniques to get some structured
    data to be indexed by Google
  • And then later retrieved
  • Technique give Google enhanced documents with
    additional annotations containing Swangle Terms

14
Swangle definition
  • swangle
  • Pronunciation swang-glFunction transitive
    verbInflected Forms swangled swangling
    /-g(-)ling/Etymology Postmodern English,
    from C mangle, Date 20th century
  • 1 to convert an RDF triple into one or more IR
    indexing terms
  • 2 to process a document or query so that its
    content bearing markup will be indexed by an
    IR system
  • Synonym see tblify
  • - swangler /-g(-)lr/ noun

15
Swangling
  • Swangling turns a SW triple into 7 word like
    terms
  • One for each non-empty subset of the three
    components with the missing elements replaced by
    the special dont care URI
  • Terms generated by a hashing function (e.g.,
    SHA1)
  • Swangling an RDF document means adding in triples
    with swangle terms.
  • This can be indexed and retrieved via
    conventional search engines like Google
  • Allows one to search for a SWD with a triple that
    claims Ossama bin Laden is located at X

16
A Swangled Triple
  • ltrdfRDF
  • xmlnss"http//swoogle.umbc.edu/ontologies/swan
    gle.owl"
  • lt/rdfgt
  • ltsSwangledTriplegt ltsswangledTextgtN656WNTZ36KQ5
    PX6RFUGVKQ63Alt/sswangledTextgt
    ltrdfscommentgtSwangled text for
    http//www.xfront.com/owl/ontologies/camera/Came
    ra, http//www.w3.org/2000/01/rdf-schema
    subClassOf, http//www.xfront.com/owl/ontol
    ogies/camera/PurchaseableItem
    lt/rdfscommentgt ltsswangledTextgtM6IMWPWIH4YQI4IM
    GZYBGPYKEIlt/sswangledTextgt ltsswangledTextgtHO2H
    3FOPAEM53AQIZ6YVPFQ2XIlt/sswangledTextgt
    ltsswangledTextgt2AQEUJOYPMXWKHZTENIJS6PQ6Mlt/sswan
    gledTextgt ltsswangledTextgtIIVQRXOAYRH6GGRZDFXKEE
    B4PYlt/sswangledTextgt ltsswangledTextgt75Q5Z3BYAK
    RPLZDLFNS5KKMTOYlt/sswangledTextgt
    ltsswangledTextgt2FQ2YI7SNJ7OMXOXIDEEE2WOZUlt/sswan
    gledTextgtlt/sSwangledTriplegt

17
Whats the point?
  • Wed like to get our documents into Google
  • Swangle terms look like words to Google and other
    search engines.
  • Cloaking obviates modifying document
  • Add rules to the web server so that, when a
    search spider asks for document X the document
    swangled(X) is returned. Caching makes this
    efficient
  • A swangle term length of 7 may be an acceptable
    length for a Semantic Web of 1010 triples --
    collision prob for a triple 210-6.
  • We could also use Swanglish hashing each triple
    into N of the 50K most common English words

18
OWLIR
19
Student Event Scenario
  • UMBC sends out descriptions of 50 events a week
    to students.
  • Each student has a standing query used to route
    event messages.
  • A student only receives announcements of events
    matching his/her interests and schedule.
  • Use LMCOs AeroText system to automatically add
    DAMLOIL markup to event descriptions.
  • Categorize text announcements into event types
  • Identify key elements and add DAML markup
  • Use JESS to reason over the markup, drawing
    ontology-supported inferences

20
Event Ontology
  • A simple ontology for University events
  • Includes classes, subclasses, properties, etc.
  • Can include instance data, e.g., UMBC, NEC,
    Fairleigh Dickenson, etc.

21
OWLIR Architecture
Expand EventDescription
Agents
Classification
Extract triples reason
InfoExtraction
LMCO AeroText Java
Jess
Jess
EventDescriptions
Text
Text triples
TextDAML
TextDAML
Text triples
Converttriples toindex terms
Extract triples reason
Converttriples toindex terms
Text
Must
Index
Query User Interface
Text
Jess
OK
SIRE
Retrieve
Must not
Text triples
Final Results
Inference on results
Results User Interface
22
Swoogle
23
Swoogle Search
SWD SWO SWI
SWOOGLE 2
The web, like Gaul, is divided into three parts
the regular web (e.g. HTML), Semantic Web
Ontologies (SWOs), and Semantic Web Instance
files (SWIs)
Web Server
Human users
Ontology Dictionary
OntologyDictionary
SwoogleStatistics
SwoogleSearch
Web Service
Intelligent Agents
service
IR analyzer
SWD analyzer
analysis
SWD Metadata
SWD Cache
digest
SWD Reader
The Web
Candidate URLs
SWD Rank
Web Crawler
Swoogle Statistics
discovery
A SWDs rank is a function of its type (SWO/SWI)
and the rank and types of the documents to which
its related.
Swoogle uses four kinds of crawlers to discover
semantic web documents and several analysis
agents to compute metadata and relations among
documents and ontologies. Metadata is stored in
a relational DBMS. Services are provided to
people and agents.
http//swoogle.umbc.edu/
Statistics as of November 2004
SWDs 336,000 Classes 95,000
Triples 47,000,000 Properties 53,000
Ontologies 4,200 Individuals 7,200,000
SWD IR Engine
Swoogle provides services to people via a web
interface and to agents as web services.
Swoogle puts documents into a character n-gram
based IR engine to compute document similarity
and do retrieval from queries
Contributors include Tim Finin, Anupam Joshi, Yun
Peng, R. Scott Cost, Jim Mayfield, Joel Sachs,
Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding,
and Drew Ogle. Partial research support was
provided by DARPA contract F30602-00-0591 and by
NSF by awards NSF-ITR-IIS-0326460 and
NSF-ITR-IDM-0219649. November 2004.
24
Concepts
  • Document
  • A Semantic Web Document (SWD) is an online
    document written in semantic web languages (i.e.
    RDF and OWL).
  • An ontology document (SWO) is a SWD that contains
    mostly term definition (i.e. classes and
    properties). It corresponds to T-Box in
    Description Logic.
  • An instance document (SWI or SWDB) is a SWD that
    contains mostly class individuals. It corresponds
    to A-Box in Description Logic.
  • Term
  • A term is a non-anonymous RDF resource which is
    the URI reference of either a class or a
    property.
  • Individual
  • An individual refers to a non-anonymous RDF
    resource which is the URI reference of a class
    member.

In swoogle, a document D is a valid SWD iff.
JENA correctly parses D and produces at least
one triple.
JENA is a Java framework for writing Semantic
Web applications. http//www.hpl.hp.com/semweb/jen
a2.htm
rdftype
foafPerson
rdfsClass
rdftype
http//.../foaf.rdffinin
foafPerson
25
Demo
Find Time Ontology (Swoogle Search)
1
  • Digest Time Ontology
  • Document view
  • Term view

2
3
Find Term Person (Ontology Dictionary)
  • Digest Term Person
  • Class properties
  • (Instance) properties

4
Swoogle Statistics
5
26
Find Time Ontology
Demo1
We can use a set of keywords to search ontology.
For example, time, before, after are basic
concepts for a Time ontology.
27
Usage of Terms in SWD
http//www.cs.umbc.edu/finin/foaf.rdf
http//foo.com/foaf.rdf
rdftype
rdftype
foafPerson
foafPerson

foafmbox
http//foo.com/foaf.rdffinin
finin_at_umbc.edu
finin_at_umbc.edu
foafmbox
http//xmlns.com/foaf/1.0/
populated Class
rdfssubClassOf
wordNetAgent
populated Property
foafPerson
rdftype
rdfsClass
rdfsdomain
defined Class
foafmbox
rdftype
defined Property
rdfProperty
defined Individual
28
Digest Time Ontology (term view)
Demo2(a)
TimeZone
before
.
intAfter
29
Digest Time Ontology (document view)
Demo2(b)
30
Find Term Person
Demo3
Not capitalized! URIref is case sensitive!
31
Digest Term Person
Demo4
167 different properties
562 different properties
32
Demo5
Swoogle Statistics
33
Swoogle IR Search
  • This is work in progress, not yet fully
    integrated into Swoogle
  • Documents are put into an ngram IR engine (after
    processing by Jena) in canonical XML form
  • Each contiguous sequence of N characters is used
    as an index term (e.g., N5)
  • Queries processed the same way
  • Character ngrams work almost as well as words but
    have some advantages
  • No tokenization, so works well with artificial
    languages and agglutinative languages
  • gt good for RDF!

34
Why character n-grams?
  • Suppose we want to find ontologies for time
  • We might use the following query
  • time temporal interval point before after during
    day month year eventually calendar clock duration
    end begin zone
  • And have matches for documents with URIs like
  • http//foo.com/timeont.owltimeInterval
  • http//foo.com/timeont.owlCalendarClockInterval
  • http//purl.org/upper/temporal/t13.owltimeThing

35
Another approach URIs as words
  • Remember ontologies define vocabularies
  • In OWL, URIs of classes and properties are the
    words
  • So, take a SWD, reduce to triples, extract the
    URIs (with duplicates), discard URIs for blank
    nodes, hash each URI to a token (use MD5Hash),
    and index the document.
  • Process queries in the same way
  • Variation include literal data (e.g., strings)
    too.

36
Conclusion
37
What we have done
  • Developed Swoogle a crawler based retrieval
    system for SWDs
  • Developed and implemented a technique to get
    Google to index and retrieve SWDs
  • Prototyped (twice) an ngram based IR engine for
    SWDs
  • Explored the integration of inference and
    retrieval
  • Used these in several demonstration systems
Write a Comment
User Comments (0)
About PowerShow.com