Title: Search Engines for Semantic Web Knowledge
1Search Engines for Semantic WebKnowledge
- Tim Finin
- University of Maryland, Baltimore County
- Joint work with Li Ding, Anupam Joshi, Yun Peng,
Pranam Kolari, Pavan Reddivari, Sandor Dornbush,
Rong Pan, Akshay Java, Joel Sachs, Scott Cost and
Vishal Doshi
? http//creativecommons.org/licenses/by-nc-sa/2.0
/ This work was partially supported by DARPA
contract F30602-97-1-0215, NSF grants CCR007080
and IIS9875433 and grants from IBM, Fujitsu and
HP.
2This talk
- Motivation
- Semantic web 101
- Swoogle Semantic Websearch engine
- Use cases and applications
- State of the Semantic Web
- Conclusions
3Once there were only afew large computers
4Then there were many,
5All connected 24x7,
RFID
802.11
TCP/IP
UltraWideBand
Bluetooth
SoftwareRadio
IRDA
6Interoperating
- tcp/ip ftp smtp
- rpc corba ssh
- http html
- xml
- gif jpg mpg mp3
- pdf
7Access to the worlds knowledge
del.icio.us
8Google has made us smarter
9But what about our agents?
- Agents still have a very minimal understanding of
text and images.
10This talk
- Motivation
- Semantic web 101
- Swoogle Semantic Websearch engine
- Use cases and applications
- State of the Semantic Web
- Conclusions
11XML helps
- XML is Lisp's bastard nephew, with uglier syntax
and no semantics. Yet XML is poised to enable the
creation of a Web of data that dwarfs anything
since the Library at Alexandria. - -- Philip Wadler, Et tu XML? The fall of
the relational empire, VLDB, Rome, September
2001.
12Semantic Web adds semantics
- The Semantic Web will globalize KR, just as the
WWW globalize hypertext - -- Tim Berners-Lee
13Semantic Web 101
- RDF/XML
- rdfRDF tag
- namespaces ? ontologies
- Semantic graph, URIs as nodes links
- triples
14Wheres the semantics?
- URIs as common rigid designators
- Conventions let URIs denote things in the real
world - Namespaces URIs give an unambiguous shared
vocabulary - RDF, RDFS and OWL have semantics defined using
model theory and also axioms - Ontologies allow agents to draw inferences
- uniStudent is a subclass of foafPerson
- Every uniStudent uniattends at least one
uniSchool - A foafPerson with a unischool is necessarily a
uniStudent
15Much of the RDF data will come from databases,
just like HTML content.
16(No Transcript)
17RDF/a
- RDF/a is a W3C proposal for embedding RDF in
XHTML documents
lthtml xmlnsfoaf"http//xmlns.com/foaf/0.1/"gt
ltheadgtlttitlegtJo Lambda's Home Pagelt/titlegtlt/headgt
ltbodygt Hello. This is ltspan
property"foafname"gtJo Lambdalt/spangt's home
page. lth2gtWorklt/h2gt If you want to contact
me at work, you can either lta rel"foafmbox"
href"mailtojo.lambda_at_example.org"gtemail
melt/agt, or call ltspan property"foafphone"gt1
777 888 9999lt/spangt. lt/bodygt lt/htmlgt
An HTML Document with RDF embedded
The triples in ntriple format.
ltgt foafname "Jo Lambda"rdfXMLLiteral
foafmbox ltmailtojo.lambda_at_example.orggt
foafphone "1 777 888 9999"rdfXMLLiteral .
18But what about our agents?
- A Google for knowledge on the Semantic Web is
needed by software agents and programs
19This talk
- Motivation
- Semantic web 101
- Swoogle Semantic Websearch engine
- Use cases and applications
- State of the Semantic Web
- Conclusions
20- http//swoogle.umbc.edu/
- Running since summer 2004
- 1.4M RDF documents, 250M RDF triples, 10K
ontologies
21Swoogle Architecture
22A Hybrid Harvesting Framework
true
Swoogle Sample Dataset
Manual submission
Inductive learner
would
Seeds R
Seeds M
Seeds H
RDF crawling
Bounded HTML crawling
Meta crawling
google
Google API call
crawl
crawl
the Web
23Performance Site Coverage
- SW06MAR - Basic statistics (Mar 31, 2006)
- 1.3M SWDs from 157K websites
- 268M triples
- 61K SWOs including gt10K in high quality
- 1.4M SWTs using 12K namespaces
- Significance
- Compare with existing works ( DAML crawler,
scutter ) - Compare SW06MAR with Googles estimated SWDs
SWDs per website
Website
24Performance crawlers contribution
- High SWD ratio 42 URLs are confirmed as SWD
- Consistent growth rate 3000 SWDs per day
- RDF crawler best harvesting method
- HTML crawler best accuracy
- Meta crawler best in detecting websites
of documents
25This talk
- Motivation
- Semantic web 101
- Swoogle Semantic Websearch engine
- Use cases and applications
- State of the Semantic Web
- Conclusions
26Applications and use cases
- Supporting Semantic Web developers
- Ontology designers, vocabulary discovery, whos
using my ontologies or data?, use analysis,
errors,statistics, etc. - Searching specialized collections
- Spire aggregating observations and data from
biologists - InferenceWeb searching over and enhancing proofs
- SemNews Text Meaning of news stories
- Supporting SW tools
- Triple shop finding data for SPARQL queries
27(No Transcript)
2880 ontologies were found that had these three
terms
By default, ontologies are ordered by their
popularity, but they can also be ordered by
recency or size.
Lets look at this one
29Basic Metadata hasDateDiscovered  2005-01-17
hasDatePing  2006-03-21 hasPingState
 PingModified type  SemanticWebDocument
isEmbedded  false hasGrammar  RDFXML
hasParseState  ParseSuccess hasDateLastmodified
 2005-04-29 hasDateCache  2006-03-21
hasEncoding  ISO-8859-1 hasLength  18K
hasCntTriple  311.00 hasOntoRatio  0.98
hasCntSwt  94.00 hasCntSwtDef  72.00
hasCntInstance  8.00
30(No Transcript)
31(No Transcript)
32These are the namespaces this ontology uses.
Clicking on one shows all of the documents using
the namespace.
All of this is available in RDF form for the
agents among us.
33Heres what the agent sees. Note the swoogle and
wob (web of belief) ontologies.
34We can also search for terms (classes,
properties) like terms for person.
3510K terms associatged with person! Ordered by
use.
Lets look at foafPersons metadata
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45UMBC Triple Shop
- http//sparql.cs.umbc.edu/
- Online SPARQL RDF query processing basedon HPs
Jena and Joseki with several interesting features - Selectable level of inference over model
- Automatically finds SWDs for give queries using
Swoogle backend database - Provide dataset creation wizard
- Dataset can be stored on our server or downloaded
- Tag, share and search over saved datasets
46Web-scale semantic web data access
data access service
the Web
agent
Index RDF data
ask (person)
Search vocabulary
Search URIrefs in SW vocabulary
inform (foafPerson)
Compose query
ask (?x rdftype foafPerson)
Search URLs in SWD index
Populate RDF database
inform (doc URLs)
Fetch docs
Query local RDF database
47Who knows Anupam Joshi? Show me their names,
email address and pictures
48The UMBC ebiquity site publishes lots of RDF
data, including FOAF profiles
49(No Transcript)
50PREFIX foaf lthttp//xmlns.com/foaf/0.1/gt SELECT
DISTINCT ?p2name ?p2mbox ?p2pix WHERE ?p1
foafname "Anupam Joshi" . ?p1 foafmbox
?p1mbox . ?p2 foafknows ?p3 . ?p3
foafmbox ?p1mbox . ?p2 foafname ?p2name
. ?p2 foafmbox ?p2mbox . OPTIONAL
?p2 foafdepiction ?p2pix . ORDER BY
?p2name
51(No Transcript)
52Swoogle found 292 RDF data files that appear
relevant to answering our query
53Lets save the dataset before we use it
54(No Transcript)
55And tag it so we and others can find it more
easily.
56Here we are using it to get an answer to Who
knows Anupam Joshi
57He has many friends!
58(No Transcript)
59This talk
- Motivation
- Semantic web 101
- Swoogle Semantic Websearch engine
- Use cases and applications
- State of the Semantic Web
- Conclusions
60Will it Scale? How?
- Heres a rough estimate of the data in RDF
documents on the semantic web based on Swoogles
crawling
System/date Terms Documents Individuals Triples Bytes
Swoogle2 1.5x105 3.5x105 7x106 5x107 7x109
Swoogle3 2x105 7x105 1.5x107 7.5x107 1x1010
2006 1x106 5x107 5x107 5x109 5x1011
2008 5x106 5x109 5x109 5x1011 5x1013
We think Swoogles centralized approach can be
made to work for the next few years if not longer.
61How much reasoning?
- SwoogleN (Nlt3) does limited reasoning
- Its expensive
- Its not clear how much should be done
- More reasoning would benefit many use cases
- e.g., type hierarchy
- Recognizing specialized metadata
- E.g., that ontology A some maps terms from B to C
62This talk
- Motivation
- Semantic web 101
- Swoogle Semantic Websearch engine
- Use cases and applications
- State of the Semantic Web
- Conclusions
63Conclusion
- The web will contain the worlds knowledge in
forms accessible to people and computers - We need better ways to discover, index, search
and reason over SW knowledge - SW search engines address different tasks than
html search engines - So they require different techniques and APIs
- Swoogle like systems can help create consensus
ontologies and foster best practices - Swoogle is for Semantic Web 1.0
- Semantic Web 2.0 will make different demands
64For more information
http//ebiquity.umbc.edu/
Annotatedin OWL
65backup
66(No Transcript)
67(No Transcript)