Title: Knowledge Access Semantic technology for KM
1Knowledge AccessSemantic technology for KM
ACAI 05 SEKT SUMMER SCHOOL ON KNOWLEDGE
TECHNOLOGY
- John Davies
- BT Research
- john.nj.davies_at_bt.com
2Overview
- Introduction to the Semantic Web
- Language stack
- Semantic Search and Browse
- Knowledge Sharing
- Natural Language Generation Summarisation
- Knowledge Delivery via Device Independence
- Quiz!
3Limitations of the Web today
- Machine-to-human, not machine-to-machine
4The Semantic Web
- allowing information to be shared and processed
- adding context and structure Tim Berners-Lee
- an extension of the current web in which
information is given well-defined meaning, better
enabling computers and people to work in
cooperation - An open platform
5Semantic Web
The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people to
work in co-operation. Berners-Lee et al.,
2001
6... Semantic Web HISTORY
10.2.2004 Resource Description Framework
(RDF) Web Ontology Language (OWL) become W3C
recommendations
Source http//www.zakon.org/robert/internet/tim
eline/
7Semantic Web Layers
Entailment of the Implicit
Explicit Semantics
Relational Distributed Data
Data Exchange
8Where we are Today the Syntactic Web
Hendler Miller 02
9i.e. the Syntactic Web is
- A place where
- computers do the presentation (easy) and
- people do the linking and interpreting (hard).
- Why not get computers to do more of the hard
work?
Goble 03
10Hard Work using the Syntactic Web
- Complex queries involving background knowledge
- Find information about animals that use sonar
but are not either bats, dolphins or whales - Locating information in data repositories
- Travel enquiries
- Prices of goods and services
- Results of human genome experiments
- Delegating complex tasks to web agents
- Book me a holiday next weekend somewhere warm,
not too far away, and where they speak French or
English
11Motivation Knowledge Management
- Knowledge workers are overwhelmed with
information - from intranets, emails, external newslines
- but may still lack the information required
- They need information identified
- by semantics, not just keywords
- by their interests and their task context
- in a form appropriate to their current physical
context - mobile phone, PDA, blackberry, laptop,
12Knowledge access
- context-aware tools for access to
semantically-annotated knowledge - search, browse, share, summarise
- integrated into day-to-day business processes
- automatic knowledge delivery based on current
context - activity, location, device, interests
- support multiple end-user devices
13XML is a first step
- Semantic markup
- HTML ? layout
- use bold font
- Insert an image here
- XML ? content
- this part of the document is the product price
- this document describes a telecommunications
service
14XML
- ltplaygt
- lttitlegtThe Life and Death of King
Johnlt/titlegt - ltDramatis Personaegt
- ltpersonagtThe Earl of PEMBROKElt/personagt
- ltpersonagtThe Earl of ESSEXlt/personagt
-
- lt/Dramatis Personaegt
- ltStagedirgtSCENE England, the
Court.lt/Stagedirgt - ltactgtAct 1
- ltscenegtScene I.
- ltspeechgt
- ltspeakergtJohnlt/speakergt
- ltlinegtNow, Chatillon, what would
France with us?lt/linegt - lt/speechgt
15QuizXML
- Standard search engine
- WWW pages indexed
- maps keywords to WWW pages
- QuizXML
- A finer-grained index
- maps keywords to documents and the XML tags in
which they occur
16 17XML is a first step
- Metadata (with limitations)
- within documents, not across documents
- prescriptive, not descriptive
- No commitment on vocabulary and modelling
primitives (subclass, instance, etc) - ltvehiclegt
- ltcargtford
- ltenginegtxyz123-4lt/enginegt
- ltmodelgtmondeogtlt/mondeogt
- lt/cargt
- lt/vehiclegt
- RDF and ontologies are the next step
18What are Ontologies?
- Ontologies provide a shared and common
understanding of a domain (medicine, finance, ) - a shared specification of a conceptualisation
- Concept map
- A simple example - Yahoo
- BusinessEconomy gt Finance gt Banking
- for WWW, defined using RDF(S) OWL
19Taxonomies
Animals
Vertebrates
Invertebrates
..
Insects
Arachnids
Reptiles
Mammals
20Ontology of People and their Roles
Employee
Expert
Analyst
Manager
Programme Mgr
Project Mgr
21Structure of an Ontology
- Typically two distinct components
- Names for important concepts and relationships in
the domain - Elephant is a concept whose members are a kind of
animal - Herbivore is a concept whose members are those
animals who eat only plants - Background knowledge/constraints on the domain
- Adult_Elephants weigh at least 2,000 kg
- No individual can be both a Herbivore and a
Carnivore
22Why develop an ontology?
- Define web resources more precisely and make them
amenable to machine processing - Make domain assumptions explicit
- Easier to change domain assumptions
- Easier to understand and update legacy data
- Separate domain and operational knowledge
- Re-use separately
- A community reference for applications
- To share a consistent understanding of what
information means
23Ontologies - Some Examples
- General purpose ontologies
- The Upper Cyc Ontology, http//www.cyc.com/cyc-2-1
/index.html - IEEE Standard Upper Ontology, http//suo.ieee.org/
- Domain and application-specific ontologies
- RDF Site Summary RSS, http//groups.yahoo.com/grou
p/rss-dev/files/schema.rdf - Dublin Core, http//dublincore.org/
- UMLS, http//www.nlm.nih.gov/research/umls/
- Open Biological Ontologies http//obo.sourceforge
.net/ - FOAF www.foaf.org
- Ontologies in a wider sense
- Agrovoc, http//www.fao.org/agrovoc/
- UNSPSC, http//eccma.org/unspsc/
- DAML.org library http//www.daml.org/
24Ontology and Logic
- Reasoning over ontologies
- Inferencing capabilities
- X is author of Y ? Y is written by X
- X co-wrote D Y co-wrote D ?
- X and Y collaborate
- Cars are a kind of vehicle
- Vehicles have 2 or more wheels ?
- Cars have 2 or more wheels
25RDF and RDF-S
- W3C standards
- RDF-S defines the ontology
- classes and their properties and relationships
- There are books and authors. Authors write books.
- RDF defines the instances of these classes and
their properties - Mark Twain is an author
- Mark Twain wrote Adventures of Tom Sawyer
- Adventures of Tom Sawyer is a book
26An example RDF Schema
Annotation of WWW resources and semantic links
domain
range
Writer
Book
hasWritten
subClassOf
FamousWriter
type
Schema(RDFS)
Data(RDF)
25/12/68
type
DoB
hasWritten
/twain.com/mark
books.com/ISBN00010475
27RDF
hasName (http//www.famouswriters.org/twain/mark
, Mark Twain) hasWritten (http//www.famousw
riters.org/twain/mark, http//www.books.org/ISB
N00001047582) title (http//www.books.org/ISBN0
0001047582, The Adventures of Tom
Sawyer) XML version ltrdfDescription
rdfabouthttp//www.famouswriters.org/twain/markgt
ltshasNamegtMark Twainlt/shasNamegt ltshasWritten
rdfresourcehttp//www.books.org/ISBN0001047/gt lt
/rdfDescriptiongt
28QuizRDF
- Searching RDF-annotated web resources
29RDF metadata annotations
Data (WWW document)
Annotation (metadata)
Lost information
- Subjective
- One of several interpretations
- Not exhaustive
RDF
30RDF as an Enrichment
Text
Annotation
RDF
Text
31Precision and recall - the IR dilemma
- Trade-off between precision and recall
- recall - how many of relevant were found
- precision - how many of found were relevant
- Holy grail high precision high recall
- QuizRDF offers both
- separately
- closely-coupled
32Indexing data model
33Multidimensional Indexing
- Traditional search engine indexing
- term ? documents
- employee ? URI1, URI3, URI9
- miller ? URI3, URI7
- QuizRDF indexing
- ltliteral,class,propertygt ? URIs
- ltgeorge, Employee, first_namegt ? URI2
- ltmiller, Employee, last_namegt ? URI1, URI3
- ltmiller, Employee, ?gt ? URI1, URI3, URI7
34QuizRDF demo
35Two Retrieval Channels
Browser interface
Keyword query
RQL
- Precise
- Machine readable
- Subjective
- Incomplete
- Higher precision
- Original content
- Complete
- Imprecise
- Higher recall
36Contribution
- Combination of
- User familiar keyword search
- More precise RDF querying
- Data and metadata as complementary
- Low threshold, high ceiling
- Works on non-RDF information
- Exploits RDF where it exists
- Integrates browsing and querying
- Fits users info seeking behavior
37Conclusions about RDF(S)
- Next step up from plain XML
- (small) ontological commitment to modeling
primitives - possible to define domain vocabulary
- limited reasoning
- subsumption, but no transitivity, symmetry,
- limited expressive power
- no cardinality constraints, equality,
disjointness,
38Web Ontology Language Requirements
- Desirable features identified for Web Ontology
Language - Extends existing Web standards
- Such as XML, RDF, RDFS
- Easy to understand and use
- Should be based on familiar KR idioms
- Formally specified
- Of adequate expressive power
- Possible to provide automated reasoning support
39OWL Language
- OWL is based on Description Logics knowledge
representation formalism - OWL (DL) benefits from many years of DL research
- Well defined semantics
- Formal properties well understood (complexity,
decidability) - Known reasoning algorithms
- Implemented systems (highly optimised)
- Three species of OWL
- OWL Full maximum expressivity, undeciable
- OWL DL based on SHIQ DL, decidable
- OWL Lite - subset of OWL DL, most efficient
reasoning
40Why OWL?
- OWL Web Ontology Language
- Owls superior intelligence is known throughout
the Hundred Acre Wood, as are his talents for
Writing, Spelling, other Educated and Special
tasks. - "My spelling is Wobbly. It's good spelling, but
it Wobbles, and the letters get in the wrong
places."
41QuizOWL!
42Re-cap
- XML, RDF, OWL language stack
- Increasingly sophisticated search
- QuizXML
- subdocument searching
- QuizRDF
- browsing by concept and across relations
- searching on metadata and full-text
- Next steps in semantic search
- identification of named entities within documents
- Exploitation of world knowledge
- KIM (Ontotext)
43The KIM Platform
- A platform offering services and infrastructure
for - (semi-) automatic semantic annotation
- ontology population
- semantic indexing and retrieval of content
- query and navigation
- Based on an Information Extraction technology
- Aim to underpin Semantic Web applications
- by providing a metadata generation technology
- in a standard, consistent, and scalable framework
44Ontologies
http//proton.semanticweb.org/
- PROTON - a light-weight upper-level ontology
- 250 NE classes
- 100 relations and attributes
- covers mostly NE classes, and to a smaller degree
general concepts
45Ontologies II
46KIM World KB
- Aims to cover the most popular entities in the
world - Entities of general importance like the ones
that appear in the news - KIM knows about
- Organizations, all important sorts of business,
international, political, government, sport,
academic - Specific people, (e.g. Politicians)
- Locations countries, regions, cities, roads,
etc.
47KIM World KB Content
- Collected from various sources, like geographical
and business intelligence gazetteers. - KIM also learns from documents indexed
- via GATE information extraction
- KB scale
- RDF Statements Small KB Full KB
- - explicit 444,086 2,248,576
- - after inference 1,014,409 5,200,017
48KIM Scaling on Data
- The Semantic Repository is based on Sesame/OWLIM.
- Our practical tests demonstrate a perfect
performance on top of - 1.2M entity descriptions
- about 15M explicit statements
- above 30M statements after forward chaining.
- Fulltext indexing with Lucene
- .5M docs, retrieval in milliseconds
49Semantic Annotation
50Simple Usage Highlight, Hyperlink, and
51Simple Usage Explore and Navigate
52People search for People
- A recent large-scale human interaction study on a
personal content IR system, carried out by
Microsoft demonstrated that - The most common query types in our logs were
People/places/things, Computers/internet and
Health/science. In the People/places thing
category, names were especially prevalent. Their
importance is highlighted by the fact that 25 of
the queries involved peoples names ... . In
contrast, general informational queries are less
prevalent.
53Semantic Queries
- The standard IR query is
- give me documents that contain the words
company, Europe, telecommunication - KIM provides indexing retrieval wrt NEs
- More precise specification and satisfaction of
information needs - specify the NEs we are interested in, and to
restrict them by their attributes and relations - Give me documents that mention a company in
Europe from the telecommunications industry
sector
54Precision in Semantic Search
- KIM can match
- a query Documents concerning a telecom company
in Europe, John Smith, and a date in the first
half of 2002. - With a document containing At its meeting on
the 10th of May, the board of Vodafone appointed
John G. Smith as CTO" - Classical IR cannot do the required reasoning
- Vodafone is a mobile operator, which is a kind of
telecom company - Vodafone is in the UK, which is a part of Europe.
- 5th of May is a "date in first half of 2002
- John G. Smith matches John Smith.
55Entity Pattern Search
56Pattern Search Entity Results
57Entity Pattern Search KIM Explorer
58Predefined Pattern Search
59Pattern Search Multiple-Entity Results
60Pattern Search, Referring Documents
61Document Details
62KIM - summary
- KIM is a platform for
- semantic annotation,
- ontology population,
- semantic indexing and retrieval,
- providing an API for remote access and
integration, - based on Information Extraction (IE) using mature
HLT (GATE). - powered by massive world knowledge
- http//www.ontotext.com/kim
63SEKTAgent
- Periodic agent search for named entities
- e.g. a person in an organisation
- Returns relevant documents and metadata
- Proactive knowledge delivery
- Linked to device indepedence module (see later)
- Based upon KIM architecture
- Result-led indexing
- Adds relevant pages to next crawl list
64SEKTAgent demo
65TAP
- Uses Google for traditional search
- Augments results with relevant data aggregated
from distributed (and semantically annotated)
data - Offers distributed query interface
66TAP
tap.stanford.edu for more information
67Swoogle
- Searching for semantic web documents and
ontologies - See swoogle.umbc.edu
68Google vs. Swoogle
- How to find a popular ontology that defines the
concept of person? - Ask Google?
- Type Person filetyperdf
- Type Person filetypeowl
- More complicated query person rdfsClass
filetyperdf - Ask Swoogle?
- Type person in document search
- 1 http//xmlns.com/foaf/0.1/index.rdf
69Find Time Ontology
We can use a set of keywords to search ontology.
For example, time, before, after are basic
concepts for a Time ontology.
70Beyond search, beyond documents
- a long list of documents is rarely the ultimate
information need of the end user - theres too much relevant information!
- support for the next step - the analysis of the
returned information - e.g. key points on a topic from a large document
you dont want to read - e.g. creation of a digest of information from
multiple documents about Bushs statements on a
given topic
71Search Engine trends
markets
- Seamless and integrated
- one search engine for Web and desktop
- implicit queries based on user activity
- Personalisation
- based on user interaction
- Beyond document lists
- sub-document analysis
- Taxonomies and classification
- taxonomy / enterprise search growing at 10 p.a.
- Ontologies and semantic annotation
- A coherent approach to all these issues
72Knowledge Sharing
- Sharing knowledge through an organisation
- learning from success and failures of others
- avoiding duplication of effort
- (Virtual) communities of practice
- Groups with shared interests who will benefit
from collaboration and sharing knowledge - (Using WWW technology to increase collaborative
radius)
73Communities the Semantic Web
- Communities require a shared conceptual
vocabulary - Consensual, evolving concept map
- Ontologies!
- OntoShare
- automates sharing of knowledge in an
organisation via community-based RDF(S) ontologies
74OntoShare
- Sharing and Classifying resources according to an
Ontology - Informs users when relevant document added to
store - Ontology-based personalisation
- Provides knowledge store for browsing and
searching
75(No Transcript)
76OntoShare Sharing knowledge
- User shares knowledge
- WWW document
- Any textual data
- Can supply annotation
77OntoShare Sharing knowledge
- System automatically extracts keywords summary
- System assigns knowledge to concepts
78OntoShare Sharing knowledge
- System emails an alert to selected users based on
match to user profile
79OntoShare Evolving Ontologies
- OntoShare automatically suggests changes to
concept characterisation - Concept characterisations evolve over time
80OntoShare Evolving Ontologies
- User can suggest new concepts for ontology at any
time - System emails community on suggestion (Ã la
Usenet) and counts votes
81Finding People Collaboration
- Use of personal profiles
- Who else is interested in this document?
- Who else is interested in this topic?
- Encouraging exchange of tacit knowledge
- Discussion threads around shared knowledge
- Adding value to the knowledge stored
82SWAP Semantic Web and Peer-to-Peer
- Distributed Knowledge Management
- Different participants with different
conceptualizations of their domain - Different knowledge sources
- Physically distributed, dynamic environment
- Peer-To-Peer Approach
- Decentralized nature Local control
- Symmetry Everyone is provider and consumer
- P2P networks as a reflection of social networks
- Flexible collaboration beyond hierarchical
structures
83Case Study The Bibster System
- Scenario Sharing of bibliographic metadata in a
Peer-to-Peer network - Bibliographic metadata is created and maintained
in a decentralized manner, - Researchers are willing to share their data
- Use of semantics is crucial in this setting
- The Bibster system allows users to
- Easily share bibliographic data
- Save work in finding this data
- Avoid re-typing this data by hand
84Semantic Methods in Bibster
- Semantic representation and querying of metadata
- Extraction and classification from e.g. BibTeX
files - Semantic Web Research Community Ontology andACM
Topic hierarchy as light-weight ontologies - Peer selection using semantic topologies
- Scalability requires intelligent query routing
- Semantic descriptions of peers expertise as
basis for peer selection - Semantic duplicate detection
- Highly redundant and inconsistent representation
of bibliographic metadata - Semantic similarity measures to detect duplicates
85Bibster Screenshot
Open Source http//bibster.sourceforge.net/
86NLG - Summarisation
- NLG takes as input structured data in a knowledge
base or ontology and produces natural language
text - Applied to provide automatic documentation of
ontologies or generate textual reports from
formal knowledge - Keeps texts constantly up-to-date so they reflect
changes in the ontology - OntoSum, University of Sheffield
87The Property Hierarchy
- Special linguistically-motivated properties
introduced to make the NLG modules more generic - active-action (e.g. works-for)
- passive-action (e.g., published-by)
- Attribute (e.g. has-age, has-web-address)
- part-whole (e.g., consists-of)
- All properties from the ontology were made
sub-properties of one of these 4 - Attribute properties recognised using heuristics,
such as property name starts with has
(hasWebPage)
88Summary Structuring
- Capture regular patterns can be applied
recursively - Describe-Instance -gt Describe-Attributes, Descri
be-Part-Whole, Describe-Active-Actions, Describe
-Passive-Actions - Describe-Attributes -gt
- attribute(Instance, Attribute),
- Describe-Attributes
- Collect all subproperties of Attribute property
relating to Instance - Attribute(John, hasMobileNumber)..
89Ontology-Based Aggregation
- Joining attribute and part-whole properties with
the same first argument to have more coherent
sentences - ATTR(Researcher XXX, Appellation
Dr)ATTR(Researcher XXX, string
my_email_at_sheff)ATTR(Researcher XXX, string
012344567)ATTR(Researcher XXX, string
www.mypage.ac.uk) - Without aggregationKalina Bontcheva has a Dr
appellation. Kalina Bontcheva has email
my_email_at_shef.com. Kalina Bon - With aggregationKalina Bontcheva has a Dr
appellation, email my_email_at_shef.com and
90Lexicalisation of Classes Properties
- 3 options
- Specified by ontology engineer
- Same as concept/property name
- Added manually when parameterising OntoSum
91Description of HSBC
Financial Institution
Person
Organisation
Bank
lendsTo
lendsTo
HSBC
employees
market-cap
43bn
137000
92Description of HSBC
93Innovative aspects
- Can tailor summary to device profile
- Apply length restriction
- e.g. for text message for mobile phone
- Generate HTML for web browser or plain text for
email - See device independence (next!)
- Readability heuristics
- introduce lists when verbalising more than 3
attributes - Use of ontology mapping rules to run same system
on multiple ontologies
94Related work
- Wilcock (Helsinki)
- Fully automatic, no lexicon
- Talking OWLs, ISWC-03
- MIAKT
- Some manual input
- More effort, more fluency
- OntoSum based on MIAKT
- Bontcheva, NLDB04
95OntoSum demonstration
96Device Independence
- context-aware tools for access to
semantically-annotated knowledge - search, browse, share, summarise
- integrated into day-to-day business processes
- automatic knowledge delivery based on current
context - activity, location, device, interests
- support multiple end-user devices
97Device independence
- 3 approaches
- Hand-craft different sites for different devices
- Labour intensive, difficult to maintain
- Extend HTML to describe interaction, navigation
and selection - Server software generates output in suitable
format using CC/PP - Inflexible difficult to control output
precisely - No support for large volume sites
- Unclear what extensions are necessary and
sufficient - SEKT approach
- Use templates to format data content appropriate
for each class of device - Fine control of output based on CC/PP profiles
- can handle large volumes of structured data -
XML databases - device-dependencies coded in the templates, e.g.
mouse capability
98Device Profiles in RDF
- CC/PP - W3C RDF standard for describing device
characteristics - CC/PP vocabularies define device components and
component attributes - UAProf is an application of CC/PP adopted by many
terminal device manufacturers - An ontology of devices inheritance and
specialisation - Profile references and Profile Diffs are sent
with an information request - javax.ccpp package for processing profiles
99User Profiles
- Effective presentation must take user preferences
accessibility issues into account - Font size
- Colour preference
- Hi res/Lo res
- Device characteristics and preference/
accessibility requirements need to be combined - Effective screen size depends on both physical
size and user preferences (e.g. font size) - Specialisation/extension of UAProf
100Profile Engine
- The Profile engine combines device and user
profiles to generate a set of conditions - The engine can be queried by other applications
- PROLOG is being used as a prototyping language
- Arithmetic calculations of effective screen size
(for example) require more than RDF/OWL - DL (DIG) interface to SWI-Prolog
101Content Adaptation
- The content adaptation engine uses conditions
generated by profile engine queries - Example conditions
- Screen size x font size ?
- number of characters of text
- GraphicsSupported?
- Colour or BW
- Device characteristic or
- Accessibility issue
102Content Generation
- Different content must be generated for different
devices - The current context (set of conditions) will be
made available to SEKT applications - Natural Language Processing techniques are be
used to generate or modify information - Mobile phone 400 character text message
- PC multimedia document
- NLG describing ontology-based knowledge in
natural language (OntoSum!)
103Device Independence
- A functional presentation of a resource should be
available via any suitable device - Requirements include content selection, layout
transformation and style selection - At present, no one language can be interpreted by
all clients - It follows that content must be formatted for the
target device on the server
104Templates
- Declarative templates are used to format the
(XML-based) data - Context (conditions) can be used to select
templates, and sections within templates - Template 1 WML
- InputEnabled?
- Template 2 HTML
- GraphicsWanted?
- Separation of data storage, processing and
display - W3C working group on device independence
- No standard for templates (yet)
105Overview
UAProf (RDF(S))
Device Properties
Context
User preferences
Repurposed Information
Raw Information
Profiling engine
Content Adaptation
(syntactic semantic)
106Device Independence demo
107Device Independence Summary
- Device and User profiles need to be combined
using a suitable ontology - A profile reasoning engine is used to generate
conditions on the format - Content can be generated according to the context
(set of conditions) - NLP techniques can be used to generate/summarise
text (semantic) - Templates are used to transform the results to a
format suitable for the device at hand (syntactic)
108Conclusion
- Semantic Web technology can offer enhancements to
a range of KM tools - Search, Share, Summarise, Deliver
- Also
- Visualisation
- RDF or OWL statements as a graph
- Integration of heterogeneous information
- Outstanding Issues
- Trade-off between reasoning and scalability
- Where does the metadata come from?
- Only KIM starting to address this point
- See also SEKT project (www.sekt-project.com)
- Who will find the killer app?!
- Plenty of topics still on the research agenda
109Acknowledgements
- Peter Haase, University of Karlsruhe
- Kalina Bontcheva, University of Sheffield
- Naso Kiryakov, Ontotext
- Ian Horrocks, University of Manchester
- Tim Glover Alistair Duke, BT
110Thank you questions?
- Heres a few for you
- What are the semantic web layers?
- Name 3 ontologies in widespread use today
- Name 3 semantic search tools
- What RDF ontology is used to characterise devices
- Why use NLG techniques on ontological
information? - What are the advantages of RDF over XML? And OWL
over RDF? - Names 3 trends in search engine development
- Describe briefly the way(s) in which metadata can
improve search performance - WIN A PRIZE!!!!!
John Davies Next Generation Web Research,
BT john.nj.davies_at_bt.com