Title: After OWL: defacto standards
1After OWL defacto standards for semantic
technologies (or what do you get for 40m EU
research money?) http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish
Cunningham,Kalina Bontcheva, Valentin Tablan,
Diana Maynard,Wim Peters, Niraj Aswani, Milena
Yankova, Yaoyong Li, Akshay Java, Michael
Dowman ILASH workshop, March 2004
2Structure of the talk
- Context
- increasing use of semantic technology in IT
- the role(s) of human language technology
- substantial investment in the next phase of
semantic web research - Semantic Web moving on from formal standards
- Acronym soup
- GATE HLT API 4 SDK SW KT
- An application Ontology-Based IE in KIM
- Issues in API design, next steps
3The Knowledge Economy and Human Language
- Gartner, December 2002
- taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications - through 2012 more than 95 of human-to-computer
information input will involve textual language - A contradiction
- to deal with the information deluge we need
formal knowledge in semantics-based systems - our information spaces are in informal and
ambiguous natural language - The challenge to reconcile these two phenomena
4HLT Closing the Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
5SEKT Semantic Knowledge Technology
- 6th framework IP project
- Duration 36 months from 1/1/4, 12.5m
- http//sekt.semanticweb.org/
- Improve automation of ontology and metadata
generation - Develop highly-scalable solutions
- Research sound inferencing despite inconsistent
models - Develop semantic knowledge access tools
- Develop methodology for deployment
6PrestoSpace (20th Century Rot)
- 20th Century audio-visual media is rapidly
disappearing - Preservation and restoration are high cost
- The costs must be justified by increased access
- Metadata descriptive information about content
- PrestoSpace (9m IP, 40 months from 02/04)
- rich metadata and semantic access
- cross-lingual access
- syndicated delivery
- repurposeable content
7The SDK research cluster
- Building the European Research Area in KM
through collaboration with related IP and NoE
projects in this area for a coordinated impact
strategy - SEKT, DIP, KnowledgeWeb SDK clusterhttp//sdk.
semanticweb.org/ - Other related projects
- AceMedia IP (semantic knowledge systems)
- PrestoSpace IP (cultural heritage / digital
libraries) - BRICKS IP (cultural heritage / digital libraries)
- Total EU/6FP investment in semantic tech.
research 40m potential to influence the
emergence of defacto standards
8Next step for Semantics tech from formal to
defacto standards?
- Computer scientists love standards, so we have
many - For any given problem there are usually 3
standards - OWL is no exception Lite, DL, Full
- There are good reasons, but cf. RDF(S)
implementation history applications will of
necessity mix and match - If we can achieve standard practice and libraries
in applications we will have made a next step and
will promote takeup - (Pathological) example TCP/IP vs. OSI
9HLT API 4 SDK SW KT
- What sorts of software do we need?
- Ontology and metadata management storage
versionning caching, inferencing etc. (below) - Human language technology components and services
(not monolithic systems, not unproven research
prototypes) - The role of measurement in scaling and
robustness in HLT this means MUC, TREC, ACE,
TIDES, ... - Heres one we baked earlier....
10GATE (the Volkswagen Beetle of Language
Processing) is
- Eight years old, with the largest user
constituency of its type - An architecture A macro-level organisational
picture for LE software systems. - A framework For programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environment For language engineers,
computational linguists et al, a graphical
development environment. - Some free components... ...and wrappers for other
people's components - Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc. - Free software (LGPL). Download at
http//gate.ac.uk/download/
11Critical mass 000s people 00s sites
- GATE users significant proportion of community.
A small sample - the American National Corpus project
- the Perseus Digital Library project, Tufts
University, US - Longman Pearson publishing, UK
- Merck KgAa, Germany
- Canon Europe, UK
- Knight Ridder, US
- BBN (leading HLT research lab), US
- SMEs Melandra, SG-MediaStyle, ...
- Imperial College, London, the University of
Manchester, UMIST, the University of Karlsruhe,
Vassar College, the University of Southern
California and a large number of other UK, US and
EU Universities - UK and EU projects inc. MyGrid, CLEF, dotkom,
AMITIES, CubReporter, Poesia...
- GATE team projects. Past
- Conceptual indexing MUMIS automatic semantic
indices for sports video - MUSE, cross-genre entitiy finder
- HSL, Health-and-safety IE
- Old Bailey collaboration with HRI on 17th
century court reports - Multiflora plant taxonomy text analysis for
biodiversity research e-science - EMILLE S. Asian language corpus
- ACE / TIDES Arabic, Chinese NE
- JHU summer w/s on semtagging
- Present
- Advanced Knowledge Technologies 12m UK five
site collaborative project - ETCSL Sumerian digital library
- MiAKT medical informatics / AKT
- SEKT Semantic Knowledge Tech
- PrestoSpace AV Preservation
- KnowledgeWeb h-TechSight
12-
- Architectural principles
- Non-prescriptive, theory neutral (strength and
weakness) - Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of
Protégé, Jena, Weka, interoperation with SCHUG in
MUMIS) - (Almost) everything is a component, and component
sets are user-extendable - (Almost) all operations are available both from
API and GUI - Why does this matter? It means that GATE works
well with other tools, embeds easily, and
achieves robustness through focus (API
requirements)
13All the worlds a Java Bean....
- CREOLE a Collection of REusable Objects for
Language Engineering - GATE components modified Java Beans with XML
configuration - The minimal component 10 lines of Java, 10
lines of XML, 1 URL - Why bother?
- Allows the system to load arbitrary language
processing components
14WebServices
GATE APIs
Onto-logy
ProtégéOnto-logy
Word- net
Gaz-etteers
...
Language Resource Layer (LRs)
- NOTES (2)
- eg Protégé LR VR both wrapped in Res. (bean)
API - ontology repositories and inference are the same
KAON Sesame Orenge ?
- NOTES
- everything is a replaceable bean
- all communication via fixed APIs
- low coupling, high modularity, high
extensibility
15Issues (1) a common HLT API
- OGSA, WMSO in the web services layer?
- Eclipse less code for us, more services for
users? (A free OWL/UML drawing tool, for example) - ISO TC37/SC4 JNLE special LIRICS consortium
16API Application Ontology-based IE
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
17Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances before
Bush
Classesinstances after
18OBIE in KIM
- An ontology (KIMO) and 200K instances KB
- High ambiguity of instances with the same label
uses disambiguation step - Lookup phase marks mentions from the ontology
- Combined with GATE-based IE system to recognise
new instances of concepts and relations - KB enrichment stage where some of these new
instances are added to the KB - Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same
label based on corpus statistics (e.g., Paris)
Popov et al. KIM. ISWC03
19OBIE in KIM (2)
Popov et al. KIM. ISWC03
20KIM demo...
Next steps in OBIE
- Continue to exploit the pluggability and
community effects of GATE (and Sesame, Lucene,
...) - SWAN Semantic Web Annotator at DERI/Galway
- Syndication
- Social networking
- Evaluation (below)
21(The P in OLP) ChallengeEvaluating Richer NE
Tagging
- Need for new metrics when evaluating
hierarchy/ontology-based NE tagging - Need to take into account distance in the
hierarchy - Tagging a company as a charity is less wrong than
tagging it as a person
22SW IE Evaluation tasks
- Detection of entities and events, given a target
ontology of the domain. - Disambiguation of the entities and events from
the documents with respect to instances in the
given ontology. For example, measuring whether
the IE correctly disambiguated Cambridge in the
text to the correct instance Cambridge, UK vs
Cambridge, MA. - Decision when a new instance needs to be added to
the ontology, because the text contains a new
instance, that does not already exist in the
ontology.
23Issues (2) a common OMM API
- Two design approaches
- the richest set of features approachpool
experience, cover all the bases, be relevant to
very many users (top-down) - the highest common factors approachanalyse
software, pick common features, create
plugability layer (bottom-up) - Both useful can be combined
- Approach B. has some key advantages
- leads to quicker version 1.0
- minimises arguments (criteria feature exists in
several sys, not is good) - Problems
- features present several places but not all
operation not supported? - new work not prefigured in version 1.0
roadmaps, placeholders
24The end
- Tutorial on HLT for the Semantic Web at European
Semantic Web Symposiumhttp//www.esws2004.org/ - These slides http//gate.ac.uk/sale/talks/ilash-
semweb-mar2004.ppt - More information http//gate.ac.uk/
http//nlp.shef.ac.uk/
25Whats the difference between Tony Blair and
Mother Theresa?
- Theres good news and bad news...
- The good news the Semantic Web is now a major
focus of some of the world leaders in AI research - The bad news AI always fails
- (Or what succeeds doesnt get called AI any
more) - How does the machine tell the difference between
Mother Theresa is a saint and Tony Blair is a
saint? (It doesnt it has no sense of irony!) - Needed clever applications of simple semantics
(contrast the success of RSS or DC with more
complex schemes) - Defacto standards when we do the simple
stuffrobustly and in the large