Title: Language Technology and the Semantic Web
1Language Technology and the Semantic Web
- Thierry Declerck Paul Buitelaar (Saarland
University DFKI GmbH)
2- We present collaborative research work on the
combination of language technology (LT) and
technologies for encoding (domain) knowledge in
ontologies, supporting the emergence of the
Semantic Web (SW), or maybe more appropriate
Semantic Webs - MUMIS (dealing with multimedia content
indexing and searching in the soccer domain,
finished in December 2002) - MuchMore (dealing with cross-lingual
information retrieval in the medical domain,
finished in Mai 2003) - Esperonto (developing a Semantic Annotation
Service for upgrading the actual Web to the
Semantic Web, Sept. 2002 - Mai 2005)
3Semantic Web Applications of LT
- Supporting accurate ontology-based semantic
annotation of multilingual web documents
(Knowledge Markup) - Supporting Ontology Learning/Construction from
linguistically/semantically annotated
multilingual text (Knowledge Extraction) - See also the Special Interest Group (SIG-5)
OntoWeb-lt on Language Technology in Ontology
Development and Use http//ontoweb-lt.dfki.de
4Knowledge Markup and Knowledge Extraction
Text/Speech
Text/Speech Mining
Linguistic and Semantic Annotations
Concepts, Relations, Events
Â
Linguistic Analysis Morpho-Syntactic Analysis and
Tagging, Semantic Class Tagging, Term/NE
Recognition, Grammatical Function Tagging,
Dependency Structure Analysis
5Knowledge Markup and Knowledge Extraction (2)
Text/Speech/Image-Video
Text/Speech/Media Mining
Linguistic, Low-level Image and Semantic
Annotations
Concepts, Relations, Events
Â
Linguistic and Media Analysis
6Integration of Language Technology and Domain
Knowledge
7Linguistic Analysis
Language technology tools are needed to support
the upgrade of the actual web to the Semantic Web
(SW) by providing an automatic analysis of the
linguistic structure of textual documents. Free
text documents undergoing linguistic analysis
become available as semi-structured documents,
from which meaningful units can be extracted
automatically (information extraction) and
organized through clustering or classification
(text mining). Here we focus on the following
linguistic analysis steps that underlie the
extraction tasks morphological analysis,
part-of-speech tagging, chunking, dependency
structure analysis, semantic tagging.
8Morphological Analysis
Morphological analysis is concerned with the
inflectional, derivational, and compounding
processes in word formation in order to determine
properties such as stem and inflectional
information. Together with part-of-speech (PoS)
information this process delivers the
morpho-syntactic properties of a word. While
processing the German word Häusern (houses) the
following morphological information should be
analysed PoSN NUMPL CASEDAT GENNEUT
STEMHAUS
9Part-of-Speech Tagging
Part-of-Speech (PoS) tagging is the process of
determining the correct syntactic class (a
part-of-speech, e.g. noun, verb, etc.) for a
particular word given its current context. The
word works in the following sentences will be
either a verb or a noun He works N,V the
whole day for nothing. His works N,V have all
been sold abroad. PoS tagging involves
disambiguation between multiple part-of-speech
tags, next to guessing of the correct
part-of-speech tag for unknown words on the basis
of context information.
10Chunking
Following Abney chunks as the non-recursive
parts of core phrases, such as nominal,
prepositional, adjectival and adverbial phrases
and verb groups. Chunk parsing is an important
step towards making natural language processing
robust, since the goal of chunk parsing is not to
deliver a full analysis of sentences, but to
extract just the linguistic fragments that can be
surely identified. However, even if this strategy
fails to produce an analysis for the whole
sentence, the partial linguistic information
gained so far will still be useful for many
applications, such as information extraction and
text mining.
11Named Entities detection
Related to chunking is the recognition of
so-called named entities (names of institutions
and companies, date expressions, etc.). The
extraction of named entities is mostly based on a
strategy that combines look up in gazetteers
(lists of companies, cities, etc.) with the
definition of regular expression patterns. Named
entity recognition can be included as part of the
linguistic chunking procedure and the following
sentence fragment the secretary-general of
the United Nations, Kofi Annan, will be
annotated as a nominal phrase, including two
named entities United Nations with named entity
class organization, and Kofi Annan with named
entity class person
12Dependency Structure Analysis
A dependency structure consists of two or more
linguistic units that immediately dominate each
other in a syntax tree. The detection of such
structures is generally not provided by chunking
but is building on the top of it. There are two
main types of dependencies that are relevant for
our purposes On the one hand, the internal
dependency structure of phrasal units or chunks
and on the other hand the so-called grammatical
functions (like subject and direct object).
13Internal Dependency Structure
In linguistic analysis, for this we use the
terms head, complements and modifiers, where the
head is the dominating node in the syntax tree of
a phrase (chunk), complements are necessary
qualifiers thereof, and modifiers are optional
qualifiers. Consider the following example The
shot by Christian Ziege goes over the goal. The
prepositional phrase by Christian Ziege
(containing the named entity Christian Ziege)
depends on (and modifies) the head noun shot.
.
14Grammatical Functions
Determine the role (function) of each of the
linguistic chunks in the sentence and allow to
identify the actors involved in certain events.
So for example in the following sentence, the
syntactic (and also the semantic) subject is the
NP constituent The shot by Christian
Ziege The shot by Christian Ziege goes over
the goal. This nominal phrase depends on (and
complements) the verb goes, whereas the Noun
shot is the head of the NP (it this the shot
going over the goal, and not Christian Ziege!)
15Semantic Tagging
Automatic semantic annotation has developed
within language technology in recent years in
connection with more integrated tasks like
information extraction, which require a certain
level of semantic analysis. Semantic tagging
consists in the annotation of each content word
in a document with a semantic category. Semantic
categories are assigned on the basis of a
semantic resources like WordNet for English or
EuroWordNet, which links words between many
European languages through a common inter-lingua
of concepts.
16Semantic Resources
- Semantic resources are captured in dictionaries,
thesauri, and semantic networks, all of which
express, either implicitly or explicitly, an
ontology of the world in general or of more
specific domains, such as medicine. - They can be roughly distinguished into the
following three groups - Thesauri Semantic resources that group
together similar words or terms according to a
standard set of relations, including broader
term, narrower term, sibling, etc. (like Roget) - Semantic Lexicons Semantic resources that
group together words (or more complex lexical
items) according to lexical semantic relations
like synonymy, hyponymy, meronymy, and antonymy
(like WordNet) - Semantic Networks Semantic resources that
group together objects denoted by natural
language expressions (terms) according to a set
of relations that originate in the nature of the
domain of application (like UMLS in the medical
domain)
17The MeSH Thesaurus
MeSH (Medical Subject Headings) is a thesaurus
for indexing articles and books in the medical
domain, which may then be used for searching
MeSH-indexed databases. MeSH provides for each
term a number of term variants that refer to the
same concept. It currently includes a vocabulary
of over 250,000 terms. The following is a sample
entry for the term gene library (MH is the term
itself, ENTRY are term variants) MH
Gene Library ENTRY Bank, Gene ENTRY
Banks, Gene ENTRY DNA Libraries ENTRY
Gene Bank etc.
18The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
19WordNet An example
The word 'tree' has two meanings that roughly
correspond to the classes of plants and that of
diagrams, each with their own hierarchy of
classes that are included in more general
super-classes 09396070 tree 0 09395329
woody_plant 0 ligneous_plant 0 09378438
vascular_plant 0 tracheophyte 0 00008864
plant 0 flora 0 plant_life 0 00002086
life_form 0 organism 0 being 0 living_thing 0
00001740 entity 0 something 0 10025462 tree
0 tree_diagram 0 09987563 plane_figure 0
two-dimensional_figure 0 09987377 figure 0
00015185 shape 0 form 0 00018604
attribute 0 00013018 abstraction 0
20CyC A Semantic Network
CYC is a semantic network of over 1,000,000
manually defined rules that cover a large part of
common sense knowledge about the world . For
example, CYC knows that trees are usually
outdoors, or that people who died stop buying
things. Each concept in this semantic network is
defined as a constant, which can represent a
collection (e.g. the set of all people), an
individual object (e.g. a particular person), a
word (e.g. the English word person), a quantifier
(e.g. there exist), or a relation (e.g. a
predicate, function, slot, attribute). The entry
for the predicate mother mother
(mother ANIM FEM)
isa FamilyRelationSlot BinaryPredicate
This says that the predicate mother takes two
arguments, the first of which must be an element
of the collection Animal, and the second of
which must be an element of the collection
FemaleAnimal.
21Word Sense Disambiguation
Words mostly have more than one interpretation,
or sense. If natural language were completely
unambiguous, there would be a one-to-one
relationship between words and senses. In fact,
things are much more complicated, because for
most words not even a fixed number of senses can
be given. Therefore, only in certain
circumstances and depending on what we mean
exactly with sense, can we give restricted
solutions to the problem of Word Sense
Disambiguation (WSD)
22A simplified Example of a Domain Ontology
Instances
Â
23Example of RDF Schema forthe Movie Ontology
ltrdfRDF xmlnsrdf'http//www.w3.org/1999/02/22
-rdf-syntax-ns' xmlnsrdfs'http//www.w3.org/2
000/01/rdf-schema' xmlnsNS0'http//webode.dia
.fi.upm.es/RDFS/MovieOntology'
gt ltrdfDescription rdfabout'http//webode.dia.fi
.upm.es/RDFS/MovieOntologySpecialEffectsCompanyAc
ting'gt ltrdftype rdfresource'http//www.w3.o
rg/2000/01/rdf-schemaClass'/gt
ltrdfscommentgtDetails of company that created
special effects in this movielt/rdfscommentgt
ltrdfssubClassOf rdfresource'http//webode.dia.f
i.upm.es/RDFS/MovieOntologyCompanyActing'/gt lt/rdf
Descriptiongt ltrdfDescription
rdfabout'http//webode.dia.fi.upm.es/RDFS/MovieO
ntologyPolice'gt ltrdftype rdfresource'http
//www.w3.org/2000/01/rdf-schemaClass'/gt
ltrdfscommentgtFilms that deal solely with police
activitylt/rdfscommentgt ltrdfssubClassOf
rdfresource'http//webode.dia.fi.upm.es/RDFS/Mov
ieOntologyCrime'/gt lt/rdfDescriptiongt
Â
etc
24Integration of Ontology and Semantic Lexicon
- Example of Semantic Lexicon is WordNet (sometimes
also referred to as a Linguistic Ontology) - Ontologies are domain specific models, usually
lacking linguistic information (PoS, Morphology,
Syntax etc.) - To be Integrated in One Resource or Kept/Accessed
Separately?
- Standardization
- Format Web-based Standards for Lexical Semantic
Representation will Increase their Uptake - (Easy Plug-and-Play, Remote Access, etc.)
- Content Widely Used (Lexical) Semantic Resources
will lead to (Further) Semantic Standardization
25Defining a Linguistic Ontology for the Art World
(Tentative)
- ltrdfRDF
- xmlnsrdf http//www.w3.org/1999/02/22-rdf-sy
ntax-ns - xmlnsrdfs http//www.w3.org/2000/01/rdf-schem
a - xmlnsxsd http//www.daml.org/2000/10/XMLSche
ma - xmlnsdaml http//www.daml.org/2001/03/damloi
l - xmlnsart http//www.art-world.org/art-world
- gt
- ltdamlOntology rdfaboutConcepts in the Art
Worldgt - ltdamlimports rdfresourceshttp//www.daml.or
g/2001/03/damloilgt - lt/damlOntologygt
26Defining Art World Concepts (Classes,
Synsets) (Tentative)
ltdamlClass rdfID"art-world.01"gt
ltrdfslabelgtart-world.01lt/rdfslabelgt
ltrdfssubClassOf rdfresource"http//www.art-wo
rld.org/art-world.00"/gt lt/damlClassgt ltart-world
.01 rdfID"work"/gt ltart-world.01
rdfID"painting"/gt ltdamlClass
rdfID"art-world.02"/gt ltart-world.02
rdfID"beautiful"/gt ltart-world.02
rdfID"colourful"/gt ltdamlClass
rdfID"art-world.03"/gt ltart-world.03
rdfID"paper"/gt ltart-world.03 rdfID"canvas"/gt
27Defining Properties (Selection
Restrictions) (Tentative)
ltdamlObjectProperty rdfID"manner"gt
ltrdfsrange rdfresource"art-world.02"/gt
ltrdfsdomain rdfresource"art-world.01"/gt lt/daml
ObjectProperty gt  ltdamlObjectProperty
rdfID"medium"gt ltrdfsrange rdfresource"art-
world.03"/gt ltrdfsdomain rdfresource"art-worl
d.01"/gt lt/damlObjectProperty gt  lt/rdfRDFgt
28(Semantic) Lexicons will be
- an Important Part of the Semantic Web
- Represented Using Markup Languages (RDF)
- Accessible in a Remote, Distributed Fashion
- Central to Further Semantic Standardization
29Multilingual terminological lexicon, attached to
a domain ontology (MUMIS)
- ltlex-element id"ID" concept"Shot-on-goal"gt
- lt... lang"DE" type"main"gtTorschusslt/termgt
- lt... lang"EN" type"main"gtshot on
goallt/termgt - lt... lang"NL" type"main"gtschot op
doellt/termgt - ltdefinitiongtein Angriffsspieler kickt den
Ball zu den - gegnerischen Torlt/definitiongt
- lt... lang"DE" type"synonym"gtDistanzschusslt/t
ermgt - lt... lang"DE" type"synonym"gtNachschusslt/term
gt - lt... lang"DE" type"synonym"gtSchusslt/termgt
- lt... lang"DE" type"synonym"gtabziehlt/termgt
- lt/lex-elementgt
30Extension and Formalization of the multilingual
terminological lexicon, including
syncategorematic information. Supporting WSD.
- ltlex-element id"ID" concept"Shot-on-goal"gt
- lt...lang "DE" type "main pos N mod
von concept Player concept player
gender gen pos posspron
gtTorschusslt/termgt - lt...lang"DE" type"synonym pos V comp
SUBJ concept Player gtabziehlt/termgt - ltdefinitiongtURL DFB home page/glossarylt/defin
itiongt - lt/lex-elementgt
31Integrating Syntactic and Domain Knowledge
Including Syntactic Analysis for a more accurate
tagging of domain specific semantic annotation
Â
32Abstraction over Syntactic Annotation
Ontology_3 Dependencies Head
Comp Mod Spec
Â
Ontology_4 Grammatical Functions Subject,
Object, Ind. Object NP Adjunct, PP Adjunct, etc..
33Merging of Syntactic and Domain Knowledge
Example of a possible rule for conceptual
annotation If (Head of Subj_NP of
Verbtypesoccershot-on-goal is a person) gt
annotate head of NP with semantic class
soccerplayer Example of a rule for
Instance Filling If (term annotated with
concept soccerplayer) gt try to find
information about relations Team, Age etc.
(Template Filling in Information Extraction).
Â
34NLP-based knowledge markup
35MuchMore DTD for Annotation
id
code
from
pref
umlsterm
umlsterms
to
tui
cui
msh
code
id
xrceterms
xrceterm
pref
from
to
tui
cui
msh
id
term1
semrels
semrel
term2
type
document
sentence
ewnterms
ewnterm
sense
offset
id
gramrels
gramrel
type
id
from
chunks
chunk
to
type
id
text
token
pos
lemma
36MuchMore Linguistic Annotation (Lemmatization,
POS, Basic Chunking)
Balint syndrom is a combination of symptoms
including simultanagnosia, a disorder of spatial
and object-based attention, disturbed spatial
perception and representation, and optic ataxia
resulting from bilateral parieto-occipital
lesions. lttextgt lttoken id"w1"
pos"NN"gtBalintlt/tokengt lttoken id"w2"
pos"NN"gtsyndromlt/tokengt lttoken id"w3"
pos"VBZ" lemma"be"gtislt/tokengt lttoken id"w4"
pos"DT" lemma"a"gtalt/tokengt lttoken id"w5"
pos"NN" lemma"combination"gtcombinationlt/tokengt
lttoken id"w6" pos"IN" lemma"of"gtoflt/tokengt
lttoken id"w7" pos"NNS" lemma"symptom"gtsymptomslt
/tokengt ... lttoken id"w20" pos"JJ"
lemma"spatial"gtspatiallt/tokengt lttoken id"w21"
pos"NN" lemma"perception"gtperceptionlt/tokengt
lttoken id"w22" pos"CC" lemma"and"gtandlt/tokengt
lttoken id"w23" pos"NN" lemma"representation"gtr
epresentationlt/tokengt ... lt/textgt ltchunksgt ltchu
nk id"c1" from"w1" to"w2" type"NP"/gt ltchunk
id"c7" from"w20" to"w23" type"NP"/gt lt/chunksgt
gt
37MuchMore Semantic Annotation (UMLS, EuroWordNet)
Balint syndrom is a combination of symptoms
including simultanagnosia, a disorder of spatial
and object-based attention, disturbed spatial
perception and representation, and optic ataxia
resulting from bilateral parieto-occipital
lesions. ltumlsterm id"t7" from"w20"
to"w21"gt ltconcept id"t7.1" cui"C0037744"
preferred"Space Perception" tui"T041"gt ltmsh
code"F2.463.593.778"/gt ltmsh
code"F2.463.593.932.869"/gt lt/conceptgt lt/umlsterm
gt ltumlsterm id"t8" from"w26"
to"w26"gt ltconcept id"t8.1" cui"C0029144"
preferred"Optics" tui"T090"gt ltmsh
code"H1.671.606"/gt lt/conceptgt lt/umlstermgt ltsemr
el id"r7" term1"t7.1" term2"t8.1"
reltype"issue_in"/gt ltewnterm id"e2"
from"w21" to"w21"gt ltsense offset"0487490"/gt lt
sense offset"3955418"/gt ltsense
offset"4002483"/gt lt/ewntermgt
38MUMIS DTD for Linguistic Annotation
Subord-Clause
AdvP
AP
Document
Sentence
Paragraph
NE
NP
PP
VG
39MUMIS DTD for Linguistic Annotation
TYPE
STRUK
AP_AGR
AP
STRING
W
AP_HEAD
40VG
MUMIS DTD for Linguistic Annotation
TYPE
VG_TYPE
VG_SUBCAT_STEM
VG_AGR
STRING
VG
SENT_STRING
KLAMMER
STRUK
VG_STRG
W
VG_HEAD
...
41MUMIS DTD for Linguistic Annotation
STEM
INFL
POS
TC
TYPE
STRING
CLAUSE_PP_ADJUNKT
CLAUSE_SUBJ
SENT_STRING
CLAUSE_PRED_SUBCAT
W
CLAUSE
CLAUSE_TYPE
CLAUSE_VG_LIST
CLAUSE_PP_LIST
CLAUSE_NP_LIST
CLAUSE_PRED_STRG
CLAUSE_PRED_AGR
...
42MUMIS Linguistic Annotation (Lemmatization
Dependency Structure)
Industrie, Handel und Dienstleistungen werden in
der ersten Liste aufgeführt, wobei die in
Klammern gesetzten Zahlen auf die Mutterfirmen
hinweisen. (Industry, trade and services are
mentioned in the first list, in which numbers
within brackets point to parent
companies.) Â ltchunksgt ltchunk id"c1" from"w1"
to"w5" type"NP" headw1,w3,w5/gt ltchunk
id"c2" from"w6" to"w6" type"VG"/gt ltchunk
id"c3" from"w7" to"w10" type"PP" headw7
complementw8,w9,w10/gt ltchunk id"c4"
from"w11" to"w1" type"VG"/gt
. lt/chunksgt  ltclausesgt ltclause id"cl1"
from"c1" to"c4" pred_struct"c2 c4"
GF_Subj"c1"/gt ltclause id"cl2" from"c6"
to"c9" pred_struct"c9" GF_Subj"c6"/gt lt/clausesgt
43MUMIS Semantic Annotation (Events)
7. Ein Freistoss von Christian Ziege aus 25
Metern geht über das Tor. ltchunksgt
ltchunk id"c1" from"w1" to"w5" type"NP"
headw2 pp modifierw3 w4 w5/gt
ltchunk id"c2" from"w6" to"8" type"PP"
headw6 complementw7 w8/gt ltchunk
id"c3" from"w9" to"9" type"VG"/gt
ltchunk id"c4" from"w10" to"w12" type"PP"
headw10 complementw11 w12/gt lt/chunksgt ltclau
sesgt ltclause id"cls1" from"c1"
to"c4" pred_struct"c3 GF_Subj"c1"/gt lt/clauses
gt lteventsgt ltevent id"e1"
clausecls1 event-namefree-kickgt
ltargumentsgt ltargument id"arg1"
name"player valuew4, w5/gt
ltargument id"arg2" name"location
value25-meter/gt ltargument id"arg3"
name"time value0700/gt
lt/argumentsgt lt/eventgt
ltevent id"e2" clausecls1 event-namegoal-scen
e-failgt ltargumentsgt
ltargument id"arg1" name"player valuew4,
w5/gt ltargument
id"arg2" name"location value25-meter/gt
ltargument id"arg3"
name"time value0700/gt
lt/argumentsgt lt/eventgt lt/eventsgt
Â
44Conceptual Annotations for Multimedia Indexing
and Retrieval A multilingual cross-document and
incremental IE approach (MUMIS)
- Technology development to automatically index
(with formal annotations) lengthy multimedia
recordings (off-line process) Find and annotate
relevant entities, relations and events - Technology development to exploit indexed
multimedia archives (on-line process) Search for
interesting scenes and play them via Internet - Test Domain Soccer Games / UEFA Tournament 2000
Â
45Off-line Task
Indexing by...
- Automatic Speech Recognition (Radio/TV
Broadcasts) - Automatically transforms the speech signals
into texts (for 3 languages Dutch, English and
German) - Natural Language Processing (Information
Extraction) - Analyse all available textual documents
(newspapers, speech transcripts, tickers, formal
texts ...), identify and extract interesting
entities, relations and events - Merging all the annotations produced so far
- Create a database with formal annotations
- Use video processing to adjust time marks
46Information Extraction
- Information Extraction (IE) is the task of
identifying, collecting and normalizing relevant
information for a specific application or user. - The relevant information is typically
represented in form of predefined templates,
which are filled by means of Natural Language
(NL) analysis. - IE combines pattern matching mechanisms,
(shallow) NLP and domain knowledge (terminology
and ontology).
47Information Extraction (2)
- IE is generally subdivided in following tasks
- - Named Entity task (NE)
- - Template Element task (TE)
- - Template Relation task (TR)
- - Scenario Template task (ST)
- - Co-reference task (CO)
48Subtask of IE
- Named Entity task (NE) Mark into the text each
string that represents, a person, organization,
or location name, or a date or time, or a
currency or percentage figure. - Template Element task (TE) Extract basic
information related to organization, person, and
artifact entities, drawing evidence from
everywhere in the text.
49Subtask of IE (2)
- Template Relation task (TR) Extract relational
information on employee_of, manufacture_of,
location_of relations etc. (TR expresses
domain-independent relationships). - Scenario Template task (ST) Extract
pre-specified event information and relate the
event information to particular organization,
person, or artifact entities (ST identifies
domain and task specific entities and relations). - Co-reference task (CO) Capture information on
co-referring expressions, i.e. all mentions of a
given entity, including those marked in NE and
TE.
50IE applied to soccer
- Terms as descriptors for the NE task
- Team Titelverteidiger Brasilien, den
respektlosen Außenseiter Schottland - PlayerSuperstar Ronaldo, von Bewacher Calderwood
noch von Abwehrchef Hendry, von Jackson als
drittem Stürmer, Torschütze Cesar, von Roberto
Carlos (16.), - Referee vom spanischen Schiedsrichter Garcia
Aranda - Trainer Schottlands Trainer Brown, Kapitän
Hendry seinen Keeper Leighton - Location im Stade de France von St. Denis (more
fine-grained location detection would be
Stadion im Stade de France and City von St.
Denis ) - Attendance Vor 80000 Zuschauern
51IE applied to soccer (2)
- Terms for NE Task
- Time in der 73. Minute, nach gerade einmal 350
Minuten, von Roberto Carlos (16.), nach einer
knappen halben Stunde, scheiterte Rivaldo
(49./52.) jeweils nur knapp, das vor der Pause
Versäumte versuchten die Brasilianer nach
Wiederbeginn, ... - Date am Mittwoch, der Turnierstart (?), im
WM-Eröffnungsspiel (?) - Score/Result Brasilien besiegt Schottland 21,
einen 21 (11)-Sieg, der zwischenzeitliche
Ausgleich, in der 4. Minute in Führung gebracht,
köpfte zum 10 ein
52IE applied to soccer (3)
- Relations for TR Task
- Opponents Brasilien besiegt Schottland, feierte
der Top-Favorit ... einen glücklichen 21
(11)-Sieg über den respektlosen Außenseiter
Schottland, - Player_of hatte Cesar Sampaio den vierfachen
Weltmeister ... in Führung gebracht, Collins
gelang ... der zwischenzeitliche Ausgleich für
die Schotten, der Keeper des FC Aberdeen,
Brasiliens Keeper Taffarel - Trainer_of Schottlands Trainer Brown
- ...
53IE applied to soccer (4)
- Events for ST task
- Goal in der 4. Minute in Führung gebracht, das
schnellste Tor ... markiert, Cesar Sampaio köpfte
zum 10 ein, Collins (38.) verwandelte den
Strafstoß, hätte Kapitän Hendry seinen Keeper
Leighton um ein Haar zum zweiten Mal bezwungen,
von dem der Ball ins Tor prallte - Foul als er den durchlaufenden Gallacher im
Strafraum allzu energisch am Trikot zog - Substitution und mußte in der 59. Minute für
Crespo Platz machen...
54NL Processing and Knowledge Markup of (German)
soccer texts with the SCHUG system
- A multilingual ontological lexicon
- Formal Text1
- Formal Text2
- XML Soccer Annotation for Text1
- XML Soccer Annotation for Text2
- Merging of Annotations for Formal Texts
- Semi-Formal Text
- Semi-Formal Text annotated with Soccer
Information (XML) -
Â
Â
55Multilingual ExtensionSpanish (Esperonto)
- Ontology
- ltlex-element id"ID" conceptSecond-half"gt
- lt... lang"DE" type"main"gtzweite
Halbzeitlt/termgt - lt... lang"EN" type"main"gtsecond
halflt/termgt - lt... langES" type"main"gtreanudacionlt/termgt
- lt/lex-elementgt
- .
- Processing with the SCHUG system
- Example
Â
Â
56Conceptual Annotations for Multimedia Indexing
and Retrieval MUMIS
Â
57The first user interface of MUMIS
Â
Â
58EsperontoPartners
Intelligent Software Components (Coord)
Semantic Web, Annotation Services. UPM ontology
development and evaluation. University of
Innsbruck Semantic Web languages. Saarland
University multilingual Annotation services,
using Information Extraction UNILIV Semantic
indexation of Semantic Web content. Routing
solutions. Visualization and navigation to make
content presentation user-friendlier. Residencia
de Estudiantes Content provider. Cultural tour
test case. Evaluation. CIDEM Content provider.
Fund finder test case. Evaluation. BioVista
Content provider. Scientific Discovery test case.
Â
Â
59Aim
Application Service Provision of Semantic
Annotation, Aggregation, Indexing and Routing of
Textual, Multimedia, and Multilingual Web Content
The project aims at bridging the gap between the
actual World Wide Web and the semantic Web by
providing a service to "upgrade" existing content
to semantic Web content. Ontologies play a key
role in this effort, together with multilingual
Natural Language Analysis of textual documents
currently in the web as free or HTML encoded
texts.
Â
Â
60Main Goals
- To bridge the gap between the current web and the
Semantic Web SemASP - Ontology-based annotation
- Sources
- Static pages
- Pages dinamically generated from DB
- Textual and multimedia information
- Web services
- Added value knowledge-based services on top of
the constructed semantic web - Routing based on P2P communication
- Semantic aggregation
- Meaning negotiation
- Support Multilinguality on ontology construction,
...
Â
Â
61 Applications
Agent
Visualization Service Provider
Multilingual NL Generation
Semantic Web
Semantic indices, Concept instances
Tagger/ Wrapper
Tagger/ Wrapper
Tagger/ Wrapper
Tagger/ Wrapper
Certificate
Multilinguality
Ontology Repository Service
Workbench
Reengineering
SemASP
Maintenance
Mapping
Multilingual
NL Understanding
World Wide Web
Static Information Provider
Web Server Provider
Dynamic Information Provider
Multimedia Data Provider
62Ontology-based Annotation
- Annotate accurately document with concepts and
terms described in various semantic resources
EuroWordNet, UMLS, Soccer ontology etc. - Annotate documents with relations defined in
the ontology
Â
63Ontology construction from Text
There are various methodologies under
investigation for extracting/learning knowledge
from text, and to encode it in an ontology (see
Ontology Learning Overview - OntoWeb D1.5
http//www.ontoweb.org). Many are based on
Machine Learning techniques We discuss here the
possibility of a rule-based approach for partial
and shallow ontology construction from text,
based on various levels of syntactic patterns
annotated in the documents.
Â
64Ontology construction from Text A starting
experiment Medicine
Document Set 65 sample phrases that link
symptoms with Rheumatoid Arthritis (RA).
Â
65Ontology construction from Text Apposition and
Paranthesis (1)
The effects of rheumatoid arthritis on bone
include structural joint damage (erosions) and
osteoporosis Linguistic Structure The
effects of rheumatoid arthritis on bone
include structural joint damage ( erosions )
and osteoporosis gt The Apposition (2
syntactic heads joint and erosions in one NP)
including a parenthesis construction suggests a
synonymy relation or a definition. Heuristic
Establishing Semantic Relations on the top of
linguistic head-modifiers constructions
Â
66Ontology construction from Text Apposition with
Paranthesis (2)
- For symptoms of rheumatoid arthritis (pain,
joint stiffness), the reference treatment is a
nonsteroidal antiinflammatory drug (NSAID) such
as diclofenac or ibuprofen. - Linguistic Structure
- For symptoms of rheumatoid arthritis ( pain ,
joint stiffness ) , the reference treatment
is a nonsteroidal antiinflammatory drug (
NSAID) - Suggesting a semantic relation between (pain
and joint stiffness) - Classify pain and joint stiffness as symptom
of RA. The word symptom is linguistically
annotated as the head of the Compl-NP of the PP
starting with For.
Â
67Ontology construction from Text Apposition with
Paranthesis (3)
But there is a need for constraining the
hypothesis In patients with rheumatoid
arthritis (RA) gt RA is abbreviation of
rheumatoid arthritis And in the
sentence Fourteen consecutive elbows have been
treated for rheumatoid arthritis (9 elbows) and
for post-traumatic osteoarthrosis (5 elbows) by
total elbow replacement with the GSB III implant.
, the parenthesis (9 elbows) and (5 elbows)
have no semantic relations to the preceding head
nouns!
Â
68Ontology construction from Text Apposition with
commas
Etoricoxib, a selective COX2 inhibitor, has been
shown to be as effective as non-selective
non-steroidal anti-inflammatory drugs in the
management of chronic pain in rheumatoid
arthritis and osteoarthritis, Linguistic
Structure Etoricoxib, a selective COX2
inhibitor, has been shown The same
hypothesis as in the former examples a semantic
relation between Etoricoxib and selective
COX2 inhibitor. Probably a isa relation
Â
69Ontology construction from Text Compound Analysis
Joints destructions, joint damage, joint
disease, joint stiffness but joint
cartilage. Knee joints vs. tender joints
What can happen to joins, where are joints
located?. Use of synsets to detect relations?
Joint cartilage is not a disease.
Â
70Ontology construction from Text PP
post-modification
inflammation of joints, synovial lining of
joints Here use of synsets for grouping that
what can happen to joints?
Â
71Ontology construction from Text Phrase Internal
Coordination
- The effects of rheumatoid arthritis on bone
include structural joint damage (erosions) and
structural joint damage - Linguistic Structure
- The effects of rheumatoid arthritis on bone
include structural joint damage ( erosions )
and osteoporosis - RA causes structural joint damage AND structural
joint damage (interpreting the head noun
effects as a causation). - Hypothesis The two heads of an NP coordination
are somehow related.
Â
72Ontology construction from Text Phrase Internal
Coordination (2)
- A study was conducted to determine the incidence
of ulnar and peripheral neuropathy - Linguistic Structure
- The incidence of ulnar and peripheral
neuropathy - The AP ulnar and peripheral AP modifies the
head noun neuropathy. The AP is a coordinated
one, having two Adjectival heads. - Hypothesis They correspond to two types of
neuropathy
Â
73Ontology construction from Text Subject Verb
Objetcs (Ind. Obj. etc.)
Rheumatoid arthritis is an immunologically
mediated inflammation of joints of unknown
aetiology and often leads to disability gt RA
leads to Disability (effect of ellipsis
resolution RA detected as the subject of the
verb leads, even if not realised in text.
Reference resolution very important for knowledge
extraction) gt Lexical semantic info collects
all objects of RA leads to gtSuggest Causality
(verb lead to)
Â
74Ontology construction from Text Subject Verb
Objects (Ind. Obj etc.)
These changes constitute hallmarks of synovial
cell activation and contribute to both chronic
inflammation and hyperplasia On line exercise!
Â
75Future Work
Still have to identify accurately the sub-set of
linguistic tags, describing syntactic/semantic
patterns that are relevant for ontology
extraction (or even ontology mark-up).
Â
76First Conclusions
Construction of partial and shallow ontologies
from (complex) syntactic patterns seems feasible.
It might seem expensive in the sense that
documents first should be (automatically)
linguistically annotated. But Machine Learning
methods also needs a lot of semi-automatically
annotated data for training. A need to conduct a
comparative evaluation taking into account as
many parameters as possible.
Â
77Practical Sessions (Adrian Raschip)
- Exercise 1 Semi-Automatic Terminological
extension Romanian and other languages. On the
base of the TMX encoded MUMIS multilingual
terminology - Exercise 2 (Manual) linguistic annotation of
English and Romanian Text on Soccer - Exercise 3 Define a soccer ontology in Protégé
- Exercise 4 Search for possible mapping rules
between linguistic annotations and relations that
might be relevant to be extracted