Language Technology and the Semantic Web

About This Presentation

Title:

Language Technology and the Semantic Web

Description:

... resources are captured in dictionaries, thesauri, and semantic networks, all ... ontology of the world in general or of more specific domains, such as medicine. ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 78

Provided by: LIA137

Category:

more less

Transcript and Presenter's Notes

Title: Language Technology and the Semantic Web

1
Language Technology and the Semantic Web

Thierry Declerck Paul Buitelaar (Saarland
University DFKI GmbH)

We present collaborative research work on the
combination of language technology (LT) and
technologies for encoding (domain) knowledge in
ontologies, supporting the emergence of the
Semantic Web (SW), or maybe more appropriate
Semantic Webs
MUMIS (dealing with multimedia content
indexing and searching in the soccer domain,
finished in December 2002)
MuchMore (dealing with cross-lingual
information retrieval in the medical domain,
finished in Mai 2003)
Esperonto (developing a Semantic Annotation
Service for upgrading the actual Web to the
Semantic Web, Sept. 2002 - Mai 2005)

3
Semantic Web Applications of LT

Supporting accurate ontology-based semantic
annotation of multilingual web documents
(Knowledge Markup)
Supporting Ontology Learning/Construction from
linguistically/semantically annotated
multilingual text (Knowledge Extraction)
See also the Special Interest Group (SIG-5)
OntoWeb-lt on Language Technology in Ontology
Development and Use http//ontoweb-lt.dfki.de

4
Knowledge Markup and Knowledge Extraction
Text/Speech
Text/Speech Mining
Linguistic and Semantic Annotations
Concepts, Relations, Events

Linguistic Analysis Morpho-Syntactic Analysis and
Tagging, Semantic Class Tagging, Term/NE
Recognition, Grammatical Function Tagging,
Dependency Structure Analysis
5
Knowledge Markup and Knowledge Extraction (2)
Text/Speech/Image-Video
Text/Speech/Media Mining
Linguistic, Low-level Image and Semantic
Annotations
Concepts, Relations, Events

Linguistic and Media Analysis
6
Integration of Language Technology and Domain
Knowledge
7
Linguistic Analysis
Language technology tools are needed to support
the upgrade of the actual web to the Semantic Web
(SW) by providing an automatic analysis of the
linguistic structure of textual documents. Free
text documents undergoing linguistic analysis
become available as semi-structured documents,
from which meaningful units can be extracted
automatically (information extraction) and
organized through clustering or classification
(text mining). Here we focus on the following
linguistic analysis steps that underlie the
extraction tasks morphological analysis,
part-of-speech tagging, chunking, dependency
structure analysis, semantic tagging.
8
Morphological Analysis
Morphological analysis is concerned with the
inflectional, derivational, and compounding
processes in word formation in order to determine
properties such as stem and inflectional
information. Together with part-of-speech (PoS)
information this process delivers the
morpho-syntactic properties of a word. While
processing the German word Häusern (houses) the
following morphological information should be
analysed PoSN NUMPL CASEDAT GENNEUT
STEMHAUS
9
Part-of-Speech Tagging
Part-of-Speech (PoS) tagging is the process of
determining the correct syntactic class (a
part-of-speech, e.g. noun, verb, etc.) for a
particular word given its current context. The
word works in the following sentences will be
either a verb or a noun He works N,V the
whole day for nothing. His works N,V have all
been sold abroad. PoS tagging involves
disambiguation between multiple part-of-speech
tags, next to guessing of the correct
part-of-speech tag for unknown words on the basis
of context information.
10
Chunking
Following Abney chunks as the non-recursive
parts of core phrases, such as nominal,
prepositional, adjectival and adverbial phrases
and verb groups. Chunk parsing is an important
step towards making natural language processing
robust, since the goal of chunk parsing is not to
deliver a full analysis of sentences, but to
extract just the linguistic fragments that can be
surely identified. However, even if this strategy
fails to produce an analysis for the whole
sentence, the partial linguistic information
gained so far will still be useful for many
applications, such as information extraction and
text mining.
11
Named Entities detection
Related to chunking is the recognition of
so-called named entities (names of institutions
and companies, date expressions, etc.). The
extraction of named entities is mostly based on a
strategy that combines look up in gazetteers
(lists of companies, cities, etc.) with the
definition of regular expression patterns. Named
entity recognition can be included as part of the
linguistic chunking procedure and the following
sentence fragment the secretary-general of
the United Nations, Kofi Annan, will be
annotated as a nominal phrase, including two
named entities United Nations with named entity
class organization, and Kofi Annan with named
entity class person
12
Dependency Structure Analysis
A dependency structure consists of two or more
linguistic units that immediately dominate each
other in a syntax tree. The detection of such
structures is generally not provided by chunking
but is building on the top of it. There are two
main types of dependencies that are relevant for
our purposes On the one hand, the internal
dependency structure of phrasal units or chunks
and on the other hand the so-called grammatical
functions (like subject and direct object).
13
Internal Dependency Structure
In linguistic analysis, for this we use the
terms head, complements and modifiers, where the
head is the dominating node in the syntax tree of
a phrase (chunk), complements are necessary
qualifiers thereof, and modifiers are optional
qualifiers. Consider the following example The
shot by Christian Ziege goes over the goal. The
prepositional phrase by Christian Ziege
(containing the named entity Christian Ziege)
depends on (and modifies) the head noun shot.
.
14
Grammatical Functions
Determine the role (function) of each of the
linguistic chunks in the sentence and allow to
identify the actors involved in certain events.
So for example in the following sentence, the
syntactic (and also the semantic) subject is the
NP constituent The shot by Christian
Ziege The shot by Christian Ziege goes over
the goal. This nominal phrase depends on (and
complements) the verb goes, whereas the Noun
shot is the head of the NP (it this the shot
going over the goal, and not Christian Ziege!)
15
Semantic Tagging
Automatic semantic annotation has developed
within language technology in recent years in
connection with more integrated tasks like
information extraction, which require a certain
level of semantic analysis. Semantic tagging
consists in the annotation of each content word
in a document with a semantic category. Semantic
categories are assigned on the basis of a
semantic resources like WordNet for English or
EuroWordNet, which links words between many
European languages through a common inter-lingua
of concepts.
16
Semantic Resources

Semantic resources are captured in dictionaries,
thesauri, and semantic networks, all of which
express, either implicitly or explicitly, an
ontology of the world in general or of more
specific domains, such as medicine.
They can be roughly distinguished into the
following three groups
Thesauri Semantic resources that group
together similar words or terms according to a
standard set of relations, including broader
term, narrower term, sibling, etc. (like Roget)
Semantic Lexicons Semantic resources that
group together words (or more complex lexical
items) according to lexical semantic relations
like synonymy, hyponymy, meronymy, and antonymy
(like WordNet)
Semantic Networks Semantic resources that
group together objects denoted by natural
language expressions (terms) according to a set
of relations that originate in the nature of the
domain of application (like UMLS in the medical
domain)

17
The MeSH Thesaurus
MeSH (Medical Subject Headings) is a thesaurus
for indexing articles and books in the medical
domain, which may then be used for searching
MeSH-indexed databases. MeSH provides for each
term a number of term variants that refer to the
same concept. It currently includes a vocabulary
of over 250,000 terms. The following is a sample
entry for the term gene library (MH is the term
itself, ENTRY are term variants) MH
Gene Library ENTRY Bank, Gene ENTRY
Banks, Gene ENTRY DNA Libraries ENTRY
Gene Bank etc.
18
The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
19
WordNet An example
The word 'tree' has two meanings that roughly
correspond to the classes of plants and that of
diagrams, each with their own hierarchy of
classes that are included in more general
super-classes 09396070 tree 0 09395329
woody_plant 0 ligneous_plant 0 09378438
vascular_plant 0 tracheophyte 0 00008864
plant 0 flora 0 plant_life 0 00002086
life_form 0 organism 0 being 0 living_thing 0
00001740 entity 0 something 0 10025462 tree
0 tree_diagram 0 09987563 plane_figure 0
two-dimensional_figure 0 09987377 figure 0
00015185 shape 0 form 0 00018604
attribute 0 00013018 abstraction 0
20
CyC A Semantic Network
CYC is a semantic network of over 1,000,000
manually defined rules that cover a large part of
common sense knowledge about the world . For
example, CYC knows that trees are usually
outdoors, or that people who died stop buying
things. Each concept in this semantic network is
defined as a constant, which can represent a
collection (e.g. the set of all people), an
individual object (e.g. a particular person), a
word (e.g. the English word person), a quantifier
(e.g. there exist), or a relation (e.g. a
predicate, function, slot, attribute). The entry
for the predicate mother mother
(mother ANIM FEM)
isa FamilyRelationSlot BinaryPredicate
This says that the predicate mother takes two
arguments, the first of which must be an element
of the collection Animal, and the second of
which must be an element of the collection
FemaleAnimal.
21
Word Sense Disambiguation
Words mostly have more than one interpretation,
or sense. If natural language were completely
unambiguous, there would be a one-to-one
relationship between words and senses. In fact,
things are much more complicated, because for
most words not even a fixed number of senses can
be given. Therefore, only in certain
circumstances and depending on what we mean
exactly with sense, can we give restricted
solutions to the problem of Word Sense
Disambiguation (WSD)
22
A simplified Example of a Domain Ontology
Instances

23
Example of RDF Schema forthe Movie Ontology
ltrdfRDF xmlnsrdf'http//www.w3.org/1999/02/22
-rdf-syntax-ns' xmlnsrdfs'http//www.w3.org/2
000/01/rdf-schema' xmlnsNS0'http//webode.dia
.fi.upm.es/RDFS/MovieOntology'
gt ltrdfDescription rdfabout'http//webode.dia.fi
.upm.es/RDFS/MovieOntologySpecialEffectsCompanyAc
ting'gt ltrdftype rdfresource'http//www.w3.o
rg/2000/01/rdf-schemaClass'/gt
ltrdfscommentgtDetails of company that created
special effects in this movielt/rdfscommentgt
ltrdfssubClassOf rdfresource'http//webode.dia.f
i.upm.es/RDFS/MovieOntologyCompanyActing'/gt lt/rdf
Descriptiongt ltrdfDescription
rdfabout'http//webode.dia.fi.upm.es/RDFS/MovieO
ntologyPolice'gt ltrdftype rdfresource'http
//www.w3.org/2000/01/rdf-schemaClass'/gt
ltrdfscommentgtFilms that deal solely with police
activitylt/rdfscommentgt ltrdfssubClassOf
rdfresource'http//webode.dia.fi.upm.es/RDFS/Mov
ieOntologyCrime'/gt lt/rdfDescriptiongt

etc
24
Integration of Ontology and Semantic Lexicon

Example of Semantic Lexicon is WordNet (sometimes
also referred to as a Linguistic Ontology)
Ontologies are domain specific models, usually
lacking linguistic information (PoS, Morphology,
Syntax etc.)
To be Integrated in One Resource or Kept/Accessed
Separately?

Standardization
Format Web-based Standards for Lexical Semantic
Representation will Increase their Uptake
(Easy Plug-and-Play, Remote Access, etc.)
Content Widely Used (Lexical) Semantic Resources
will lead to (Further) Semantic Standardization

25
Defining a Linguistic Ontology for the Art World
(Tentative)

ltrdfRDF
xmlnsrdf http//www.w3.org/1999/02/22-rdf-sy
ntax-ns
xmlnsrdfs http//www.w3.org/2000/01/rdf-schem
a
xmlnsxsd http//www.daml.org/2000/10/XMLSche
ma
xmlnsdaml http//www.daml.org/2001/03/damloi
l
xmlnsart http//www.art-world.org/art-world
gt
ltdamlOntology rdfaboutConcepts in the Art
Worldgt
ltdamlimports rdfresourceshttp//www.daml.or
g/2001/03/damloilgt
lt/damlOntologygt

26
Defining Art World Concepts (Classes,
Synsets) (Tentative)
ltdamlClass rdfID"art-world.01"gt
ltrdfslabelgtart-world.01lt/rdfslabelgt
ltrdfssubClassOf rdfresource"http//www.art-wo
rld.org/art-world.00"/gt lt/damlClassgt ltart-world
.01 rdfID"work"/gt ltart-world.01
rdfID"painting"/gt ltdamlClass
rdfID"art-world.02"/gt ltart-world.02
rdfID"beautiful"/gt ltart-world.02
rdfID"colourful"/gt ltdamlClass
rdfID"art-world.03"/gt ltart-world.03
rdfID"paper"/gt ltart-world.03 rdfID"canvas"/gt
27
Defining Properties (Selection
Restrictions) (Tentative)
ltdamlObjectProperty rdfID"manner"gt
ltrdfsrange rdfresource"art-world.02"/gt
ltrdfsdomain rdfresource"art-world.01"/gt lt/daml
ObjectProperty gt ltdamlObjectProperty
rdfID"medium"gt ltrdfsrange rdfresource"art-
world.03"/gt ltrdfsdomain rdfresource"art-worl
d.01"/gt lt/damlObjectProperty gt lt/rdfRDFgt
28
(Semantic) Lexicons will be

an Important Part of the Semantic Web
Represented Using Markup Languages (RDF)
Accessible in a Remote, Distributed Fashion
Central to Further Semantic Standardization

29
Multilingual terminological lexicon, attached to
a domain ontology (MUMIS)

ltlex-element id"ID" concept"Shot-on-goal"gt
lt... lang"DE" type"main"gtTorschusslt/termgt
lt... lang"EN" type"main"gtshot on
goallt/termgt
lt... lang"NL" type"main"gtschot op
doellt/termgt
ltdefinitiongtein Angriffsspieler kickt den
Ball zu den
gegnerischen Torlt/definitiongt
lt... lang"DE" type"synonym"gtDistanzschusslt/t
ermgt
lt... lang"DE" type"synonym"gtNachschusslt/term
gt
lt... lang"DE" type"synonym"gtSchusslt/termgt
lt... lang"DE" type"synonym"gtabziehlt/termgt
lt/lex-elementgt

30
Extension and Formalization of the multilingual
terminological lexicon, including
syncategorematic information. Supporting WSD.

ltlex-element id"ID" concept"Shot-on-goal"gt
lt...lang "DE" type "main pos N mod
von concept Player concept player
gender gen pos posspron
gtTorschusslt/termgt
lt...lang"DE" type"synonym pos V comp
SUBJ concept Player gtabziehlt/termgt
ltdefinitiongtURL DFB home page/glossarylt/defin
itiongt
lt/lex-elementgt

31
Integrating Syntactic and Domain Knowledge
Including Syntactic Analysis for a more accurate
tagging of domain specific semantic annotation

32
Abstraction over Syntactic Annotation
Ontology_3 Dependencies Head
Comp Mod Spec

Ontology_4 Grammatical Functions Subject,
Object, Ind. Object NP Adjunct, PP Adjunct, etc..

33
Merging of Syntactic and Domain Knowledge
Example of a possible rule for conceptual
annotation If (Head of Subj_NP of
Verbtypesoccershot-on-goal is a person) gt
annotate head of NP with semantic class
soccerplayer Example of a rule for
Instance Filling If (term annotated with
concept soccerplayer) gt try to find
information about relations Team, Age etc.
(Template Filling in Information Extraction).

34
NLP-based knowledge markup
35
MuchMore DTD for Annotation
id

code

from

pref

umlsterm
umlsterms
to

tui

cui
msh
code

id

xrceterms
xrceterm
pref
from

to

tui

cui
msh

id

term1
semrels

semrel

term2
type

document
sentence

ewnterms
ewnterm
sense

offset

id

gramrels
gramrel

type

id

from

chunks

chunk

to

type

id

text

token

pos

lemma

36
MuchMore Linguistic Annotation (Lemmatization,
POS, Basic Chunking)
Balint syndrom is a combination of symptoms
including simultanagnosia, a disorder of spatial
and object-based attention, disturbed spatial
perception and representation, and optic ataxia
resulting from bilateral parieto-occipital
lesions. lttextgt lttoken id"w1"
pos"NN"gtBalintlt/tokengt lttoken id"w2"
pos"NN"gtsyndromlt/tokengt lttoken id"w3"
pos"VBZ" lemma"be"gtislt/tokengt lttoken id"w4"
pos"DT" lemma"a"gtalt/tokengt lttoken id"w5"
pos"NN" lemma"combination"gtcombinationlt/tokengt
lttoken id"w6" pos"IN" lemma"of"gtoflt/tokengt
lttoken id"w7" pos"NNS" lemma"symptom"gtsymptomslt
/tokengt ... lttoken id"w20" pos"JJ"
lemma"spatial"gtspatiallt/tokengt lttoken id"w21"
pos"NN" lemma"perception"gtperceptionlt/tokengt
lttoken id"w22" pos"CC" lemma"and"gtandlt/tokengt
lttoken id"w23" pos"NN" lemma"representation"gtr
epresentationlt/tokengt ... lt/textgt ltchunksgt ltchu
nk id"c1" from"w1" to"w2" type"NP"/gt ltchunk
id"c7" from"w20" to"w23" type"NP"/gt lt/chunksgt
gt
37
MuchMore Semantic Annotation (UMLS, EuroWordNet)
Balint syndrom is a combination of symptoms
including simultanagnosia, a disorder of spatial
and object-based attention, disturbed spatial
perception and representation, and optic ataxia
resulting from bilateral parieto-occipital
lesions. ltumlsterm id"t7" from"w20"
to"w21"gt ltconcept id"t7.1" cui"C0037744"
preferred"Space Perception" tui"T041"gt ltmsh
code"F2.463.593.778"/gt ltmsh
code"F2.463.593.932.869"/gt lt/conceptgt lt/umlsterm
gt ltumlsterm id"t8" from"w26"
to"w26"gt ltconcept id"t8.1" cui"C0029144"
preferred"Optics" tui"T090"gt ltmsh
code"H1.671.606"/gt lt/conceptgt lt/umlstermgt ltsemr
el id"r7" term1"t7.1" term2"t8.1"
reltype"issue_in"/gt ltewnterm id"e2"
from"w21" to"w21"gt ltsense offset"0487490"/gt lt
sense offset"3955418"/gt ltsense
offset"4002483"/gt lt/ewntermgt
38
MUMIS DTD for Linguistic Annotation
Subord-Clause
AdvP
AP
Document
Sentence
Paragraph
NE
NP
PP
VG
39
MUMIS DTD for Linguistic Annotation
TYPE
STRUK
AP_AGR
AP
STRING
W
AP_HEAD
40
VG
MUMIS DTD for Linguistic Annotation
TYPE
VG_TYPE
VG_SUBCAT_STEM
VG_AGR
STRING
VG
SENT_STRING
KLAMMER
STRUK
VG_STRG
W
VG_HEAD
...
41
MUMIS DTD for Linguistic Annotation
STEM
INFL
POS
TC
TYPE
STRING
CLAUSE_PP_ADJUNKT
CLAUSE_SUBJ
SENT_STRING
CLAUSE_PRED_SUBCAT
W
CLAUSE
CLAUSE_TYPE
CLAUSE_VG_LIST
CLAUSE_PP_LIST
CLAUSE_NP_LIST
CLAUSE_PRED_STRG
CLAUSE_PRED_AGR
...
42
MUMIS Linguistic Annotation (Lemmatization
Dependency Structure)
Industrie, Handel und Dienstleistungen werden in
der ersten Liste aufgeführt, wobei die in
Klammern gesetzten Zahlen auf die Mutterfirmen
hinweisen. (Industry, trade and services are
mentioned in the first list, in which numbers
within brackets point to parent
companies.) ltchunksgt ltchunk id"c1" from"w1"
to"w5" type"NP" headw1,w3,w5/gt ltchunk
id"c2" from"w6" to"w6" type"VG"/gt ltchunk
id"c3" from"w7" to"w10" type"PP" headw7
complementw8,w9,w10/gt ltchunk id"c4"
from"w11" to"w1" type"VG"/gt
. lt/chunksgt ltclausesgt ltclause id"cl1"
from"c1" to"c4" pred_struct"c2 c4"
GF_Subj"c1"/gt ltclause id"cl2" from"c6"
to"c9" pred_struct"c9" GF_Subj"c6"/gt lt/clausesgt
43
MUMIS Semantic Annotation (Events)
7. Ein Freistoss von Christian Ziege aus 25
Metern geht über das Tor. ltchunksgt
ltchunk id"c1" from"w1" to"w5" type"NP"
headw2 pp modifierw3 w4 w5/gt
ltchunk id"c2" from"w6" to"8" type"PP"
headw6 complementw7 w8/gt ltchunk
id"c3" from"w9" to"9" type"VG"/gt
ltchunk id"c4" from"w10" to"w12" type"PP"
headw10 complementw11 w12/gt lt/chunksgt ltclau
sesgt ltclause id"cls1" from"c1"
to"c4" pred_struct"c3 GF_Subj"c1"/gt lt/clauses
gt lteventsgt ltevent id"e1"
clausecls1 event-namefree-kickgt
ltargumentsgt ltargument id"arg1"
name"player valuew4, w5/gt
ltargument id"arg2" name"location
value25-meter/gt ltargument id"arg3"
name"time value0700/gt
lt/argumentsgt lt/eventgt
ltevent id"e2" clausecls1 event-namegoal-scen
e-failgt ltargumentsgt
ltargument id"arg1" name"player valuew4,
w5/gt ltargument
id"arg2" name"location value25-meter/gt
ltargument id"arg3"
name"time value0700/gt
lt/argumentsgt lt/eventgt lt/eventsgt

44
Conceptual Annotations for Multimedia Indexing
and Retrieval A multilingual cross-document and
incremental IE approach (MUMIS)

Technology development to automatically index
(with formal annotations) lengthy multimedia
recordings (off-line process) Find and annotate
relevant entities, relations and events
Technology development to exploit indexed
multimedia archives (on-line process) Search for
interesting scenes and play them via Internet
Test Domain Soccer Games / UEFA Tournament 2000

45
Off-line Task
Indexing by...

Automatic Speech Recognition (Radio/TV
Broadcasts)
Automatically transforms the speech signals
into texts (for 3 languages Dutch, English and
German)
Natural Language Processing (Information
Extraction)
Analyse all available textual documents
(newspapers, speech transcripts, tickers, formal
texts ...), identify and extract interesting
entities, relations and events
Merging all the annotations produced so far
Create a database with formal annotations
Use video processing to adjust time marks

46
Information Extraction

Information Extraction (IE) is the task of
identifying, collecting and normalizing relevant
information for a specific application or user.
The relevant information is typically
represented in form of predefined templates,
which are filled by means of Natural Language
(NL) analysis.
IE combines pattern matching mechanisms,
(shallow) NLP and domain knowledge (terminology
and ontology).

47
Information Extraction (2)

IE is generally subdivided in following tasks
- Named Entity task (NE)
- Template Element task (TE)
- Template Relation task (TR)
- Scenario Template task (ST)
- Co-reference task (CO)

48
Subtask of IE

Named Entity task (NE) Mark into the text each
string that represents, a person, organization,
or location name, or a date or time, or a
currency or percentage figure.
Template Element task (TE) Extract basic
information related to organization, person, and
artifact entities, drawing evidence from
everywhere in the text.

49
Subtask of IE (2)

Template Relation task (TR) Extract relational
information on employee_of, manufacture_of,
location_of relations etc. (TR expresses
domain-independent relationships).
Scenario Template task (ST) Extract
pre-specified event information and relate the
event information to particular organization,
person, or artifact entities (ST identifies
domain and task specific entities and relations).
Co-reference task (CO) Capture information on
co-referring expressions, i.e. all mentions of a
given entity, including those marked in NE and
TE.

50
IE applied to soccer

Terms as descriptors for the NE task
Team Titelverteidiger Brasilien, den
respektlosen Außenseiter Schottland
PlayerSuperstar Ronaldo, von Bewacher Calderwood
noch von Abwehrchef Hendry, von Jackson als
drittem Stürmer, Torschütze Cesar, von Roberto
Carlos (16.),
Referee vom spanischen Schiedsrichter Garcia
Aranda
Trainer Schottlands Trainer Brown, Kapitän
Hendry seinen Keeper Leighton
Location im Stade de France von St. Denis (more
fine-grained location detection would be
Stadion im Stade de France and City von St.
Denis )
Attendance Vor 80000 Zuschauern

51
IE applied to soccer (2)

Terms for NE Task
Time in der 73. Minute, nach gerade einmal 350
Minuten, von Roberto Carlos (16.), nach einer
knappen halben Stunde, scheiterte Rivaldo
(49./52.) jeweils nur knapp, das vor der Pause
Versäumte versuchten die Brasilianer nach
Wiederbeginn, ...
Date am Mittwoch, der Turnierstart (?), im
WM-Eröffnungsspiel (?)
Score/Result Brasilien besiegt Schottland 21,
einen 21 (11)-Sieg, der zwischenzeitliche
Ausgleich, in der 4. Minute in Führung gebracht,
köpfte zum 10 ein

52
IE applied to soccer (3)

Relations for TR Task
Opponents Brasilien besiegt Schottland, feierte
der Top-Favorit ... einen glücklichen 21
(11)-Sieg über den respektlosen Außenseiter
Schottland,
Player_of hatte Cesar Sampaio den vierfachen
Weltmeister ... in Führung gebracht, Collins
gelang ... der zwischenzeitliche Ausgleich für
die Schotten, der Keeper des FC Aberdeen,
Brasiliens Keeper Taffarel
Trainer_of Schottlands Trainer Brown
...

53
IE applied to soccer (4)

Events for ST task
Goal in der 4. Minute in Führung gebracht, das
schnellste Tor ... markiert, Cesar Sampaio köpfte
zum 10 ein, Collins (38.) verwandelte den
Strafstoß, hätte Kapitän Hendry seinen Keeper
Leighton um ein Haar zum zweiten Mal bezwungen,
von dem der Ball ins Tor prallte
Foul als er den durchlaufenden Gallacher im
Strafraum allzu energisch am Trikot zog
Substitution und mußte in der 59. Minute für
Crespo Platz machen...

54
NL Processing and Knowledge Markup of (German)
soccer texts with the SCHUG system

A multilingual ontological lexicon
Formal Text1
Formal Text2
XML Soccer Annotation for Text1
XML Soccer Annotation for Text2
Merging of Annotations for Formal Texts
Semi-Formal Text
Semi-Formal Text annotated with Soccer
Information (XML)

55
Multilingual ExtensionSpanish (Esperonto)

Ontology
ltlex-element id"ID" conceptSecond-half"gt
lt... lang"DE" type"main"gtzweite
Halbzeitlt/termgt
lt... lang"EN" type"main"gtsecond
halflt/termgt
lt... langES" type"main"gtreanudacionlt/termgt
lt/lex-elementgt
.
Processing with the SCHUG system
Example

56
Conceptual Annotations for Multimedia Indexing
and Retrieval MUMIS

57
The first user interface of MUMIS

58
EsperontoPartners
Intelligent Software Components (Coord)
Semantic Web, Annotation Services. UPM ontology
development and evaluation. University of
Innsbruck Semantic Web languages. Saarland
University multilingual Annotation services,
using Information Extraction UNILIV Semantic
indexation of Semantic Web content. Routing
solutions. Visualization and navigation to make
content presentation user-friendlier. Residencia
de Estudiantes Content provider. Cultural tour
test case. Evaluation. CIDEM Content provider.
Fund finder test case. Evaluation. BioVista
Content provider. Scientific Discovery test case.

59
Aim
Application Service Provision of Semantic
Annotation, Aggregation, Indexing and Routing of
Textual, Multimedia, and Multilingual Web Content
The project aims at bridging the gap between the
actual World Wide Web and the semantic Web by
providing a service to "upgrade" existing content
to semantic Web content. Ontologies play a key
role in this effort, together with multilingual
Natural Language Analysis of textual documents
currently in the web as free or HTML encoded
texts.

60
Main Goals

To bridge the gap between the current web and the
Semantic Web SemASP
Ontology-based annotation
Sources
Static pages
Pages dinamically generated from DB
Textual and multimedia information
Web services
Added value knowledge-based services on top of
the constructed semantic web
Routing based on P2P communication
Semantic aggregation
Meaning negotiation
Support Multilinguality on ontology construction,
...

61
Applications

Agent
Visualization Service Provider
Multilingual NL Generation
Semantic Web

Semantic indices, Concept instances

Tagger/ Wrapper
Tagger/ Wrapper
Tagger/ Wrapper
Tagger/ Wrapper
Certificate
Multilinguality
Ontology Repository Service
Workbench
Reengineering
SemASP
Maintenance
Mapping
Multilingual
NL Understanding

World Wide Web
Static Information Provider
Web Server Provider
Dynamic Information Provider
Multimedia Data Provider
62
Ontology-based Annotation

Annotate accurately document with concepts and
terms described in various semantic resources
EuroWordNet, UMLS, Soccer ontology etc.
Annotate documents with relations defined in
the ontology

63
Ontology construction from Text
There are various methodologies under
investigation for extracting/learning knowledge
from text, and to encode it in an ontology (see
Ontology Learning Overview - OntoWeb D1.5
http//www.ontoweb.org). Many are based on
Machine Learning techniques We discuss here the
possibility of a rule-based approach for partial
and shallow ontology construction from text,
based on various levels of syntactic patterns
annotated in the documents.

64
Ontology construction from Text A starting
experiment Medicine
Document Set 65 sample phrases that link
symptoms with Rheumatoid Arthritis (RA).

65
Ontology construction from Text Apposition and
Paranthesis (1)
The effects of rheumatoid arthritis on bone
include structural joint damage (erosions) and
osteoporosis Linguistic Structure The
effects of rheumatoid arthritis on bone
include structural joint damage ( erosions )
and osteoporosis gt The Apposition (2
syntactic heads joint and erosions in one NP)
including a parenthesis construction suggests a
synonymy relation or a definition. Heuristic
Establishing Semantic Relations on the top of
linguistic head-modifiers constructions

66
Ontology construction from Text Apposition with
Paranthesis (2)

For symptoms of rheumatoid arthritis (pain,
joint stiffness), the reference treatment is a
nonsteroidal antiinflammatory drug (NSAID) such
as diclofenac or ibuprofen.
Linguistic Structure
For symptoms of rheumatoid arthritis ( pain ,
joint stiffness ) , the reference treatment
is a nonsteroidal antiinflammatory drug (
NSAID)
Suggesting a semantic relation between (pain
and joint stiffness)
Classify pain and joint stiffness as symptom
of RA. The word symptom is linguistically
annotated as the head of the Compl-NP of the PP
starting with For.

67
Ontology construction from Text Apposition with
Paranthesis (3)

But there is a need for constraining the
hypothesis In patients with rheumatoid
arthritis (RA) gt RA is abbreviation of
rheumatoid arthritis And in the
sentence Fourteen consecutive elbows have been
treated for rheumatoid arthritis (9 elbows) and
for post-traumatic osteoarthrosis (5 elbows) by
total elbow replacement with the GSB III implant.
, the parenthesis (9 elbows) and (5 elbows)
have no semantic relations to the preceding head
nouns!

68
Ontology construction from Text Apposition with
commas

Etoricoxib, a selective COX2 inhibitor, has been
shown to be as effective as non-selective
non-steroidal anti-inflammatory drugs in the
management of chronic pain in rheumatoid
arthritis and osteoarthritis, Linguistic
Structure Etoricoxib, a selective COX2
inhibitor, has been shown The same
hypothesis as in the former examples a semantic
relation between Etoricoxib and selective
COX2 inhibitor. Probably a isa relation

69
Ontology construction from Text Compound Analysis
Joints destructions, joint damage, joint
disease, joint stiffness but joint
cartilage. Knee joints vs. tender joints
What can happen to joins, where are joints
located?. Use of synsets to detect relations?
Joint cartilage is not a disease.

70
Ontology construction from Text PP
post-modification
inflammation of joints, synovial lining of
joints Here use of synsets for grouping that
what can happen to joints?

71
Ontology construction from Text Phrase Internal
Coordination

The effects of rheumatoid arthritis on bone
include structural joint damage (erosions) and
structural joint damage
Linguistic Structure
The effects of rheumatoid arthritis on bone
include structural joint damage ( erosions )
and osteoporosis
RA causes structural joint damage AND structural
joint damage (interpreting the head noun
effects as a causation).
Hypothesis The two heads of an NP coordination
are somehow related.

72
Ontology construction from Text Phrase Internal
Coordination (2)

A study was conducted to determine the incidence
of ulnar and peripheral neuropathy
Linguistic Structure
The incidence of ulnar and peripheral
neuropathy
The AP ulnar and peripheral AP modifies the
head noun neuropathy. The AP is a coordinated
one, having two Adjectival heads.
Hypothesis They correspond to two types of
neuropathy

73
Ontology construction from Text Subject Verb
Objetcs (Ind. Obj. etc.)
Rheumatoid arthritis is an immunologically
mediated inflammation of joints of unknown
aetiology and often leads to disability gt RA
leads to Disability (effect of ellipsis
resolution RA detected as the subject of the
verb leads, even if not realised in text.
Reference resolution very important for knowledge
extraction) gt Lexical semantic info collects
all objects of RA leads to gtSuggest Causality
(verb lead to)

74
Ontology construction from Text Subject Verb
Objects (Ind. Obj etc.)
These changes constitute hallmarks of synovial
cell activation and contribute to both chronic
inflammation and hyperplasia On line exercise!

75
Future Work
Still have to identify accurately the sub-set of
linguistic tags, describing syntactic/semantic
patterns that are relevant for ontology
extraction (or even ontology mark-up).

76
First Conclusions
Construction of partial and shallow ontologies
from (complex) syntactic patterns seems feasible.
It might seem expensive in the sense that
documents first should be (automatically)
linguistically annotated. But Machine Learning
methods also needs a lot of semi-automatically
annotated data for training. A need to conduct a
comparative evaluation taking into account as
many parameters as possible.

77
Practical Sessions (Adrian Raschip)

Exercise 1 Semi-Automatic Terminological
extension Romanian and other languages. On the
base of the TMX encoded MUMIS multilingual
terminology
Exercise 2 (Manual) linguistic annotation of
English and Romanian Text on Soccer
Exercise 3 Define a soccer ontology in Protégé
Exercise 4 Search for possible mapping rules
between linguistic annotations and relations that
might be relevant to be extracted