Title: L-ISA Learning Domain Specific ISA relations from the WEB
1L-ISALearning Domain Specific ISA relations
from the WEB
- Alessandra Potrich and Emanuele Pianta
- Fondazione Bruno Kessler - IRST
- Trento, Italy
LREC 2008
Marrakech, 31may 2008
2Overview
- Learning ISA relations in the patent processing
domain (the PatExpert Project) - The L-ISA algorithm
- Evaluation
- Future Work
3Ontology Learning/Population
- Ontology Learning acquisition of new concepts
and relations between them - e.g., a device is an artifact
- Ontology Population acquisition of factual
knowledge about specific instances - e.g. Einstein is an instance of a scientist
- e.g. Einstein was born in 1879
4PATExpert
- Funded by the European Union
- Aim improving patent retrieval, summarization,
paraphrasing, classification and valuing through
shallow and deep semantic analysis - Main semantic analysis task recognizing
occurrences of KB concepts and relations - Proof of the concept on two domains
- Optical Recording
- Machine Tools
- Focus of the presentation Ontology Learning in
the Optical Recording domain
5Optical Recording Domain Ontology (ORDO)
- Based on the Owl formalism
- Built in three stages
- 200 hundreds manually crafted concepts starting
from a list of the most frequent terms in a
reference corpus - Pro-ISA ontology learning algorithm based on
projection of WordNet fragments onto ORDO - L-ISA ontology learning algorithm based on
acquisition of isa templates from the Web
6Patent Concept Annotation
- Given a target word
- disambiguate it, by assigning a WN synset whose
domain is compatible with the optical recording
domain (exploiting WORDNET-DOMAINS) - If the synset is linked to an ORDO concept
annotate the target word with the ORDO concept - Otherwise apply Pro-ISA
- Otherwise apply L-ISA
7Choosing the right sense
- Senses for the word CD
- 1. cadmium, Cd, atomic_number_48
(CHEMISTRY) - 2. candle, candela, cd, standard candle
(PHISYCS) - 3. certificate of deposit, CD
(MONEY) - 4. compact disk, compact disc, CD (COMPUTER,
MUSIC)
8Direct Concept Annotation
KB-concept
synset
compact_disk, compact_disc, CD
ordocd
CD
lemma
9Pro-ISA 1Looking for a WN-to-ORDO link
event
sumoProcess
happening, occurrence, occurrent, natural_event
trouble
noise, interference, disturbance
crosstalk, XT -cross_talk, cross-talk,
crosstalk_amount
cross-talk
Lemma
10Pro-ISA 2 Projecting ISA chains (WN -gt ORDO)
event
sumoProcess
happening, occurrence, occurrent, natural_event
auto_ordohappening
auto_ordotrouble
trouble
noise, interference, disturbance
auto_ordonoise
crosstalk, XT -cross_talk, cross-talk,
crosstalk_amount
auto_ordocrosstalk
cross-talk
11From Pro-ISA to L-ISA
- In 15 of cases, the target word is not in
WordNet, so Pro-ISA cannot be applied - Then try and exploit the WEB
- Why not the patent corpus itself?
- Isa relations are not frequent in restricted
corpora - Patents often contain concept definitions with
local scope - We dont want idiosyncratic concept definitions,
but common, shared definitions.
12Learning ISA relations from a corpus
- by exploiting linguistic patterns expressing
the ISA relation (Hearst, 1992 Hearst, 1998
Mititelu, 2006) - Many patterns have been presented in the
literature, but - Few evaluations of the pattern reliability
(except Snow 2006) - Even less task-oriented evaluation in domain
specific, concrete application scenarios. - This paper attempt to provide both kind of
evaluations in a real-word, challenging scenario
such as patent semantic analysis.
13Lexico-Syntactic Patterns
- Patterns reported in the literature
sequence of tokens
NP1 isa-phrase NP2
syntactic noun phrases
- In our case we are looking for the hypernym of a
specific target term
Term-NP isa-phrase Hyper-NP
Hyper-NP isa-phrase Term-NP
14L-ISA
- Google (or any other web engine) does not allow
for searching lexico-syntactic patterns - So, we proceed in three steps
- Snippet acquisition from Google
- Lexico-syntactic filtering
- Semantic filtering
15L-ISA Snippet Acquisition
- Suppose we cannot link the term photodetector
to any ORDO concept. - We want to exploit the following lexico-syntactic
pattern - ltTERM-NPgt is an ltHYPER-NPgt
- Submit to Google the following string query
- photodetector is an
- Keep the first 100 snippets (at most), e.g.
- ... upper frequencies, the PIN waveguide
photodetector is an attractive device, since it
is possible to reduce transit time without .. - Transform HTML snippets in pure text.
16L-ISA Lexico-syntactic Filtering
- Annotate snippets with TextPro (PoS, lemma,
chunk) - Recognize ltTerm-NPgt isa-phrase ltHyper-NPgt in the
annotated snippets
token PoS lemma chunk
TERM-NP the AT0 the B-NP
PIN NN1 pin I-NP
waveguide NN1 waveguide I-NP
photodetector NN1 photodetector I-NP
isa-phrase is VBZ be B-VP
an AT0 an B-NP
HYPER-NP attractive AJ0 attractive I-NP
device NN1 device I-NP
17L-ISA Lexico-syntactic Filtering
- Filter out TERM-NP
- if target term is modified (e.g. PIN waveguide
photodetector above) - if it looks like a proper names (e.g. uppercase
letter in the middle of a sentence). - Keep HYPER-NP
- only if it fits a restricted number of
PoS-pattern - (N AN NN NNN ANN XNN R Vpastpart
AXN)
TERM-NP HYPER-NP
the photodetector analog signal
A photodetector apparatus
photodetector effective monitor
The photodetector electric device
a photodetector electronic device
photodetector object
18Semantic Filtering
- Keep only those HYPER-NPs compatible with the
Optical Recording domain, by checking - whether the HYPER-NP is already a label in one of
the known ontologies (SUMO, ORDO, AUTO-ORDO) - whether it is present in a WordNet synset with a
WORDNET-DOMAIN label compatible with the Optical
Recording domain.
HYPER-NP IN KB IN WN DOM. COMPAT
analog signal
apparatus yes yes midlow
effective monitor
electric device
electronic device yes
object yes yes midlow
Candidate hypernyms for photodetctor
19Candidate Selection
- Candidates are weighed according to
- Frequency and Reliability of patterns where the
hypernym occurs - Variety of patterns
- Belonging to specific ontologies (manual ORDO,
AUTO-ORDO or SUMO, in decreasing preference
order)
20Evaluating the Reliability of ISA Patterns
- Assessement of the reliability of the patterns
reported in the literature as predictors of the
isa relation - Around 80 templates
- On three target terms groove, photodetector
and magnetic head. - Google returned around 9.000 snippets
- Only snippets passing lexico-syntactic filtering
have been actually manually evaluated (about
1,450) - Guideline try to interpret the intentions of the
author (does he/she really intende to say that X
isa subclass of Y, beyond inappropriate phrasing,
and even if you know that it is not true?) - Results of this evaluation exploited in weighting
the hypernym candidates
21Evaluating the L-ISA accuracy
- Measuring the accuracy of the L-ISA algorithm in
finding the hypernym of a given domain concept - Most frequent 100 terms that we were not able to
link to the ORDO ontology using the Pro-ISA
learning strategy - Including wrong target terms (because of errors
of the linguistic processors, e.g. a past
participle instead of a noun) - Accuracy 78.6
22Future Work
- Extend evaluation set
- Inter-coder agreement
- Use Machine Learning to optimize the weights
associated to templates