L-ISA Learning Domain Specific ISA relations from the WEB - PowerPoint PPT Presentation

About This Presentation
Title:

L-ISA Learning Domain Specific ISA relations from the WEB

Description:

... cd, standard candle (PHISYCS) 3. certificate of deposit, CD ... 4. compact disk, compact disc, CD (COMPUTER, MUSIC) CD {compact_disk, compact_disc, CD} ordo:cd ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 23
Provided by: lrec
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: L-ISA Learning Domain Specific ISA relations from the WEB


1
L-ISALearning Domain Specific ISA relations
from the WEB
  • Alessandra Potrich and Emanuele Pianta
  • Fondazione Bruno Kessler - IRST
  • Trento, Italy

LREC 2008
Marrakech, 31may 2008
2
Overview
  • Learning ISA relations in the patent processing
    domain (the PatExpert Project)
  • The L-ISA algorithm
  • Evaluation
  • Future Work

3
Ontology Learning/Population
  • Ontology Learning acquisition of new concepts
    and relations between them
  • e.g., a device is an artifact
  • Ontology Population acquisition of factual
    knowledge about specific instances
  • e.g. Einstein is an instance of a scientist
  • e.g. Einstein was born in 1879

4
PATExpert
  • Funded by the European Union
  • Aim improving patent retrieval, summarization,
    paraphrasing, classification and valuing through
    shallow and deep semantic analysis
  • Main semantic analysis task recognizing
    occurrences of KB concepts and relations
  • Proof of the concept on two domains
  • Optical Recording
  • Machine Tools
  • Focus of the presentation Ontology Learning in
    the Optical Recording domain

5
Optical Recording Domain Ontology (ORDO)
  • Based on the Owl formalism
  • Built in three stages
  • 200 hundreds manually crafted concepts starting
    from a list of the most frequent terms in a
    reference corpus
  • Pro-ISA ontology learning algorithm based on
    projection of WordNet fragments onto ORDO
  • L-ISA ontology learning algorithm based on
    acquisition of isa templates from the Web

6
Patent Concept Annotation
  • Given a target word
  • disambiguate it, by assigning a WN synset whose
    domain is compatible with the optical recording
    domain (exploiting WORDNET-DOMAINS)
  • If the synset is linked to an ORDO concept
    annotate the target word with the ORDO concept
  • Otherwise apply Pro-ISA
  • Otherwise apply L-ISA

7
Choosing the right sense
  • Senses for the word CD
  • 1. cadmium, Cd, atomic_number_48
    (CHEMISTRY)
  • 2. candle, candela, cd, standard candle
    (PHISYCS)
  • 3. certificate of deposit, CD
    (MONEY)
  • 4. compact disk, compact disc, CD (COMPUTER,
    MUSIC)

8
Direct Concept Annotation
KB-concept
synset
compact_disk, compact_disc, CD
ordocd
CD
lemma
9
Pro-ISA 1Looking for a WN-to-ORDO link
event
sumoProcess
happening, occurrence, occurrent, natural_event
trouble
noise, interference, disturbance
crosstalk, XT -cross_talk, cross-talk,
crosstalk_amount
cross-talk
Lemma
10
Pro-ISA 2 Projecting ISA chains (WN -gt ORDO)
event
sumoProcess
happening, occurrence, occurrent, natural_event
auto_ordohappening
auto_ordotrouble
trouble
noise, interference, disturbance
auto_ordonoise
crosstalk, XT -cross_talk, cross-talk,
crosstalk_amount
auto_ordocrosstalk
cross-talk
11
From Pro-ISA to L-ISA
  • In 15 of cases, the target word is not in
    WordNet, so Pro-ISA cannot be applied
  • Then try and exploit the WEB
  • Why not the patent corpus itself?
  • Isa relations are not frequent in restricted
    corpora
  • Patents often contain concept definitions with
    local scope
  • We dont want idiosyncratic concept definitions,
    but common, shared definitions.

12
Learning ISA relations from a corpus
  • by exploiting linguistic patterns expressing
    the ISA relation (Hearst, 1992 Hearst, 1998
    Mititelu, 2006)
  • Many patterns have been presented in the
    literature, but
  • Few evaluations of the pattern reliability
    (except Snow 2006)
  • Even less task-oriented evaluation in domain
    specific, concrete application scenarios.
  • This paper attempt to provide both kind of
    evaluations in a real-word, challenging scenario
    such as patent semantic analysis.

13
Lexico-Syntactic Patterns
  • Patterns reported in the literature

sequence of tokens
NP1 isa-phrase NP2
syntactic noun phrases
  • In our case we are looking for the hypernym of a
    specific target term

Term-NP isa-phrase Hyper-NP
Hyper-NP isa-phrase Term-NP
14
L-ISA
  • Google (or any other web engine) does not allow
    for searching lexico-syntactic patterns
  • So, we proceed in three steps
  • Snippet acquisition from Google
  • Lexico-syntactic filtering
  • Semantic filtering

15
L-ISA Snippet Acquisition
  • Suppose we cannot link the term photodetector
    to any ORDO concept.
  • We want to exploit the following lexico-syntactic
    pattern
  • ltTERM-NPgt is an ltHYPER-NPgt
  • Submit to Google the following string query
  • photodetector is an
  • Keep the first 100 snippets (at most), e.g.
  • ... upper frequencies, the PIN waveguide
    photodetector is an attractive device, since it
    is possible to reduce transit time without ..
  • Transform HTML snippets in pure text.

16
L-ISA Lexico-syntactic Filtering
  • Annotate snippets with TextPro (PoS, lemma,
    chunk)
  • Recognize ltTerm-NPgt isa-phrase ltHyper-NPgt in the
    annotated snippets

token PoS lemma chunk
TERM-NP the AT0 the B-NP
PIN NN1 pin I-NP
waveguide NN1 waveguide I-NP
photodetector NN1 photodetector I-NP
isa-phrase is VBZ be B-VP
an AT0 an B-NP
HYPER-NP attractive AJ0 attractive I-NP
device NN1 device I-NP
17
L-ISA Lexico-syntactic Filtering
  • Filter out TERM-NP
  • if target term is modified (e.g. PIN waveguide
    photodetector above)
  • if it looks like a proper names (e.g. uppercase
    letter in the middle of a sentence).
  • Keep HYPER-NP
  • only if it fits a restricted number of
    PoS-pattern
  • (N AN NN NNN ANN XNN R Vpastpart
    AXN)

TERM-NP HYPER-NP
the photodetector analog signal
A photodetector apparatus
photodetector effective monitor
The photodetector electric device
a photodetector electronic device
photodetector object
18
Semantic Filtering
  • Keep only those HYPER-NPs compatible with the
    Optical Recording domain, by checking
  • whether the HYPER-NP is already a label in one of
    the known ontologies (SUMO, ORDO, AUTO-ORDO)
  • whether it is present in a WordNet synset with a
    WORDNET-DOMAIN label compatible with the Optical
    Recording domain.

HYPER-NP IN KB IN WN DOM. COMPAT
analog signal
apparatus yes yes midlow
effective monitor
electric device
electronic device yes
object yes yes midlow
Candidate hypernyms for photodetctor
19
Candidate Selection
  • Candidates are weighed according to
  • Frequency and Reliability of patterns where the
    hypernym occurs
  • Variety of patterns
  • Belonging to specific ontologies (manual ORDO,
    AUTO-ORDO or SUMO, in decreasing preference
    order)

20
Evaluating the Reliability of ISA Patterns
  • Assessement of the reliability of the patterns
    reported in the literature as predictors of the
    isa relation
  • Around 80 templates
  • On three target terms groove, photodetector
    and magnetic head.
  • Google returned around 9.000 snippets
  • Only snippets passing lexico-syntactic filtering
    have been actually manually evaluated (about
    1,450)
  • Guideline try to interpret the intentions of the
    author (does he/she really intende to say that X
    isa subclass of Y, beyond inappropriate phrasing,
    and even if you know that it is not true?)
  • Results of this evaluation exploited in weighting
    the hypernym candidates

21
Evaluating the L-ISA accuracy
  • Measuring the accuracy of the L-ISA algorithm in
    finding the hypernym of a given domain concept
  • Most frequent 100 terms that we were not able to
    link to the ORDO ontology using the Pro-ISA
    learning strategy
  • Including wrong target terms (because of errors
    of the linguistic processors, e.g. a past
    participle instead of a noun)
  • Accuracy 78.6

22
Future Work
  • Extend evaluation set
  • Inter-coder agreement
  • Use Machine Learning to optimize the weights
    associated to templates
Write a Comment
User Comments (0)
About PowerShow.com