Title: Linguistic Processing of Classification Hierarchies
1Linguistic Processing of Classification
Hierarchies
Bernardo Magnini ITC-irst, Istituto per la
Ricerca Scientifica e Tecnologica Trento - Italy
2Current Research Topics on Text Processing at
ITC-irst
- Question/Answering
- TREC style
- Information Extraction
- ML approach, DOT.KOM project
- Lexical Acquisition and Linguistic Resources
- MultiWordnet, Wordnet Domains, corpora for
Italian - Word Sense Disambiguation
- Based on domains, MEANING project
- NLP for Knowledge Management
- Edamok project
- Evaluation of NLP Technologies
- Qa at CLEF-2003, Senseval-3
3Current Research Topics on Text Processing at
ITC-irst
- Question/Answering
- TREC style
- Information Extraction
- ML approach, DOT.Kom project
- Lexical Acquisition and Linguistic Resources
- MultiWordnet, Wordnet Domains, corpora for
Italian - Word Sense Disambiguation
- Based on domains, Meaning project
- NLP for Knowledge Management
- Edamok project
- Evaluation of NLP Technologies
- Qa at CLEF-2003, Senseval-3
4Outline
- Classification Hierarchies (CH)
- Concept hierarchies
- Approaches toward interoperability of CHs
- Semantic interpretation of CHs
- Making the information explicit the role of
linguistic and world knowledge - Experimental setting
- Preliminary results with CTXMATCH algorithm
5Organizing papers A senior researcher
Work
- Knowledge about the domain is used
- Classification schema are repeated
- Labels are interpreted in their context
WSD
QA
Papers
Projects
Experiments
Senseval-2
ACL-02
Submission
Camera ready
Submission
6Organizing papersA young researcher
Home
- A different view for the same documents
- Redundant information
- Different labels for the same concept
Articles
Code
2002
2001
2000
workshops
Int. conferences
journals
Senseval-2
ACL-02
7Organizing papersA student
Disambiguation
- Less structure corresponds to more complex labels
- Any kind of document is allowed (text, images,
code, )
Results-all-word-Eng.
Senseval-Call-for-paper
Senseval-article
Meaning-project
Algorithm-description
Acl-article-final-version
Lexical-sample-training-data
8Questions
- Can a system automatically discover similarities
among different views of the same documents? - Example retrieving documents in classification B
using the schema of classification A - How much reasoning is involved?
- Labels are expressed in a natural language.
- Is there a role for NLP technologies?
9Classification Hierarchies CH (1)
- Taxonomic organization of documents
- Easy to build no formal language is required
- Widespread used
- Web directories (Google, Yahoo!, Looksmart,
portals) - Market place catalogues for product
classifications - File systems
- Local Ontologies
- Documents are classified at all levels of the
hierarchy - CHs structure reflect both the documents and
world knowledge
10Classification Hierarchies (2)
Vacation
- Semi-structured relations among nodes are not
formally defined. - Document dependent CHs are organized according
to the documents that have to be classified. - Specificity criterion a document is classified
in the more specific node of the hierarchy.
2001
2000
Sea
Lake
Sea
Mountains
Tuscany
Spain
USA
11Interoperability among CHs
- Commercial interest Distributed Knowledge
Management in corporations - Scientific interest. Various terms have been
recently used, including - Meaning negotiation
- Semantic coordination
- Mapping between domain models
- Semantic mediation
- Ontology merging, integration or alignment
- Integration of hierarchical categorization
- Fits well in the Semantic Web perspective
- Common goal find mappings between nodes of two
classification hierarchies
12Interoperability among CHs
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
13Interoperability among CHs
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
14Interoperability among CHs
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
?
15Qualitative Mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
More general
16Qualitative mapping
Source CH
Target CH
Vacation
More specific
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
17Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Equivalent
Tuscany
Spain
USA
Tuscany
18Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
Not compatible
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
19Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Compatible
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
20Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
21Approaches to CH mapping
- Approaches to CH mapping can be grouped in four
classes, according with the kind of information
used - Based on document content
- Based on document classifications
- Based on structural information
- Based on semantic interpretation of labels
(CTXMATCH)
221. Mapping based on Documents
- Consider the content of the document
- Procedure Madhavan et al. AAAI-2002
- Train a classifier on documents of source CH
- Apply the classifier to documents of target CH
- Drawbacks
- Needs the documents
- Only textual documents can be considered
- Do not consider structural information
- Do not produce qualitative mappings
232. Mapping based on Classifications
- Consider the number of documents in common with
nodes of different CHs - Procedure Ichise et al. IJCAI-2003
- Compute a a statistical model of classification
criteria of source and target CHs - Determine similarity between pairs of nodes in
source and target - Drawbacks
- Needs documents in common
- Does not produce qualitative mappings
243. Mapping Based on Structural Information (1)
- Consider node definitions and their lexical
expansions - Procedure Calvanese et al. ISWC 2001
- Automatically propose candidate mappings based on
lexicographic criteria - Correct mappings are validated by a domain expert
- Drawbacks
- Require human intervention
- Feasible for ontology integration, not for CHs
253. Mapping Based on Structural Information (2)
- Consider structural constraints among nodes
- Procedure Daude et al. ACL-2000, this
conference - Select candidates pairs with lexicographic
criteria - Select structural constraints
- Use relaxation labelling to chose the best
candidate - Drawbacks
- Good for WordNet, but CHs have a lot of implicit
knowledge - Do not produce qualitative mapping
264. Mapping Based on Semantic Interpretation
- Consider linguistic processing of nodes and world
knowledge - Procedure Bouquet et al. ISWC-2003, to appear
- Build a logical interpretation for the source and
the target nodes - Compute the relation between the two logical
forms - Drawbacks
- Require world knowledge
- Require tuning of linguistic tools for CHs
27Semantic Interpretation (1)
Images
More specific
Italy
More specific
Beach
Mountain
- World Knowledge is necessary
28Semantic Interpretation (2)
Images
More specific
Italy
More specific
Beach
Mountain
More specific
Equivalent
29Linguistic Processing of CHs
- How linguistic techniques work on CHs?
- Tokenization and Part of Speech Tagging
- Multiwords recognition
- Named entities recognition
- Word sense disambiguation
- Which peculiar problems are posed by CHs as far
as their semantic interpretation is concerned? - How much implicit information is it possible to
extract from CHs?
30Part of Speech Tags (1)
Vacation
- Nouns are prevalent
- Limited context available for solving ambiguities
2001
2000
Sea
Lake
Beach
Mountains
Tuscany
Spain
USA
31Part of Speech Tags (2)
- POS tagger TNT Brants, ANLP-2000
- CH 5k tokens extracted by a balanced set of CHs
(web directories, file systems, product
catalogues, ontologies) both for English and
Italian - Text
- English training over 1M words (BNC)
- Italian training over 50k words (Elsnet)
32Tokenization
Credit agencies
Business credit agencies
Business credit gathering or reporting services
Value added network (VAN) services
From UNSPSC
33Abbreviations
Potato, pot. product
Semi-instant product (veg.)
From EClass
34Multiwords
- Multiword on two contiguous levels
- Multiword on one level
Sport
Billiards
Players
United States
From Google
35Coordination
Healthcare Services
Alternative and Holistic medicine
Witch doctors or voodoo services
From UNSPSC
36Multilinguality
37Lexical Ambiguity
- Structural information provide context for word
sense disambiguation - The connections between WSD and web directories
have been investigated by Gonzalo et al. 2003
Plants
Trees
Apple tree
From Google
38Arc Interpretation
- Relations among nodes are not formally defined
- Instance-of
- In CHs documents classified under a certain node
A are a subset of the documents classified under
a parent node of A. - According to our world knowledge the relation
among two nodes can be interpreted in various
ways.
39Arc Interpretation
- Relations among nodes are not formally defined
- Part-of
Images
Tuscany
Pisa
Florence
From Google
40Arc Interpretation
- Relations among nodes are not formally defined
- Generic Associations
Television
Cable_TV
Satellite
Public_Access
Guides
From Google
41Arc Interpretation
- Relations among nodes are not formally defined
- Meta-level criteria
World Languages
A
B
Afrikaans
Bali
From Google
42Implicit Negation
- Trentino is part of North Italy
43Implicit Negation
- Trentino is part of North Italy
Origin of ITC-irst employees
Italy
North except Trentino
Center
South
Trentino
From ITC-irst personnel office
44CTXMATCH Algorithm
- Semantic explicitation
- Linguistic analysis of labels
- Shallow parsing, access to wordnet, multiwords
- Contextualization
- Sense filtering (use Wordnet as knowledge
repository) - Sense composition (use Wordnet as knowledge
repository) - Semantic comparison
- Build a logical form (description logics)
- Computing the logical relation between two
formula (SAT solver)
45An Experimental SettingMatching Web Directories
- Task automatically discover qualitative mappings
among corresponding directories of Google and
Yahoo - CTXMATCH
- Input a pair ltN1, N2gt belonging to CH1 and CH2
- Output a relation holding between N1 and N2
- more general, more specific, equivalent, no
relation - Evaluation define a metric considering the
documents (Urls) classified both by Google and
Yahoo. Define a mapping between this metric and
the CTXMATCH relations. - Baseline string match of the paths of the two
nodes.
46Matching Google and Yahoo! Linguistic Analysis
47Matching Google and Yahoo! Preliminary Results
Google Architecture/History/Periods_and_Styles/Go
thic
Is More specific than
Yahoo Architecture/History/Medieval
48Ongoing and Future Experiments
- Web directories build a reference benchmark for
evaluating matching algorithms. - Include Looksmart
- Google English vs Google Italian
- File systems
- Collaboration Edamok, SWAP, MEANING
- Domain specific applications
- Medical classification integration of UML in the
algorithm - Public Administration matching document
classification hierarchies for automatic routing - Edamok project www.edamok.itc.it
- Papers, algorithm specifications, case studies
49Conclusions
- Interoperability of Classification Hierarchies
- Scientific interest Semantic Web community
- Application oriented interest
- NLP can play a crucial role
- A proper experimental setting is necessary for
comparing different approaches - CTXMATCH
- Qualitative mappings
- Semantic interpretation based on linguistic
analysis - Preliminary results