Title: Thesaurus Mapping
1Thesaurus Mapping
Martin Doerr
Centre for Cultural Informatics and Documentation
Systems
Institute of Computer Science
Foundation for Research and Technology - Hellas
Bath, UK, January 11, 2000
2Thesaurus MappingThe Problem
- Logical aspects
- Semantics of involved entities
- Notions of translation
- Objectives and logics of mapping
- Production of mappings
- Human
- Language engineering, cluster analysis
- Architecture
- Mapping management
- Mapping service
- Integration in IT environment
3Thesaurus MappingWhy do we need mapping?
- Thesauri for information retrieval depend on
- View point (e.g. functional, morphological,
social, - special database fields etc.)
- Language or social group (experts, common people
etc.) - Size and distribution of target material
(effective partitioning) - Therefore
- Concepts differ
- Use of concepts differs
- Semantic embedding differs
- Even if we agree on the same world
- Research topic Formalisation of views and
context
4Thesaurus Mapping Semantics of entities
- Concepts are defined by agreement,
- e.g. orange (colour)
- Concepts identify sets of real world objects
- Concepts are identified by
- scope notes, literature references, examples,
images - Concepts should not be changed
- they should be created or abandoned
- they should be understood, accepted or rejected
- A Descriptor is a concept identifier
5Thesaurus Mapping Semantics of entities
- Links should express opinions and differences
- about set relation between concepts
- subsumtion, disjointness etc.
- about derived concepts
- about term usage
- opinions may be human or computational !
- Terms (noun phrases) should be used
- by social groups to refer to (multiple) concepts
- without direct linguistic meaning
- one term is selected as concept identifier
6Thesaurus Mapping Semantics of entities
- concept - concept relations
- set semantics
- BT, between thesauri/ version - for query
expansion, users - associative RTs, BTP, etc, - for user
guidance - concept - term
- authoritative preferred, used for - for
cataloguers, users - statistical, possible synonyms - for
information retrieval - term - term relations
- dictionary entries - limited
precision, within LE tools
7Thesaurus Mapping What is a Multilingual
Thesaurus?
- A translated thesaurus For comprehension
- Established concepts and terms from one user
group - Optimally interpreted in words of another or
more languages - Translations are not established terms
- Mapped thesauri (ISO5964) For transition
- Independent thesauri, each one from another user
group - Established concepts and terms.
- links declare overlap between concepts
- Interlingua For communication and knowledge
sharing - Compromise to share concepts between many user
groups - Optimally interpreted in words of another
language
8Thesaurus Mapping Functionality of Mapping
- Transparent query transformation (Z39.50!)
- Replace Boolean term combination from thesaurus
A with optimal term combination from thesaurus B
to retrieve equivalent results - Guaranteed transition needed (ev. to higher
concepts) - Need controlled loss of precision or recall
(research!) - Combinatorial explosion
- Need cascading Thes A gt Thes B gt
Thes C
9Thesaurus Mapping Logics of Mapping
- Interthesaurus relations (ISO 5964)
- (from Descriptor of Thes. A to Descriptor of
Thes. B ) - partial equivalence
- Better broader equivalence
- narrower equivalence
- exact equivalence
- inexact equivalence (/-)
- good for FTR only
- single to multiple equivalence
- Betterexact equivalence to BOOLEAN
combination of - target terms.
- AND
(intersection), OR (union), NOT (complement)
10Thesaurus Mapping Translation and Mapping
English Heritage Thesaurus
Merimee Thesaurus
AND
Interthesaurus relations
linguistic translation
linguistic translation
Interlingua
English Vocabulary
French Vocabulary
11Thesaurus Mapping Boolean OR-Combinations
- Combines instances of B and C
- Uses properties of either B or C
- Is BT of B, C and NT of
- their common broader terms.
Exact equivalence
A
BT
B OR C
Boolean Compound
B
C
12Thesaurus Mapping Boolean AND-Combinations
- Uses instances of both, B and C
- Combines properties of B and C
- Is NT of B, C and BT of their
- common narrower terms.
C
B
BT
A
Exact equivalence
B AND C
Boolean Compound
13Thesaurus Mapping Approximation by Inclusion
Broader equivalence
A
BT
B
Narrower equivalences
C
14Thesaurus Mapping Avoid redundant linking!
Broader equivalence
B
A
BT
Exact equivalence
Narrower equivalences
15Thesaurus Mapping Problems of Mapping
- Consistency and reasoning (Description Logics!)
- Optimal substitution of combined query terms
- Protocol to propagate recall/ precision control
- Inverse reading of one-to-many links.
- Postcoordination unclear semantics !
- e.g. grinding factories, solution by DL ?
16Thesaurus Mapping Production of Mappings
- Human assessment needs (see Term-IT)
- CSCW, work flow, decentralised management tools
- Excellent comparative presentation of thesaurus
contents - Language engineering (see Term-IT)
- termhood recognition, automatic translation by
parallel texts, - filtering by occurrence in target indexing
language. - Excellent for preprocessing !
- Analysis of use
- Cluster analysis with doubly indexed entries.
- Libraries problem to identify the same work !
17SIS - Thesaurus Management System Co-operative
linking
Group 1
Group 2
Version 0
Version 0
Version 1
Version 1
Version 2
New Workspace
New Workspace
obsolete term
18Thesaurus MappingUsers Environment
19Thesaurus MappingThree-level Architecture
End User
National Authority Providers
Local TMS
Local TMS
concept proposal
Thesaurus initialization
concept proposal
Thesaurus initialization
Update term use
Update term use
CMS Maintainer
CMS Maintainer
CMS
CMS
20Thesaurus Mapping Architectural Considerations
- We propose to distinguish
- Collection Management Systems with local term
management - National authority providers
- Mapping service
- Mapping service
- Co-operative mapping production environment and
system, - - for few languages (3?), domain specific ?
- Large scale mapping tables detached from
production system, - accessible as replicated Web resource.
- Integration
- Access engines connect to mapping resources on
demand - Provision of suitable metadata for CMS
capabilities
21Thesaurus Mapping Conclusions
- Thesaurus mapping is feasible and the best means
to access coherently multiple CMS with controlled
vocabulary - Thesaurus mapping is a major investment in human
- resources and IT environment
- Targeted research can much improve the currently
feasible - - quality of mapping
- - quality of service
- - and production cost