Title: Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach
1Towards Gazetteer Integration Through an
Instance-based Thesauri Mapping Approach
- Daniela F. Brauner, Marco A. Casanova, Ruy L.
Milidiú - dani, casanova, milidiu_at_inf.puc-rio.br
- Pontifical Catholic University of Rio de Janeiro
(PUC-Rio) - Department of Informatics
2Summary
- Motivation
- Gazetteers Thesauri
- Gazetteer Integration
- Instance-based Thesauri Mapping
- Conceptual and Statistical Model
- Experiments with Geographic Data
- Conclusions
3Motivation
- Goal Gazetteer Integration
- how to migrate entries from gazetteer GB to
gazetteer GA - Problems
- Duplicated Entries EliminationGazetteers may
overlap requires detecting and eliminating
duplicates - Reclassification of migrated entriesGazetteers
may adopt different classification schemes
requires mapping the classification scheme of GB
to that of GA
4Summary
- Motivation
- Gazetteers Thesauri
- Gazetteer Integration
- Instance-based Thesauri Mapping
- Conceptual and Statistical Model
- Experiments with Geographic Data
- Conclusions
5Gazetteers Thesauri
- Gazetteer
- a gazetteer is a geographical dictionary (as at
the back of an atlas) containing a list of
geographic names, together with their geographic
locations and other descriptive information
WordNet 2005. - a gazetteer is a catalog of geographic feature,
where each entry has as attributes - a unique ID
- a unique type a term taken from a feature type
thesaurus - a name
- optionally, a location an approximation of the
feature footprint
6Gazetteers Thesauri
- Thesauri
- a thesaurus is a structured and defined list of
terms which standardizes words used for indexing
UNESCO 1995 - thesaurus relationships
- NT narrower term
- BT broader term
- RT related term
- ...
7Gazetteers Thesauri
identifier display-name class gmly gmlx
adlgaz-1-1457057-00 Rio de Janeiro, Estado do - Brazil administrative areas -22.0 -42.5
adlgaz-1-1457059-20 Rio de Janeiro, Serra do - Brazil mountains -17.95 -44.95
adlgaz-1-1457061-32 Rio de Janeiro - Brazil populated places -22.9 -43.2333
adlgaz-1-1437138-6b Janeiro, Rio de - Brazil streams -11.85 -45.15
adlgaz-1-3223719-6f Rio de Janeiro - Loreto, Departamento de - Peru populated places -4.3833 -71.8167
Ex ADL Feature Type Thesaurus
8Gazetteers Thesauri
- ADL Feature Type Thesaurus (sample terms rooted
at regions)
regions . agricultural regions . biogeographic regions . . barren lands . . deserts . . forests . . . petrified forests . . . rain forests . . . woods . . grasslands . . habitats . . jungles . . oases . . shrublands . . snow regions . . tundras . . wetlands regions (cont.) . climatic regions . coastal zones . economic regions . land regions . . continents . . islands . . . archipelagos . . subcontinents . linguistic regions . map regions . . chart regions . . map quadrangle regions . . UTM zones
9Gazetteers Thesauri
- ADL Feature Type Thesaurus (sample entry)
islands A feature type category for places such as the island of Manhattan. Used for The category islands is used instead of any of the following. atolls cays island arcs isles islets keys (islands) land-tied islands mangrove islands Broader Terms land regions islands is a subtype of "land regions.." Related Terms The following is a list of other categories related to islands (non-hierarchical relationships) bars (physiographic) Scope Note Tracts of land smaller than a continent, surrounded by the water of an ocean, sea, lake or stream. Glossary of Geology, 4th ed.. Definition of islands .
10Gazetteers Thesauri
Mediator
Mediator
CAT
GAZ
DS
CAT
GAZ
Local Catalogue
Reference Gazetteer
Local DataSource
External Catalogue
External Gazetteer
Wrapper
DataSource
11Summary
- Motivation
- Gazetteers Thesauri
- Gazetteer Integration
- Instance-based Thesauri Mapping
- Conceptual and Statistical Model
- Experiments with Geographic Data
- Conclusions
12Gazetteer Integration
- Gazetteer Integration Problem
- how to migrate entries from gazetteer GB to
gazetteer GA
TA
TB
GA
GB
GEONet
ADL Gazetteer
13Gazetteer Integration
- Duplicated Entries Elimination
- Gazetteers GA and GB may have entries that
representthe same real-world features - use footprints to detect possible duplicates
identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457059 Rio de Janei mountains -17.95 -44.95
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FB
identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FA
fa fb
TA
TB
GA
GB
GEONet
ADL Gazetteer
14Gazetteer Integration
- Reclassification of migrated entries
- Gazetteers may adopt different classification
schemes requires mapping the classification
scheme of GB to that of GA
TA
TB
GA
GB
m( tb ) ta
GEONet
ADL Gazetteer
15Gazetteer Integration
16Gazetteer Integration
- Aligning terms does not work...
...
17Gazetteer Integration
- Aligning term definitions is even worse...
- (ADL) bay indentations of a coastline or
shoreline enclosing a part of a body of water
bodies of water partly surrounded by land. - (GNS) bay a coastal indentation between two
capes or headlands, larger than a cove but
smaller than a gulf. - (GNS) island tracts of land, smaller than a
continent, surrounded by water at high water.
18Gazetteer Integration
- Formal approaches (based on DL) are hopeless...
... ltowlClass rdfID"Island"gt
ltrdfssubClassOfgt ltowlRestrictiongt
ltowlonProperty rdfresource "http//sweet.jpl.na
sa.gov/ontology/space.owlsurroundedBy_2D" /gt
ltowlallValuesFromgt ltowlClassgt
ltowlunionOf rdfparseType"Collection"gt
ltowlClass rdfabout"OceanRegion" /gt
ltowlClass rdfabout"LandwaterRegion" /gt
lt/owlunionOfgt lt/owlClassgt
lt/owlallValuesFromgt lt/owlRestrictiongt
lt/rdfssubClassOfgt ltrdfssubClassOf
rdfresource"LandRegion" /gt lt/owlClassgt ... lt/
rdfRDFgt
19Gazetteer Integration
20Gazetteer Integration
- Instance-based Thesauri Mapping
- use duplicates to figure out how to map the
classification scheme of GB to that of GA
identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457059 Rio de Janei mountains -17.95 -44.95
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FB
identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FA
fa fb
TA
TB
GA
GB
m( tb ) ta
GEONet
ADL Gazetteer
21Summary
- Motivation
- Gazetteers Thesauri
- Gazetteer Integration
- Instance-based Thesauri Mapping
- Conceptual and Statistical Model
- Experiments with Geographic Data
- Conclusions
22Instance-based Thesauri Mapping Approach
- Conceptual and Statistical Model
- n(ta ,tb) number of occurrences of pairs of
objects fa and fb such that - fa ? GA and fb ? GB
- fa fb
- ta and tb are the types of fa, and fb,
respectively - n(ta) the number of entries in FA classified as
ta
FB
FA
23Instance-based Thesauri Mapping Approach
- Conceptual and Statistical Model
- P(ta ,tb) Mapping Rate Estimator
- an estimation for the frequency that the term ta
maps to tb, for each pair of terms ta ? TA and
tb ? TB
FB
FA
where ?
24Instance-based Thesauri Mapping Approach
- Conceptual and Statistical Model
- ? Threshold Mapping Rate
- m(tb) ta iff P(ta ,tb) ? ?
- Problem What is the value of ? ?
FB
FA
25Summary
- Motivation
- Gazetteers Thesauri
- Gazetteer Integration
- Instance-based Thesauri Mapping
- Conceptual and Statistical Model
- Experiments with Geographic Data
- Conclusions
26Experiments with Geographic Data
- Data collection
- ADL Gazetteer (ADL Feature Type Thesaurus - TA)
- Instances 16783
- Thesaurus terms 210
- GEOnet Server Names (GEOnet Thesaurus - TB)
- Instances 87608
- Thesaurus terms 642
27Experiments with Geographic Data
- Model Evaluation Test
- Data collected was partitioned into 7 datasets
- 6 for tuning
- 1 for testing
Tuning sets
Testing set
28Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
29Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
30Experiments with Geographic Data
ADL (ta) GEOnet (tb) n( ta , tb )
bays BAY 40
beaches BCH 66
countries, 2nd order divisions ADM2 30
forests PRK 2
islands ISL 562
islands ISLS 38
mountains HLL 222
mountains HLLS 166
physiographic features UPLD 204
populated places PPL 10961
populated places PPLL 12
populated places ISL 6
power generation sites PS 10
railroad features RSTN 406
railroad features RSTP 201
ridges RDGE 131
ridges SPUR 55
wetlands MRSH 9
wetlands FLTT 5
ADL (ta) n( ta )
islands 600
mountains 388
physiographic features 204
populated places 10979
power generation sites 10
railroad features 607
ridges 186
wetlands 14
...
...
31Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
32Experiments with Geographic Data
Collected data
Validation Step
Training Set (Tk)
Validation Set (Vk)
...
...
6-fold cross-validation
33Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
34Experiments with Geographic Data
Collected data
Estimated Threshold Mapping Rate
Testing set
6-fold cross-validation
35Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
36Experiments with Geographic Data
Testing Step
Collected data
Threshold 0.4
C P Accuracy C/P
26 29 89.7
Legend C correct term alignments P proposed
term alignments
Testing set
Example Aligned terms
ta tb P(ta,tb)
agricultural sites FRM 0.96974
bays BAY 0.50039
forests RESF 0.50078
islands ISL 0.93422
lakes LK 0.91849
...
6-fold cross-validation
37Summary
- Motivation
- Gazetteers Thesauri
- Gazetteer Integration
- Instance-based Thesauri Mapping
- Conceptual and Statistical Model
- Experiments with Geographic Data
- Conclusions
38Conclusions
- Conclusions
- duplicates help reclassification !
- a semantic approach may work when syntactic
approaches fail (badly) - If you buy the idea, you also get...
- a strategy to gradually learn how to reclassify
gazetteer entries (as in a mediator) - a strategy to mediate access to object catalogs
in general(as long as it is possible to detect
duplicates) - (Gazetteer for the Brazilian territory
- extracted from the ADL Gazetteer
- entries classified according to 4 different
(aligned) schemes - encapsulated by Web services)
39Towards Gazetteer Integration Through an
Instance-based Thesauri Mapping Approach
- Daniela F. Brauner, Marco A. Casanova, Ruy L.
Milidiú - dani, casanova, milidiu_at_inf.puc-rio.br
- Pontifical Catholic University of Rio de Janeiro
(PUC-Rio) - Department of Informatics