Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach

Description:

mountains. Rio de Janeiro, Serra do - Brazil. gml:x. class. display-name ... CAT. External. Gazetteer. GAZ. Departamento de Inform tica - PUC-Rio ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 40
Provided by: Dan8164
Category:

less

Transcript and Presenter's Notes

Title: Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach


1
Towards Gazetteer Integration Through an
Instance-based Thesauri Mapping Approach
  • Daniela F. Brauner, Marco A. Casanova, Ruy L.
    Milidiú
  • dani, casanova, milidiu_at_inf.puc-rio.br
  • Pontifical Catholic University of Rio de Janeiro
    (PUC-Rio)
  • Department of Informatics

2
Summary
  • Motivation
  • Gazetteers Thesauri
  • Gazetteer Integration
  • Instance-based Thesauri Mapping
  • Conceptual and Statistical Model
  • Experiments with Geographic Data
  • Conclusions

3
Motivation
  • Goal Gazetteer Integration
  • how to migrate entries from gazetteer GB to
    gazetteer GA
  • Problems
  • Duplicated Entries EliminationGazetteers may
    overlap requires detecting and eliminating
    duplicates
  • Reclassification of migrated entriesGazetteers
    may adopt different classification schemes
    requires mapping the classification scheme of GB
    to that of GA

4
Summary
  • Motivation
  • Gazetteers Thesauri
  • Gazetteer Integration
  • Instance-based Thesauri Mapping
  • Conceptual and Statistical Model
  • Experiments with Geographic Data
  • Conclusions

5
Gazetteers Thesauri
  • Gazetteer
  • a gazetteer is a geographical dictionary (as at
    the back of an atlas) containing a list of
    geographic names, together with their geographic
    locations and other descriptive information
    WordNet 2005.
  • a gazetteer is a catalog of geographic feature,
    where each entry has as attributes
  • a unique ID
  • a unique type a term taken from a feature type
    thesaurus
  • a name
  • optionally, a location an approximation of the
    feature footprint

6
Gazetteers Thesauri
  • Thesauri
  • a thesaurus is a structured and defined list of
    terms which standardizes words used for indexing
    UNESCO 1995
  • thesaurus relationships
  • NT narrower term
  • BT broader term
  • RT related term
  • ...

7
Gazetteers Thesauri
  • ADL Gazetteer

identifier display-name class gmly gmlx
adlgaz-1-1457057-00 Rio de Janeiro, Estado do - Brazil administrative areas -22.0 -42.5
adlgaz-1-1457059-20 Rio de Janeiro, Serra do - Brazil mountains -17.95 -44.95
adlgaz-1-1457061-32 Rio de Janeiro - Brazil populated places -22.9 -43.2333
adlgaz-1-1437138-6b Janeiro, Rio de - Brazil streams -11.85 -45.15
adlgaz-1-3223719-6f Rio de Janeiro - Loreto, Departamento de - Peru populated places -4.3833 -71.8167
Ex ADL Feature Type Thesaurus
8
Gazetteers Thesauri
  • ADL Feature Type Thesaurus (sample terms rooted
    at regions)

regions . agricultural regions . biogeographic regions . . barren lands . . deserts . . forests . . . petrified forests . . . rain forests . . . woods . . grasslands . . habitats . . jungles . . oases . . shrublands . . snow regions . . tundras . . wetlands regions (cont.) . climatic regions . coastal zones . economic regions . land regions . . continents . . islands . . . archipelagos . . subcontinents . linguistic regions . map regions . . chart regions . . map quadrangle regions . . UTM zones
9
Gazetteers Thesauri
  • ADL Feature Type Thesaurus (sample entry)

islands A feature type category for places such as the island of Manhattan. Used for The category islands is used instead of any of the following. atolls cays island arcs isles islets keys (islands) land-tied islands mangrove islands Broader Terms land regions islands is a subtype of "land regions.." Related Terms The following is a list of other categories related to islands (non-hierarchical relationships) bars (physiographic) Scope Note Tracts of land smaller than a continent, surrounded by the water of an ocean, sea, lake or stream. Glossary of Geology, 4th ed.. Definition of islands .
10
Gazetteers Thesauri
Mediator
Mediator
CAT
GAZ
DS
CAT
GAZ
Local Catalogue
Reference Gazetteer
Local DataSource
External Catalogue
External Gazetteer
Wrapper
DataSource
11
Summary
  • Motivation
  • Gazetteers Thesauri
  • Gazetteer Integration
  • Instance-based Thesauri Mapping
  • Conceptual and Statistical Model
  • Experiments with Geographic Data
  • Conclusions

12
Gazetteer Integration
  • Gazetteer Integration Problem
  • how to migrate entries from gazetteer GB to
    gazetteer GA

TA
TB
GA
GB
GEONet
ADL Gazetteer
13
Gazetteer Integration
  • Duplicated Entries Elimination
  • Gazetteers GA and GB may have entries that
    representthe same real-world features
  • use footprints to detect possible duplicates

identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457059 Rio de Janei mountains -17.95 -44.95
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FB
identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FA
fa fb
TA
TB
GA
GB
GEONet
ADL Gazetteer
14
Gazetteer Integration
  • Reclassification of migrated entries
  • Gazetteers may adopt different classification
    schemes requires mapping the classification
    scheme of GB to that of GA

TA
TB
GA
GB
m( tb ) ta
GEONet
ADL Gazetteer
15
Gazetteer Integration
16
Gazetteer Integration
  • Aligning terms does not work...

...
17
Gazetteer Integration
  • Aligning term definitions is even worse...
  1. (ADL) bay indentations of a coastline or
    shoreline enclosing a part of a body of water
    bodies of water partly surrounded by land.
  2. (GNS) bay a coastal indentation between two
    capes or headlands, larger than a cove but
    smaller than a gulf.
  3. (GNS) island tracts of land, smaller than a
    continent, surrounded by water at high water.

18
Gazetteer Integration
  • Formal approaches (based on DL) are hopeless...

... ltowlClass rdfID"Island"gt
ltrdfssubClassOfgt ltowlRestrictiongt
ltowlonProperty rdfresource "http//sweet.jpl.na
sa.gov/ontology/space.owlsurroundedBy_2D" /gt
ltowlallValuesFromgt ltowlClassgt
ltowlunionOf rdfparseType"Collection"gt
ltowlClass rdfabout"OceanRegion" /gt
ltowlClass rdfabout"LandwaterRegion" /gt
lt/owlunionOfgt lt/owlClassgt
lt/owlallValuesFromgt lt/owlRestrictiongt
lt/rdfssubClassOfgt ltrdfssubClassOf
rdfresource"LandRegion" /gt lt/owlClassgt ... lt/
rdfRDFgt
19
Gazetteer Integration
20
Gazetteer Integration
  • Instance-based Thesauri Mapping
  • use duplicates to figure out how to map the
    classification scheme of GB to that of GA

identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457059 Rio de Janei mountains -17.95 -44.95
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FB
identifier display-name class gmly gmlx
adlgaz-1-1457057- Rio de Janeiro administrat -22.0 -42.5
adlgaz-1-1457061- Rio de Jane populated places -22.9 -43.2333
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
adlgaz-1-1437138 Janeiro, Rio de streams -11.85 -45.15
FA
fa fb
TA
TB
GA
GB
m( tb ) ta
GEONet
ADL Gazetteer
21
Summary
  • Motivation
  • Gazetteers Thesauri
  • Gazetteer Integration
  • Instance-based Thesauri Mapping
  • Conceptual and Statistical Model
  • Experiments with Geographic Data
  • Conclusions

22
Instance-based Thesauri Mapping Approach
  • Conceptual and Statistical Model
  • n(ta ,tb) number of occurrences of pairs of
    objects fa and fb such that
  • fa ? GA and fb ? GB
  • fa fb
  • ta and tb are the types of fa, and fb,
    respectively
  • n(ta) the number of entries in FA classified as
    ta

FB
FA








23
Instance-based Thesauri Mapping Approach
  • Conceptual and Statistical Model
  • P(ta ,tb) Mapping Rate Estimator
  • an estimation for the frequency that the term ta
    maps to tb, for each pair of terms ta ? TA and
    tb ? TB

FB
FA








where ?
24
Instance-based Thesauri Mapping Approach
  • Conceptual and Statistical Model
  • ? Threshold Mapping Rate
  • m(tb) ta iff P(ta ,tb) ? ?
  • Problem What is the value of ? ?

FB
FA








25
Summary
  • Motivation
  • Gazetteers Thesauri
  • Gazetteer Integration
  • Instance-based Thesauri Mapping
  • Conceptual and Statistical Model
  • Experiments with Geographic Data
  • Conclusions

26
Experiments with Geographic Data
  • Data collection
  • ADL Gazetteer (ADL Feature Type Thesaurus - TA)
  • Instances 16783
  • Thesaurus terms 210
  • GEOnet Server Names (GEOnet Thesaurus - TB)
  • Instances 87608
  • Thesaurus terms 642

27
Experiments with Geographic Data
  • Model Evaluation Test
  • Data collected was partitioned into 7 datasets
  • 6 for tuning
  • 1 for testing

Tuning sets
Testing set
28
Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
29
Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
30
Experiments with Geographic Data
ADL (ta) GEOnet (tb) n( ta , tb )
bays BAY 40
beaches BCH 66
countries, 2nd order divisions ADM2 30
forests PRK 2
islands ISL 562
islands ISLS 38
mountains HLL 222
mountains HLLS 166
physiographic features UPLD 204
populated places PPL 10961
populated places PPLL 12
populated places ISL 6
power generation sites PS 10
railroad features RSTN 406
railroad features RSTP 201
ridges RDGE 131
ridges SPUR 55
wetlands MRSH 9
wetlands FLTT 5
ADL (ta) n( ta )
islands 600
mountains 388
physiographic features 204
populated places 10979
power generation sites 10
railroad features 607
ridges 186
wetlands 14
...
...
31
Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
32
Experiments with Geographic Data
Collected data
Validation Step
Training Set (Tk)
Validation Set (Vk)
...
...
6-fold cross-validation
33
Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
34
Experiments with Geographic Data
Collected data
Estimated Threshold Mapping Rate
Testing set
6-fold cross-validation
35
Experiments with Geographic Data
Collected data
Testing set
6-fold cross-validation
36
Experiments with Geographic Data
Testing Step
Collected data
Threshold 0.4
C P Accuracy C/P
26 29 89.7
Legend C correct term alignments P proposed
term alignments
Testing set
Example Aligned terms
ta tb P(ta,tb)
agricultural sites FRM 0.96974
bays BAY 0.50039
forests RESF 0.50078
islands ISL 0.93422
lakes LK 0.91849
...
6-fold cross-validation
37
Summary
  • Motivation
  • Gazetteers Thesauri
  • Gazetteer Integration
  • Instance-based Thesauri Mapping
  • Conceptual and Statistical Model
  • Experiments with Geographic Data
  • Conclusions

38
Conclusions
  • Conclusions
  • duplicates help reclassification !
  • a semantic approach may work when syntactic
    approaches fail (badly)
  • If you buy the idea, you also get...
  • a strategy to gradually learn how to reclassify
    gazetteer entries (as in a mediator)
  • a strategy to mediate access to object catalogs
    in general(as long as it is possible to detect
    duplicates)
  • (Gazetteer for the Brazilian territory
  • extracted from the ADL Gazetteer
  • entries classified according to 4 different
    (aligned) schemes
  • encapsulated by Web services)

39
Towards Gazetteer Integration Through an
Instance-based Thesauri Mapping Approach
  • Daniela F. Brauner, Marco A. Casanova, Ruy L.
    Milidiú
  • dani, casanova, milidiu_at_inf.puc-rio.br
  • Pontifical Catholic University of Rio de Janeiro
    (PUC-Rio)
  • Department of Informatics
Write a Comment
User Comments (0)
About PowerShow.com