Title: Geographic Information Retrieval (GIR)
1Geographic Information Retrieval (GIR) Ranking
Methods for Digital Libraries
Ray R. Larson and Patricia Frontiera School of
Information Management Systems and College of
Environmental Design, University of California,
Berkeley -- ray_at_sims.berkeley.edu
- The Geographic Footprint
- In GIR applications, the geographic footprint is
typically the only quantitative spatial
characteristic that is encoded and utilized. - The Footprint is a geometric representation of
the extent of the geographic content of the
information object being described. Usually
expressed in geographic coordinates (i.e.
latitude and longitude - Points maintain a general sense of location but
not extent or shape - Polygons identify location, extent, and shape
with varying degree of precision - The minimum aligned bounding rectangle (MBR) is
the most commonly used polygonal spatial
representantion in GIR systems. - Spatial query formation
- A spatial approach to GIR requires a geographic
interface to support spatial thinking and query
formation. - Spatial Queries key issues
- Communicating with the user if the user selects
a place name from a list, what type of geometric
approximation is used to represent the query
region (a point, a simple bounding box polygon, a
complex polygon?) - Level of detail in a graphic interface needs to
be sufficient to support geographic queries. - How can queries for more complex spatial
characteristics be supported? - Density, dispersion, pattern
- Spatial Query Example 1st and 2nd generation
interfaces from the FGDC/NSDI Efforts
Geographic data are an extremely important
resource for a wide range of scientists,
planners, policy makers, and analysts who study
natural and planned environments. Notably, the
landscape of geographic analysis has been
changing rapidly from data and computation poor
to data and computation rich. Developments in
digital electronic technologies, such as
satellites, integrated GPS units, digital
cameras, and miniature sensors, are dramatically
increasing the types and amounts of digitally
available raw geographic data and derived
information products. At the same time, advances
in computer hardware, software and network
technologies continue to improve our ability to
store and analyze these large, complex data sets.
These factors are contributing to a growing
political, social, scientific and economic
awareness of the value of geographic information
and driving new applications for its use. In
response to this, geographic digital libraries
that specialize in providing access to these data
are growing in number, collection size, and
sophistication. Moreover, mainstream digital
libraries, i.e. those that deal with primarily
text materials, are increasingly considering
geographic access methods for information
resources that, while not specifically about
geographic features, have important geographic
characteristics. Simply stated, most of the
objects in digital libraries are, to a greater or
lesser extent, about or related to particular
places on or near the surface of the Earth.
Place name georeferencing is extremely
effective because names are the primary means by
which people refer to geographic locations.
However, place names have well-documented lexical
and geographical problems. Lexical problems
include lack of uniqueness, alternate names or
spellings, and name changes. Geographical
problems include boundaries that change over
time, places with ambiguous boundaries, and
geographic features or areas of interest without
known place names. Unlike place names,
geographic coordinate representations provide an
unambiguous and persistent method for locating
geographic areas or features. However, the use
of coordinates presents many challenges in terms
of storage, indexing, processing and user
interface design that only recently have begun to
be investigated in the context of geographic
information retrieval (GIR). .
- Spatial Similarity Measures Matching and Spatial
Ranking - Spatial similarity can be considered as a
indicator of relevance documents whose spatial
content is more similar to the spatial content of
query will be considered more relevant to the
information need represented by the query. - Need to consider both Qualitative, non-geometric
spatial attributes and Quantitative, geometric
spatial attributes - Three basic approaches to spatial similarity
measures and ranking
- Method 2 Topological Overlap
- Spatial searches are constrained to only those
candidate GIOs that either - are completely contained within the query region,
- overlap with the query region,
- or, contain the query region.
- Each category is exclusive and all retrieved
items are considered relevant. - The result set cannot be ranked
- categorized topologoical relationship only,
- no metric refinement
- Method 1 Simple Overlap
- Candidate geographic information objects (GIOs)
that have any overlap with the query region are
retrieved. - Included in the result set are any GIOs that are
contained within, overlap, or contain the query
region. - The spatial score for all GIOs is either relevant
(1) or not relevant (0). - The result set cannot be ranked
- topological relationship only, no metric
refinement
- Method 3 Degree of Overlap
- Candidate geographic information objects (GIOs)
that have any overlap with the query region are
retrieved. - A spatial similarity score is determined based on
the degree to which the candidate GIO overlaps
with the query region. - The greater the overlap with respect to the query
region, the higher the spatial similarity score. - This method provides a score by which the result
set can be ranked - topological relationship overlap
- metric refinement area of overlap
- Our Approach Is a Probabilistic Estimate of
Probability of Relevance based on Logistic
Regression from from a sample of data with
relevance judgements. - Test Data
- 2554 metadata records indexed by 322 unique
geographic regions (represented as MBRs) and
associated place names. - 2072 records (81) indexed by 141 unique CA place
names - 881 records indexed by 42 unique counties (out of
a total of 46 unique counties indexed in CEIC
collection) - 427 records indexed by 76 cities (of 120)
- 179 records by 8 bioregions (of 9)
- 3 records by 2 national parks (of 5)
- 309 records by 11 national forests (of 11)
- 3 record by 1 regional water quality control
board region (of 1) - 270 records by 1 state (CA)
- 482 records (19) indexed by 179 unique user
defined areas (approx 240) for regions within or
overlapping CA - 12 represent onshore regions (within the CA
mainland) - 88 (158 of 179) offshore or coastal regions
- Geographic Approximations for CA Counties, UDAs,
and training sample
- X1 area of overlap(query region, candidate GIO)
/ area of query region - X2 area of overlap(query region, candidate GIO)
/ area of candidate GIO - X3 1 abs(fraction of overlap region that is
onshore fraction of candidate GIO that is
onshore) - Where Range for all variables is 0 (not similar)
to 1 (same)
- Geographic Information Retrieval (GIR)
- Definitions Geographic information retrieval
(GIR) is concerned with spatial approaches to the
retrieval of geographically referenced, or
Georeferenced information objects (GIOs) - Information objects that are about specific
regions or features on or near the surface of the
Earth. - Geospatial data are a special type of
georeferenced information that encodes a specific
geographic feature or set of features along with
associated attributes - maps, air photos, satellite imagery, digital
geographic data, etc - Georeferencing and GIR
- Within a GIR system, e.g., a geographic digital
library, information objects can be georeferenced
by place names or by geographic coordinates (i.e.
longitude latitude) - GIR is not GIS
- GIS is concerned with spatial representations,
relationships, and analysis at the level of the
individual spatial object or field. - GIR is concerned with the retrieval of geographic
information resources (and geographic information
objects at the set level) that may be relevant to
a geographic query region. - Spatial Approaches to GIR
Geodata.gov
NSDI Clearinghouse
- 42 of 58 counties referenced in the test
collection metadata - 10 counties randomly selected as query regions to
train LR model - 32 counties used as query regions to test model
The Geodata.gov site provides better support for
a query on wetlands near Petaluma because of the
increase cartographic detail that appears as the
user zooms in. ( you cant even find Petaluma on
the NSDI site and you can get even more lost if
you zoom in further).
- These results suggest
- Convex Hulls perform better than MBRs
- Expected result given that the CH is a higher
quality approximat - A probabilistic ranking based on MBRs can perform
as well if not better than a non-probabiliistic
ranking method based on Convex Hulls - Since any approximation other than the MBR
requires great expense, this suggests that the
exploration of new ranking methods based on the
MBR are a good way to go.
Acknowledgements
This research was sponsored at U.C. Berkeley by
the National Science Foundation and the Joint
Information Systems Committee (UK) under the
International Digital Libraries Program award
IIS-99755164. Additional Support was provided by
the Institute for Museum and Library Services as
part of the Going Places in the Catalog project.
Conservative Approximations