Title: Information Retrieval
1Information Retrieval
Lecture 22 Non-Textual Materials 1
2The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung, "The Google File System." 19th ACM
Symposium on Operating Systems Principles,
October 2003. http//www.cs.rochester.edu/sosp2003
/papers/p125-ghemawat.pdf "Component failures
are the norm rather than the exception.... The
quantity and quality of the components virtually
guarantee that some are not functional at any
given time and some will not recover from their
current failures. We have seen problems caused by
application bugs, operating system bugs, human
errors, and the failures of disks, memory,
connectors, networking, and power supplies...."
3Examples of Non-textual Materials
Content Attribute maps lat. and long.,
content photograph subject, date and place bird
songs and images field mark, bird
song software task, algorithm data set survey
characteristics video subject, date, etc.
4Possible Approaches to Information Discovery for
Non-text Materials
Human indexing Manually created metadata
records Automated information retrieval Automatic
ally created metadata records (e.g., image
recognition) Context associated text, links,
etc. (e.g., Google image search) Multimodal
combine information from several sources User
expertise Browsing user interface design
5Example 1 Blobworld
6(No Transcript)
7(No Transcript)
8Surrogates
Surrogates for searching Catalog records
Finding aids Classification
schemes Surrogates for browsing Summaries
(thumbnails, titles, skims, etc.)
9Catalog Records for Non-Textual Materials
General metadata standards, such as Dublin Core
and MARC, can be used to create a textual catalog
record of non-textual items. Subject based
metadata standards apply to specific categories
of materials, e.g., FGDC for geospatial
materials. Text-based searching methods can be
used to search these catalog records.
10Automated Creation of Metadata Records
Sometimes it is possible to generate metadata
automatically from the content of a digital
object. The effectiveness varies from field to
field. Examples Images -- characteristics
of color, texture, shape, etc. (crude)
Music -- optical recognition of score (good)
Bird song -- spectral analysis of sounds
(good) Fingerprints (good)
11Collections Finding Aids and the EAD
Finding aid A list, inventory, index or other
textual document created by an archive, library
or museum to describe holdings. May provide
fuller information than is normally contained
within a catalog record or be less
specific. Does not necessarily have a detailed
record for every item. The Encoded Archival
Description (EAD) A format (XML DTD) used to
encode electronic versions of finding aids.
Heavily structured -- much of the information
is derived from hierarchical relationships.
12Collection-Level Metadata
Collection-level metadata is used to describe a
group of items. For example, one record might
describe all the images in a photographic
collection. Note There are proposals to add
collection-level metadata records to Dublin Core.
However, a collection is not a document-like
object.
13Collection-Level Metadata
14Example 2 Photographs
Photographs in the Library of Congress' American
Memory collections In American Memory, each
photograph is described by a MARC record. The
photographs are grouped into collections, e.g.,
The Northern Great Plains, 1880-1920 Photographs
from the Fred Hultstrand and F.A. Pazandak
Photograph Collections
Information discovery is by searching the
catalog records browsing the collections
15(No Transcript)
16(No Transcript)
17(No Transcript)
18Photographs Cataloguing Difficulties
Automatic Image recognition methods are very
primitive Manual Photographic collections
can be very large Many photographs may show
the same subject Photographs have little or
no internal metadata (no title page) The
subject of a photograph may not be known (Who are
the people in a picture? Where is the location?)
19Photographs Difficulties for Users
Searching Often difficult to narrow the
selection down by searching -- browsing is
required Criteria may be different from
those in catalog (e.g., graphical
characteristics) Browsing Offline. Handling
many photographs is tedious. Photographs can be
damaged by repeated handling Online.
Viewing many images can be tedious. Screen
quality may be inadequate.
20Example 3 Geospatial Information
Example Alexandria Digital Library at the
University of California, Santa Barbara Funded
by the NSF Digital Libraries Initiative since
1994. Collections include any data
referenced by a geographical footprint.
terrestrial maps, aerial and satellite
photographs, astronomical maps, databases,
related textual information Program of
research with practical implementation at the
university's map library
21Alexandria User Interface
22Alexandria Computer Systems and User Interfaces
- Computer systems
- Digitized maps and geospatial information --
large files - Wavelets provide multi-level decomposition
of image - -gt first level is a small coarse image
- -gt extra levels provide greater detail
- User interfaces
- Small size of computer displays
- Slow performance of Internet in delivering
large files - -gt retain state throughout a session
23Alexandria Information Discovery
Metadata for information discovery Coverage
geographical area covered, such as the city of
Santa Barbara or the Pacific Ocean. Scope
varieties of information, such as topographical
features, political boundaries, or population
density. Latitude and longitude provide basic
metadata for maps and for geographical features.
24Gazetteer
Gazetteer database and a set of procedures that
translate representations of geospatial
references place names, geographic features,
coordinates postal codes, census tracts Search
engine tailored to peculiarities of searching for
place names. Research is making steady progress
at feature extraction, using automatic programs
to identify objects in aerial photographs or
printed maps -- topic for long-term research.
25Gazetteers
The Alexandria Digital Library (ADL) geolibrary
at University of California at Santa Barbara
where a primary attribute of objects is location
on Earth (e.g., map, satellite photograph). Geogra
phic footprint latitude and longitude values
that represent a point, a bounding box, a linear
feature, or a complete polygonal boundary.
Gazetteer list of geographic names, with
geographic locations and other descriptive
information. Geographic name proper name for a
geographic place or feature (e.g., Santa Barbara
County, Mount Washington, St. Francis Hospital,
and Southern California)
26Use of a Gazetteer
Answers the "Where is" question for
example, "Where is Santa Barbara?"
Translates between geographic names and
locations. A user can find objects by
matching the footprint of a geographic name
to the footprints of the collection
objects. Locates particular types of
geographic features in a designated area.
For example, a user can draw a box around
an area on a map and find the schools, hospitals,
lakes, or volcanoes in the area.
27Alexandria Gazetteer Example from a search on
"Tulsa"
Feature name State County Type Latitude Longitude
Tulsa OK Tulsa pop pl 360914N 0955933W Tulsa
Country OK Osage locale 360958N 0960012W Club
Tulsa County OK Tulsa civil 360600N 09554
00W Tulsa Helicopters OK Tulsa airport 360500N 095
5205W Incorporated Heliport
28Challenges for the Alexandria Gazetteer
Content standard A standard conceptual schema
for gazetteer information. Feature
types A type scheme to categorize individual
features, is rich in term variants and
extensible. Temporal aspects Geographic names
and attributes change through time. "Fuzzy"
footprints Extent of a geographic feature is
often approximate or ill-defined (e.g., Southern
California).
29Challenges for the Alexandria Gazetteer
(continued)
Quality aspects (a) Indicate the accuracy of
latitude and longitude data. (b) Ensure that the
reported coordinates agree with the other
elements of the description. Spatial
extents (a) Points do not represent the extent
of the geographic locations and are therefore
only minimally useful. (b) Bounding boxes, often
include too much territory (e.g., the bounding
box for California also includes Nevada).
30Alexandria Gazetteer
Alexandria Digital Library Linda L. Hill, James
Frew, and Qi Zheng, Geographic Names The
Implementation of a Gazetteer in a Georeferenced
Digital Library. D-Lib Magazine, 5 1, January
1999. http//www.dlib.org/dlib/january99/hill/01hi
ll.html
31Alexandria Thesaurus Example
canals A feature type category for places such
as the Erie Canal. Used for The category
canals is used instead of any of the following.
canal bends canalized streams ditch
mouths ditches drainage canals
drainage ditches ... more ... Broader Terms
Canals is a sub-type of hydrographic structures.
32Alexandria Thesaurus Example (continued)
canals (continued) Related Terms The following
is a list of other categories related to canals
(non-hierarchial relationships). channels
locks transportation features tunnels Scope
Note Manmade waterway used by watercraft or for
drainage, irrigation, mining, or water power.
Definition of canals.