Title: Seamless Searching of Numeric and Textual Resources
1Seamless Searching of Numeric and Textual
Resources
Michael Buckland
School of Information Management and
Systems University of California, Berkeley
2The Significance of Vocabulary
- An economic claim Vocabulary problems reduce the
benefits and return on investment in information
services. - Vocabulary is used for indexicality, therefore
issues of identity are central to LIS. - Vocabulary is central to digital libraries.
- Vocabulary central to explaining the history of
conceptions of LIS!
3God --- Knowableness --- History of doctrines
--- Early church, ca. 30-600 --- Congresses.
4Economic Rationale
- Massive investment in repositories
- Large investment in categorization schemes
classifications, thesauri, concept codes,
headings, - Categorization schemes usually specialized and
stylized - Increasingly unfamiliar to searchers, hence
ineffective, inefficient use
5Remedy
Support for searching unfamiliar metadata
vocabularies Interface to translate searchers
vocabulary into systems vocabulary.
6Examples
Automobile import, export data (Census Bureau)
Automobiles?
No data.
Cars?
Railway or tramway stock
(Passenger motor vehicles, spark ignition engine.)
7in Library of Congress Classification
TL 205
in U.S. Patent Classification
180/280
in Standard Industrial Classification
3711
8Example Coastal pollution
F SU COASTAL POLLUTION 0
F TW COASTAL POLLUTION SUMMARIZE SUBJECTS
MeSH Seawater Water pollution Bacteria Water
microbiology Air pollution Environmental
monitoring Bathing beaches
LCSH Marine pollution Coastal zone
management Water --- Pollution Petroleum industry
and trade Beach erosion Coasts Barrier islands
9International Harmonized Commodity Classification
System Computer
- HS 84 Nuclear reactors, boilers, machines and
mechanical appliances - HS 8471 Automatic data processing machines and
units thereof, magnetic or optical readers,
machines for transcribing data - HS 847120 Digital auto data proc mach contng in
the same housing a CPU and input output device
10INSPEC Thesaurus subdomain-based indexes
- Water subdomain Fission reactor safety
Fission reactor fuel Polymers Organic
insulating materials Water supply Cable
insulation Insulation testing and Insulating
oils. - Biology subdomain Water Biomechanics
Physiological models Neurophysiology Cellular
effects of radiation. - Information Studies subdomain Agriculture
Natural resources Forecasting theory Operations
research Erosion.
11Example Vietnam War.
U.C. MELVYL Online Catalog
FIND XSU VIETNAM WAR Search Results 0 records
FIND XSU VIETNAMESE CONFLICT Search Results
4,190 records
12Emanuel Goldberg Aerial photography using a
Drachen
Example Tethered balloons. English
Aerostat. German Drachen ( Kite in dictionary)
13Entry vocabulary search interfaces
- Software and algorithms map natural language
vocabulary to specialized metadata terms. - Allows users to enter ordinary language queries
while taking advantage of existing subject
headings, categorization - Uses co-occurrence statistics to link users
ordinary language terms to system vocabularies - Statistical association between lexical items in
titles and abstracts and the systems metadata
vocabulary - Suggests most likely system vocabulary
14Thesaurus navigation
- Facilitates browsing where structure is present
Broader, narrower, related terms - Guides searcher to other parts of the structure
Retrieval set analysis
- Navigation within micro-domain
15Web access WWW forms-based application supported
by PerlSupports searches on remote
repositoriesFour subdomain dictionaries in
three databases--- BIOSIS (Biological
abstracts) subdomain water--- INSPEC
subdomains information science, water ---
U.S. Patent Office classification
16Statement of work
- Varied prototype Entry Vocabulary Modules.
- Unintrusive development of EVMs by agents
- Sensitivity to subdomains.
- Natural language processing to augment
statistical term frequency. - Recommendations for metadata codebooks for
numeric databases. - www.sims.berkeley.edu/metadata/