Title: XDMR Project Meeting, July 12, 2005
1Mapping Metadata Across Genres and Languages
- XDMR Project Meeting, July 12, 2005
- Fredric C. Gey, University of California,
Berkeley - Colleagues Michael Buckland, Aitao Chen, Ray
Larson - Programmers and Students Barbara Norgard,
Natalia Perelman - Work performed under
- DARPA research contract Search support for
unfamiliar metadata vocabularies (1997-2001) - DARPA TIDES research grant Translingual
Vocabulary Mappings using Domain Ontologies
(1999-2004) - Institute for Museum and Library Services Grant
Seamless search of textual and numeric databases
(1999-2001), others (2002-2006)
Fredric C. Gey
2HETEROGENEOUS DIGITAL INFORMATION SEARCHCurrent
Search Technology (multiple independent searches
without search aids)
Bibliography Full Text
QUERY
Maps and other Geospatial data
Music and other media
3SEARCHING UNFAMILIAR METADATA PROBLEM STATEMENT
- Numerous databases are indexed by structured
metadata classifications - Classification schemes are highly specialized
- As digital libraries multiply in size and
diversity, there is a need for search engines for
non-specialists - Search engines should translate from ordinary
language to specialized classifications
In U.S. Import-Export Database
computer
No result found
4Searching Statistical Information Problem
statement
- Numeric statistical information is often thinly
documented or documented with specialized
vocabularies - Census welfare --gt public assistance
- Foreign trade harmonized commodity
classification computer --gt digital adp machine - Standardized Industrial Classification (SIC
codes) automobile --gt motor vehicle - In search, the users ordinary language term is
unlikely to match the limited technical
vocabulary used to document the statistical
resource - Can we remedy this situation?
Fredric C. Gey
5Searching Statistical Information Evidence Poor
Data
- Compared to searching textual databases, numeric
statistical information and its terminology is
both evidence-poor and highly technical - Statistical databases lack the rich set of
textual clues which can identify items of numeric
information - However, if we can find a textual resource
associated with the numeric metadata, we may be
able to mine the text to improve numeric search. - This is possible for some numeric classification
schemes
Fredric C. Gey
6SEARCHING UNFAMILIAR METADATA EXAMPLES FROM
FOREIGN TRADE IMPORTS-EXPORTS
- U.S. Foreign Trade Imports-Exports
- Two CD-ROMS/month
- Classified by 16,000 numeric commodity codes in
12 digit hierarchy - Search for automobile or computer yields no
result - Search for Car yields Railway or Tramway
Stock, Etc. - Need to know classifications Passenger Motor
Vehicle or Digital ADP Machine w/CPU
In U.S. Import-Export Database
Car
Railway or Tramway Stock
7SEARCHING UNFAMILIAR METADATA U.S. STANDARD
INDUSTRIAL CLASSIFICATION SYSTEM
- U.S. Standard Industrial Classification System
(SIC) - Used to classify and aggregate industrial
activity in the U.S. - Codes defined by Office of Management and Budget
- County Business Patterns reports annual
employment, payroll, firm size by county, SIC
code
In U.S. SIC System
Lobster
Nothing found
8Searching Statistical Information Economic
Classification Codes
- Standard Industrial Classification (SIC) and
North American Industrial Classification (NAICS)
codes have been used to index trade magazines - This provides a textual resource of hundreds of
thousands of documents and millions of words
associated with the numeric data. - Thus mappings can be made between the words from
the magazines abstracts and the classifications - User queries can be matched against words and
phrases most closely associated with the
particular numeric data classification - A ranked list of classifications can be displayed
to the user in order to improve the search - Harmonized commodity classifications can be
searched using SIC codes as a search proxy
Fredric C. Gey
9SEARCHING UNFAMILIAR METADATASOLUTION -- ENTRY
VOCABULARY TECHNOLOGY
- Maps between ordinary language and specialized
classifications - Implemented using text categorization techniques
- Requires training collections which have been
manually indexed - Preserves and leverages investment in creation of
complex classification structures - Can also be applied to the task of multi-lingual
information access
In U.S. Patent Database
For Automobile
Try 180/280
10UNFAMILIAR METADATA PROJECT OVERVIEW
- Provides entry vocabulary modules for searching
unfamiliar metadata (text, Patents, Standard
Industrial Classification) - Fundamental method uses Bayesian inference based
upon training data (documents) to map between
ordinary language and complex hierarchical
metadata structures - Technique can be applied to searching poorly
documented statistical information
Unfamiliar Metadata
11SEARCHING UNFAMILIAR METADATA ENTRY
VOCABULARIES CONSTRUCTED
- Ordinary language to U.S. Patent Classification
- Ordinary language to INSPEC thesaurus terms
- Ordinary language to Library of Congress
classification codes - Ordinary language to Standard Industrial
Classification system - Experience in entry vocabulary technology
(training set size, data cleaning, natural
language) - Ordinary Language to NAICS with link to 1997
Economic Census
In U.S. SIC classification
For Lobster
Try Shellfish
12Translingual Vocabulary Mappings Using Digital
Library Catalogs(Berkeley work with Aitao Chen,
Ray Larson, Michael Buckland)
- Applies digital library multilingual catalogs
(books, maps, pictures) in 400 languages to
develop multilingual vocabulary lexicons - 8,000 Afghan(Pushto) words mapped to English in
30 minutes - 20,000 word corpus of Afghan(Pushto) text in 3
hours - Records contain one or more sentences from a
particular language - Each is categorized by human indexers into
standardized schemes - Library of Congress Subject Heading (Islamic
Fundamentalism) - International standard (MARC) dataset formatting
- Prototype implemented using 12 million records
from the University of California electronic
catalog (http//melvyl.ucop.edu)
13Search SUBJECT Islamic Fundamentalism and
LANGUAGE Arabic
Yield 119 Arabic language samples on topic
Islamic Fundamentalism
14BACKGROUND ON DIGITAL LIBRARY CATALOGS
- Library catalogs are being automated at a furious
pace worldwide - Library objects (books, maps, pictures) in 400
languages - Records contain one or more sentences from a
particular language - All categorized by specialists into standardized
schemes - Library of Congress Subject Heading (Islamic
Fundamentalism) - Library of Congress Classification (BP60, BP63,
KF27) - Dewey Decimal Classification (297.2, 306.6,
320.5) - International standard (MARC) dataset formatting
- 1000 (est.) remotely searchable catalogs
worldwide accessible using the international
search/retrieve protocol Z39.50
15WHAT CAN DIGITAL LIBRARY CATALOGS PROVIDE?
- Millions of sentence fragments (titles). Many
languages - Sentences with precise topical content identified
- From 150,000 Library of Congress Subject Headings
- Subject headings form transfer point
(interlingua) between English topics and words in
other languages - Can be used for
- Query translation in cross-language information
retrieval - Creation of bilingual dictionaries
16BERKELEYS TECHNOLOGY VOCABULARY INDEXES
- Takes large training sets of digital objects
(book/article titles) - Software/algorithms perform maximum likelihood
mappings between words/phrases and subject
headings - Search software accepts words and returns subject
headings and vice versa - Prototype search software maps from/to English
and - Arabic, Chinese, French, German
- Italian, Japanese, Russian, Spanish
- currently uses romanized search for non-roman
scripts - currently phrases in English only
- Language-specific and/or domain-specific
- Our prototype for nine languages available at
- http//otlet.sims.berkeley.edu/mulevm2.html
17BERKELEY TRAINING SET AND PROTOTYPE
- University of California digital library catalog
- Private copy, 10 million records (5 million
non-English) - Records in over 100 languages
- Obtained in MARC database standard format
- Foreign language titles use Library of Congress
transliteration (Romanization) standard - Prototype search software maps from/to English
and - Arabic, Chinese, French, German
- Italian, Japanese, Russian, Spanish
18Example Library of Congress Subject Heading
Islamic Fundamentalism yields most closely
associated words in multiple languages
19LIBRARY RESOURCES vs. WEB RESOURCES
- LIBRARY CATALOGS
- Databases in standard formats
- Data tagged with rich metadata
- Database size limited
- Data content identified according to well-defined
rules
- WEB PAGE MINING
- Data in quite varied formats
- Limited or non-existent metadata
- Database size potentially massive
- Content rarely identified
20Foreign search words are mapped to English
subject headings
Russian-German bilingual dictionary entries
Perevod Ubersetzung Ubersetzung -- Perevod
21Translingual Vocabulary Indexes, Place Names,
Stemming CLIR for the TIDES Surprise Language
Exercise Fredric Gey, Aitao Chen, Vivien Petras
(University of California, Berkeley)
- Library catalog records
- Hindi 72,967 records (44,366 Berkeley,
28,601 UK) - Urdu 30,206
- Bengali 23,430
- Tamil 20,232
- Cebuano 16 (300 est. worldwide)
- Day 1 English-Hindi vocabulary index
http//metadata.sims.berkeley.edu/Evms/cdl2/hindi2
/weblcsh_hindi.html - Terrorism --atankavada ???????
- Day 5 10,839 English words mapped to
(transliterated) Hindi (only 492 unique) - Unable to resolve LOC back transliteration
- Day 2 178 city names from Indian Census, hand
translated to Hindi by ISI, lat/lon by Beth
Sundheim of SPAWAR. - Day 8 5,161 city and village names
- Day 3 Hindi Monolingual UTF-8 Retrieval
- Day 6 added ITRANS transliteration
- Day 18 English-Hindi CLIR system
- Day 24 Berkeleys statistical stemmer clusters
Hindi words to English stems - Derived from the IBM/ISI lexicons
- CLIR evaluation participation
- 10 runs submitted, all automatic
- Results very satisfactory
- Also Geographic visualization for stories
associated with cities - Lessons learned
- World-wide library retrieval (using Z39.50
protocol) can increase coverage - Data limited by language density, timeliness
- Back-transliteration of scripted languages needs
further research - Geographic visualization useful
22HETEROGENEOUS DIGITAL INFORMATION SEARCHEnhanced
Search (augmented with Entry Vocabulary Module
(EVM) Technology)
EVMp
EVMs
EVMt
QUERYplus
EVMg
EVMm
QUERY
23HETEROGENEOUS DIGITAL INFORMATION SEARCHDirect
Mappings and Search Between Multiple Information
Types
EVMs
EVMp
EVMt
QUERYplus
EVMg
EVMm
QUERY
24Geographic Visualization of Hindi
Stories(Electronic Cultural Atlas Initiative,
University of California)
25TWO RELEVANT STANDARDS under development
- SDMX Statistical Data and Metadata Exchange
- EC,
- ISO Standard, Technical Committee 154
- ISO/TS 173692005 SDMX
- www.sdmx.org
- DDI Data Documentation Initiative
- XML standard for documenting statistical dataset
for archiving - International committee of statistical data
archivists - Coordinated by national archive at University of
Michigan - Version 2.0 approved May
- Working on V 3.0 to include detailed geography
- www.icpsr.umich.edu/DDI
26SEARCHING UNFAMILIAR METADATA Web Site and PI
Contact Information
- www.sims.berkeley.edu/research/metadata
- www.sims.berkeley.edu/research/seamless
- Michael Buckland and Ray Larson (buckland, ray
_at_sims.berkeley.edu) - Fredric Gey (gey_at_berkeley.edu)