XDMR Project Meeting, July 12, 2005 - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

XDMR Project Meeting, July 12, 2005

Description:

Arabic, Chinese, French, German. Italian, Japanese, Russian, Spanish ... currently phrases in English only. Language-specific and/or domain-specific ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 27
Provided by: Fre957
Category:
Tags: xdmr | july | meeting | project

less

Transcript and Presenter's Notes

Title: XDMR Project Meeting, July 12, 2005


1
Mapping Metadata Across Genres and Languages
  • XDMR Project Meeting, July 12, 2005
  • Fredric C. Gey, University of California,
    Berkeley
  • Colleagues Michael Buckland, Aitao Chen, Ray
    Larson
  • Programmers and Students Barbara Norgard,
    Natalia Perelman
  • Work performed under
  • DARPA research contract Search support for
    unfamiliar metadata vocabularies (1997-2001)
  • DARPA TIDES research grant Translingual
    Vocabulary Mappings using Domain Ontologies
    (1999-2004)
  • Institute for Museum and Library Services Grant
    Seamless search of textual and numeric databases
    (1999-2001), others (2002-2006)

Fredric C. Gey
2
HETEROGENEOUS DIGITAL INFORMATION SEARCHCurrent
Search Technology (multiple independent searches
without search aids)
Bibliography Full Text
QUERY
Maps and other Geospatial data
Music and other media
3
SEARCHING UNFAMILIAR METADATA PROBLEM STATEMENT
  • Numerous databases are indexed by structured
    metadata classifications
  • Classification schemes are highly specialized
  • As digital libraries multiply in size and
    diversity, there is a need for search engines for
    non-specialists
  • Search engines should translate from ordinary
    language to specialized classifications

In U.S. Import-Export Database
computer
No result found
4
Searching Statistical Information Problem
statement
  • Numeric statistical information is often thinly
    documented or documented with specialized
    vocabularies
  • Census welfare --gt public assistance
  • Foreign trade harmonized commodity
    classification computer --gt digital adp machine
  • Standardized Industrial Classification (SIC
    codes) automobile --gt motor vehicle
  • In search, the users ordinary language term is
    unlikely to match the limited technical
    vocabulary used to document the statistical
    resource
  • Can we remedy this situation?

Fredric C. Gey
5
Searching Statistical Information Evidence Poor
Data
  • Compared to searching textual databases, numeric
    statistical information and its terminology is
    both evidence-poor and highly technical
  • Statistical databases lack the rich set of
    textual clues which can identify items of numeric
    information
  • However, if we can find a textual resource
    associated with the numeric metadata, we may be
    able to mine the text to improve numeric search.
  • This is possible for some numeric classification
    schemes

Fredric C. Gey
6
SEARCHING UNFAMILIAR METADATA EXAMPLES FROM
FOREIGN TRADE IMPORTS-EXPORTS
  • U.S. Foreign Trade Imports-Exports
  • Two CD-ROMS/month
  • Classified by 16,000 numeric commodity codes in
    12 digit hierarchy
  • Search for automobile or computer yields no
    result
  • Search for Car yields Railway or Tramway
    Stock, Etc.
  • Need to know classifications Passenger Motor
    Vehicle or Digital ADP Machine w/CPU

In U.S. Import-Export Database
Car
Railway or Tramway Stock
7
SEARCHING UNFAMILIAR METADATA U.S. STANDARD
INDUSTRIAL CLASSIFICATION SYSTEM
  • U.S. Standard Industrial Classification System
    (SIC)
  • Used to classify and aggregate industrial
    activity in the U.S.
  • Codes defined by Office of Management and Budget
  • County Business Patterns reports annual
    employment, payroll, firm size by county, SIC
    code

In U.S. SIC System
Lobster
Nothing found
8
Searching Statistical Information Economic
Classification Codes
  • Standard Industrial Classification (SIC) and
    North American Industrial Classification (NAICS)
    codes have been used to index trade magazines
  • This provides a textual resource of hundreds of
    thousands of documents and millions of words
    associated with the numeric data.
  • Thus mappings can be made between the words from
    the magazines abstracts and the classifications
  • User queries can be matched against words and
    phrases most closely associated with the
    particular numeric data classification
  • A ranked list of classifications can be displayed
    to the user in order to improve the search
  • Harmonized commodity classifications can be
    searched using SIC codes as a search proxy

Fredric C. Gey
9
SEARCHING UNFAMILIAR METADATASOLUTION -- ENTRY
VOCABULARY TECHNOLOGY
  • Maps between ordinary language and specialized
    classifications
  • Implemented using text categorization techniques
  • Requires training collections which have been
    manually indexed
  • Preserves and leverages investment in creation of
    complex classification structures
  • Can also be applied to the task of multi-lingual
    information access

In U.S. Patent Database
For Automobile
Try 180/280
10
UNFAMILIAR METADATA PROJECT OVERVIEW
  • Provides entry vocabulary modules for searching
    unfamiliar metadata (text, Patents, Standard
    Industrial Classification)
  • Fundamental method uses Bayesian inference based
    upon training data (documents) to map between
    ordinary language and complex hierarchical
    metadata structures
  • Technique can be applied to searching poorly
    documented statistical information

Unfamiliar Metadata
11
SEARCHING UNFAMILIAR METADATA ENTRY
VOCABULARIES CONSTRUCTED
  • Ordinary language to U.S. Patent Classification
  • Ordinary language to INSPEC thesaurus terms
  • Ordinary language to Library of Congress
    classification codes
  • Ordinary language to Standard Industrial
    Classification system
  • Experience in entry vocabulary technology
    (training set size, data cleaning, natural
    language)
  • Ordinary Language to NAICS with link to 1997
    Economic Census

In U.S. SIC classification
For Lobster
Try Shellfish
12
Translingual Vocabulary Mappings Using Digital
Library Catalogs(Berkeley work with Aitao Chen,
Ray Larson, Michael Buckland)
  • Applies digital library multilingual catalogs
    (books, maps, pictures) in 400 languages to
    develop multilingual vocabulary lexicons
  • 8,000 Afghan(Pushto) words mapped to English in
    30 minutes
  • 20,000 word corpus of Afghan(Pushto) text in 3
    hours
  • Records contain one or more sentences from a
    particular language
  • Each is categorized by human indexers into
    standardized schemes
  • Library of Congress Subject Heading (Islamic
    Fundamentalism)
  • International standard (MARC) dataset formatting
  • Prototype implemented using 12 million records
    from the University of California electronic
    catalog (http//melvyl.ucop.edu)

13
Search SUBJECT Islamic Fundamentalism and
LANGUAGE Arabic
Yield 119 Arabic language samples on topic
Islamic Fundamentalism
14
BACKGROUND ON DIGITAL LIBRARY CATALOGS
  • Library catalogs are being automated at a furious
    pace worldwide
  • Library objects (books, maps, pictures) in 400
    languages
  • Records contain one or more sentences from a
    particular language
  • All categorized by specialists into standardized
    schemes
  • Library of Congress Subject Heading (Islamic
    Fundamentalism)
  • Library of Congress Classification (BP60, BP63,
    KF27)
  • Dewey Decimal Classification (297.2, 306.6,
    320.5)
  • International standard (MARC) dataset formatting
  • 1000 (est.) remotely searchable catalogs
    worldwide accessible using the international
    search/retrieve protocol Z39.50

15
WHAT CAN DIGITAL LIBRARY CATALOGS PROVIDE?
  • Millions of sentence fragments (titles). Many
    languages
  • Sentences with precise topical content identified
  • From 150,000 Library of Congress Subject Headings
  • Subject headings form transfer point
    (interlingua) between English topics and words in
    other languages
  • Can be used for
  • Query translation in cross-language information
    retrieval
  • Creation of bilingual dictionaries

16
BERKELEYS TECHNOLOGY VOCABULARY INDEXES
  • Takes large training sets of digital objects
    (book/article titles)
  • Software/algorithms perform maximum likelihood
    mappings between words/phrases and subject
    headings
  • Search software accepts words and returns subject
    headings and vice versa
  • Prototype search software maps from/to English
    and
  • Arabic, Chinese, French, German
  • Italian, Japanese, Russian, Spanish
  • currently uses romanized search for non-roman
    scripts
  • currently phrases in English only
  • Language-specific and/or domain-specific
  • Our prototype for nine languages available at
  • http//otlet.sims.berkeley.edu/mulevm2.html

17
BERKELEY TRAINING SET AND PROTOTYPE
  • University of California digital library catalog
  • Private copy, 10 million records (5 million
    non-English)
  • Records in over 100 languages
  • Obtained in MARC database standard format
  • Foreign language titles use Library of Congress
    transliteration (Romanization) standard
  • Prototype search software maps from/to English
    and
  • Arabic, Chinese, French, German
  • Italian, Japanese, Russian, Spanish

18
Example Library of Congress Subject Heading
Islamic Fundamentalism yields most closely
associated words in multiple languages
19
LIBRARY RESOURCES vs. WEB RESOURCES
  • LIBRARY CATALOGS
  • Databases in standard formats
  • Data tagged with rich metadata
  • Database size limited
  • Data content identified according to well-defined
    rules
  • WEB PAGE MINING
  • Data in quite varied formats
  • Limited or non-existent metadata
  • Database size potentially massive
  • Content rarely identified

20
Foreign search words are mapped to English
subject headings
Russian-German bilingual dictionary entries
Perevod Ubersetzung Ubersetzung -- Perevod
21
Translingual Vocabulary Indexes, Place Names,
Stemming CLIR for the TIDES Surprise Language
Exercise Fredric Gey, Aitao Chen, Vivien Petras
(University of California, Berkeley)
  • Library catalog records
  • Hindi 72,967 records (44,366 Berkeley,
    28,601 UK)
  • Urdu 30,206
  • Bengali 23,430
  • Tamil 20,232
  • Cebuano 16 (300 est. worldwide)
  • Day 1 English-Hindi vocabulary index
    http//metadata.sims.berkeley.edu/Evms/cdl2/hindi2
    /weblcsh_hindi.html
  • Terrorism --atankavada ???????
  • Day 5 10,839 English words mapped to
    (transliterated) Hindi (only 492 unique)
  • Unable to resolve LOC back transliteration
  • Day 2 178 city names from Indian Census, hand
    translated to Hindi by ISI, lat/lon by Beth
    Sundheim of SPAWAR.
  • Day 8 5,161 city and village names
  • Day 3 Hindi Monolingual UTF-8 Retrieval
  • Day 6 added ITRANS transliteration
  • Day 18 English-Hindi CLIR system
  • Day 24 Berkeleys statistical stemmer clusters
    Hindi words to English stems
  • Derived from the IBM/ISI lexicons
  • CLIR evaluation participation
  • 10 runs submitted, all automatic
  • Results very satisfactory
  • Also Geographic visualization for stories
    associated with cities
  • Lessons learned
  • World-wide library retrieval (using Z39.50
    protocol) can increase coverage
  • Data limited by language density, timeliness
  • Back-transliteration of scripted languages needs
    further research
  • Geographic visualization useful

22
HETEROGENEOUS DIGITAL INFORMATION SEARCHEnhanced
Search (augmented with Entry Vocabulary Module
(EVM) Technology)
EVMp
EVMs
EVMt
QUERYplus
EVMg
EVMm
QUERY
23
HETEROGENEOUS DIGITAL INFORMATION SEARCHDirect
Mappings and Search Between Multiple Information
Types
EVMs
EVMp
EVMt
QUERYplus
EVMg
EVMm
QUERY
24
Geographic Visualization of Hindi
Stories(Electronic Cultural Atlas Initiative,
University of California)
25
TWO RELEVANT STANDARDS under development
  • SDMX Statistical Data and Metadata Exchange
  • EC,
  • ISO Standard, Technical Committee 154
  • ISO/TS 173692005 SDMX
  • www.sdmx.org
  • DDI Data Documentation Initiative
  • XML standard for documenting statistical dataset
    for archiving
  • International committee of statistical data
    archivists
  • Coordinated by national archive at University of
    Michigan
  • Version 2.0 approved May
  • Working on V 3.0 to include detailed geography
  • www.icpsr.umich.edu/DDI

26
SEARCHING UNFAMILIAR METADATA Web Site and PI
Contact Information
  • www.sims.berkeley.edu/research/metadata
  • www.sims.berkeley.edu/research/seamless
  • Michael Buckland and Ray Larson (buckland, ray
    _at_sims.berkeley.edu)
  • Fredric Gey (gey_at_berkeley.edu)
Write a Comment
User Comments (0)
About PowerShow.com