Current Issues in Multilanguage Databases - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Current Issues in Multilanguage Databases

Description:

Select images from a collection indexed with free text captions in an unfamiliar language. ... tolerance for lower-end translation and error rate. Not ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 18
Provided by: off685
Learn more at: http://www.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Current Issues in Multilanguage Databases


1
Current Issues in Multi-language Databases
  • Amy Hawkins
  • Tammy Wells-Angerer

2
Why Multi-Language Databases?
  • We are talking about increasing the world-wide
    potential for access to
  • Knowledge and implicitly to progress and
    development. Not only should it
  • be possible for users throughout the world to
    have access to the massive
  • amounts of information of all types --
    scientific, economic, literary, news,
  • etc. -- now available over the networks, but also
    for information providers
  • to make their work and ideas available in their
    preferred language,
  • confident that this does not in itself preclude
    or limit access. The diversity
  • of the world's languages and cultures gives rise
    to an enormous wealth of
  • knowledge and ideas. It is thus essential that we
    study and develop
  • computational methodologies and tools that help
    us to preserve and exploit
  • this heritage. The survival of languages which
    are not available for
  • electronic communication will become increasingly
    problematic in the
  • future.
  • -Peters Picchi, D-Lib Magazine,1997

3
Characteristics of a Multi-Language Database
  • Contains text-based materials (for our
    discussion)
  • Either materials in multiple languages or
    (research definition) in languages other than the
    users native language.
  • Also known as Cross-Language, Inter-lingual

4
Multi-lingual presence on the Internet
  • 82.3 English
  • 4.0 German
  • 3.1 Japanese
  • 1.8 French
  • 1.1 Spanish
  • - Source Babel Team (1997)
  • Grefenstette and Nioche found similar findings
    in 2000 but found that non-English documents were
    increasing at a significantly faster rate.

5
Multi-Language Research Inherently
interdisciplinary
  • Linguists
  • Business
  • Information Library Scientists
  • Computer Scientists
  • Engineers
  • Intelligence Community

6
History
  • IR research into Multi-language retrieval began
    in the early 1970s
  • Research area has developed rapidly since then
    with pushes from Db and PC development in the
    1980s and Internet in the 1990s

7
Users Typical Needs
  • Search a monolingual collection in a language
    that the user cannot read.
  • Retrieve information from a multilingual
    collection using a query in a single language.
  • Select images from a collection indexed with
    free text captions in an unfamiliar language.
  • Locate documents in a multilingual collection of
    scanned page images.
  • - Source Oard (1996)

8
User Groups Researchers
  • Generally require a greater degree of linguistic
    precision and have a low tolerance for
    poor-quality translation
  • Require comprehensive coverage of all
    scholarship in a given area along with precision
    in definitions when searching multilingual
    databases
  • Most likely to encounter technology when
    searching published literature, digital
    libraries, the Internet.

9
User Groups Military Intelligence
  • Every day the NSA pulls in data equivalent to
    the contents of the Library of Congress book
    materials. -Glen
    Zorpette of IEEE Spectrum quoting Jim Banford
  • Require precision, speed and security in
    processing and interpreting data
  • TIDES Project Sponsored by DARPA (The Defense
    Advanced Research Projects Agency)
  • Goal To convert free text from a variety of
    languages into English

10
User Groups Business
  • Want to reach wide markets and meet needs of
    clients who are largely interested in searching
    and translation across several languages.

11
Users Groups Casual Users
  • Want to search the internet and find all
    relevant documents/resources.
  • More tolerance for lower-end translation and
    error rate.
  • Not currently seen as a large market
  • Most likely to encounter the technology built
    into Internet Search engines, library catalogs,
    etc.

12
Globalization
  • Electronic commerce
  • Database design issues
  • Static vs. derived fields currency
  • Dates
  • Multilanguage products
  • Multilanguage queries

13
Designers and Developers
  • Unicode What is it?
  • Character charts
  • UTF-8, UTF-16
  • Applications and DBs written to Unicode standard
    are portable
  • Data stored as UTF-16 is twice the size of data
    stored as UTF-8
  • Support
  • Oracle's Oracle 8i, 9i
  • IBM's DB2 Universal Database (UDB)
  • Microsoft SQL Server
  • Sybase Adaptive Server
  • -Source Unicode Consortium (2002)

14
Benefits and Shortcomings
  • Benefit - Store documents in different languages
    in the same database
  • Shortcoming Users have to type query in every
    language they want to use
  • Passive vocabulary exceeds active vocabulary
  • Typing the same query in multiple languages is
    slow and error prone

15
Multi-Language results
  • On-the-fly query translation
  • Sybase Web Demo
  • http//www.sybase.com/detail?id1009189webdemo
  • Networked University Digital Library
  • Networked Digital Library of Theses and
    Dissertations
  • http//www.ndltd.org/
  • Heterogeneity character encoding, language,
    metadata, protocols, repository technologies,
    structure of the data

16
Middleware diagram
17
Summary
  • Critical mass of interest, technology, tools,
    funding, content
  • Poised to move forward on optimizing
    multi-language databases to meet current and
    future user needs
Write a Comment
User Comments (0)
About PowerShow.com