Multilingual Information Access for - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Multilingual Information Access for

Description:

Translation resources. Machine Translation (MT) Parallel ... Dictionary-based translation. General Machine-readable dictionaries. CH Domain-specific lexicons ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 41
Provided by: I123
Category:

less

Transcript and Presenter's Notes

Title: Multilingual Information Access for


1
Culture OnLine, 5-7.6
Multilingual Information Access for Cultural
Heritage Carol Peters ISTI-CNR, Italy
2
Outline
  • What is MLIA/CLIR
  • What is the State-of-the-Art?
  • Where are the Problems?
  • MLIA for Cultural Heritage
  • The MultiMatch project

3
Europes Linguistic Diversity
4
There are 6,800 known languages spoken in 200
countries 2,261 have writing systems (the others
are only spoken)
Just 300 have some kind of language processing
tools
5
Multilinguality in the Information Society
  • Acquisition and dissemination of information in
    digital form over language boundaries
  • Web as platform for knowledge dissemination and
    acquisition
  • Content available and needs to be accessed in
    many languages
  • Information providers and seekers should have
    equal opportunities
  • Preservation of national languages

6
Multilinguality in the Information Society
  • Information providers and seekers should have
    equal opportunities
  • Preservation of national languages

7
Trend Toward Multilingual Web
Source http//global-reach.biz/globstats/evol.htm
l
8
i2010 Digital Libraries(Europeana)
  • Multilinguality is a key issue and impacts not
    only on
  • Online accessibility of Europes cultural
    heritage (a common multilingual access point)
  • but also
  • Digitisation
  • Preservation and storage

9
Online Access
  • Increasing pressure for access to information
    without language or cultural barriers
  • Find information in foreign languages
  • Read and interpret that information
  • Merge with information in other languages
  • Need for Multilingual Information Access

10
What is MLIA?
  • MLIA related research regards the storage,
    access, retrieval and presentation of information
    in any of the world's languages.
  • Two main areas of interest
  • multiple language access, browsing, display
  • cross-language information discovery and
    retrieval

11
Multi-Language Access, Browsing, Display
  • The enabling technology
  • character encoding
  • specific requirements of particular languages and
    scripts
  • internationalization localization

12
Cross-Language Information Retrieval
  • Crossing the language barrier
  • querying of multilingual collection in one
    language against documents in many other
    languages
  • filtering, selecting, ranking retrieved documents
  • presenting retrieved information in an
    interpretable and exploitable fashion

13
The Problem
14
(No Transcript)
15
(No Transcript)
16
CLIR methods
  • How is it done?
  • Pre-process index both documents and queries
    generally using language dependent techniques
    (tokenisation, stopwords, stemming, morphological
    analysis, decompounding, etc.)
  • Translate queries or documents (or both)
  • Translation resources
  • Machine Translation (MT)
  • Parallel/comparable corpora
  • Bilingual Dictionaries
  • Multilingual Thesauri
  • Conceptual Interlingua
  • Find relevant documents in target collection(s)
    present results

17
CLIR for Multimedia
  • Retrieval from a mixed media collection is non-
    trivial problem
  • Different media processed in different ways and
    suffer from different kinds of indexing errors
  • spoken documents indexed using speech recognition
  • handwritten documents indexed using OCR
  • images indexed using significant features
  • Need for complex integration of multiple
    technologies

18
Main CLIR Difficulties (I)
  • Language identification
  • Morphology inflection, derivation, compounding,
  • OOV terms, e.g. proper names, terminology
  • Multi-word concepts, e.g. phrases and idioms
  • Ambiguity, e.g. polysemy
  • Handling many languages L1 -gt Ln
  • Merging results from different sources / media
  • Presenting the results in useful fashion

19
Main CLIR Difficulties (II)
  • MLIA systems need clever pre-processing of target
    collections (e.g. semantic analysis,
    classification, information extraction)
  • MLIA systems need intelligent post-processing of
    results merging/ summarization / translation
  • MLIA systems need well-developed resources
  • Language Processing Tools
  • Language Resources
  • Resources are expensive to acquire, maintain,
    update

20
Involving the user
  • Interactive CLIR systems can help users locate
    and identify relevant foreign-language documents
  • Formulate and translate the query (e.g. entering
    diacritics, selecting translation alternatives)
  • Query re-formulation (e.g. selecting query
    expansion terms)
  • Browsing/navigating results (e.g. translating
    metadata)
  • Identifying relevant documents (e.g. summarising
    and translating results)

21
Implementing MLIA is Complex
  • Multilingual Portals
  • How many languages / how many levels should be
    multilingual / how to handle updates /linguistic
    and cultural dependent issues
  • Monolingual Search for Multiple Languages
  • encoding and representation issues / language
    identification / indexing issues (stop words,
    stemmers, morphological analysers, named entity
    recognition, ..)
  • Cross-Language Search
  • translation resources (dictionaries, corpora, MT
    systems)
  • Presentation of Results
  • in form interpretable and exploitable by user

22
MLIA for Cultural Heritage
  • The main problems are the same
  • BUT
  • Need for fine tuning with respect to
  • Specific terminology
  • Specific media involved
  • Specific user profile

Case Study The MultiMatch project
23
MultiMatch The Initial Idea
  • Problem
  • The Web contains a wealth of fragmented CH
    information but users are left to discover,
    interpret and aggregate it.
  • Objectives
  • Develop a search engine that provides targeted,
    enriched access to heterogeneous CH objects
  • across all media types and language boundaries
  • supporting various user classes
  • with aggregate views on complex task scenarios
  • Assist CH institutions to raise visibility and
    disseminate content

24
acquisition
crawling
25
Main Research Challenges
  • From document to complex objects retrieval
  • Focused crawling for acquisition of CH-related
    information from heterogeneous MM resources
  • CH concept and relation extraction using
    information extraction and text mining techniques
  • Multimedia search and mixed media search
  • Multilingual management with support for query
    formulation, cross-language retrieval and
    summarization
  • Integration and representation of related objects
  • Presentation of aggregated search results
  • User support (e.g. search history, annotation
    facilities, personalized presentation of results,
    etc.)

26
Multilingual Support
  • Provide system with monolingual and multilingual
    search functionalities (initially four languages,
    extended to six)
  • Provide effective translation strategies e.g.
    multilingual dictionaries, machine translation,
    thesaurus term expansion for domain specific
    content
  • Multilingual query expansion
  • Dynamic summarization of results over languages

27
Multilingual Indexing
  • Six separate monolingual index files
  • Dutch, English, German, Italian, Polish, Spanish
  • One multilingual (English) index file
  • Translate all incoming docs to English
  • Store all in a single index
  • Translate incoming queries to English
  • Submit to English index

28
Query Translation Methods
  • Machine translation
  • Commerical MT system
  • Dictionary-based translation
  • General Machine-readable dictionaries
  • CH Domain-specific lexicons

29
Machine Translation
  • WorldLingo commercial machine translation system
    used under licence
  • Supports all language pairs for the six selected
    languages
  • Easy to use and integrate into prototype
  • Good well documented API

30
Dictionary-based Query Translation
  • Word-by-word (phrase-by-phrase) translation via
    bilingual dictionary look-up
  • Translation resources
  • General translation lexicon
  • FREELANG dictionary www.freelang.net
  • Universal dictionary www.dicts.info
  • Domain-specific translation lexicon
  • Wikipedia www.wikipedia.org

31
Domain-specific Lexicon Construction
  • Multilingual Wikipedia
  • A Wikipedia page written in one language often
    contains hyperlinks to counterparts in other
    languages
  • For example

32
Domain-specific Translation Example
33
(No Transcript)
34
Query expansion
  • Thesaurus expansion
  • Terms from Eurowordnet
  • Relevance Feedback
  • Terms from user-selected relevant documents
  • Blind feedback
  • Terms added from the system top-tanked and
    assumed-relevant documents

35
Cross Media Data Aggregation
  • Enhance retrieval techniques by better
    integration of modalities
  • Combined visual plus text search (free text
    metadata) machine-learning based fusion of
    mono-modla search results

36
User Interaction and Interface Design
  • User-centred design process
  • Evaluate and refine the interface based on
    empirical evaluation and usability testing
  • Interface supporting multilingual and multimedia
    search
  • Default search on all types of content
  • Specialized search on metadata fields and on
    different media
  • Use of semantic structures for search and browse

37
(No Transcript)
38
Cross-Language Search
39
The Consortium
  • Academia
  • Istituto di Scienza e Tecnologie
    dellInformazione
  • University of Sheffield
  • Dublin City University
  • University of Amsterdam
  • University of Geneva
  • Universidad Nacional de Educación a Distancia
  • Industry
  • OCLC
  • WIND Telecomunicazioni S.p.A.
  • Cultural Heritage
  • Fratelli Alinari Istituto Edizioni Artistiche SpA
  • Netherlands Institute for Sound and Vision
  • University of Alicante Biblioteca Virtual
    Miguel de Cervantes
  • Coordinator Pasquale Savino savino_at_isti.cnr.it

40
More Information
  • Website with Online Demo
  • www.multimatch.eu
  • MultiMatch Final User Workshop, 25 October,
    Florence, Italy
  • Everyone is welcome
  • Contact Sam Minelli, User Groups Coordinator
    sam_at_alinari.it
Write a Comment
User Comments (0)
About PowerShow.com