Title: Multilingual Information Access for
1Culture OnLine, 5-7.6
Multilingual Information Access for Cultural
Heritage Carol Peters ISTI-CNR, Italy
2Outline
- What is MLIA/CLIR
- What is the State-of-the-Art?
- Where are the Problems?
- MLIA for Cultural Heritage
- The MultiMatch project
3Europes Linguistic Diversity
4There are 6,800 known languages spoken in 200
countries 2,261 have writing systems (the others
are only spoken)
Just 300 have some kind of language processing
tools
5Multilinguality in the Information Society
- Acquisition and dissemination of information in
digital form over language boundaries - Web as platform for knowledge dissemination and
acquisition - Content available and needs to be accessed in
many languages
- Information providers and seekers should have
equal opportunities - Preservation of national languages
6Multilinguality in the Information Society
- Information providers and seekers should have
equal opportunities - Preservation of national languages
7Trend Toward Multilingual Web
Source http//global-reach.biz/globstats/evol.htm
l
8i2010 Digital Libraries(Europeana)
- Multilinguality is a key issue and impacts not
only on - Online accessibility of Europes cultural
heritage (a common multilingual access point) - but also
- Digitisation
- Preservation and storage
9Online Access
- Increasing pressure for access to information
without language or cultural barriers - Find information in foreign languages
- Read and interpret that information
- Merge with information in other languages
- Need for Multilingual Information Access
10What is MLIA?
- MLIA related research regards the storage,
access, retrieval and presentation of information
in any of the world's languages. - Two main areas of interest
- multiple language access, browsing, display
- cross-language information discovery and
retrieval
11Multi-Language Access, Browsing, Display
- The enabling technology
- character encoding
- specific requirements of particular languages and
scripts - internationalization localization
12Cross-Language Information Retrieval
- Crossing the language barrier
- querying of multilingual collection in one
language against documents in many other
languages - filtering, selecting, ranking retrieved documents
- presenting retrieved information in an
interpretable and exploitable fashion
13The Problem
14(No Transcript)
15(No Transcript)
16CLIR methods
- How is it done?
- Pre-process index both documents and queries
generally using language dependent techniques
(tokenisation, stopwords, stemming, morphological
analysis, decompounding, etc.) - Translate queries or documents (or both)
- Translation resources
- Machine Translation (MT)
- Parallel/comparable corpora
- Bilingual Dictionaries
- Multilingual Thesauri
- Conceptual Interlingua
- Find relevant documents in target collection(s)
present results
17CLIR for Multimedia
- Retrieval from a mixed media collection is non-
trivial problem - Different media processed in different ways and
suffer from different kinds of indexing errors - spoken documents indexed using speech recognition
- handwritten documents indexed using OCR
- images indexed using significant features
- Need for complex integration of multiple
technologies
18Main CLIR Difficulties (I)
- Language identification
- Morphology inflection, derivation, compounding,
- OOV terms, e.g. proper names, terminology
- Multi-word concepts, e.g. phrases and idioms
- Ambiguity, e.g. polysemy
- Handling many languages L1 -gt Ln
- Merging results from different sources / media
- Presenting the results in useful fashion
19Main CLIR Difficulties (II)
- MLIA systems need clever pre-processing of target
collections (e.g. semantic analysis,
classification, information extraction) - MLIA systems need intelligent post-processing of
results merging/ summarization / translation - MLIA systems need well-developed resources
- Language Processing Tools
- Language Resources
- Resources are expensive to acquire, maintain,
update
20Involving the user
- Interactive CLIR systems can help users locate
and identify relevant foreign-language documents - Formulate and translate the query (e.g. entering
diacritics, selecting translation alternatives) - Query re-formulation (e.g. selecting query
expansion terms) - Browsing/navigating results (e.g. translating
metadata) - Identifying relevant documents (e.g. summarising
and translating results)
21Implementing MLIA is Complex
- Multilingual Portals
- How many languages / how many levels should be
multilingual / how to handle updates /linguistic
and cultural dependent issues - Monolingual Search for Multiple Languages
- encoding and representation issues / language
identification / indexing issues (stop words,
stemmers, morphological analysers, named entity
recognition, ..) - Cross-Language Search
- translation resources (dictionaries, corpora, MT
systems) - Presentation of Results
- in form interpretable and exploitable by user
22MLIA for Cultural Heritage
- The main problems are the same
- BUT
- Need for fine tuning with respect to
- Specific terminology
- Specific media involved
- Specific user profile
Case Study The MultiMatch project
23MultiMatch The Initial Idea
- Problem
- The Web contains a wealth of fragmented CH
information but users are left to discover,
interpret and aggregate it. - Objectives
- Develop a search engine that provides targeted,
enriched access to heterogeneous CH objects - across all media types and language boundaries
- supporting various user classes
- with aggregate views on complex task scenarios
- Assist CH institutions to raise visibility and
disseminate content
24acquisition
crawling
25Main Research Challenges
- From document to complex objects retrieval
- Focused crawling for acquisition of CH-related
information from heterogeneous MM resources - CH concept and relation extraction using
information extraction and text mining techniques - Multimedia search and mixed media search
- Multilingual management with support for query
formulation, cross-language retrieval and
summarization - Integration and representation of related objects
- Presentation of aggregated search results
- User support (e.g. search history, annotation
facilities, personalized presentation of results,
etc.)
26Multilingual Support
- Provide system with monolingual and multilingual
search functionalities (initially four languages,
extended to six) - Provide effective translation strategies e.g.
multilingual dictionaries, machine translation,
thesaurus term expansion for domain specific
content - Multilingual query expansion
- Dynamic summarization of results over languages
27Multilingual Indexing
- Six separate monolingual index files
- Dutch, English, German, Italian, Polish, Spanish
- One multilingual (English) index file
- Translate all incoming docs to English
- Store all in a single index
- Translate incoming queries to English
- Submit to English index
28Query Translation Methods
- Machine translation
- Commerical MT system
- Dictionary-based translation
- General Machine-readable dictionaries
- CH Domain-specific lexicons
29Machine Translation
- WorldLingo commercial machine translation system
used under licence - Supports all language pairs for the six selected
languages - Easy to use and integrate into prototype
- Good well documented API
30Dictionary-based Query Translation
- Word-by-word (phrase-by-phrase) translation via
bilingual dictionary look-up - Translation resources
- General translation lexicon
- FREELANG dictionary www.freelang.net
- Universal dictionary www.dicts.info
- Domain-specific translation lexicon
- Wikipedia www.wikipedia.org
31Domain-specific Lexicon Construction
- Multilingual Wikipedia
- A Wikipedia page written in one language often
contains hyperlinks to counterparts in other
languages - For example
32Domain-specific Translation Example
33(No Transcript)
34Query expansion
- Thesaurus expansion
- Terms from Eurowordnet
- Relevance Feedback
- Terms from user-selected relevant documents
- Blind feedback
- Terms added from the system top-tanked and
assumed-relevant documents
35Cross Media Data Aggregation
- Enhance retrieval techniques by better
integration of modalities - Combined visual plus text search (free text
metadata) machine-learning based fusion of
mono-modla search results
36User Interaction and Interface Design
- User-centred design process
- Evaluate and refine the interface based on
empirical evaluation and usability testing - Interface supporting multilingual and multimedia
search - Default search on all types of content
- Specialized search on metadata fields and on
different media - Use of semantic structures for search and browse
37(No Transcript)
38Cross-Language Search
39The Consortium
- Academia
- Istituto di Scienza e Tecnologie
dellInformazione - University of Sheffield
- Dublin City University
- University of Amsterdam
- University of Geneva
- Universidad Nacional de Educación a Distancia
- Industry
- OCLC
- WIND Telecomunicazioni S.p.A.
- Cultural Heritage
- Fratelli Alinari Istituto Edizioni Artistiche SpA
- Netherlands Institute for Sound and Vision
- University of Alicante Biblioteca Virtual
Miguel de Cervantes - Coordinator Pasquale Savino savino_at_isti.cnr.it
40More Information
- Website with Online Demo
- www.multimatch.eu
- MultiMatch Final User Workshop, 25 October,
Florence, Italy - Everyone is welcome
- Contact Sam Minelli, User Groups Coordinator
sam_at_alinari.it