Infrastructures and Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Infrastructures and Evaluation

Description:

... French, and German documents, both in a monolingual and a cross-lingual mode ... guidelines, plus all groups had to submit a monolingual baseline. Documents: ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 17
Provided by: ashb4
Learn more at: http://www.elsnet.org
Category:

less

Transcript and Presenter's Notes

Title: Infrastructures and Evaluation


1
Infrastructures and Evaluation
  • Donna Harman
  • National Institute of Standards and Technology
  • Gaithersburg, Maryland
  • http//trec.nist.gov

2
TREC Tasks
3
Workshop on Cross-Linguistic Information
Retrieval, SIGIR 1996
  • Paper Building a Large Multilingual Test
    Collection from Comparable News Documents by
    Páraic Sheridan, Jean Paul Ballerini and Peter
    Schäuble
  • Used Swiss news agency (SDA) data in French,
    German and Italian

4
TREC-6 Cross-Language Track
  • In cooperation with the Swiss Federal Institute
    of Technology (ETH)
  • Task Summary retrieval of English, French, and
    German documents, both in a monolingual and a
    cross-lingual mode
  • Guidelines ad hoc task guidelines, plus all
    groups had to submit a monolingual baseline
  • Documents
  • Neue Zürcher Zeitung (1994) German (200 MB)
  • SDA (1988-1990) French (250 MB), German (330
    MB)
  • AP (1988-1990) English (759 MB)
  • Topics and relevance assessments all done at NIST

5
TREC-6 Cross-Language Results - revised 01/20/98
6
Major issues with language resources
  • No public domain stopword lists, stemmers, etc.
    for German and French
  • Jacques Savoy contributed a Porter-like stemmer
    for French and a stopword list
  • Martin Braschler and Paul Over from NIST built a
    simple German stemmer and decompounder
  • Questions from participants about how much of the
    final result was based on having access to
    better resources

7
Major issues in CLIR resources
  • Major lack of machine-readable bilingual
    dictionaries
  • Resulted in the use of limited dictionaries
  • Resulted in the use of assorted mapped word lists
    that were found on the web
  • Major lack of parallel corpora
  • Resulted in the use of comparable corpora
  • (Later) resulted in the mining of the web for
    parallel text
  • Heavy use of SYSTRAN in query translation

8
Lessons learned from TREC-6
  • Importance of basic corpora
  • Difficulty in locating public domain tools
  • Problems of building multilingual testing data in
    the U.S. this led to European cooperation in
    later TRECs

9
Importance of Basic Corpora
  • The public availability of corpora, including
    text, speech and other multimedia data, is the
    most critical infrastructure
  • Newspapers (and their multimedia counterparts)
    are particularly valuable
  • Large volume readily available
  • Available in most languages
  • General purpose domain
  • Other genre also important

10
Uses of this Corpora
  • The basic building block for IR test collections
  • A rich source of vocabulary and language
    structure information for many tasks
  • Use of comparable corpora, e.g. corpora from the
    same time period, allows statistical mining of
    cross-language, cross-media word pairs

11
Importance of Basic Tools
  • For IR stopword lists, stemmers, decompounders,
    segmenters, etc.
  • For other NLP tasks, add parsers, part-of-speech
    taggers, noun phrase detectors, named entity
    recognizers, etc.
  • For MT, add sentence aligners, etc.
  • These need to be readily available for all
    languages

12
Other Basic Infrastructures
  • Parallel text
  • WordNets
  • Treebanks
  • Thesaurii (often domain specific)
  • Machine readable dictionaries
  • Knowledge bases such as CYC
  • Gazetteers, etc.

13
Critical Issues for Infrastructure
  • Widespread availability of what already exists
    this is both an issue of good dissemination and
    reasonable costs
  • Serious examination of the cost/benefit ratio of
    building any new infrastructure by the funding
    agencies
  • A clearer relationship between infrastructure,
    tools, and evaluation

14
Proposal Widespread availability
  • Set up a central worldwide site with links to a
    site in each country that catalogs publicly
    available corpora and tools
  • Be realistic about the costs of corpora the
    costs of building corpora should be paid by
    funding agencies and therefore should be
    available at a TRULY minimal cost

15
Proposal Cost/Benefit Model
  • Look at basic corpora first
  • Prime target a worldwide newspaper collection
    with at least 250 MB per language look for
    publishing locations with multiple languages
  • Look at simple infrastructures also
  • Examples lists of proper nouns, crude
    bilingual dictionaries, stemmers
  • Continue support of basic infrastructures like
    WordNets

16
Proposal Role of Evaluation
  • Evaluation forums are critical to making progress
    in language technology
  • Encourage friendly competition provide a
    common task focal point for research groups
    worldwide
  • Enable identification of good tools for broader
    dissemination
  • Identify what the real issues are what are the
    most useful types of new infrastructure needed
Write a Comment
User Comments (0)
About PowerShow.com