Title: Infrastructures and Evaluation
1Infrastructures and Evaluation
- Donna Harman
- National Institute of Standards and Technology
- Gaithersburg, Maryland
- http//trec.nist.gov
2TREC Tasks
3Workshop on Cross-Linguistic Information
Retrieval, SIGIR 1996
- Paper Building a Large Multilingual Test
Collection from Comparable News Documents by
Páraic Sheridan, Jean Paul Ballerini and Peter
Schäuble - Used Swiss news agency (SDA) data in French,
German and Italian
4TREC-6 Cross-Language Track
- In cooperation with the Swiss Federal Institute
of Technology (ETH) - Task Summary retrieval of English, French, and
German documents, both in a monolingual and a
cross-lingual mode - Guidelines ad hoc task guidelines, plus all
groups had to submit a monolingual baseline - Documents
- Neue Zürcher Zeitung (1994) German (200 MB)
- SDA (1988-1990) French (250 MB), German (330
MB) - AP (1988-1990) English (759 MB)
- Topics and relevance assessments all done at NIST
5TREC-6 Cross-Language Results - revised 01/20/98
6Major issues with language resources
- No public domain stopword lists, stemmers, etc.
for German and French - Jacques Savoy contributed a Porter-like stemmer
for French and a stopword list - Martin Braschler and Paul Over from NIST built a
simple German stemmer and decompounder - Questions from participants about how much of the
final result was based on having access to
better resources
7Major issues in CLIR resources
- Major lack of machine-readable bilingual
dictionaries - Resulted in the use of limited dictionaries
- Resulted in the use of assorted mapped word lists
that were found on the web - Major lack of parallel corpora
- Resulted in the use of comparable corpora
- (Later) resulted in the mining of the web for
parallel text - Heavy use of SYSTRAN in query translation
8Lessons learned from TREC-6
- Importance of basic corpora
- Difficulty in locating public domain tools
- Problems of building multilingual testing data in
the U.S. this led to European cooperation in
later TRECs
9Importance of Basic Corpora
- The public availability of corpora, including
text, speech and other multimedia data, is the
most critical infrastructure - Newspapers (and their multimedia counterparts)
are particularly valuable - Large volume readily available
- Available in most languages
- General purpose domain
- Other genre also important
10Uses of this Corpora
- The basic building block for IR test collections
- A rich source of vocabulary and language
structure information for many tasks - Use of comparable corpora, e.g. corpora from the
same time period, allows statistical mining of
cross-language, cross-media word pairs
11Importance of Basic Tools
- For IR stopword lists, stemmers, decompounders,
segmenters, etc. - For other NLP tasks, add parsers, part-of-speech
taggers, noun phrase detectors, named entity
recognizers, etc. - For MT, add sentence aligners, etc.
- These need to be readily available for all
languages
12Other Basic Infrastructures
- Parallel text
- WordNets
- Treebanks
- Thesaurii (often domain specific)
- Machine readable dictionaries
- Knowledge bases such as CYC
- Gazetteers, etc.
13Critical Issues for Infrastructure
- Widespread availability of what already exists
this is both an issue of good dissemination and
reasonable costs - Serious examination of the cost/benefit ratio of
building any new infrastructure by the funding
agencies - A clearer relationship between infrastructure,
tools, and evaluation
14Proposal Widespread availability
- Set up a central worldwide site with links to a
site in each country that catalogs publicly
available corpora and tools - Be realistic about the costs of corpora the
costs of building corpora should be paid by
funding agencies and therefore should be
available at a TRULY minimal cost
15Proposal Cost/Benefit Model
- Look at basic corpora first
- Prime target a worldwide newspaper collection
with at least 250 MB per language look for
publishing locations with multiple languages - Look at simple infrastructures also
- Examples lists of proper nouns, crude
bilingual dictionaries, stemmers - Continue support of basic infrastructures like
WordNets
16Proposal Role of Evaluation
- Evaluation forums are critical to making progress
in language technology - Encourage friendly competition provide a
common task focal point for research groups
worldwide - Enable identification of good tools for broader
dissemination - Identify what the real issues are what are the
most useful types of new infrastructure needed