Title: Next Steps Technical Details
1Next Steps / Technical Details Bruno Pouliquen
Ralf Steinberger Addressing the Language
Barrier Problem in the Enlarged EU Automatic
Eurovoc Descriptor Assignment JRC Workshop,
Ispra, 16/17 September 2004 http//www.jrc.cec.eu.
int/langtech
2Eurovoc indexing Extend language coverage
- Czech
- Croatian
- Latvian
- Lithuanian
- Polish
- Slovak
- Analysis
- Danish
- Dutch
- English
- Finnish
- French
- German
- (Greek)
- Italian
- Portuguese
- Spanish
- Swedish
- (Lithuanian)
- (Bulgarian)
- (Hungarian)
- Soon also
- Albanian
- Romanian
- Russian
- Slovene
3Incentive for collaboration
- Mutual benefit
- We can provide tools and results to you (to
non-commercial Member State organisations) - JRC will be able to Eurovoc-index documents for
news analysis, etc. - No payments by the JRC are foreseen
- How to go ahead? / What to do next?
- We need Eurovoc-indexed texts in your
languages(or translations of Eurovoc-indexed
texts!) (Acquis Communautaire)
4Format to provide training texts to the JRC
- Ideally
- Plain text (not MS-Word, RTF, PDF, etc.)
- UTF-8 character encoding
- With CELEX code
- With Eurovoc descriptor code (mentioning Eurovoc
version) - XML format, structured
- Linguistically pre-processed and structured
- lemmatised
- annexes / signatures separate
- title separate
- stop word lists
- MANY texts
- 80,000 English texts were enough to train ca.
3500 descriptors (out of 6000)!
5Descriptor distribution in Spanish EP/EC texts
6Descriptor distribution in Spanish EP/EC texts
7Descriptor distribution in Spanish Congress texts
8Descriptor distribution in Hungarian texts
9Procedure
- You provide us with
- A big XML file containing the documents
- A stop word list
- We will give back to you
- A subset of documents (evaluation set)
- Same format
- Additional information on automatic Eurovoc
descriptors assigned - Some statistics on descriptor usage frequency,
etc. - An online browser interface to see the assignment
results - A validation interface
10 ltxmlgt ltassignmentgt Eurovoc Assignment lt/assignme
ntgt lt/xmlgt
export
11XML format
12(No Transcript)
13(No Transcript)
14Results of descriptor assignment - interface
15Results of descriptor assignment - XML
ltassignmentgt ltdescriptor ID"1006020102000000"
COSINE"0.20" OKAPI"8.83"gt PRESIDENCY OF THE
EC COUNCILlt/descriptorgt ltdescriptor
ID"1016030000000000" COSINE"0.17"
OKAPI"9.08"gt EUROPEAN UNIONlt/descriptorgt ltdescr
iptor ID"1006040100000000" COSINE"0.15"
OKAPI"9.63"gt PRESIDENTlt/descriptorgt ltdescriptor
ID"2826020000000000" COSINE"0.14"
OKAPI"7.82"gt SOCIAL POLICYlt/descriptorgt ltdescri
ptor ID"1011020102000000" COSINE"0.14"
OKAPI"8.22"gt PRINCIPLE OF SUBSIDIARITYlt/descrip
torgt ... lt/assignmentgt
16Results of descriptor assignment - validation
Numeric feedback?
17Arranging the collaboration of scientific partners
- The JRC will be able to provide the tool and
indexing results. - The JRC does not have specific funds to pay for
this work. - Possibilities for collaboration between
parliament and scientists - informal collaboration without payment
- formal collaboration (contract, payment)
- apply for a project with national or EU funding
(example Hungary) - M.Sc. Theses (e.g. Lithuanian), internships (e.g.
Estonian), -
- We would like to have lemmatisers for the new
languages. ? - If necessary, we can train system without
linguistic pre-processing.
18Pre-processing of the texts (by scientists?)
- Linguistic pre-processing, needed for each
language - General and corpus-specific list of stop words
(several thousand!) - For highly inflected languages some lemmatiser
or stemmer - Multi-word term mark-up for disambiguation
purposes? - Further text processing
- Some document structuring to separate title,
text, footer and annex - Conversion to XML
- Conversion to UTF-8
19Dealing with different versions of Eurovoc
- Problem has not yet been solved request for your
input - En training material was indexed with versions
3.1 and 4 - Challenge new descriptors need new training
material ? delay - Re-training required
20Dealing with different versions of Eurovoc (2)
- Case 1 New descriptor
- Search old and new documents for related
documents for re-training - Case 2 New name for old descriptor
- Replace the descriptor name OLD_NAME ?
NEW_NAME - Case 3 New place in hierarchy
- No problem
- Case 4 Disappearing descriptor
- Will no longer be assigned
21Dealing with different versions of Eurovoc (2)
- Case 5 Several descriptors are conflated
- No problem
- Case 6 A descriptor is split into two or more
- Re-training required(see Case 1)
NEW_NAME_1 OLD_NAME
NEW_NAME_2 NEW_NAME_3
22Dealing with different versions of Eurovoc (3)
- Changes between Eurovoc versions should not only
be described in free text. - They should be formalised in a machine-readable
way(e.g. in XML, in table format, ). - This should be done centrally for the thesaurus
(i.e. for all thesaurus languages), rather than
separately for each language!
23Appeal to Eurovoc community / EP / OPOCE
- Make Eurovoc available to the wide public in
machine-readable form - Formalise the version differences (e.g. XML)
- Make Eurovoc-indexed texts available to the
scientific community - Controlled by licences, if necessary
- E.g. via the Evaluations and Language resources
Distribution Agency ELDA - See http//www.elda.fr
- ELDA handles the practical and legal issues
related to the distribution of language
resources, provides legal advice in the field of
HLT, and drafts and concludes distribution
agreements on behalf of ELRA. - Wealth of parallel texts to train multilingual
text analysis applications - Machine Translation
- Multilingual Named Entity Recognition
- Multilingual classification
- Multi-document summarisation
-
- Automatic indexing
- The benefit is yours!
24(No Transcript)