Next Steps Technical Details

About This Presentation

Title:

Next Steps Technical Details

Description:

Addressing the Language Barrier Problem in the Enlarged EU ... PRINCIPLE OF SUBSIDIARITY /descriptor /assignment JRC-Ispra, 17.09.04, Slide 16 ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 25

Provided by: ralf99

Category:

more less

Transcript and Presenter's Notes

Title: Next Steps Technical Details

1
Next Steps / Technical Details Bruno Pouliquen
Ralf Steinberger Addressing the Language
Barrier Problem in the Enlarged EU Automatic
Eurovoc Descriptor Assignment JRC Workshop,
Ispra, 16/17 September 2004 http//www.jrc.cec.eu.
int/langtech
2
Eurovoc indexing Extend language coverage

Czech
Croatian
Latvian
Lithuanian
Polish
Slovak

Analysis
Danish
Dutch
English
Finnish
French
German
(Greek)
Italian
Portuguese
Spanish
Swedish
(Lithuanian)
(Bulgarian)
(Hungarian)

Soon also
Albanian
Romanian
Russian
Slovene

3
Incentive for collaboration

Mutual benefit
We can provide tools and results to you (to
non-commercial Member State organisations)
JRC will be able to Eurovoc-index documents for
news analysis, etc.
No payments by the JRC are foreseen
How to go ahead? / What to do next?
We need Eurovoc-indexed texts in your
languages(or translations of Eurovoc-indexed
texts!) (Acquis Communautaire)

4
Format to provide training texts to the JRC

Ideally
Plain text (not MS-Word, RTF, PDF, etc.)
UTF-8 character encoding
With CELEX code
With Eurovoc descriptor code (mentioning Eurovoc
version)
XML format, structured
Linguistically pre-processed and structured
lemmatised
annexes / signatures separate
title separate
stop word lists
MANY texts
80,000 English texts were enough to train ca.
3500 descriptors (out of 6000)!

5
Descriptor distribution in Spanish EP/EC texts
6
Descriptor distribution in Spanish EP/EC texts
7
Descriptor distribution in Spanish Congress texts
8
Descriptor distribution in Hungarian texts
9
Procedure

You provide us with
A big XML file containing the documents
A stop word list
We will give back to you
A subset of documents (evaluation set)
Same format
Additional information on automatic Eurovoc
descriptors assigned
Some statistics on descriptor usage frequency,
etc.
An online browser interface to see the assignment
results
A validation interface

10

ltxmlgt ltassignmentgt Eurovoc Assignment lt/assignme
ntgt lt/xmlgt
export
11
XML format
12
(No Transcript)
13
(No Transcript)
14
Results of descriptor assignment - interface
15
Results of descriptor assignment - XML
ltassignmentgt ltdescriptor ID"1006020102000000"
COSINE"0.20" OKAPI"8.83"gt PRESIDENCY OF THE
EC COUNCILlt/descriptorgt ltdescriptor
ID"1016030000000000" COSINE"0.17"
OKAPI"9.08"gt EUROPEAN UNIONlt/descriptorgt ltdescr
iptor ID"1006040100000000" COSINE"0.15"
OKAPI"9.63"gt PRESIDENTlt/descriptorgt ltdescriptor
ID"2826020000000000" COSINE"0.14"
OKAPI"7.82"gt SOCIAL POLICYlt/descriptorgt ltdescri
ptor ID"1011020102000000" COSINE"0.14"
OKAPI"8.22"gt PRINCIPLE OF SUBSIDIARITYlt/descrip
torgt ... lt/assignmentgt
16
Results of descriptor assignment - validation
Numeric feedback?
17
Arranging the collaboration of scientific partners

The JRC will be able to provide the tool and
indexing results.
The JRC does not have specific funds to pay for
this work.
Possibilities for collaboration between
parliament and scientists
informal collaboration without payment
formal collaboration (contract, payment)
apply for a project with national or EU funding
(example Hungary)
M.Sc. Theses (e.g. Lithuanian), internships (e.g.
Estonian),
We would like to have lemmatisers for the new
languages. ?
If necessary, we can train system without
linguistic pre-processing.

18
Pre-processing of the texts (by scientists?)

Linguistic pre-processing, needed for each
language
General and corpus-specific list of stop words
(several thousand!)
For highly inflected languages some lemmatiser
or stemmer
Multi-word term mark-up for disambiguation
purposes?
Further text processing
Some document structuring to separate title,
text, footer and annex
Conversion to XML
Conversion to UTF-8

19
Dealing with different versions of Eurovoc

Problem has not yet been solved request for your
input
En training material was indexed with versions
3.1 and 4
Challenge new descriptors need new training
material ? delay
Re-training required

20
Dealing with different versions of Eurovoc (2)

Case 1 New descriptor
Search old and new documents for related
documents for re-training
Case 2 New name for old descriptor
Replace the descriptor name OLD_NAME ?
NEW_NAME
Case 3 New place in hierarchy
No problem
Case 4 Disappearing descriptor
Will no longer be assigned

21
Dealing with different versions of Eurovoc (2)

Case 5 Several descriptors are conflated
No problem
Case 6 A descriptor is split into two or more
Re-training required(see Case 1)

NEW_NAME_1 OLD_NAME
NEW_NAME_2 NEW_NAME_3
22
Dealing with different versions of Eurovoc (3)

Changes between Eurovoc versions should not only
be described in free text.
They should be formalised in a machine-readable
way(e.g. in XML, in table format, ).
This should be done centrally for the thesaurus
(i.e. for all thesaurus languages), rather than
separately for each language!

23
Appeal to Eurovoc community / EP / OPOCE

Make Eurovoc available to the wide public in
machine-readable form
Formalise the version differences (e.g. XML)
Make Eurovoc-indexed texts available to the
scientific community
Controlled by licences, if necessary
E.g. via the Evaluations and Language resources
Distribution Agency ELDA
See http//www.elda.fr
ELDA handles the practical and legal issues
related to the distribution of language
resources, provides legal advice in the field of
HLT, and drafts and concludes distribution
agreements on behalf of ELRA.
Wealth of parallel texts to train multilingual
text analysis applications
Machine Translation
Multilingual Named Entity Recognition
Multilingual classification
Multi-document summarisation
Automatic indexing
The benefit is yours!