Approaching a New Language in Machine Translation

About This Presentation

Title:

Approaching a New Language in Machine Translation

Description:

Approaching a New Language in Machine Translation. Anna S gvall Hein, Per Weijnitz ... collecting a small sv-en translation corpus from the automotive domain (Scania) ... –

Number of Views:76

Avg rating:3.0/5.0

Slides: 29

Provided by: kejoha

Category:

more less

Transcript and Presenter's Notes

Title: Approaching a New Language in Machine Translation

1
Approaching a New Language in Machine Translation

Anna Sågvall Hein, Per Weijnitz

2
A Swedish example

Experiences of
rule-based translation by means of translation
software that was developed from scratch
statistical translation by means of publicly
available software

3
Developing a robust transfer-based system for
Swedish

collecting a small sv-en translation corpus from
the automotive domain (Scania)
building a prototype of a core translation
engine, Multra
extending the translation corpus to 50k words for
each language and scaling-up the dictionaries for
the extended corpus
building a translation system, Mats for hosting
Multra and processing real-word documents
making the system robust, transparent and
trace-able
building an extended, more flexible version of
Mats, Convertus

4
Features of the Multra engine

transfer-based
modular
analysis by chart parsing
transfer based on unification
generation based on unification and concatenation
non-deterministic processing
preference machinery

5
Features of the host system(s)

robust
always produces a translation
modular
a separate module for each translation step
transparent
text based communication between modules
trace-able
step-wise for each module
evaluation of the linguistic coverage
counting and collecting missing units from each
module
process communication
MATS, unidirectional pipe
Convertus, blackboard

6
Robustness

dictionary
complementary access to external dictionaries
analysis
exploiting partial analyses
concatenation of sub-strings in preserved order
transfer
only differences covered by rules
generation
token translations presented in source language
order
fall back generations cleaned up using a language
model

7
Language resources, full system

analysis
dictionary
grammar
transfer
dictionary
grammar
generation
dictionary
grammar
external translation dictionary
target language model

8
Language resources, simplified, direct
translation system

analysis
dictionary
transfer
dictionary
generation
dictionary
external translation dictionary
target language model

9
Achievements

Bleu scores 0.4-0.5 for training materials
automotive service literature
EU agricultural texts
security police communication
academic curricula

10
Current project

Translation of curricula of Uppsala University
from Swedish to English

11
Current development

initial studies of automatic extraction of
grammar rules from text and tree-banks for
parsing and generation
inspired by
Megyesi, B. (2002). Data-Driven Syntactic
Analysis - Methods and Applications for Swedish.
Ph.D.Thesis. Department of Speech, Music and
Hearing, KTH, Stockholm, Sweden.
Nivre, J., Hall, J. and Nilsson, J. (2006)
MaltParser A Data-Driven Parser-Generator for
Dependency Parsing. In Proceedings of LREC.

12
Statistical MT

Publicly available software
decoder
Pharaoh (Koehn 2004)
translation models
UPlug (Tiedemann, J. 2003)
GIZA (Och, F. J. and Ney, H. 2000)
Thot (Ortiz-Martínez, D. et al. 2005)
language models
SRILM (Stolcke, A. 2002)

13
Success factors

language differences
translation direction
size of training corpus
density of corpus
corpus density lexical openness, degree of
repetetiveness of n-grams, plus other significant
factors
How can they be appropriately formalised?
Measured? Combined?

14
Experiments

limited amount of training data (assumed for
minority languages) lt32k sentence pairs
Swedish represents the minority lang.
search for correlation between density of corpus
and translation quality

15
Mats automotive corpus
languages BLEU training size
sv-en 0.627 16k
en-sv 0.646 16k
sv-de 0.491 16k
de-sv 0.506 16k
16
Europarl
languages BLEU training size
sv-en 0.225 20k
en-sv 0.247 20k
sv-de 0.201 20k
de-sv 0.231 20k
17
Mats Europarl, density in terms of
type/occurrence ratio
Corpus BLEU 3-GRAM 4-GRAM
Mats, 16k 0.63 68.2 78.2
Europarl, 16k 0.23 76.3 92.3
18
BLEU for Europarl 10 SL-gtsv
19
BLEU for Europarl sv-gt10 TL
20
4-gram type/occurrence ratio, SL-gtsv
21
3-gram type/occurrence ratio, SL-gtsv
22
Detailed view, Europarl, sv-gten

Examining the correlation between SL n-gram
type/occurrence density - and BLEU.

Size (k) 1 2 4 8 16 32
3-gram 81.6 81.0 80.0 77.8 74.3 69.9
4-gram 94.0 93.9 93.6 92.8 91.3 89.2
BLEU 0.13 0.16 0.19 0.21 0.23 0.25
23
Detailed view, Europarl sv-fi

Examining the correlation between SL n-gram
type/occurrence density - and BLEU.

Size (k) 1 2 4 8 16 32
3-gram 82.8 82.3 81.0 78.8 75.4 70.9
4-gram 94.4 94.4 94.0 93.3 92.0 90.0
BLEU 0.05 0.07 0.08 0.09 0.10 0.11
24
Rule-based and statistical - moving slightly off
domain

MATS automotive corpus used for training, 16k
test data from Mats (outside training data) and
from separate, similar corpus Scania98

system language pair MATS test, bleu Scania98 test, bleu
convertus sv-gten 0.345 0.377
pharaoh sv-gten 0.627 0.324
25
Correlation between overlap and performance -
Pharaoh

MATS automotive corpus used for training, 16k
test data from MATS and Scania98
measured occurrences of test data units that
also occur in the training data
test and training source language data overlap
the precondition for successful data driven MT

Test data BLEU sent 4-gram 3-gram 2-gram 1-gram
MATS 0.627 32 31 46 72 97
Scania98 0.324 6 7 16 44 88
26
Summary

development of Convertus, a robust transfer-based
system equipped with language resources for sv-en
translation in several domains
BLEU measures of smt using publicly available
software (Pharaoh) and Europarl
10 languages, two translation directions, and
training intervals of 5k sentence pairs up to 32k
data on density of Europarl in terms of overlaps
comparing rbmt and smt using Convertus and
Pharaoh
searching for a formal way of quantifying how
well a corpus will work for SMT
starting with density of source language

27
Concluding remarks