Approaching a New Language in Machine Translation - PowerPoint PPT Presentation

About This Presentation
Title:

Approaching a New Language in Machine Translation

Description:

Approaching a New Language in Machine Translation. Anna S gvall Hein, Per Weijnitz ... collecting a small sv-en translation corpus from the automotive domain (Scania) ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 29
Provided by: kejoha
Category:

less

Transcript and Presenter's Notes

Title: Approaching a New Language in Machine Translation


1
Approaching a New Language in Machine Translation
  • Anna Sågvall Hein, Per Weijnitz

2
A Swedish example
  • Experiences of
  • rule-based translation by means of translation
    software that was developed from scratch
  • statistical translation by means of publicly
    available software

3
Developing a robust transfer-based system for
Swedish
  • collecting a small sv-en translation corpus from
    the automotive domain (Scania)
  • building a prototype of a core translation
    engine, Multra
  • extending the translation corpus to 50k words for
    each language and scaling-up the dictionaries for
    the extended corpus
  • building a translation system, Mats for hosting
    Multra and processing real-word documents
  • making the system robust, transparent and
    trace-able
  • building an extended, more flexible version of
    Mats, Convertus

4
Features of the Multra engine
  • transfer-based
  • modular
  • analysis by chart parsing
  • transfer based on unification
  • generation based on unification and concatenation
  • non-deterministic processing
  • preference machinery

5
Features of the host system(s)
  • robust
  • always produces a translation
  • modular
  • a separate module for each translation step
  • transparent
  • text based communication between modules
  • trace-able
  • step-wise for each module
  • evaluation of the linguistic coverage
  • counting and collecting missing units from each
    module
  • process communication
  • MATS, unidirectional pipe
  • Convertus, blackboard

6
Robustness
  • dictionary
  • complementary access to external dictionaries
  • analysis
  • exploiting partial analyses
  • concatenation of sub-strings in preserved order
  • transfer
  • only differences covered by rules
  • generation
  • token translations presented in source language
    order
  • fall back generations cleaned up using a language
    model

7
Language resources, full system
  • analysis
  • dictionary
  • grammar
  • transfer
  • dictionary
  • grammar
  • generation
  • dictionary
  • grammar
  • external translation dictionary
  • target language model

8
Language resources, simplified, direct
translation system
  • analysis
  • dictionary
  • transfer
  • dictionary
  • generation
  • dictionary
  • external translation dictionary
  • target language model

9
Achievements
  • Bleu scores 0.4-0.5 for training materials
  • automotive service literature
  • EU agricultural texts
  • security police communication
  • academic curricula

10
Current project
  • Translation of curricula of Uppsala University
    from Swedish to English

11
Current development
  • initial studies of automatic extraction of
    grammar rules from text and tree-banks for
    parsing and generation
  • inspired by
  • Megyesi, B. (2002). Data-Driven Syntactic
    Analysis - Methods and Applications for Swedish.
    Ph.D.Thesis. Department of Speech, Music and
    Hearing, KTH, Stockholm, Sweden.
  • Nivre, J., Hall, J. and Nilsson, J. (2006)
    MaltParser A Data-Driven Parser-Generator for
    Dependency Parsing. In Proceedings of LREC.

12
Statistical MT
  • Publicly available software
  • decoder
  • Pharaoh (Koehn 2004)
  • translation models
  • UPlug (Tiedemann, J. 2003)
  • GIZA (Och, F. J. and Ney, H. 2000)
  • Thot (Ortiz-Martínez, D. et al. 2005)
  • language models
  • SRILM (Stolcke, A. 2002)

13
Success factors
  • language differences
  • translation direction
  • size of training corpus
  • density of corpus
  • corpus density lexical openness, degree of
    repetetiveness of n-grams, plus other significant
    factors
  • How can they be appropriately formalised?
    Measured? Combined?

14
Experiments
  • limited amount of training data (assumed for
    minority languages) lt32k sentence pairs
  • Swedish represents the minority lang.
  • search for correlation between density of corpus
    and translation quality

15
Mats automotive corpus
languages BLEU training size
sv-en 0.627 16k
en-sv 0.646 16k
sv-de 0.491 16k
de-sv 0.506 16k
16
Europarl
languages BLEU training size
sv-en 0.225 20k
en-sv 0.247 20k
sv-de 0.201 20k
de-sv 0.231 20k
17
Mats Europarl, density in terms of
type/occurrence ratio
Corpus BLEU 3-GRAM 4-GRAM
Mats, 16k 0.63 68.2 78.2
Europarl, 16k 0.23 76.3 92.3
18
BLEU for Europarl 10 SL-gtsv
19
BLEU for Europarl sv-gt10 TL
20
4-gram type/occurrence ratio, SL-gtsv
21
3-gram type/occurrence ratio, SL-gtsv
22
Detailed view, Europarl, sv-gten
  • Examining the correlation between SL n-gram
    type/occurrence density - and BLEU.

Size (k) 1 2 4 8 16 32
3-gram 81.6 81.0 80.0 77.8 74.3 69.9
4-gram 94.0 93.9 93.6 92.8 91.3 89.2
BLEU 0.13 0.16 0.19 0.21 0.23 0.25
23
Detailed view, Europarl sv-fi
  • Examining the correlation between SL n-gram
    type/occurrence density - and BLEU.

Size (k) 1 2 4 8 16 32
3-gram 82.8 82.3 81.0 78.8 75.4 70.9
4-gram 94.4 94.4 94.0 93.3 92.0 90.0
BLEU 0.05 0.07 0.08 0.09 0.10 0.11
24
Rule-based and statistical - moving slightly off
domain
  • MATS automotive corpus used for training, 16k
  • test data from Mats (outside training data) and
    from separate, similar corpus Scania98

system language pair MATS test, bleu Scania98 test, bleu
convertus sv-gten 0.345 0.377
pharaoh sv-gten 0.627 0.324
25
Correlation between overlap and performance -
Pharaoh
  • MATS automotive corpus used for training, 16k
  • test data from MATS and Scania98
  • measured occurrences of test data units that
    also occur in the training data
  • test and training source language data overlap
    the precondition for successful data driven MT

Test data BLEU sent 4-gram 3-gram 2-gram 1-gram
MATS 0.627 32 31 46 72 97
Scania98 0.324 6 7 16 44 88
26
Summary
  • development of Convertus, a robust transfer-based
    system equipped with language resources for sv-en
    translation in several domains
  • BLEU measures of smt using publicly available
    software (Pharaoh) and Europarl
  • 10 languages, two translation directions, and
    training intervals of 5k sentence pairs up to 32k
  • data on density of Europarl in terms of overlaps
  • comparing rbmt and smt using Convertus and
    Pharaoh
  • searching for a formal way of quantifying how
    well a corpus will work for SMT
  • starting with density of source language

27
Concluding remarks
  • building a rule-based system from scratch is a
    major undertaking
  • customizing existing software is better
  • smt systems can be built fairly easily using
    publicly available software
  • restrictions on commercial use, though
  • factors influencing quality in smt
  • size of training corpus
  • density of source side of training corpus
  • language differences and translation direction
  • other important factors (future work)
  • quality of training corpus, alignment quality,

28
Concluding remarks (cont.)
  • smt versus rbmt
  • smt seems more sensitive to density than rbmt
  • error analysis and correction can be
    linguistically controlled in rbmt as opposed to
    smt
Write a Comment
User Comments (0)
About PowerShow.com