Alex Fraser - PowerPoint PPT Presentation

1 / 7
About This Presentation
Title:

Alex Fraser

Description:

Virtually all data can be used for training (93M words English, 82M words Arabic) ... IBM Model 1 Viterbi word alignment is used to project high precision chunk ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 8
Provided by: tri5361
Learn more at: http://www.amtaweb.org
Category:
Tags: alex | fraser | words

less

Transcript and Presenter's Notes

Title: Alex Fraser


1
Issues in Arabic MT
  • Alex Fraser
  • USC/ISI

2
ISI Arabic System for 2003 TIDES Evaluation
  • Alignment Template Approach (Och and Others at
    RWTH Aachen)
  • Maximum BLEU training (Och, ACL 2003)
  • Customization of Training for Arabic System
  • Model and Search are not Arabic dependent
    (currently)
  • Top Scoring System

3
Character Encoding and Normalization
  • Arabic UTF-8 reduced to CP-1256 character set
    (8-bit MS-Windows encoding)
  • Handle non-Arabic characters that look similar
  • Numbers
  • Normalization is important
  • Strip Kashida, vowels, Shadda
  • Normalize Alef variants, Alef Maqsura/Yeh,
    Heh/The Marbuta, Hamza variants

4
Morphology
  • Simple morphological segmentation did not improve
    performance at large training sizes
  • MT Extensions to UMass Light Stemming for IR
    (Larkey et al., SIGIR 2002)
  • Modified Buckwalter Stemmer (LDC), conservative
    stems (Xu, Fraser, Weischedel, SIGIR 2002)
  • Space-separated Arabic strings are already
    translated as consecutive-word phrases with
    baseline system
  • Used Buckwalter Stemmer and Gloss for unknown
    words

5
Training on long sentences
  • Realignment of sentences of length gt 45 tokens on
    chunk level
  • Virtually all data can be used for training (93M
    words English, 82M words Arabic).
  • English chunks are projected to Arabic
  • IBM Model 1 Viterbi word alignment is used to
    project high precision chunk breaks from English
    to Arabic
  • Dynamic programming search for best chunk
    projection

6
Error Analysis
  • Verbal movement and form
  • VSO ordering
  • Tense
  • NP structure
  • Missing 'to be' in present tense
  • Also causes spurious to be
  • PRO
  • These are all syntactic problems
  • Also Important Named Entities, Unknown Words

7
Future
  • More parallel data 1 billion words
  • More in-domain data
  • More test sets
  • Named Entity list
  • Research on Syntax
Write a Comment
User Comments (0)
About PowerShow.com