Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

About This Presentation

Title:

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

Description:

'Decode' new input by searching for the most likely sequence of phrase matches, ... was a two-month, three person effort we were quite happy with the outcome ... –

Number of Views:100

Avg rating:3.0/5.0

Slides: 31

Provided by: chadtl

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

1
Rapid Prototyping of a Transfer-based
Hebrew-to-EnglishMachine Translation System

Alon Lavie
Language Technologies Institute
Carnegie Mellon University
Joint work with
Shuly Wintner, Danny Shacham, Nurit Melnik,
Yuval Krymolowski - University of Haifa
Erik Peterson Carnegie Mellon University

2
Outline

Context of this Work
CMU Statistical Transfer MT Framework
Hebrew and its Challenges for MT
Hebrew-to-English System
Morphological Analysis and Generation
MT Resources lexicon and grammar
Translation Examples
Performance Evaluation
Conclusions, Current and Future Work

3
Current State-of-the-art in Machine Translation

MT underwent a major paradigm shift over the past
15 years
From manually crafted rule-based systems with
manually designed knowledge resources
To search-based approaches founded on automatic
extraction of translation models/units from large
sentence-parallel corpora
Current Dominant Approach Phrase-based
Statistical MT
Extract and statistically model large volumes of
phrase-to-phrase correspondences from
automatically word-aligned parallel corpora
Decode new input by searching for the most
likely sequence of phrase matches, using a
statistical Language Model for the target language

4
Current State-of-the-art in Machine Translation

Phrase-based MT State-of-the-art
Requires minimally several million words of
parallel text for adequate training
Limited to language-pairs for which such data
exists major European languages, Chinese,
Japanese, a few others
Linguistically shallow and highly lexicalized
models result in weak generalization
Best performance levels (BLEU0.6) on
Arabic-to-English provide understandable but
often still somewhat disfluent translations
Ill suited for Hebrew and most of the worlds
minor languages

5
CMUs Statistical-Transfer (XFER) Approach

Framework Statistical search-based approach with
syntactic translation transfer rules that can be
acquired from data but also developed and
extended by experts
Elicitation use bilingual native informants to
produce a small high-quality word-aligned
bilingual corpus of translated phrases and
sentences
Transfer-rule Learning apply ML-based methods to
automatically acquire syntactic transfer rules
for translation between the two languages
XFER Decoder
XFER engine produces a lattice of possible
transferred structures at all levels
Decoder searches and selects the best scoring
combination
Rule Refinement refine the acquired rules via a
process of interaction with bilingual informants
Word and Phrase bilingual lexicon acquisition

6
(No Transcript)
7
Transfer Rule Formalism
SL the old man, TL ha-ish ha-zaqen NPNP
DET ADJ N -gt DET N DET ADJ ( (X1Y1) (X1Y3)
(X2Y4) (X3Y2) ((X1 AGR) 3-SING) ((X1 DEF
DEF) ((X3 AGR) 3-SING) ((X3 COUNT)
) ((Y1 DEF) DEF) ((Y3 DEF) DEF) ((Y2 AGR)
3-SING) ((Y2 GENDER) (Y4 GENDER)) )

Type information
Part-of-speech/constituent information
Alignments
x-side constraints
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) (X1 AGR))

8
The Transfer Engine

Main algorithm chart-style bottom-up integrated
parsingtransfer with beam pruning
Seeded by word-to-word translations
Driven by transfer rules
Generates a lattice of transferred translation
segments at all levels
Some Unique Features
Works with either learned or manually-developed
transfer grammars
Handles rules with or without unification
constraints
Supports interfacing with servers for
morphological analysis and generation
Can handle ambiguous source-word analyses and/or
SL segmentations represented in the form of
lattice structures

9
XFER Output Lattice
(28 28 "AND" -5.6988 "W" "(CONJ,0 'AND')") (29 29
"SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE'))
") (29 29 "SINCE THEN" -12.0165 "MAZ " "(ADVP,0
(ADV,6 'SINCE THEN')) ") (29 29 "EVER SINCE"
-12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE'))
") (30 30 "WORKED" -10.9913 "BD " "(VERB,0 (V,11
'WORKED')) ") (30 30 "FUNCTIONED" -16.0023 "BD "
"(VERB,0 (V,10 'FUNCTIONED')) ") (30 30
"WORSHIPPED" -17.3393 "BD " "(VERB,0 (V,12
'WORSHIPPED')) ") (30 30 "SERVED" -11.5161 "BD "
"(VERB,0 (V,14 'SERVED')) ") (30 30 "SLAVE"
-13.9523 "BD " "(NP0,0 (N,34 'SLAVE')) ") (30 30
"BONDSMAN" -18.0325 "BD " "(NP0,0 (N,36
'BONDSMAN')) ") (30 30 "A SLAVE" -16.8671 "BD "
"(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0
(N,34 'SLAVE')) ) ) ) ") (30 30 "A BONDSMAN"
-21.0649 "BD " "(NP,1 (LITERAL 'A') (NP2,0
(NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ")
10
The Lattice Decoder

Simple Stack Decoder, similar in principle to
simple Statistical MT decoders
Searches for best-scoring path of non-overlapping
lattice arcs
No reordering during decoding
Scoring based on log-linear combination of
scoring components, with weights trained using
MERT
Scoring components
Statistical Language Model
Fragmentation how many arcs to cover the entire
translation?
Length Penalty
Rule Scores
Lexical Probabilities (not fully integrated)

11
XFER Lattice Decoder
0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT
TO A MORNING MEAL Overall -8.18323, Prob
-94.382, Rules 0, Frag 0.153846, Length 0,
Words 13,13 235 lt 0 8 -19.7602 B H IWM RBII
(PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE') (NP2,0
(NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1
(N,6 'DAY')))))))gt 918 lt 8 14 -46.2973 H ARIH
AKL AT H PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0
(NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0
'ATE'))(NP,100 (NP,2 (LITERAL 'THE') (NP2,0
(NP1,0 (NP0,1 (N,24 'RABBIT')))))))gt 584 lt 14 17
-30.6607 L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1
(LITERAL 'A') (NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32
'MORNING'))(NP0,0 (N,27 'MEAL')))))))gt
12
XFER MT Prototypes

General XFER framework under development for past
five years
Prototype systems so far
German-to-English
Dutch-to-English
Chinese-to-English
Hindi-to-English
Hebrew-to-English
In progress or planned
Mapudungun-to-Spanish
Quechua-to-Spanish
Brazilian Portuguese-to-English
Native-Brazilian languages to Brazilian
Portuguese
Hebrew-to-Arabic

13
Challenges for Hebrew MT

Puacity in existing language resources for Hebrew
No publicly available broad coverage
morphological analyzer
No publicly available bilingual lexicons or
dictionaries
No POS-tagged corpus or parse tree-bank corpus
for Hebrew
No large Hebrew/English parallel corpus
Scenario well suited for CMU transfer-based MT
framework for languages with limited resources

14
Modern Hebrew Spelling

Two main spelling variants
KTIV XASER (difficient) spelling with the
vowel diacritics, and consonant words when the
diacritics are removed
KTIV MALEH (full) words with I/O/U vowels are
written with long vowels which include a letter
KTIV MALEH is predominant, but not strictly
adhered to even in newspapers and official
publications ? inconsistent spelling
Example
niqud (spelling) NIQWD, NQWD, NQD
When written as NQD, could also be niqed, naqed,
nuqad

15
Morphological Analyzer

We use a publicly available morphological
analyzer distributed by the Technions Knowledge
Center, adapted for our system
Coverage is reasonable (for nouns, verbs and
adjectives)
Produces all analyses or a disambiguated analysis
for each word
Output format includes lexeme (base form), POS,
morphological features
Output was adapted to our representation needs
(POS and feature mappings)

16
Morphology Example

Input word BWRH
0 1 2 3 4
--------BWRH--------
-----B-----WR--H--
--B---H----WRH---

17
Morphology Example

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0)
Y2 ((SPANSTART 1)
(SPANEND 4) (SPANEND
2) (SPANEND 3)
(LEX BWRH) (LEX B)
(LEX WR)
(POS N) (POS
PREP)) (POS N)
(GEN F)
(GEN M)
(NUM S)
(NUM S)
(STATUS ABSOLUTE))
(STATUS ABSOLUTE))
Y3 ((SPANSTART 3) Y4 ((SPANSTART 0)
Y5 ((SPANSTART 1)
(SPANEND 4) (SPANEND
1) (SPANEND 2)
(LEX LH) (LEX
B) (LEX H)
(POS POSS)) (POS
PREP)) (POS DET))
Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)
(SPANEND 4) (SPANEND
4)
(LEX WRH) (LEX
BWRH)
(POS N) (POS
LEX))
(GEN F)
(NUM S)

18
Translation Lexicon

Constructed our own Hebrew-to-English lexicon,
based primarily on existing Dahan H-to-E and
E-to-H dictionary made available to us, augmented
by other public sources
Coverage is not great but not bad as a start
Dahan H-to-E is about 15K translation pairs
Dahan E-to-H is about 7K translation pairs
Base forms, POS information on both sides
Converted Dahan into our representation, added
entries for missing closed-class entries
(pronouns, prepositions, etc.)
Had to deal with spelling conventions
Recently augmented with 50K translation pairs
extracted from Wikipedia (mostly proper names and
named entities)

19
Manual Transfer Grammar (human-developed)

Initially developed by Alon in a couple of days,
extended and revised by Nurit over time
Current grammar has 36 rules
21 NP rules
one PP rule
6 verb complexes and VP rules
8 higher-phrase and sentence-level rules
Captures the most common (mostly local)
structural differences between Hebrew and English

20
Transfer GrammarExample Rules
NP1,2 SL MLH ADWMH TL A RED
DRESS NP1NP1 NP1 ADJ -gt ADJ
NP1 ( (X2Y1) (X1Y2) ((X1 def) -) ((X1
status) c absolute) ((X1 num) (X2 num)) ((X1
gen) (X2 gen)) (X0 X1) )
NP1,3 SL H MLWT H ADWMWT TL THE RED
DRESSES NP1NP1 NP1 "H" ADJ -gt ADJ
NP1 ( (X3Y1) (X1Y2) ((X1 def) ) ((X1
status) c absolute) ((X1 num) (X3 num)) ((X1
gen) (X3 gen)) (X0 X1) )
21
Hebrew-to-English MT Prototype

Initial prototype developed within a two month
intensive effort
Accomplished
Adapted available morphological analyzer
Constructed a preliminary translation lexicon
Translated and aligned Elicitation Corpus
Learned XFER rules
Developed (small) manual XFER grammar
System debugging and development
Evaluated performance on unseen test data using
automatic evaluation metrics

22
Example Translation

Input
???? ?????? ???? ?????? ?????? ????? ???? ??
????? ??????
After debates many decided the government to hold
referendum in issue the withdrawal
Output
AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD
A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL

23
Noun Phrases Construct State
????? ????? ??????
HXL_at_T HNSIA HRAWNdecision.3SF-CS the-president
.3SM the-first.3SM
THE DECISION OF THE FIRST PRESIDENT
????? ????? ???????
HXL_at_T HNSIA HRAWNHdecision.3SF-CS the-presiden
t.3SM the-first.3SF
THE FIRST DECISION OF THE PRESIDENT
24
Noun Phrases - Possessives
????? ????? ??????? ??????? ??? ???? ????? ?????
?????? ???????
HNSIA HKRIZ HMIMH HRAWNH LW THIHthe-president
announced that-the-task.3SF the-first.3SF of-him
will.3SF
LMCWA PTRWN LSKSWK BAZWRNWto-find solution to-the
-conflict in-region-POSS.1P
Without transfer grammar THE PRESIDENT ANNOUNCED
THAT THE TASK THE BEST OF HIM WILL BE TO FIND
SOLUTION TO THE CONFLICT IN REGION OUR
With transfer grammar THE PRESIDENT ANNOUNCED
THAT HIS FIRST TASK WILL BE TO FIND A SOLUTION TO
THE CONFLICT IN OUR REGION
25
Subject-Verb Inversion
????? ?????? ?????? ??????? ?????? ????? ???
ATMWL HWDIH HMMLH yesterday announced.3SF the-g
overnment.3SF
TRKNH BXIRWT BXWD HBAthat-will-be-held.3PF ele
ctions.3PF in-the-month the-next
Without transfer grammar YESTERDAY ANNOUNCED THE
GOVERNMENT THAT WILL RESPECT OF THE FREEDOM OF
THE MONTH THE NEXT
With transfer grammar YESTERDAY THE GOVERNMENT
ANNOUNCED THAT ELECTIONS WILL ASSUME IN THE NEXT
MONTH
26
Subject-Verb Inversion
???? ??? ?????? ?????? ????? ????? ?????? ????
???? ????
LPNI KMH BWWT HWDIH HNHLT HMLWNbefore several
weeks announced.3SF management.3SF.CS the-hotel
HMLWN ISGR BSWF HNH that-the-hotel.3SM will-be
-closed.3SM at-end.3SM.CS the-year
Without transfer grammar IN FRONT OF A FEW WEEKS
ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL
WILL CLOSE AT THE END THIS YEAR
With transfer grammar SEVERAL WEEKS AGO THE
MANAGEMENT OF THE HOTEL ANNOUNCED THAT THE HOTEL
WILL CLOSE AT THE END OF THE YEAR
27
Evaluation Results

Test set of 62 sentences from Haaretz newspaper,
2 reference translations

System BLEU NIST P R METEOR
No Gram 0.0616 3.4109 0.4090 0.4427 0.3298
Learned 0.0774 3.5451 0.4189 0.4488 0.3478
Manual 0.1026 3.7789 0.4334 0.4474 0.3617
28
Current and Future Work

Issues specific to the Hebrew-to-English system
Coverage further improvements in the translation
lexicon and morphological analyzer
Manual Grammar development
Acquiring/training of word-to-word translation
probabilities
Acquiring/training of a Hebrew language model at
a post-morphology level that can help with
disambiguation
General Issues related to XFER framework
Discriminative Language Modeling for MT
Effective models for assigning scores to transfer
rules
Improved grammar learning
Merging/integration of manual and acquired
grammars

29
Conclusions

Test case for the CMU XFER framework for rapid MT
prototyping
Preliminary system was a two-month, three person
effort we were quite happy with the outcome
Core concept of XFER Decoding is very powerful
and promising for MT
We experienced the main bottlenecks of knowledge
acquisition for MT morphology, translation
lexicons, grammar...

30
Questions?

Write a Comment

User Comments (0)