Title: AMTEXT: Extractionbased MT for Arabic
1AMTEXTExtraction-based MT for Arabic
- Faculty
- Alon Lavie, Jaime Carbonell
-
- Students and Staff
- Laura Kieras, Peter Jansen
- Informant
- Loubna El Abadi
2Goals and Approach
- Analysts often are looking for limited concrete
information within the text ? full MT may not be
necessary - Alternative rather than full MT followed by
extraction, first extract and then translate only
extracted information - But how do we extract just the relevant parts
in the source language? - AMTEXT approach
- learn extraction patterns and their translations
from small amounts of human translated and
aligned data - Combine with broad coverage Named-Entity
translation lexicons - System output translation of extracted
information a structured representation
3AMTEXT Extraction-based MT
Word-aligned elicited data
Source Text
Learning Module
Run Time Extract Transfer System
Transfer Rules
Filled Template
Partial Parser Transfer Engine
SS NE-P pagash et NE-P TE -gt NE-P met with
NE-P TE((X1Y1) (X4Y4) (X5Y5))
Extracted Target Text
Post-processor Extractor
NE Translation Lexicon
Word Translation Lexicon
4Elicitation Example
5Learning Extraction Translation Patterns
- Elicited example
- Sharon nifgash hayom im bush
- Sharon met with Bush today
- After Generalization
- ltPERSONgt ltMEET-Vgt ltTEgt im ltPERSONgt
- ltPERSONgt ltMEET-Vgt with ltPERSONgt ltTEgt
- Resulting Learned Pattern Rule
- SS PERSON MEET-V TE im PERSON -gt PERSON
MEET-V with PERSON TE - ( (X1Y1)
- (X2Y2)
- (X3Y5)
- (X5Y4))
6Transfer Rule Formalism
SL the old man, TL ha-ish ha-zaqen NPNP
DET ADJ N -gt DET N DET ADJ ( (X1Y1) (X1Y3)
(X2Y4) (X3Y2) ((X1 AGR) 3-SING) ((X1 DEF
DEF) ((X3 AGR) 3-SING) ((X3 COUNT)
) ((Y1 DEF) DEF) ((Y3 DEF) DEF) ((Y2 AGR)
3-SING) ((Y2 GENDER) (Y4 GENDER)) )
- Type information
- Part-of-speech/constituent information
- Alignments
- x-side constraints
- y-side constraints
- xy-constraints,
- e.g. ((Y1 AGR) (X1 AGR))
7The Transfer Engine
8Partial Parsing
- Input Full text in the foreign language
- Output Translation of extracted/matched text
- Goal Extract by effectively matching transfer
rules with the full text - Identify/parse NEs and words in restricted
vocabulary - Identify transfer-rule (source-side) patterns
- Transfer Engine produces a complete lattice of
transfer translations
Sharon, meluve b-sar ha-xuc shalom, yipagesh im
bush hayom
NE-P
NE-P
NE-P
TE
Sharon will meet with Bush today
9Post Processing
- Translation Selection Module
- select most complete and coherent translation
from lattice based on scoring heuristics - Structure Extraction
- Extract translated entities from the pattern and
display in a structured table format - Output Display
- Perl scripts construct HTML page for displaying
complete translation results
10Translation Selection Module Features
- Goal Scoring function that can identify the most
likely best match - Lattice arc features from the transfer engine
- matched range of source
- matched parts of target
- transfer score
- partial parse
11Lattice Example
- Arafat to meet Peres in Brussels on Monday
- ErfAt yltqy byryz msAA AlAvnyn fy brwksl
- (1 1 "Arafat" 3 "ErfAt" "(PNAME,0 "Arafat")")
- (2 2 "will meet with" 3 "yltqy" "(MEET-V,5 "will
meet with")") - (3 3 "Peres" 3 "byryz" "(PNAME,1 "Peres")")
- (1 3 "Arafat will meet with Peres" 3 "ErfAt yltqy
byryz" "((S,11 (PERSON,1 (PNAM - E,0 "Arafat") ) (MEET-V,5 "will meet with")
(PERSON,1 (PNAME,1 "Peres") ) ) )") - (4 4 "msAA" 3 "msAA" "(UNK,0 "msAA")")
- (5 5 "Monday" 3 "AlAvnyn" "(DAY,0 "Monday")")
- (4 5 "on Monday" 2.9 "msAA AlAvnyn" "((TE,4
(LITERAL "on")(DAY,0 "Monday") ) )") - (1 5 "Arafat will meet with Peres on Monday" 3.2
"ErfAt yltqy byryz msAA AlAvnyn - " "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5
"will meet with") (PERSON,1 (P - NAME,1 "Peres") ) (TE,4 (LITERAL "on")(DAY,0
"Monday") ) ) )") - (1 5 "Arafat will meet with Peres Monday" 3.1
"ErfAt yltqy byryz msAA AlAvnyn" " - ((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5
"will meet with") (PERSON,1 (PNAM - E,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )")
- (6 6 "fy" 3 "fy" "(UNK,2 "fy")")
- (7 7 "Brussels" 3 "brwksl" "(PLACE,0
"Brussels")")
12Example Extracting Features
- 1 5 ? Length (tokens)
of source segment (ar) (1) - "Arafat will meet with Peres Monday" ? length
of trans segment (2) - 3.1
? transfer engine score (3) - "ErfAt yltqy byryz msAA AlAvnyn" ? length of
source segment (4) - 1 2 3 4 5
- "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5
"will meet with") (PERSON,1 (PNAME,1 "Peres") )
(TE,5 (DAY,0 "Monday") ) ) )" - ? Transfer structure - full frame (S) or
not? (5) - Secondary feature (6) relative lengths of (2)
over (4) the smaller, the more concise the
source language match (less extraneous material,
i.e. less chance of mistranslation).
13Selecting Best Translation
For each parse Pj in the lattice, calculate a
score Sj based on features fi with weight
coefficients wi, as follows
Weights wi trained by hill climbing
(training set / manual reference parse)
14Proof-of-Concept System
- Arabic-to-English
- Newswire text (available from TIDES)
- Very limited set of actions (X meet Y)
- Limited collection of translation patterns
- ltPerson-NEgt ltmeet-verbgt ltPerson-NEgt ltLOCgt ltTEgt
- Limited vocabulary and NE lexicon
15System Development
- Training corpus of 535 short sentences translated
and aligned by bilingual informant - 258 simple meeting sentences
- 120 Temporal Expressions
- 105 Location Expressions
- 52 Title Expressions
- Translation Lexicon of Names Entities (person
names, organizations and locations) converted
from Fei Huangs NE translation/transliteration
work - Pattern Generalizations semi-automatically
learned from the training data - Patterns manually enhanced with skipping
markers - Initial System integrated
- Development with informant on 74 sentence dev data
16Resulting System
- Transfer Grammar contains
- 21 transfer pattern rules
- 12 Meet Verb rules
- 4/17/11/17 Person/TE/LOC/PTitle high-level
rules - Transfer Lexicon contains 3070 entries (mostly
names and locations) - Estimated development effort/time
- 20 hours with informant
- 50 hours of lexical and rule development
17Evaluation
- Development set of 74 sentences
- Test set of 76 unseen sentences with meeting
information - Identified subset of each set on which meeting
patterns could potentially apply (Good) - 53 development sentences
- 44 test sentences
18Evaluation
- Translation-based
- Unigram token-based retrieval metrics precision
/ recall / F1 - Entity-based
- Recall for each role in the meeting frame (V, P1,
P2, LOC and TE) - Partial recall credit for partial matches
- Partial credit (50) for P1/P2 role interchange
19Evaluation Results
20Demonstration
- http//www-2.cs.cmu.edu/afs/cs/user/alavie/Avenue/
tmp/demo20sep/met.dev.htm
21Conclusions
- Attractive methodology for joint extraction
translation of Essential Elements of Information
from full foreign language texts - Rapid Development - circumvents need for
developing high-quality full MT or high-quality
IE technology for the foreign source language - Effective use of bilingual informants
- Main Open Question Scalability
- Can this methodology be effective with much
broader and more complex types of extracted EEIs? - Is automatic learning of generalized patterns
feasible and effective in such more complex
scenarios? - Can the selection heuristics effectively cope
with the vast amounts of ambiguity expected in a
large scale system?