AMTEXT: Extractionbased MT for Arabic - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

AMTEXT: Extractionbased MT for Arabic

Description:

Full MT of text is problematic: Requires large amounts of resources, long development time ... Run Time Transfer System. Word-aligned elicited data. Partial ... – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 20
Provided by: AlonL
Category:

less

Transcript and Presenter's Notes

Title: AMTEXT: Extractionbased MT for Arabic


1
AMTEXTExtraction-based MT for Arabic
  • Faculty
  • Alon Lavie, Jaime Carbonell
  • Students and Staff
  • Laura Kieras, Peter Jansen
  • Informant
  • Loubna El Abadi

2
Background and Objectives
  • Full MT of text is problematic
  • Requires large amounts of resources, long
    development time
  • Quality of output varies
  • Analysts often are looking for limited concrete
    information within the text ? full MT may not be
    necessary
  • Alternative rather than full MT followed by
    extraction, first extract and then translate only
    extracted information
  • Text Extraction technology has made much progress
    in past decade TIPSTER, TREC, EELD
  • Research Question Can Extraction-based MT result
    in improved accuracy and utility of information
    for analysts?

3
Extraction-based MT
  • Traditional Approach
  • Develop information extraction capability for the
    source language
  • Runtime Extractor produces a template of
    extracted feature-value information
  • If desired, English Generator can render the
    information in the form of text
  • Drawback Adapting extraction technology to a new
    foreign language is difficult
  • Requires significant expertise in the foreign
    language
  • Significant amounts of human development time
  • Not clear that it is an attractive solution

4
AMTEXT Approach
  • Attempt to leverage from our work on automatic
    learning of MT transfer rules
  • Develop an elicitation corpus specifically
    designed for targeted extraction patterns
  • Learn generalized transfer rules for targeted
    extraction patterns from elicitation corpus
  • Acquire high accuracy Named-Entity translation
    lexicon limited translation lexicon for
    targeted vocabulary
  • Runtime use partial parser transfer rules to
    translate only the matched portions of SL text

5
AMTEXT Extraction-based MT
Word-aligned elicited data
Source Text
Learning Module
Run Time Transfer System
Transfer Rules
Partial Parser
SS NE-P pagash et NE-P TE -gt NE-P met with
NE-P TE((X1Y1) (X4Y4) (X5Y5))
Extracted Target Text
Transfer Engine
NE Translation Lexicon
Word Translation Lexicon
6
Elicitation Example
7
Elicitation Example
8
Elicitation Example
9
Elicitation Example
10
Learning Transfer Rules
  • Different notion of rule generalization than in
    our full XFER approach
  • Generalize from examples to NEs that play
    specific roles in target extraction pattern
  • Verbs and function words may not be generalized
  • Example

Sharon will meet with Bush today sharon yipagesh
im bush hayom
Goal Rule
SS NE-P yipagesh im NE-P TE -gt NE-P will
meet with NE-P TE((X1Y1) (X4Y5) (X5Y6))
11
Acquisition of Named Entity Translation Lexicon
  • Utilize Fei Huangs work on building Named Entity
    Translation Lexicons based on transliteration
    models
  • NE Lexicon will be split into meaningful
    sub-categories PNs, Organizations, Locations,
    etc.
  • NE translation lexicon augmented with NEs from
    elicited data
  • Goal High coverage and high accuracy
    identification of NEs that play a part in the
    transfer rules

12
Named Entity Translation Lexicon
  • English-Arabic lexicon from Fei
  • Trained on TIDES Newswire Data
  • 7522 entries sorted by transliteration score
  • Example

4.51948528108464 XXX Israel
AsrAAyl 4.05498190544419 XXX Kabul
kAbwl 3.66368346525326 XXX Paris
bArys 3.65527347080481 XXX Afghanistan
AfgAnstAn 3.47030997281853 XXX Pakistan
bAkstAn 3.23199522148251 XXX Moscow
mwskw 3.20392400497002 XXX Arafat
ErfAt 3.13060360328543 XXX Beirut
byrwt 3.06872591580516 XXX Russia rwsyA
13
Named Entity Identification
  • NE Identifinder for English
  • Available from BBN
  • Will be used for identifying English NEs within
    elicited data ? Arabic NEs from word alignments
  • NE Identifinder for Arabic
  • Requested from BBN, so far no response
  • Will use if available, can manage without it
    (naïve identification based on NE translation
    lexicon)

14
Acquisition of Limited Word Translation Lexicon
  • Vocabulary of interest is limited based on
    specific actions and objects that are of interest
    ? scopeable on the English side
  • Elicitation corpus serves as a high-quality
    initial source for extracting this translation
    lexicon
  • Statistical word-to-word translation dictionary
    from SMT or EBMT can be used as a source for
    expanding coverage on the foreign language side
  • Experiment if time/resources permit with
    incorporating expanded vocabulary into transfer
    rules

15
Partial Parsing
  • Input Full text in the foreign language
  • Output Translation of extracted/matched text
  • Goal Extract by effectively matching transfer
    rules with the full text
  • Identify/parse NEs and words in restricted
    vocabulary
  • Identify transfer-rule (source-side) patterns
  • Handle expected high-levels of ambiguity

Sharon, meluve b-sar ha-xuc shalom, yipagesh im
bush hayom
NE-P
NE-P
NE-P
TE
Sharon will meet with Bush today
16
Scope of Pilot System
  • Arabic-to-English
  • Newswire text (available from TIDES)
  • Limited set of actions (X meet Y) (X attend Y)
    (X hold Y) (X kill Y) (X announce Y)
  • Limited translation patterns
  • ltsubj-NEgt ltverbgt ltobjgt ltLOCgt ltTEgt
  • Limited vocabulary

17
Evaluation Plan
  • Compare AMTEXT approach to full-text
    Arabic-to-English SMT, on a limited task of
    translation of relations within the scope of
    coverage
  • Establish a test set for evaluation
  • Define an appropriate metric Precision/Recall/F1
    of relations and entities
  • Compare performance

18
Current Status
  • Initial small elicitation corpus translated and
    aligned
  • Extraction of elicitation phrases from Penn-TB in
    advanced stages
  • Identifying scope of coverage relations,
    actions, translation patterns
  • Preliminary NE translation lexicon available

19
Work Plan
  • Creation of full elicitation corpus
    Nov-03
  • Translation/align. of elicitation corpus
    Nov/Dec-03
  • Install and integrate BBN English Identifinder
    Dec-03
  • Acquire initial NE translation lexicon
    Dec-03
  • Acquire initial word translation lexicon Dec-03
  • Develop and integrate partial parser
    Dec-03/Feb-04
  • Modify Transfer Engine for AMTEXT configuration
    Dec-03/Jan-04
  • Integration of preliminary complete system
    Feb-04
  • Design of evaluation Feb-04
  • System testing and modifications Feb/Apr-04
  • Test-set evaluation Apr-04
Write a Comment
User Comments (0)
About PowerShow.com