Coping with Surprise: Multiple CMU MT Approaches - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Coping with Surprise: Multiple CMU MT Approaches

Description:

(X3::Y3) (X4::Y4) {NP,14244} ;;Score:0.0429. NP::NP [N] - [DET N] (X1::Y2) {PP,4894} ... http://www.vigyanprasar.com/dream/index.asp ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 21
Provided by: chadtl
Category:
Tags: cmu | approaches | bbc | com | coping | multiple | news | surprise | uk | y3

less

Transcript and Presenter's Notes

Title: Coping with Surprise: Multiple CMU MT Approaches


1
Coping with SurpriseMultiple CMU MT Approaches
  • Alon Lavie
  • Lori Levin, Jaime Carbonell,
  • Alex Waibel, Stephan Vogel,
  • Ralf Brown, Robert Frederking
  • Language Technologies Institute
  • Carnegie Mellon University
  • Joint work with
  • Katharina Probst, Erik Peterson, Joy Zhang,
  • Fei Huang, Alicia Tribble, Ariadna Font-Llitjos,
  • Rachel Reynolds, Richard Cohen

2
Main Hindi SLE Efforts
  • Data Collection
  • Elicited Data Collection
  • Data from contacts in India
  • Web Crawling
  • Language Processing Utilities
  • Morphology
  • Encoding identification and conversion
  • MT system development
  • XFER system
  • SMT system
  • EBMT system

3
Elicited Data Collection
  • Goal Acquire high quality word aligned
    Hindi-English data to support XFER system
    development (grammar learning)
  • Recruited team of 20 bilingual speakers at CMU
    and in India
  • Extracted a corpus of phrases (NPs and PPs) from
    Brown Corpus section of Penn TreeBank
  • Controlled Elicitation Corpus (typologically
    diverse, limited vocabulary) also translated into
    Hindi
  • Resulting in total of 17589 word aligned
    translated phrases (50KB words)

4
The CMU Elicitation Tool
5
Elicited Data Collection High quality,
word-aligned data
Controlled elicitation corpus translated and
aligned by Hindi speakers - Typologically
diverse, vocabulary limited
6
Elicited Data Collection High quality,
word-aligned data
Uncontrolled elicitation corpus English phrases
extracted from the Brown Corpus, translated by
Hindi Speakers - Specific constituent types,
large vocabulary
7
Elicited Data Collection High quality,
word-aligned data
Variety of phrase complexities and phrase lengths
8
Elicited Data Collection
  • Problems and issues
  • English ? Hindi direction allowed us to use the
    Penn TreeBank to extract accurate phrases
  • However, bilingual informants not well accustomed
    to type Hindi ? some typos
  • Limits utility of the data, little effect on
    accuracy
  • Using the WSJ portion of the PennTB may have been
    a better fit for genre

9
Main CMU Contributions to SLE Shared Resources
  • Elicited Data Corpus (50KB)
  • Indian Government Parallel Text ERDC.tgz (338 MB)
  • CMU Phrase Lexicon Joyphrase.gz (3.5 MB)
  • Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5
    MB)
  • CMU Aligned Sentences CMU-aligned-sentences.tar.gz
    (1.3 MB)
  • CMU Phrases and sentences CMU-phrasessentences.zi
    p (468 KB)
  • Bilingual Named Entity List IndiaTodayLPNETranslis
    ts.tar.gz (54KB)
  • Web Crawling
  • Most sites with possible parallel texts had Hindi
    in proprietary encodings
  • Osho http//www.osho.com/Content.cfm?LanguageHin
    di

10
Hindi Morphological Analyzer
  • http//www.iiit.net/ltrc/morph/index.htm
  • High quality and high coverage morphological
    analyzer from IIIT
  • Input full inflected forms (RomanWX encoding)
  • Output root form collection of features
  • Installing as a local server required some
    effort, e.g. UTF-8 ? RomanWX
  • Used primarily in our XFER system

11
Other Hindi Processing Utilities
  • Encoding identification and conversion tools
  • Built two automatic encoding identifiers, used
    for web data collection
  • Located and installed encoding converters from a
    variety of encodings
  • Most widely used was UTF-8 to RomanWX

12
XFER System for Hindi
  • Three transfer strategies
  • match against phrase-to-phrase entries
    (full-forms, no morphology)
  • morphologically analyze input words and match
    against lexicon
  • matches feed into manual and learned transfer
    rules
  • match original word against lexicon - provides
    word-to-word translation as fall-back for input
    not otherwise covered
  • Simple decoding greedy left-to-right search that
    prefers longer input segments NIST 5.35
  • Strong decoding with latticesLM NIST 5.47

13
Examples of Learned Rules
14
SMT System for Hindi
  • Resources
  • Trained on commonly available bilingual corpora
  • Used bilingual Hindi-English dictionary
  • Named Entities
  • 70 million word English LM
  • CMU SMT System
  • Tuned on ISI devtest data
  • Monotone decoding, as reordering did not result
    in improvement on this test set
  • Mixed casing based on Named Entities and simple
    rules
  • NIST score 6.74

15
EBMT System for Hindi
  • Training data same as SMT a few hand-written
    equivalent class generalizations
  • English LM built from APW portion of GigaWord
    Corpus (600M words)
  • Encoding variation raw training data in a
    variety of different encodings ? all converted to
    UTF-8 (already supported by EBMT)
  • Preprocessing of example phrases to improve word
    matching
  • Match Hindi possessive with English s
  • NIST Score 5.98

16
A Truly Limited Data Scenario for Hindi-to-English
  • Put together a scenario with very miserly data
    resources
  • Elicited Data corpus 17589 phrases
  • Cleaned portion (top 12) of LDC dictionary
    2725 Hindi words (23612 translation pairs)
  • Manually acquired resources during the SLE
  • 500 manual bigram translations
  • 72 manually written phrase transfer rules
  • 105 manually written postposition rules
  • 48 manually written time expression rules
  • No additional parallel text!!
  • Results presented tomorrow

17
Other CMU Contributions to SLE Shared Resources
  • FOUND RESOURCES not on LDC Website
  • From TidesSLList Archive website
  • Vogel email 6/2
  • Hindi Language Resources http//www.cs.colostate.
    edu/malaiya/hindilinks.html
  • General Information on Hindi Script
    http//www.latrobe.edu.au/indiangallery/devanagari
    .htm
  • Dictionaries at http//www.iiit.net/ltrc/Dictiona
    ries/Dict_Frame.html
  • English to Hindu dictionary in different formats
    http//sanskrit.gde.to/hindi/
  • A small English to Urdu dictionary
    http//www.cs.wisc.edu/navin/india/urdu.dictionar
    y
  • The Bible at http//www.gospelcom.net/ibs/bibles/
  • The Emille Project http//www.emille.lancs.ac.uk/
    home.htm
  • Hardcopy phrasebook references
  • A Monthly Newsletter of Vigyan Prasar
  • http//www.vigyanprasar.com/dream/index.asp
  • Morphological Analyser http//www.iiit.net/ltrc/m
    orph/index.htm

18
Other CMU Contributions to SLE Shared Resources
  • FOUND RESOURCES not on LDC Website (cont.)
  • From TidesSLList Archive website
  • Tribble email, via Vogel 6/2 Possible parallel
    websites
  • http//www.bbc.co.uk (English)
  • http//www.bbc.co.uk/urdu/ (Hindi)
  • http//sify.com/news_info/news/
  • http//sify.com/hindi/
  • http//in.rediff.com/index.html (English)
  • http//www.rediff.com/hindi/index.html (Hindi)
  • http//www.indiatoday.com/itoday/index.html
  • http//www.indiatodayhindi.com
  • Vogel email 6/2
  • http//us.rediff.com/index.html
  • http//www.rediff.com/hindi/index.html Already
    listed
  • http//www.niharonline.com/
  • http//www.niharonline.com/hindi/index.html
  • http//www.boloji.com/hindi/index.html
  • http//www.boloji.com/hindi/hindi/index.htm
  • The Gita Supersite http//www.gitasupersite.iitk.a
    c.in/

19
Other CMU Contributions to SLE Shared Resources
  • FOUND RESOURCES not on LDC Website (cont.)
  • From TidesSLList Archive website
  • 6/20 Parallel Hindi/English webpages
  • GAIL (Natural Gas Co.) http//gail.nic.in/
    UTF-8. Found by CMU undergrad Web team Mike
    Maxwell, LDC, found it at the same time.
  • SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE
  • From TidesSLList Archive website
  • Frederking email 6/3 announced, 6/4 provided
  • Ralf Brown's idenc encoding classifier
  • Frederking email 6/5
  • PDF extractions from LanguageWeaver URLs
    http//progress.is.cs.cmu.edu/surprise/Hindi/ParDo
    c/06-04-2003/English/ http//progress.is.cs.cmu.ed
    u/surprise/Hindi/ParDoc/06-04-2003/Hindi/
  • Frederking email 6/5
  • Richard Wang's Perl ident.pl encoding classifier
    and ISCII-UTF8.pl converter
  • Frederking email 6/11
  • Erik Peterson here has put together a Perl
    wrapper for the IIIT Morphology package, so that
    the input can be UTF-8 http//progress.is.cs.cmu.
    edu/surprise/morph_wrapper.tar.gz

20
Other CMU Contributions to SLE Shared Resources
  • SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE
    (cont.)
  • From TidesSLList Archive website
  • Levin email 6/13
  • Directory of Elicited Word-Aligned English-Hindi
    Translated Phrases http//progress.is.cs.cmu.edu/
    surprise/Elicited-Data/
  • Frederking email 6/20
  • Undecoded but believed to be parallel webpages
    http//progress.is.cs.cmu.edu/surprise/merged_urls
    .txt
  • PDF extractions from same http//progress.is.cs.c
    mu.edu/surprise/merged_urls/
  • Frederking email 6/24
  • Several individual parallel webpages sites may
    have more www.commerce.nic.in/setup.htm www.comme
    rce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/boo
    ks1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in
Write a Comment
User Comments (0)
About PowerShow.com