Title: Coping with Surprise: Multiple CMU MT Approaches
1Coping with SurpriseMultiple CMU MT Approaches
- Alon Lavie
- Lori Levin, Jaime Carbonell,
- Alex Waibel, Stephan Vogel,
- Ralf Brown, Robert Frederking
- Language Technologies Institute
- Carnegie Mellon University
- Joint work with
- Katharina Probst, Erik Peterson, Joy Zhang,
- Fei Huang, Alicia Tribble, Ariadna Font-Llitjos,
- Rachel Reynolds, Richard Cohen
2Main Hindi SLE Efforts
- Data Collection
- Elicited Data Collection
- Data from contacts in India
- Web Crawling
- Language Processing Utilities
- Morphology
- Encoding identification and conversion
- MT system development
- XFER system
- SMT system
- EBMT system
3Elicited Data Collection
- Goal Acquire high quality word aligned
Hindi-English data to support XFER system
development (grammar learning) - Recruited team of 20 bilingual speakers at CMU
and in India - Extracted a corpus of phrases (NPs and PPs) from
Brown Corpus section of Penn TreeBank - Controlled Elicitation Corpus (typologically
diverse, limited vocabulary) also translated into
Hindi - Resulting in total of 17589 word aligned
translated phrases (50KB)
4The CMU Elicitation Tool
5Elicited Data Collection
- Problems and issues
- English ? Hindi direction allowed us to use the
Penn TreeBank to extract accurate phrases - However, bilingual informants not accustomed to
type Hindi ? typos - Limits utility of the data, less effect on
accuracy - Using the WSJ portion of the PennTB may have been
a better fit for genre
6Main CMU Contributions to SLE Shared Resources
- Elicited Data Corpus (50KB)
- Indian Government Parallel Text ERDC.tgz (338 MB)
- CMU Phrase Lexicon Joyphrase.gz (3.5 MB)
- Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5
MB) - CMU Aligned Sentences CMU-aligned-sentences.tar.gz
(1.3 MB) - CMU Phrases and sentences CMU-phrasessentences.zi
p (468 KB) - Bilingual Named Entity List IndiaTodayLPNETranslis
ts.tar.gz (54KB) - Web Crawling
- Most sites with possible parallel texts had Hindi
in proprietary encodings - Osho http//www.osho.com/Content.cfm?LanguageHin
di
7Hindi Morphological Analyzer
- http//www.iiit.net/ltrc/morph/index.htm
- High quality and high coverage morphological
analyzer from IIIT - Input full inflected forms (RomanWX encoding)
- Output root form collection of features
- Installing as a local server required some
effort, e.g. UTF-8 ? RomanWX - Used primarily in our XFER system
8Other Hindi Processing Utilities
- Encoding identification and conversion tools
- Built two automatic encoding identifiers, used
for web data collection - Located and installed encoding converters from a
variety of encodings - Most widely used was UTF-8 to RomanWX
9XFER System for Hindi
- Three passes
- match against phrase-to-phrase entries
(full-forms, no morphology) - morphologically analyze input words and match
against lexicon - matches feed into manual and learned transfer
rules - match original word against lexicon - provides
word-to-word translation as fall-back for input
not otherwise covered - Simple decoding greedy left-to-right search that
prefers longer input segments NIST 5.35 - Strong decoding with latticesLM NIST 5.47
10Examples of Learned Rules
NP,14244 Score0.0429 NPNP N -gt DET N ( (X1Y2) )
NP,14434 Score0.0040 NPNP ADJ CONJ ADJ N -gt ADJ CONJ ADJ N ( (X1Y1) (X2Y2) (X3Y3) (X4Y4) )
PP,4894Score0.0470PPPP NP POSTP -gt PREP NP((X2Y1)(X1Y2))
11SMT System for Hindi
- Resources
- Trained on commonly available bilingual corpora
- Used bilingual Hindi-English dictionary
- Named Entities
- 70 million word English LM
- CMU SMT System
- Tuned on ISI devtest data
- Monotone decoding, as reordering did not result
in improvement on this test set - Mixed casing based on Named Entities and simple
rules - NIST score 6.74
12EBMT System for Hindi
- Training data same as SMT a few hand-written
equivalent class generalizations - English LM built from APW portion of GigaWord
Corpus (600M words) - Encoding variation raw training data in a
variety of different encodings ? all converted to
UTF-8 (already supported by EBMT) - Preprocessing of example phrases to improve word
matching - Match Hindi possessive with English s
- NIST Score 5.98
13A Truly Limited Data Scenario for Hindi-to-English
- Put together a scenario with very miserly data
resources - Elicited Data corpus 17589 phrases
- Cleaned portion (top 12) of LDC dictionary
2725 Hindi words (23612 translation pairs) - Manually acquired resources during the SLE
- 500 manual bigram translations
- 72 manually written phrase transfer rules
- 105 manually written postposition rules
- 48 manually written time expression rules
- No additional parallel text!!
- Results presented tomorrow
14Other CMU Contributions to SLE Shared Resources
- FOUND RESOURCES not on LDC Website
- From TidesSLList Archive website
- Vogel email 6/2
- Hindi Language Resources http//www.cs.colostate.
edu/malaiya/hindilinks.html - General Information on Hindi Script
http//www.latrobe.edu.au/indiangallery/devanagari
.htm - Dictionaries at http//www.iiit.net/ltrc/Dictiona
ries/Dict_Frame.html - English to Hindu dictionary in different formats
http//sanskrit.gde.to/hindi/ - A small English to Urdu dictionary
http//www.cs.wisc.edu/navin/india/urdu.dictionar
y - The Bible at http//www.gospelcom.net/ibs/bibles/
- The Emille Project http//www.emille.lancs.ac.uk/
home.htm - Hardcopy phrasebook references
- A Monthly Newsletter of Vigyan Prasar
- http//www.vigyanprasar.com/dream/index.asp
- Morphological Analyser http//www.iiit.net/ltrc/m
orph/index.htm
15Other CMU Contributions to SLE Shared Resources
- FOUND RESOURCES not on LDC Website (cont.)
- From TidesSLList Archive website
- Tribble email, via Vogel 6/2 Possible parallel
websites - http//www.bbc.co.uk (English)
- http//www.bbc.co.uk/urdu/ (Hindi)
- http//sify.com/news_info/news/
- http//sify.com/hindi/
- http//in.rediff.com/index.html (English)
- http//www.rediff.com/hindi/index.html (Hindi)
- http//www.indiatoday.com/itoday/index.html
- http//www.indiatodayhindi.com
- Vogel email 6/2
- http//us.rediff.com/index.html
- http//www.rediff.com/hindi/index.html Already
listed - http//www.niharonline.com/
- http//www.niharonline.com/hindi/index.html
- http//www.boloji.com/hindi/index.html
- http//www.boloji.com/hindi/hindi/index.htm
- The Gita Supersite http//www.gitasupersite.iitk.a
c.in/
16Other CMU Contributions to SLE Shared Resources
- FOUND RESOURCES not on LDC Website (cont.)
- From TidesSLList Archive website
- 6/20 Parallel Hindi/English webpages
- GAIL (Natural Gas Co.) http//gail.nic.in/
UTF-8. Found by CMU undergrad Web team Mike
Maxwell, LDC, found it at the same time. - SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE
- From TidesSLList Archive website
- Frederking email 6/3 announced, 6/4 provided
- Ralf Brown's idenc encoding classifier
- Frederking email 6/5
- PDF extractions from LanguageWeaver URLs
http//progress.is.cs.cmu.edu/surprise/Hindi/ParDo
c/06-04-2003/English/ http//progress.is.cs.cmu.ed
u/surprise/Hindi/ParDoc/06-04-2003/Hindi/ - Frederking email 6/5
- Richard Wang's Perl ident.pl encoding classifier
and ISCII-UTF8.pl converter - Frederking email 6/11
- Erik Peterson here has put together a Perl
wrapper for the IIIT Morphology package, so that
the input can be UTF-8 http//progress.is.cs.cmu.
edu/surprise/morph_wrapper.tar.gz
17Other CMU Contributions to SLE Shared Resources
- SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE
(cont.) - From TidesSLList Archive website
- Levin email 6/13
- Directory of Elicited Word-Aligned English-Hindi
Translated Phrases http//progress.is.cs.cmu.edu/
surprise/Elicited-Data/ - Frederking email 6/20
- Undecoded but believed to be parallel webpages
http//progress.is.cs.cmu.edu/surprise/merged_urls
.txt - PDF extractions from same http//progress.is.cs.c
mu.edu/surprise/merged_urls/ - Frederking email 6/24
- Several individual parallel webpages sites may
have more www.commerce.nic.in/setup.htm www.comme
rce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/boo
ks1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in