Coping with Surprise: Multiple CMU MT Approaches - PowerPoint PPT Presentation

About This Presentation

Title:

Coping with Surprise: Multiple CMU MT Approaches

Description:

Osho http://www.osho.com/Content.cfm?Language=Hindi. August 5, 2003. TIDES PI Meeting/ SLE ... http://www.vigyanprasar.com/dream/index.asp ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 18

Provided by: chadtl

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Coping with Surprise: Multiple CMU MT Approaches

1
Coping with SurpriseMultiple CMU MT Approaches

Alon Lavie
Lori Levin, Jaime Carbonell,
Alex Waibel, Stephan Vogel,
Ralf Brown, Robert Frederking
Language Technologies Institute
Carnegie Mellon University
Joint work with
Katharina Probst, Erik Peterson, Joy Zhang,
Fei Huang, Alicia Tribble, Ariadna Font-Llitjos,
Rachel Reynolds, Richard Cohen

2
Main Hindi SLE Efforts

Data Collection
Elicited Data Collection
Data from contacts in India
Web Crawling
Language Processing Utilities
Morphology
Encoding identification and conversion
MT system development
XFER system
SMT system
EBMT system

3
Elicited Data Collection

Goal Acquire high quality word aligned
Hindi-English data to support XFER system
development (grammar learning)
Recruited team of 20 bilingual speakers at CMU
and in India
Extracted a corpus of phrases (NPs and PPs) from
Brown Corpus section of Penn TreeBank
Controlled Elicitation Corpus (typologically
diverse, limited vocabulary) also translated into
Hindi
Resulting in total of 17589 word aligned
translated phrases (50KB)

4
The CMU Elicitation Tool
5
Elicited Data Collection

Problems and issues
English ? Hindi direction allowed us to use the
Penn TreeBank to extract accurate phrases
However, bilingual informants not accustomed to
type Hindi ? typos
Limits utility of the data, less effect on
accuracy
Using the WSJ portion of the PennTB may have been
a better fit for genre

6
Main CMU Contributions to SLE Shared Resources

Elicited Data Corpus (50KB)
Indian Government Parallel Text ERDC.tgz (338 MB)
CMU Phrase Lexicon Joyphrase.gz (3.5 MB)
Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5
MB)
CMU Aligned Sentences CMU-aligned-sentences.tar.gz
(1.3 MB)
CMU Phrases and sentences CMU-phrasessentences.zi
p (468 KB)
Bilingual Named Entity List IndiaTodayLPNETranslis
ts.tar.gz (54KB)
Web Crawling
Most sites with possible parallel texts had Hindi
in proprietary encodings
Osho http//www.osho.com/Content.cfm?LanguageHin
di

7
Hindi Morphological Analyzer

http//www.iiit.net/ltrc/morph/index.htm
High quality and high coverage morphological
analyzer from IIIT
Input full inflected forms (RomanWX encoding)
Output root form collection of features
Installing as a local server required some
effort, e.g. UTF-8 ? RomanWX
Used primarily in our XFER system

8
Other Hindi Processing Utilities

Encoding identification and conversion tools
Built two automatic encoding identifiers, used
for web data collection
Located and installed encoding converters from a
variety of encodings
Most widely used was UTF-8 to RomanWX

9
XFER System for Hindi

Three passes
match against phrase-to-phrase entries
(full-forms, no morphology)
morphologically analyze input words and match
against lexicon
matches feed into manual and learned transfer
rules
match original word against lexicon - provides
word-to-word translation as fall-back for input
not otherwise covered
Simple decoding greedy left-to-right search that
prefers longer input segments NIST 5.35
Strong decoding with latticesLM NIST 5.47

10
Examples of Learned Rules
NP,14244 Score0.0429 NPNP N -gt DET N ( (X1Y2) )
NP,14434 Score0.0040 NPNP ADJ CONJ ADJ N -gt ADJ CONJ ADJ N ( (X1Y1) (X2Y2) (X3Y3) (X4Y4) )
PP,4894Score0.0470PPPP NP POSTP -gt PREP NP((X2Y1)(X1Y2))
11
SMT System for Hindi

Resources
Trained on commonly available bilingual corpora
Used bilingual Hindi-English dictionary
Named Entities
70 million word English LM
CMU SMT System
Tuned on ISI devtest data
Monotone decoding, as reordering did not result
in improvement on this test set
Mixed casing based on Named Entities and simple
rules
NIST score 6.74

12
EBMT System for Hindi

Training data same as SMT a few hand-written
equivalent class generalizations
English LM built from APW portion of GigaWord
Corpus (600M words)
Encoding variation raw training data in a
variety of different encodings ? all converted to
UTF-8 (already supported by EBMT)
Preprocessing of example phrases to improve word
matching
Match Hindi possessive with English s
NIST Score 5.98

13
A Truly Limited Data Scenario for Hindi-to-English

Put together a scenario with very miserly data
resources
Elicited Data corpus 17589 phrases
Cleaned portion (top 12) of LDC dictionary
2725 Hindi words (23612 translation pairs)
Manually acquired resources during the SLE
500 manual bigram translations
72 manually written phrase transfer rules
105 manually written postposition rules
48 manually written time expression rules
No additional parallel text!!
Results presented tomorrow

14
Other CMU Contributions to SLE Shared Resources

FOUND RESOURCES not on LDC Website
From TidesSLList Archive website
Vogel email 6/2
Hindi Language Resources http//www.cs.colostate.
edu/malaiya/hindilinks.html
General Information on Hindi Script
http//www.latrobe.edu.au/indiangallery/devanagari
.htm
Dictionaries at http//www.iiit.net/ltrc/Dictiona
ries/Dict_Frame.html
English to Hindu dictionary in different formats
http//sanskrit.gde.to/hindi/
A small English to Urdu dictionary
http//www.cs.wisc.edu/navin/india/urdu.dictionar
y
The Bible at http//www.gospelcom.net/ibs/bibles/
The Emille Project http//www.emille.lancs.ac.uk/
home.htm
Hardcopy phrasebook references
A Monthly Newsletter of Vigyan Prasar
http//www.vigyanprasar.com/dream/index.asp
Morphological Analyser http//www.iiit.net/ltrc/m
orph/index.htm

15
Other CMU Contributions to SLE Shared Resources

FOUND RESOURCES not on LDC Website (cont.)
From TidesSLList Archive website
Tribble email, via Vogel 6/2 Possible parallel
websites
http//www.bbc.co.uk (English)
http//www.bbc.co.uk/urdu/ (Hindi)
http//sify.com/news_info/news/
http//sify.com/hindi/
http//in.rediff.com/index.html (English)
http//www.rediff.com/hindi/index.html (Hindi)
http//www.indiatoday.com/itoday/index.html
http//www.indiatodayhindi.com
Vogel email 6/2
http//us.rediff.com/index.html
http//www.rediff.com/hindi/index.html Already
listed
http//www.niharonline.com/
http//www.niharonline.com/hindi/index.html
http//www.boloji.com/hindi/index.html
http//www.boloji.com/hindi/hindi/index.htm
The Gita Supersite http//www.gitasupersite.iitk.a
c.in/

16
Other CMU Contributions to SLE Shared Resources

FOUND RESOURCES not on LDC Website (cont.)
From TidesSLList Archive website
6/20 Parallel Hindi/English webpages
GAIL (Natural Gas Co.) http//gail.nic.in/
UTF-8. Found by CMU undergrad Web team Mike
Maxwell, LDC, found it at the same time.
SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE
From TidesSLList Archive website
Frederking email 6/3 announced, 6/4 provided
Ralf Brown's idenc encoding classifier
Frederking email 6/5
PDF extractions from LanguageWeaver URLs
http//progress.is.cs.cmu.edu/surprise/Hindi/ParDo
c/06-04-2003/English/ http//progress.is.cs.cmu.ed
u/surprise/Hindi/ParDoc/06-04-2003/Hindi/
Frederking email 6/5
Richard Wang's Perl ident.pl encoding classifier
and ISCII-UTF8.pl converter
Frederking email 6/11
Erik Peterson here has put together a Perl
wrapper for the IIIT Morphology package, so that
the input can be UTF-8 http//progress.is.cs.cmu.
edu/surprise/morph_wrapper.tar.gz

17
Other CMU Contributions to SLE Shared Resources

SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE
(cont.)
From TidesSLList Archive website
Levin email 6/13
Directory of Elicited Word-Aligned English-Hindi
Translated Phrases http//progress.is.cs.cmu.edu/
surprise/Elicited-Data/
Frederking email 6/20
Undecoded but believed to be parallel webpages
http//progress.is.cs.cmu.edu/surprise/merged_urls
.txt
PDF extractions from same http//progress.is.cs.c
mu.edu/surprise/merged_urls/
Frederking email 6/24
Several individual parallel webpages sites may
have more www.commerce.nic.in/setup.htm www.comme
rce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/boo
ks1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in