Title: Named Entity Transliteration
1Named Entity Transliteration
- Richard Sproat
- University of Illinois at Urbana-Champaign
- rws_at_uiuc.edu
- In collaboration with ChengXiang Zhai, Su-youn
Yoon, Andrew Fister, Tao Tao, Kyoung-young Kim,
Xiao Hu, Xuanhui Wang, Alex Kotov, Dan Roth, Alex
Klementiev
The Second Workshop onComputational Approaches
to Arabic Script-based Languages July 21-22,
2007LSA 2007 Linguistic InstituteStanford
University
2Which are (or have been) Arabic script-based
languages an incomplete list
- Arabic (duh)
- Indo-European Languages Persian, Urdu, Kurdish,
Pashto, Kashmiri, Sindhi, (Albanian, Bosnian,
Spanish) - Altaic languages (Ottoman Turkish), Uyghur
- Austronesian languages (Malay), (Malagasy)
- Sino-Tibetan (Hui) Chinese
- Various sub-saharan languages (Swahili, Kanuri,
Hausa, Fulani)
3Why are Arabic script-based languages
interesting?
- Only three things that make Arabic script-based
languages interesting qua Arabic script-based
languages - Layout properties of the script itself e.g.
paper by Aamir Wali and Shafiq-ur Rahman in this
conference - In Urdu, the Nastaliq script is heavily
structured around ligatures resulting in word
segmentation problems similar to Chinese. - Arabic-based writing systems are (usually)
consonantal alphabets abjads. - Severe ambiguity problems associated with not
writing vowels, and various devices used to write
(even short) vowels for foreign words - These devices became used in some Arabic-based
writing systems Kurdish, Uyghur, Hui Chinese
to turn the writing systems into alphabets - They also show up in foreign word transliteration
4Importance of transliteration
??????? ?? ?? ?-??? ?? ???? ??????? ?? ???? ?? ??
?????? ???????? ?????? ?????????? ?????? (????)
?? ?? ??????? ?????? ??.
"??? ????????? ???????? ??????, ??????, ?????
?????????? ????? ???????????? ??????
?????????????, ?? ??? ??? ???? ????? ?????
??????? ????? ????????? ?????? ??? ??????
??????", ????? ?????????????? ????????
???????????? ????? ???? ???????. Â
?????????,????????36???,????????80????
5Transliteration in Egyptian
6New Kingdom Egyptian (1550 BC onwards)
7Transliteration in Early Semitic
- Bilingual Etruscan/Phoenician text from Pyrgi,
mentioning Caere (Etruscan Cisra), 500 BC
Thefarie (Tiberius)
Cisria
8Three methods for rating transliteration pairs
- Phonetic distance
- Supervised models
- Lightly or un- supervised models
- Temporal cooccurrence
- Comparable corpora
- Document cooccurrence
9Previous Work
- Transliteration Knight Graehl 1998 Meng et
al. 2001 Gao et al. 2004 inter alia. - Comparable corpora Fung, 1995 Rapp 1995 Tanaka
and Iwasaki, 1996 Franz et al.,1998 Ballesteros
and Croft, 1998 Masuichi et al., 2000 Sadat et
al., 2003 Tao and Zhai, 2005. - Mining transliterations from multilingual web
pages Zhang Vines, 2004
10Comparable Corpora
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
11Named Entity Transliteration with Comparable
Corpora (ACL06)
- Supervised transliteration model trained on small
lexicon - Temporal alignment across English-Chinese corpora
- Within-document cooccurrence
Richard Sproat, Tao Tao and ChengXiang Zhai.
"Named Entity Transliteration with Comparable
Corpora". ACL 2006, July 17-21, 2006, Sydney,
Australia.
12Method
- We assume that we have comparable corpora,
consisting of newspaper articles in a pair of
languages - In our experiments we use data from the English
and Chinese stories from the Xinhua News agency
for about 6 months of 2001. - Identify English NEs using method of Li et al.
2001 (based on SNoW Carlson et al. 1999) - Potential Chinese NEs can be found by finding
spans of characters typically used for foreign
names
13Foreign name characters
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
???????????
???? (Megawati)
14Method
- Given an English name, identify candidate Chinese
character n-grams as possible transliterations. - Score each candidate based on how likely the
candidate is to be a transliteration of the
English name. - Two initial scoring methods.
- Phonetic scoring,
- Frequency profile of the candidate pair over
time. - Propagate scores of all the candidate
transliteration pairs globally based on their
cooccurences in document pairs in the comparable
corpora.
15Method 1 Phonetic Transliteration
- Much work using the source-channel approach
- Cast as a problem where you have a clean
source e.g. a Chinese name and a noisy
channel that corrupts the source into the
observed form e.g. an English name - P(EC)P(C)
- E.g. P(fi,E fi1,E fi2,E fin,E sC)
- Chinese characters represent syllables (s) we
match these to sequences of English phonemes (f)
16Method 1 Phonetic Transliteration
- Train a transliteration model from a dictionary
of known transliterations (720 entries) - Identify names in English news text for a given
day using an existing named entity recognizer - Process same day of Chinese text looking for
sequences of characters used in foreign names - Do an all-pairs match using the transliteration
model to find possible transliteration pairs
17Some Automatically Found Pairs
Pairs found in same day of newswire text
18Method 2 Frequency Correlation
Day 1
Day 2
Day 3
Day n
Any term W
Time line
Term Frequency
Normalized to obtain a distribution
19Megawati-English Megawati-Chinese
Frequency Probability
Megawati-English Arafat-Chinese
Quantitative measure Pearson correlation
20Time correlation, hand-assigned phonetic scores
(work with Xiao Hu, Su-youn Yoon)
word_pair time correlation phonetic
distance lhasa ?? 0.632006 2.875000 taliban
??? 0.828756 3.071429 mustafa ???? 0.819008 3.18
7500 pakistan ???? 0.652715 3.222222 hutu
?? 0.518008 3.375000 killian ??? 0.765722 3.5833
33 durban ?? 0.646309 3.700000 mahathir
???? 0.560175 3.785714 kabul ??? 0.857423 3.90000
0 kuchma ??? 0.516287 3.916667 paper
?? 0.54034 4.000000 wahid ??? 0.852819 4.000000
cuba ?? 0.509132 4.100000 kashmir
???? 0.641191 4.142857 tajikistan
????? 0.59021 4.227273 musharraf
???? 0.605773 4.250000 ramallah
??? 0.719745 4.250000 nathan ?? 0.714915 4.33333
3 malta ??? 0.715252 4.400000 kazakhstan
????? 0.721586 4.409091
- English candidates are the most frequent 2000
words with (non-sentence-initial) capital letters - Chinese candidates found using foreign name
characters - Pair-wise time correlation scores calculated
- Pairs with correlation of 0.5 or more kept
- Phonetic distances calculated using EMNLP method
- Rank by increasing phonetic distance
- Keep only the first appearance of each word in
either language
21Method 3 Combining Phonetic and Time Correlation
Methods
- Two methods of combination
- Phonetic filter use the phonetic model to filter
out (clearly impossible) candidates and then use
the frequency correlation method to rank the
candidates. - Score combination mean of normalized scores
22Method 4 Score Propagation
- The methods so far score each transliteration
pair independently - But knowing that two transliteration pairs
co-occur in the same cross-lingual document pair
should increase our confidence in both
transliteration pairs - Similarly, document pairs that contain lots of
plausible transliteration pairs are likely
comparable content-wise - Thus, document/transliteration pairs reinforce
each other
23Score Propagation
Assume w1 w4
e1 , c1 , w1 e2 , c2, w2 e3 , c3 , w3
e4 , c4 , w4 e5 , c5 , w5 e6 , c6 , w6
Initial ws are calculated individually
ei English terms ci Chinese terms wi
transliteration scores
Given the co-occurrence, we reestimate w'1 w'4
24Score Propagation
Translit.Score
Translit. Pair
25Score Propagation
Similar to Page Rank (Brin Page, 1998)
26Estimate of P(ji)
- Two ways
- Cooccurrences (CO) in whole collection
- Mutual Information (MI)
27Evaluation
- 1 days worth of comparable news articles (234
Chinese stories and 322 English stories) from
Chinese English Gigaword Corpus (LDC) - Generate 600 English names with the NER (Li et
al., 2004) - Find potential Chinese names by looking for
strings of foreign name characters - 627 Chinese candidates
- In principle any of the 600 x 627 pairings could
be correct - Use phonetic, time-correlation and score
propagation methods to rank the candidate
pairings. - Evaluate using Mean Reciprocal Rank (MRR)
28Evaluation Further Details
- Small number of English names do not seem to have
any standard transliteration - Removing these, we have a list of 490 out of
original 600 English names - Furthermore, about 20 of answers are not in
Chinese candidate list - Either they are really not there
- Or our candidate selection process missed them.
- This motivates two scores
- AllMRR using the original list of 600 English
names - CoreMRR for just those where the English names
are also in our Chinese candidate list
29Evaluation
- Phonetic correspondence yielded an AllMRR score
of 0.3 and a CoreMRR score of 0.89 - Time-correlation scores yielded results as
follows, with different correlation measures
30Score Combination
31Score Propagation
32Score Propagation (Core)
a 0.95, CO
a 0.95, MI
a 0.9, CO
a 0.9, MI
higher a, less propagation
33Summary
- Three complementary methods
- Phonetic match
- Time correlation
- Document cooccurrence
- Combining all three improves MRR
34Unsupervised Named Entity Transliteration Using
Temporal and Phonetic Correlation (EMNLP 06)
- Use linguistically derived edit distance based on
phonetic features - Time correlation, as before
- Tested on several language pairs
- English/Chinese
- English/Arabic
- English/Hindi
- (English/Korean)
Tao Tao, Su-Youn Yoon, Andrew Fister, Richard
Sproat and ChengXiang Zhai. "Unsupervised Named
Entity Transliteration Using Temporal and
Phonetic Correlation." EMNLP, July 22-23, 2006,
Sydney, Australia.
35Phonetic Transcriptions (WorldBet)
36Pseudofeaturesbased on L2 learner errors (Swan
Smith 2002)
37Example Pseudofeatures
38Hand-assigned costs
39Example phoneme distances
40Example transliterations
41Two ways to combine time phonetic distance
- Rank combination
- Score combination
42Evaluation data
43Evaluation
Phonetic
Time
Combination
44Score combination with different as
MRR
a
more weight on time corr.
more weight on phonetics
45Training Phonetic Weights (ACL 07)
Su-Youn Yoon, Kyoung-Young Kim and Richard Sproat
Multilingual Transliteration Using Feature based
Phonetic Method. ACL 2007, Prague
- Train linear-classifier (from SNoW) on per
language basis. - Training data consists of
- 500 pairs from training dictionary
- 750,000 negative examples 500 English words
crossed with 1500 randomly selected words from
target language - Similarly generated held out test data
46Results on held-out data
47Comparable corpora test data
- Comparable corpora, 200 most frequent English
names in 7-day window - For Arabic, Hindi, Korean pick all words in
same 7-day window. (Korean picked from a one
month window). - Chinese candidates generated as before.
48Example results
49Results
50Results on cross-lingual training(core MRR)
51Arabic, trained on Arabic (CoreMRR 0.72)
52Arabic trained on Hindi(CoreMRR 0.62)
53What do we learn about Arabic Script-based
Languages?
- Maybe not too much so far, since the techniques
are blind to the properties of the script - But can one show that theres a measurable
difference in transliteration accuracy due to the
defective nature of the writing system for
ASBLs?
54JHU CLSP 2008 Workshop??
- Proposal for workshop on transliteration was
accepted at the WS2007 sic planning meeting in
early November - Plan will be to produce the definitive
public-domain transliteration toolkit
55Proposed toolkit
- Should allow one to develop orthographic/phonetic
models using minimal training data. (I.e., one
should not assume a large, or indeed any, seed
lexicon). - A part of this will be a system that can guess a
ballpark pronunciation for any word in any
language based on the UTF-8 codepoints in the
word. - Provide a suite of methods to compare
time-frequency distributions of terms across
comparable corpora - Use document cooccurrence to boost confidence in
transliteration candidates - Be able to leverage lexical resources when
available. - Provide a set of methods whereby the above
methods can be combined and improved through
iterative retraining. - Provide high quality transliterators for 10
languages Arabic, Chinese, Urdu, Farsi, Russian,
Hindi, Macedonian, Thai, Korean, Bengali. - Provide a suite of visualization tools that allow
one to track names in comparable corpora from two
or more languages.
56UTF-8 pronouncer (work with Kyoung-young Kim)
Armenian_unicode.txt Greek_unicode.txt
Limbu_unicode.txt Tagbanwa_unicode.txt Benga
li_unicode.txt Gujarati_unicode.txt
Malayalam_unicode.txt Tamil_unicode.txt Buhid_u
nicode.txt Gurmukhi_unicode.txt
Mongolian_unicode.txt Telugu_unicode.txt Chero
kee_unicode.txt Hanunoo_unicode.txt
Myanmar_unicode.txt Thaana_unicode.txt Coptic_
unicode.txt Hebrew_unicode.txt
Oriya_unicode.txt Thai_unicode.txt Cyril
lic_unicode.txt Japanese_unicode.txt
Sinhala_unicode.txt Tibetan_unicode.txt Dev
anagari_unicode.txt Kannada_unicode.txt
SylotiNagri_unicode.txt Tifinagh_unicode.txt Et
hiopic_unicode.txt Korean_unicode.txt
Syriac_unicode.txt UnifiedCanadianAbori
ginalSyllabics_unicode.txt Georgian_unicode.txt
Lao_unicode.txt Tagalog_unicode.txt
Yi_unicode.txt
57Proposed toolkit
- Should allow one to develop orthographic/phonetic
models using minimal training data. (I.e., one
should not assume a large, or indeed any, seed
lexicon). - A part of this will be a system that can guess a
ballpark pronunciation for any word in any
language based on the UTF-8 codepoints in the
word. - Provide a suite of methods to compare
time-frequency distributions of terms across
comparable corpora - Use document cooccurrence to boost confidence in
transliteration candidates - Be able to leverage lexical resources when
available. - Provide a set of methods whereby the above
methods can be combined and improved through
iterative retraining. - Provide high quality transliterators for 10
languages Arabic, Chinese, Urdu, Farsi, Russian,
Hindi, Macedonian, Thai, Korean, Bengali. - Provide a suite of visualization tools that allow
one to track names in comparable corpora from two
or more languages.
58Acknowledgments
- National Security Agency Contract NBCHC040176,
REFLEX - Google Research Grant
- Team members ChengXiang Zhai, Su-youn Yoon,
Andrew Fister, Tao Tao, Kyoung-young Kim, Xiao
Hu, Xuanhui Wang, Alex Kotov, Dan Roth, Alex
Klementiev