Named Entity Transliteration - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Named Entity Transliteration

Description:

Potential Chinese NE's can be found by finding spans of characters typically ... Find potential Chinese names by looking for strings of 'foreign name characters' ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 58
Provided by: richard142
Category:

less

Transcript and Presenter's Notes

Title: Named Entity Transliteration


1
Named Entity Transliteration
  • Richard Sproat
  • University of Illinois at Urbana-Champaign
  • rws_at_uiuc.edu
  • In collaboration with ChengXiang Zhai, Su-youn
    Yoon, Andrew Fister, Tao Tao, Kyoung-young Kim,
    Xiao Hu, Xuanhui Wang, Alex Kotov, Dan Roth, Alex
    Klementiev

The Second Workshop onComputational Approaches
to Arabic Script-based Languages July 21-22,
2007LSA 2007 Linguistic InstituteStanford
University
2
Which are (or have been) Arabic script-based
languages an incomplete list
  • Arabic (duh)
  • Indo-European Languages Persian, Urdu, Kurdish,
    Pashto, Kashmiri, Sindhi, (Albanian, Bosnian,
    Spanish)
  • Altaic languages (Ottoman Turkish), Uyghur
  • Austronesian languages (Malay), (Malagasy)
  • Sino-Tibetan (Hui) Chinese
  • Various sub-saharan languages (Swahili, Kanuri,
    Hausa, Fulani)

3
Why are Arabic script-based languages
interesting?
  • Only three things that make Arabic script-based
    languages interesting qua Arabic script-based
    languages
  • Layout properties of the script itself e.g.
    paper by Aamir Wali and Shafiq-ur Rahman in this
    conference
  • In Urdu, the Nastaliq script is heavily
    structured around ligatures resulting in word
    segmentation problems similar to Chinese.
  • Arabic-based writing systems are (usually)
    consonantal alphabets abjads.
  • Severe ambiguity problems associated with not
    writing vowels, and various devices used to write
    (even short) vowels for foreign words
  • These devices became used in some Arabic-based
    writing systems Kurdish, Uyghur, Hui Chinese
    to turn the writing systems into alphabets
  • They also show up in foreign word transliteration

4
Importance of transliteration
??????? ?? ?? ?-??? ?? ???? ??????? ?? ???? ?? ??
?????? ???????? ?????? ?????????? ?????? (????)
?? ?? ??????? ?????? ??.
"??? ????????? ???????? ??????, ??????, ?????
?????????? ????? ???????????? ??????
?????????????, ?? ??? ??? ???? ????? ?????
??????? ????? ????????? ?????? ??? ??????
??????", ????? ?????????????? ????????
???????????? ????? ???? ???????.  
?????????,????????36???,????????80????
5
Transliteration in Egyptian
6
New Kingdom Egyptian (1550 BC onwards)
7
Transliteration in Early Semitic
  • Bilingual Etruscan/Phoenician text from Pyrgi,
    mentioning Caere (Etruscan Cisra), 500 BC

Thefarie (Tiberius)
Cisria
8
Three methods for rating transliteration pairs
  • Phonetic distance
  • Supervised models
  • Lightly or un- supervised models
  • Temporal cooccurrence
  • Comparable corpora
  • Document cooccurrence

9
Previous Work
  • Transliteration Knight Graehl 1998 Meng et
    al. 2001 Gao et al. 2004 inter alia.
  • Comparable corpora Fung, 1995 Rapp 1995 Tanaka
    and Iwasaki, 1996 Franz et al.,1998 Ballesteros
    and Croft, 1998 Masuichi et al., 2000 Sadat et
    al., 2003 Tao and Zhai, 2005.
  • Mining transliterations from multilingual web
    pages Zhang Vines, 2004

10
Comparable Corpora
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
11
Named Entity Transliteration with Comparable
Corpora (ACL06)
  • Supervised transliteration model trained on small
    lexicon
  • Temporal alignment across English-Chinese corpora
  • Within-document cooccurrence

Richard Sproat, Tao Tao and ChengXiang Zhai.
"Named Entity Transliteration with Comparable
Corpora". ACL 2006, July 17-21, 2006, Sydney,
Australia.
12
Method
  • We assume that we have comparable corpora,
    consisting of newspaper articles in a pair of
    languages
  • In our experiments we use data from the English
    and Chinese stories from the Xinhua News agency
    for about 6 months of 2001.
  • Identify English NEs using method of Li et al.
    2001 (based on SNoW Carlson et al. 1999)
  • Potential Chinese NEs can be found by finding
    spans of characters typically used for foreign
    names

13
Foreign name characters
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
???????????
???? (Megawati)
14
Method
  • Given an English name, identify candidate Chinese
    character n-grams as possible transliterations.
  • Score each candidate based on how likely the
    candidate is to be a transliteration of the
    English name.
  • Two initial scoring methods.
  • Phonetic scoring,
  • Frequency profile of the candidate pair over
    time.
  • Propagate scores of all the candidate
    transliteration pairs globally based on their
    cooccurences in document pairs in the comparable
    corpora.

15
Method 1 Phonetic Transliteration
  • Much work using the source-channel approach
  • Cast as a problem where you have a clean
    source e.g. a Chinese name and a noisy
    channel that corrupts the source into the
    observed form e.g. an English name
  • P(EC)P(C)
  • E.g. P(fi,E fi1,E fi2,E fin,E sC)
  • Chinese characters represent syllables (s) we
    match these to sequences of English phonemes (f)

16
Method 1 Phonetic Transliteration
  • Train a transliteration model from a dictionary
    of known transliterations (720 entries)
  • Identify names in English news text for a given
    day using an existing named entity recognizer
  • Process same day of Chinese text looking for
    sequences of characters used in foreign names
  • Do an all-pairs match using the transliteration
    model to find possible transliteration pairs

17
Some Automatically Found Pairs
Pairs found in same day of newswire text
18
Method 2 Frequency Correlation

Day 1
Day 2
Day 3
Day n
Any term W
Time line

Term Frequency
Normalized to obtain a distribution
19
Megawati-English Megawati-Chinese
Frequency Probability
Megawati-English Arafat-Chinese
Quantitative measure Pearson correlation
20
Time correlation, hand-assigned phonetic scores
(work with Xiao Hu, Su-youn Yoon)
word_pair time correlation phonetic
distance lhasa ?? 0.632006 2.875000 taliban
??? 0.828756 3.071429 mustafa ???? 0.819008 3.18
7500 pakistan ???? 0.652715 3.222222 hutu
?? 0.518008 3.375000 killian ??? 0.765722 3.5833
33 durban ?? 0.646309 3.700000 mahathir
???? 0.560175 3.785714 kabul ??? 0.857423 3.90000
0 kuchma ??? 0.516287 3.916667 paper
?? 0.54034 4.000000 wahid ??? 0.852819 4.000000
cuba ?? 0.509132 4.100000 kashmir
???? 0.641191 4.142857 tajikistan
????? 0.59021 4.227273 musharraf
???? 0.605773 4.250000 ramallah
??? 0.719745 4.250000 nathan ?? 0.714915 4.33333
3 malta ??? 0.715252 4.400000 kazakhstan
????? 0.721586 4.409091
  • English candidates are the most frequent 2000
    words with (non-sentence-initial) capital letters
  • Chinese candidates found using foreign name
    characters
  • Pair-wise time correlation scores calculated
  • Pairs with correlation of 0.5 or more kept
  • Phonetic distances calculated using EMNLP method
  • Rank by increasing phonetic distance
  • Keep only the first appearance of each word in
    either language

21
Method 3 Combining Phonetic and Time Correlation
Methods
  • Two methods of combination
  • Phonetic filter use the phonetic model to filter
    out (clearly impossible) candidates and then use
    the frequency correlation method to rank the
    candidates.
  • Score combination mean of normalized scores

22
Method 4 Score Propagation
  • The methods so far score each transliteration
    pair independently
  • But knowing that two transliteration pairs
    co-occur in the same cross-lingual document pair
    should increase our confidence in both
    transliteration pairs
  • Similarly, document pairs that contain lots of
    plausible transliteration pairs are likely
    comparable content-wise
  • Thus, document/transliteration pairs reinforce
    each other

23
Score Propagation
Assume w1 w4
e1 , c1 , w1 e2 , c2, w2 e3 , c3 , w3
e4 , c4 , w4 e5 , c5 , w5 e6 , c6 , w6

Initial ws are calculated individually
ei English terms ci Chinese terms wi
transliteration scores
Given the co-occurrence, we reestimate w'1 w'4
24
Score Propagation
Translit.Score
Translit. Pair
25
Score Propagation
Similar to Page Rank (Brin Page, 1998)
26
Estimate of P(ji)
  • Two ways
  • Cooccurrences (CO) in whole collection
  • Mutual Information (MI)

27
Evaluation
  • 1 days worth of comparable news articles (234
    Chinese stories and 322 English stories) from
    Chinese English Gigaword Corpus (LDC)
  • Generate 600 English names with the NER (Li et
    al., 2004)
  • Find potential Chinese names by looking for
    strings of foreign name characters
  • 627 Chinese candidates
  • In principle any of the 600 x 627 pairings could
    be correct
  • Use phonetic, time-correlation and score
    propagation methods to rank the candidate
    pairings.
  • Evaluate using Mean Reciprocal Rank (MRR)

28
Evaluation Further Details
  • Small number of English names do not seem to have
    any standard transliteration
  • Removing these, we have a list of 490 out of
    original 600 English names
  • Furthermore, about 20 of answers are not in
    Chinese candidate list
  • Either they are really not there
  • Or our candidate selection process missed them.
  • This motivates two scores
  • AllMRR using the original list of 600 English
    names
  • CoreMRR for just those where the English names
    are also in our Chinese candidate list

29
Evaluation
  • Phonetic correspondence yielded an AllMRR score
    of 0.3 and a CoreMRR score of 0.89
  • Time-correlation scores yielded results as
    follows, with different correlation measures

30
Score Combination
31
Score Propagation
32
Score Propagation (Core)
a 0.95, CO
a 0.95, MI
a 0.9, CO
a 0.9, MI
higher a, less propagation
33
Summary
  • Three complementary methods
  • Phonetic match
  • Time correlation
  • Document cooccurrence
  • Combining all three improves MRR

34
Unsupervised Named Entity Transliteration Using
Temporal and Phonetic Correlation (EMNLP 06)
  • Use linguistically derived edit distance based on
    phonetic features
  • Time correlation, as before
  • Tested on several language pairs
  • English/Chinese
  • English/Arabic
  • English/Hindi
  • (English/Korean)

Tao Tao, Su-Youn Yoon, Andrew Fister, Richard
Sproat and ChengXiang Zhai. "Unsupervised Named
Entity Transliteration Using Temporal and
Phonetic Correlation." EMNLP, July 22-23, 2006,
Sydney, Australia.
35
Phonetic Transcriptions (WorldBet)
36
Pseudofeaturesbased on L2 learner errors (Swan
Smith 2002)
37
Example Pseudofeatures
38
Hand-assigned costs
39
Example phoneme distances
40
Example transliterations
41
Two ways to combine time phonetic distance
  • Rank combination
  • Score combination

42
Evaluation data
43
Evaluation
Phonetic
Time
Combination
44
Score combination with different as
MRR
a
more weight on time corr.
more weight on phonetics
45
Training Phonetic Weights (ACL 07)
Su-Youn Yoon, Kyoung-Young Kim and Richard Sproat
Multilingual Transliteration Using Feature based
Phonetic Method. ACL 2007, Prague
  • Train linear-classifier (from SNoW) on per
    language basis.
  • Training data consists of
  • 500 pairs from training dictionary
  • 750,000 negative examples 500 English words
    crossed with 1500 randomly selected words from
    target language
  • Similarly generated held out test data

46
Results on held-out data
47
Comparable corpora test data
  • Comparable corpora, 200 most frequent English
    names in 7-day window
  • For Arabic, Hindi, Korean pick all words in
    same 7-day window. (Korean picked from a one
    month window).
  • Chinese candidates generated as before.

48
Example results
49
Results
50
Results on cross-lingual training(core MRR)
51
Arabic, trained on Arabic (CoreMRR 0.72)
52
Arabic trained on Hindi(CoreMRR 0.62)
53
What do we learn about Arabic Script-based
Languages?
  • Maybe not too much so far, since the techniques
    are blind to the properties of the script
  • But can one show that theres a measurable
    difference in transliteration accuracy due to the
    defective nature of the writing system for
    ASBLs?

54
JHU CLSP 2008 Workshop??
  • Proposal for workshop on transliteration was
    accepted at the WS2007 sic planning meeting in
    early November
  • Plan will be to produce the definitive
    public-domain transliteration toolkit

55
Proposed toolkit
  • Should allow one to develop orthographic/phonetic
    models using minimal training data. (I.e., one
    should not assume a large, or indeed any, seed
    lexicon).
  • A part of this will be a system that can guess a
    ballpark pronunciation for any word in any
    language based on the UTF-8 codepoints in the
    word.
  • Provide a suite of methods to compare
    time-frequency distributions of terms across
    comparable corpora
  • Use document cooccurrence to boost confidence in
    transliteration candidates
  • Be able to leverage lexical resources when
    available.
  • Provide a set of methods whereby the above
    methods can be combined and improved through
    iterative retraining.
  • Provide high quality transliterators for 10
    languages Arabic, Chinese, Urdu, Farsi, Russian,
    Hindi, Macedonian, Thai, Korean, Bengali.
  • Provide a suite of visualization tools that allow
    one to track names in comparable corpora from two
    or more languages.

56
UTF-8 pronouncer (work with Kyoung-young Kim)
Armenian_unicode.txt Greek_unicode.txt
Limbu_unicode.txt Tagbanwa_unicode.txt Benga
li_unicode.txt Gujarati_unicode.txt
Malayalam_unicode.txt Tamil_unicode.txt Buhid_u
nicode.txt Gurmukhi_unicode.txt
Mongolian_unicode.txt Telugu_unicode.txt Chero
kee_unicode.txt Hanunoo_unicode.txt
Myanmar_unicode.txt Thaana_unicode.txt Coptic_
unicode.txt Hebrew_unicode.txt
Oriya_unicode.txt Thai_unicode.txt Cyril
lic_unicode.txt Japanese_unicode.txt
Sinhala_unicode.txt Tibetan_unicode.txt Dev
anagari_unicode.txt Kannada_unicode.txt
SylotiNagri_unicode.txt Tifinagh_unicode.txt Et
hiopic_unicode.txt Korean_unicode.txt
Syriac_unicode.txt UnifiedCanadianAbori
ginalSyllabics_unicode.txt Georgian_unicode.txt
Lao_unicode.txt Tagalog_unicode.txt
Yi_unicode.txt
57
Proposed toolkit
  • Should allow one to develop orthographic/phonetic
    models using minimal training data. (I.e., one
    should not assume a large, or indeed any, seed
    lexicon).
  • A part of this will be a system that can guess a
    ballpark pronunciation for any word in any
    language based on the UTF-8 codepoints in the
    word.
  • Provide a suite of methods to compare
    time-frequency distributions of terms across
    comparable corpora
  • Use document cooccurrence to boost confidence in
    transliteration candidates
  • Be able to leverage lexical resources when
    available.
  • Provide a set of methods whereby the above
    methods can be combined and improved through
    iterative retraining.
  • Provide high quality transliterators for 10
    languages Arabic, Chinese, Urdu, Farsi, Russian,
    Hindi, Macedonian, Thai, Korean, Bengali.
  • Provide a suite of visualization tools that allow
    one to track names in comparable corpora from two
    or more languages.

58
Acknowledgments
  • National Security Agency Contract NBCHC040176,
    REFLEX
  • Google Research Grant
  • Team members ChengXiang Zhai, Su-youn Yoon,
    Andrew Fister, Tao Tao, Kyoung-young Kim, Xiao
    Hu, Xuanhui Wang, Alex Kotov, Dan Roth, Alex
    Klementiev
Write a Comment
User Comments (0)
About PowerShow.com