Title: GeoName: a system for backtransliterating
1GeoName a system for back-transliterating pinyin
place names Kui-Lam Kwok Qiang Deng Computer
Science Dept., Queens College City University of
New York email kwok_at_ir.cs.qc.edu emailpeterqc_at_ya
hoo.com
2Or issues involving cross language
referencing of a Chinese place by name
3Content 1. Back-transliteration
problem 2. GeoName system - a proposed
approach 3. Evaluation 4.
Observation/conclusion
4- Transliteration
- alphabet mismatch when expressing
- Chinese (place) names in English Texts
- names represented by PRC Pinyin code
- e.g. Beijing, Shenzhen
5 Back-Transliteration given the Pinyin code,
what are the original Chinese characters?
6- Back-Transliteration
- Why Chinese Characters are needed?
- remove ambiguity of referenced Pinyin place
- reconcile names in English Chinese texts
- may assist alignment in E/C parallel texts
- necessary for E-C Cross Language IR
- (when translating English queries containing
- Pinyin place, person, organization names)
74 Possible Ambiguities in EnglishChinese cross
language place name references
8- Ambiguity 3 Back-transliteration
- --gt which character string is correct?
- e.g.
- Chinas capital in Chinese - ??
- PRC Pinyin (1 char, 1 syllable) -
- ? --gt bei ? --gt jing
- map back from Pinyin to characters
- bei --gt ?,?,?,?,?,?,?,?, (total 23)
- jing--gt ?,?,?,?,?,?,?,?, (total 20)
- ambiguous candidates??,??,??,??
- which one?
9Ambiguity 4 Name Reference --gt same name,
different places Suppose result of
back-transliteration is beijing --gt??, then
which ?? ? (longitude, latitude)
10Ambiguity 1 E/C Pinyin Systems --gt which
Pinyin system was used ? e.g. Hong Kong in
characters - ?? PRC Pinyin ? -gt xiang, ? -gt
gang Wade-Giles ? -gt hsiang, ? -gt kang Hong
Kong ? -gt hong, ? -gt kong hong kong
back-transliterate using PRC Pinyin hong -gt
???????????? (19) kong -gt ???????
(7) Original ?? is NOT one of these 7x19
combinations !
11- Ambiguity 2 Syllable Segmentation
- which segmentation is correct?
- e.g. ??? - possible pinyin writing styles
- Qin Huang Dao
- QinHuangDao
- Qinhuangdao lt-- most common, used in NYT
- --gt how many syllables?
- Qin huang dao 3 char
- Qin huang da o 4 char
- Qin hu ang dao 4 char
- Qin hu ang da o 5 char
12- Summarize given a Pinyin geographic name
- Pinyin system -- which?
- segmentation -- how many syllables?
- 3. back-transliterate -- which candidate
- character string?
- 4. resolve same name, different places.
13GeoName a system for back-transliterating Pin
yin place names
14- GeoName E-C cross language place reference
- which Pinyin system?
- -- user chooses or allow both PY WG
- 2. how many segmented syllables?
- -- fewest syllables ranked first
- 3. back-transliterate which candidate ?
- -- a) bi-list b) confirm by web/Chinese place
list - c) rank candidates by frequency
- 4. resolve same name different places
- -- not considered
15GeoName Given English Pinyin place E e1e2..
en (n syllables), many possible Chinese
character string candidates C c1c2.. cn
argmaxC P(CE) argmaxC P(EC)P(C)
argmaxC P(C), by assuming P(EC) ?i
P(eiC) i.e. ei, ek independent ?i P(ei
ci) i.e. ei, ck independent 1 i.e.
all ci map to unique ei
16GeoName P(C) language model of Chinese place
names ltobtain training data by processing TREC,
NTCIR Chinese collections using BBN
IdentiFinder 80K approximate unique place
namesgt Use P(C) to sort candidates fewest
syllables ranked earlier ltbigram model
P(c2c1)P(c3c2).. not too goodgt
17GeoName A heuristic weighting formula based on
whole string, bigram and character
frequencies g(C) a1log f(C)a1 ?
a2log f(cicj)a2 ?a3log
f(ci)a3, - factor ignored if f(.) 0
a1gta2gta3 - a1log f(C)a1 gt a string seen
before is probably correct
18Evaluation Use frequency formula only on
162 Pinyin city names from bilingual map (no
bilingual pair list were employed)
19(No Transcript)
20- Examples of Correct Names ranked 1
- Daqiu (??), Wanbi (??), Gongzhuling, ..
- (???)
- Examples of Failed Names
- Non-Pinyin
- Qarqi, Yengisar, Jorra, Dongkar, ..
- (???) (??) (??) (??)
- mainly longer names
- Tuolu, Fenglingguan, Qingguandu,
- (??) (???) (???)
- Dating, Shasonggang, Denglonghe, ..
- (??) (???) (???)
21- GeoName further improvement
- Hypothesis prefer candidate strings that have
- been seen before as location names
- confirm candidates on
- a bilingual list (4K) tag 100
- ftp//ftpserver.ciesin.columbia.edu/pub/data/Chin
a /CITAS/gb_code/ - Chinese monolingual place name
- list (80K4K) tag 010
- web data via Google search tag 001
-
22GeoName flowchart
1. Pinyin place name input user indicates PRC or
WG system.
3. Bilingual table(4k) lookup. tag 100
2. Pinyin segmentation map to all possible GB
character strings. tag 000
4. Merge GB candidates
6. WWW confirmation. tag 101, 001
5. Monolingual name list (84k) confirmation. tag
110, 010
tag 111, 011
7. Evaluate weight g(C) rank according to (1)
tag, (2) name character length, (3) g(C).
23GeoName Evaluation Evaluate system result
using tag000, rank by g(C) tag001, web
confirmation g(C) tag010, mono-list
confirmation g(C) tag111, bi-list all above
24(No Transcript)
25Example of back-transliteration web
no-web Tag 111 (with web confirmation) Chaguga
ng 001 1.38629436 ??? 000 15.68423107 ??? 000
9.24647942 ??? 000 9.24647942 ??? 000 8.55333224
??? 000 8.55333224 ??? 000 8.55333224 ??? 000
8.55333224 ??? 000 8.55333224 ??? 000 8.55333224
??? Tag 110 (without web confirmation) Chaguga
ng 000 15.68423107 ??? 000 9.24647942 ??? 000
9.24647942 ??? 000 8.55333224 ??? 000 8.55333224
??? 000 8.55333224 ??? 000 8.55333224 ??? 000
8.55333224 ??? 000 8.55333224 ??? 000 8.55333224
???
26Examples Luliangqu 010 40.02587171 ??? 000
9.24647942 ??? 000 9.24647942 ??? 000 9.24647942
??? 000 9.24647942 ??? 000 9.24647942 ??? 000
9.24647942 ??? 000 9.24647942 ??? 000 9.24647942
??? 000 9.24647942 ??? district/region Xiao
yishi 110 40.18588115 ??? 000 9.24647942 ??? 000
9.24647942 ??? 000 8.55333224 ??? 000 8.55333224
??? 000 8.55333224 ??? 000 8.55333224 ??? 000
8.55333224 ??? 000 8.55333224 ??? 000 8.55333224
??? city Yimaxiang 000 15.68423107 ???
000 9.24647942 ??? 000 9.24647942 ??? 000
9.24647942 ??? 000 9.24647942 ??? 000 9.24647942
??? 000 9.24647942 ??? 000 9.24647942 ??? 000
9.24647942 ??? 000 9.24647942 ??? village
Mengnanzhuang 000 14.95494484 ??? 000 8.51719319
??? 000 8.51719319 ??? 000 8.51719319 ??? 000
8.51719319 ??? 000 7.82404601 ??? 000 7.82404601
??? 000 7.82404601 ??? 000 7.82404601 ??? 000
7.82404601 ??? place
27- Conclusion
- reasonable back-transliteration results
- for map cities
- longer names (gt2 char), more error
- non-pinyin names, does not work
- Future Work
- increase training data
- improve ranking function
- direct translation (not just confirmation)
- using web
- better/more realistic evaluation
28If interested can demonstrate GeoName
(needs Linux re-boot) Try GeoName
at http//post.cs.qc.edu/spell2gb/ (needs
Chinese character display) feedback
appreciated Thank You!