Title: CrossLingual Named Entity Retrieval
1Cross-Lingual Named Entity Retrieval
- ChengXiang Zhai, Tao Tao
- Department of Computer Science
- Univ. of Illinois at Urbana-Champaign
- Jan. 4, 2005
2Outline
- Problem definition (unsupervised learning from
comparable corpora) - General ideas (freq. correlation, iterative
feedback, mixture models) - Preliminary results (lexical retrieval,
transliteration)
3Problem Definition
- The general problem Unsupervised learning from
comparable corpora - Given a set of comparable corpora (e.g., news
articles published on the same day) - Assuming no additional resources (e.g., no
bilingual dictionary) - How do we do
- Document alignment
- Entity extraction
- Transliteration
-
- Different from most existing work in the emphasis
on completely unsupervised learning and
robustness of methods
4A More Specific Problem Cross-lingual Named
Entity Retrieval
- Given a name in English (e.g., Bush), how do we
find its translation(s) in Chinese? - More generally, given a word/phrase in one
language, how do we exploit comparable corpora to
find related words/phrases in another language? - Challenges
- No additional resources to leverage (completely
unsupervised) - Need to figure out the word/phrase boundaries in
languages such as Chinese
5Our Basic Idea
- Exploit the correlation between words in
different languages that are about the same topic - Observations
- When some major event happens (e.g., the recent
sea surge disaster), it is very likely covered by
news articles in multiple languages - Each event/topic tends to have its own
associated vocabulary (e.g., names such as Sri
Lanka, India may occur in recent news
articles) - We thus will likely see the frequency of a name
such as Sri Lanka tends to peak recently as
compared with other time periods and the pattern
is likely the same across languages
6An Example of Frequency Correlation(swimming)
7Unsupervised Cross-Lingual Lexical Retrieval
- Represent each lexical unit with a frequency
distribution over the dates - Given a lexical unit X in language A, compute the
similarity between its freq. distribution and
that of any unit Y in language B - Return the top ranked Ys in language B as
possibly related units in B for X in A. - The top ranked Ys may suggest transliterations
of X in B if X is a name in A
8Top Words for Swimming (from both English and
Chinese Articles)
Correct Chinese translation of swimming
12. took 21.1154 13. meters 21.087 14. Korea
21.0268 15. Women's 21.0154 16. She 21.0062
17. ? 20.9737 18. ? 20.9366 19. Japan 20.9341
20. Japan, 20.9245 21. half 20.8783 22. medals
20.8348
1. swimming 23.6389 2. ? 22.5897 3. Swimming
22.497 4. Russia 21.9056 5. record 21.7685 6.
Medal 21.6855 7. won 21.6112 8. ? 21.5027 9. ?
21.3815 10. Hungary 21.2824 11. third 21.1592
Correct translations
A standard retrieval formula (BM25) is used
9The Choice of Lexical Units
- The frequency distribution clearly depends on the
choice of lexical units - The method can be expected to work well for some
unigrams - For many names, we will have to consider n-grams
of Chinese characters
10Another Example (Bush) Showing the Need for
N-grams
Bu Correlation 0.286
Shi Correlation 0.386
Bu-Shi Correlation 0.448
11However, even if we consider n-grams, the method
would still work well in only very special cases
- Two additional questions
- How can we make use of such partially correct
results? - How do we further improve the results ?
12Exploiting Cross-Lingual Lexical Retrieval
- As additional bias/evidence for transliteration
- E.g., take the top-k candidates from any
preliminary transliteration results and rerank
them - As a basis for document alignment
- Define the similarity between two articles in
different languages based on the similarities
between the freq distributions of the words in
each of them - Align a document in language A with the
top-ranked document in language B - A possible similarity function
13An Iterative Algorithm for Mapping Lexical Units
and Aligning Articles
- Start with the most reliable mappings of lexical
units - Align articles based on these reliable mappings
- Re-compute the frequency distributions based on
the aligned articles and generate a new
generation of mappings - Re-align the articles using the new mappings
14How do We Do All These in a Principled Way?
15A Coordinated Mixture Model
Day 1
Day 2
Day n
16Details of the Mixture Model
Coordinated mixture model
Lexical translation Document alignment
17Results from a Similar Mixture Modelfor News
Article Comparison
18Preliminary Experiments Lexical Retrieval
- Data Set Chinese-English comparable corpora
- About 150 days, 87,000 English articles, 35,000
Chinese articles - Task Given an English word, retrieve the top-k
most correlated Chinese n-grams - Research questions
- How to efficiently compute the frequency of
n-grams? - How to represent a lexical unit with freq.
distribution? - How to measure lexical similarity?
19Efficient N-gram Freq. Counting
- Index inverted index by Lemur toolkit
http//www-2.cs.cmu.edu/lemur - quickly access documents from a word
- quickly access word position information
- Single word frequency vector Lemur toolkit
provides such function - N-gram frequency vector Merge the single word
frequency vectors.For example, t1 d1 1, 3,
9, 34 t2 d1 2, 7, 10,
18, 56
t1t2 d1 1, 9 t2t1 d1 2
20Frequency Vector Normalization
day1 day2 day3 .. day156
English term E 2 0 4 7
5 8 Chinese n-gram C 1 2 4
5 6 2
- TF normalization a word count 0?1 is more
important than 100?1011) TF ? log(1TF) 4 ?
log(5)2) Okapi - Normalize them as a prob. distribution divided
by the sum of all counts
21Similarity Measures
X x10.02, x20.00, y30.04, x40.07 Y
y10.01, y20.02, x30.00, y40.05
- Mutual information only consider zero or
non-zero - Cosine distance
- JS divergence H(p1 P(X) p2 P(Y)) p1
H(P(x)) p2 H(P(Y)) - Dynamic Time Warping
- Consider the time shifting
- Similar to the edit distance(dynamic programming)
x1, x2, x3, , xn
Constraint time shifting
y1 y2 y3 yn
22Comparison (bush)
- Cosine dist.
- ?? 0.835038
- ?? 0.774567
- ?? 0.769585
- ?? 0.764102
- ?? 0.76317
- ?? 0.76276
- ?? 0.758067
- ?? 0.753101
- ?? 0.745806
- ?? 0.743761
- JS Divergence
- ?? 0.0635597
- ?? 0.0658826
- ?? 0.0681127
- ?? 0.0689964
- ?? 0.0717511
- ?? 0.0729297
- ?? 0.0748018
- ?? 0.078494
-
- ?? 0.097773 (21)
Mutual info. ?? 0.154381 ?? 0.154381 ??
0.154381 ?? 0.154381 ?? 0.154381 ?? 0.154381
?? 0.154381 ?? 0.154381 ?? 0.0827685 (129)
23Another example (blair)
Cosine distance
- ?? 0.646626
- ?? 0.641913
- ?? 0.638979
- ?? 0.629399
- ?? 0.623625
- ?? 0.621927
???
24Transliteration Prior
- Intuition The ranking the score can be set as
the prior probability to do transliteration - Examples
palestine
kazakhstan
colombia
???? 0.761795 ??? 0.443768 ??? 0.353061 ???
0.310424 ??? 0.263955 ??? 0.259201 ???
0.227172 ??? 0.202834 ???? 0.199544 ???
0.196019 ??? 0.10867 ??? 0.106474
???? 0.62962 ??? 0.487948 ??? 0.231453 ????
0.147453 ??? 0.139291 ??? 0.136079 ???
0.0783344 ??? 0.0760304 ??? 0.0760304
????? 0.797614 ??? 0.403872 ??? 0.316003 ???
0.175327 ???? 0.138001 ??? 0.131324 ??? 0.126798
??? 0.0426488
25Next Steps
- Further improve the similarity measures
- Apply lexical retrieval to transliteration
- Apply lexical retrieval to alignment
- Explore the iterative feedback strategy
- Explore mixture models
26The End