CrossLingual Named Entity Retrieval - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

CrossLingual Named Entity Retrieval

Description:

Index: inverted index by Lemur toolkit http://www-2.cs.cmu.edu/~lemur ... Single word frequency vector: Lemur toolkit provides such function ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 27

Provided by: Ale8212

Category:

more less

Transcript and Presenter's Notes

Title: CrossLingual Named Entity Retrieval

1
Cross-Lingual Named Entity Retrieval

ChengXiang Zhai, Tao Tao
Department of Computer Science
Univ. of Illinois at Urbana-Champaign
Jan. 4, 2005

2
Outline

Problem definition (unsupervised learning from
comparable corpora)
General ideas (freq. correlation, iterative
feedback, mixture models)
Preliminary results (lexical retrieval,
transliteration)

3
Problem Definition

The general problem Unsupervised learning from
comparable corpora
Given a set of comparable corpora (e.g., news
articles published on the same day)
Assuming no additional resources (e.g., no
bilingual dictionary)
How do we do
Document alignment
Entity extraction
Transliteration
Different from most existing work in the emphasis
on completely unsupervised learning and
robustness of methods

4
A More Specific Problem Cross-lingual Named
Entity Retrieval

Given a name in English (e.g., Bush), how do we
find its translation(s) in Chinese?
More generally, given a word/phrase in one
language, how do we exploit comparable corpora to
find related words/phrases in another language?
Challenges
No additional resources to leverage (completely
unsupervised)
Need to figure out the word/phrase boundaries in
languages such as Chinese

5
Our Basic Idea

Exploit the correlation between words in
different languages that are about the same topic
Observations
When some major event happens (e.g., the recent
sea surge disaster), it is very likely covered by
news articles in multiple languages
Each event/topic tends to have its own
associated vocabulary (e.g., names such as Sri
Lanka, India may occur in recent news
articles)
We thus will likely see the frequency of a name
such as Sri Lanka tends to peak recently as
compared with other time periods and the pattern
is likely the same across languages

6
An Example of Frequency Correlation(swimming)
7
Unsupervised Cross-Lingual Lexical Retrieval

Represent each lexical unit with a frequency
distribution over the dates
Given a lexical unit X in language A, compute the
similarity between its freq. distribution and
that of any unit Y in language B
Return the top ranked Ys in language B as
possibly related units in B for X in A.
The top ranked Ys may suggest transliterations
of X in B if X is a name in A

8
Top Words for Swimming (from both English and
Chinese Articles)
Correct Chinese translation of swimming
12. took 21.1154 13. meters 21.087 14. Korea
21.0268 15. Women's 21.0154 16. She 21.0062
17. ? 20.9737 18. ? 20.9366 19. Japan 20.9341
20. Japan, 20.9245 21. half 20.8783 22. medals
20.8348
1. swimming 23.6389 2. ? 22.5897 3. Swimming
22.497 4. Russia 21.9056 5. record 21.7685 6.
Medal 21.6855 7. won 21.6112 8. ? 21.5027 9. ?
21.3815 10. Hungary 21.2824 11. third 21.1592
Correct translations
A standard retrieval formula (BM25) is used
9
The Choice of Lexical Units

The frequency distribution clearly depends on the
choice of lexical units
The method can be expected to work well for some
unigrams
For many names, we will have to consider n-grams
of Chinese characters

10
Another Example (Bush) Showing the Need for
N-grams
Bu Correlation 0.286
Shi Correlation 0.386
Bu-Shi Correlation 0.448
11
However, even if we consider n-grams, the method
would still work well in only very special cases

Two additional questions
How can we make use of such partially correct
results?
How do we further improve the results ?

12
Exploiting Cross-Lingual Lexical Retrieval

As additional bias/evidence for transliteration
E.g., take the top-k candidates from any
preliminary transliteration results and rerank
them
As a basis for document alignment
Define the similarity between two articles in
different languages based on the similarities
between the freq distributions of the words in
each of them
Align a document in language A with the
top-ranked document in language B
A possible similarity function

13
An Iterative Algorithm for Mapping Lexical Units
and Aligning Articles

Start with the most reliable mappings of lexical
units
Align articles based on these reliable mappings
Re-compute the frequency distributions based on
the aligned articles and generate a new
generation of mappings
Re-align the articles using the new mappings

14
How do We Do All These in a Principled Way?
15
A Coordinated Mixture Model
Day 1
Day 2

Day n
16
Details of the Mixture Model
Coordinated mixture model
Lexical translation Document alignment
17
Results from a Similar Mixture Modelfor News
Article Comparison
18
Preliminary Experiments Lexical Retrieval

Data Set Chinese-English comparable corpora
About 150 days, 87,000 English articles, 35,000
Chinese articles
Task Given an English word, retrieve the top-k
most correlated Chinese n-grams
Research questions
How to efficiently compute the frequency of
n-grams?
How to represent a lexical unit with freq.
distribution?
How to measure lexical similarity?

19
Efficient N-gram Freq. Counting

Index inverted index by Lemur toolkit
http//www-2.cs.cmu.edu/lemur
quickly access documents from a word
quickly access word position information
Single word frequency vector Lemur toolkit
provides such function
N-gram frequency vector Merge the single word
frequency vectors.For example, t1 d1 1, 3,
9, 34 t2 d1 2, 7, 10,
18, 56

t1t2 d1 1, 9 t2t1 d1 2
20
Frequency Vector Normalization
day1 day2 day3 .. day156
English term E 2 0 4 7
5 8 Chinese n-gram C 1 2 4
5 6 2

TF normalization a word count 0?1 is more
important than 100?1011) TF ? log(1TF) 4 ?
log(5)2) Okapi
Normalize them as a prob. distribution divided
by the sum of all counts

21
Similarity Measures
X x10.02, x20.00, y30.04, x40.07 Y
y10.01, y20.02, x30.00, y40.05

Mutual information only consider zero or
non-zero
Cosine distance
JS divergence H(p1 P(X) p2 P(Y)) p1
H(P(x)) p2 H(P(Y))
Dynamic Time Warping
Consider the time shifting
Similar to the edit distance(dynamic programming)

x1, x2, x3, , xn
Constraint time shifting
y1 y2 y3 yn
22
Comparison (bush)

Cosine dist.
?? 0.835038
?? 0.774567
?? 0.769585
?? 0.764102
?? 0.76317
?? 0.76276
?? 0.758067
?? 0.753101
?? 0.745806
?? 0.743761

JS Divergence
?? 0.0635597
?? 0.0658826
?? 0.0681127
?? 0.0689964
?? 0.0717511
?? 0.0729297
?? 0.0748018
?? 0.078494
?? 0.097773 (21)

Mutual info. ?? 0.154381 ?? 0.154381 ??
0.154381 ?? 0.154381 ?? 0.154381 ?? 0.154381
?? 0.154381 ?? 0.154381 ?? 0.0827685 (129)
23
Another example (blair)
Cosine distance

?? 0.646626
?? 0.641913
?? 0.638979
?? 0.629399
?? 0.623625
?? 0.621927

???
24
Transliteration Prior

Intuition The ranking the score can be set as
the prior probability to do transliteration
Examples

palestine
kazakhstan
colombia
???? 0.761795 ??? 0.443768 ??? 0.353061 ???
0.310424 ??? 0.263955 ??? 0.259201 ???
0.227172 ??? 0.202834 ???? 0.199544 ???
0.196019 ??? 0.10867 ??? 0.106474
???? 0.62962 ??? 0.487948 ??? 0.231453 ????
0.147453 ??? 0.139291 ??? 0.136079 ???
0.0783344 ??? 0.0760304 ??? 0.0760304
????? 0.797614 ??? 0.403872 ??? 0.316003 ???
0.175327 ???? 0.138001 ??? 0.131324 ??? 0.126798
??? 0.0426488
25
Next Steps