Jianfeng Gao - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Jianfeng Gao

Description:

Basic unit of indexing in Chinese IR word, n-gram, or mixed ... Character unigram and bigram is widely used (average length of Chinese word is 1.6 characters) ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 36
Provided by: yan4
Category:

less

Transcript and Presenter's Notes

Title: Jianfeng Gao


1
TREC-9 CLIR Experiments at MSRCN
  • Jianfeng Gao
  • Microsoft Research China (MSRCN)

2
People
  • Jianfeng Gao, Microsoft Research China
  • Jian-Yun Nie, Université de Montréal
  • Jian Zhang, Tsinghua University, China
  • Endong Xun, Microsoft Research China
  • Yi Su, Tsinghua University, China
  • Ming Zhou, Microsoft Research China
  • Changning Huang, Microsoft Research China

3
What is TREC ?
  • A workshop series that provides the
    infrastructure for large-scale testing of text
    retrieval technology
  • Realistic test collection
  • Uniform, appropriate scoring procedures
  • A forum for the exchange of research ideas and
    for the discussion of research methodology
  • Sponsored by NIST, DARPA/ITO, ARDA

4
TREC-9 Task Tracks
  • Cross-Language Information Retrieval (CLIR)
  • Filtering
  • Interactive
  • Query
  • Question Answering
  • Spoken Document Retrieval
  • Web Track

5
TREC-9 CLIR Task
  • Given a topic in English, retrieve the top 1000
    documents ranked by similarity to the topic from
    a collection of Chinese newspaper/wire documents.

6
TREC-9 CLIR Topics
  • 25 English topics (CH55-79) created at NIST
  • Example
  • Number CH55
  • World Trade Organization membership
  • Description What speculations on the
    effects of the entry of China or Taiwan into the
    World Trade Organization (WTO) are being reported
    in the Asian press?
  • Narrative Documents reporting support by
    other nations for China's or Taiwan's entry into
    the World Trade Organization (WTO) are not
    relevant.

7
TREC-9 CLIR Document Collection
  • 126,937 documents 188 MB
  • Traditional Chinese, BIG5 encoding
  • Sources
  • Hong Kong Commercial Data
  • 11. Aug 98 - 31. Jul 99
  • Hong Kong Daily News
  • 1. Feb 99 - 31. Jul 99
  • Takongpao
  • 21. Oct 98 - 4. Mar 99

8
Participants
  • BBN Technologies
  • Fudan University
  • IBM T.J. Watson Research Center
  • Johns Hopkins University
  • Korea Advanced Institute of Science and
    Technology
  • Microsoft Research, China
  • MNIS-TextWise Labs
  • National Taiwan University

9
Participants (cont.)
  • Queens College, CUNY
  • RMIT University
  • Telcordia Technologies, Inc.
  • The Chinese University of Hong Kong
  • Trans-EZ Inc.
  • University of California at Berkeley
  • University of Maryland
  • University of Massachusetts

10
Outline
  • Introduction
  • Finding the best indexing units for Chinese IR
  • Query translation
  • Query expansion
  • Experimental results in TREC 9
  • Conclusion

11
Introduction (1)
  • Participate for the first time in TREC
  • System modified version of SMART
  • Pre-processing word segmentation

12
Introduction (2)
  • Our work involves two aspects
  • Chinese IR
  • Finding the best indexing unit
  • Query expansion, etc.
  • CLIR query translation
  • Translation disambiguation using co-occurrence
  • Phrase detecting and translation using language
    model
  • Translation coverage enhancement using
    translation model
  • Resources
  • Lexicon Chinese, bilingual (LDC, HIT, etc.)
  • Corpus Chinese, bilingual
  • Software tools NLPWin, IBM MT, etc.

13
Outline
  • Introduction
  • Finding the best indexing units for Chinese IR
  • Query translation
  • Query expansion
  • Experimental results in TREC 9
  • Conclusion

14
Characteristics of Chinese IR
  • Chinese language issues
  • No standard definition of word and lexicon
  • No space between words
  • Word is the basic unit of indexing in traditional
    IR
  • In this study
  • Basic unit of indexing in Chinese IR word,
    n-gram, or mixed
  • Does the accuracy of word segmentation have a
    significant impact on IR performance

15
Indexing Units for Chinese IR
  • Using n-grams
  • No linguistic knowledge required
  • Character unigram and bigram is widely used
    (average length of Chinese word is 1.6
    characters)
  • Using words
  • Linguistic knowledge is required for word
    segmentation dictionary, heuristic rules,

16
Possible representations in Chinese IR
17
Experiments
  • Impact of dict. using the longest matching with
    a small dict. and with a large dict.
  • Combining the first method with single characters
  • Using full segmentation
  • Using bi-grams and uni-grams (characters)
  • Combining words with bi-grams and characters
  • Unknown word detection using NLPWin

18
Summary of Experiments
19
Outline
  • Introduction
  • Finding the best indexing units for Chinese IR
  • Query translation
  • Query expansion
  • Experimental results in TREC 9
  • Conclusion

20
Query Translation
  • Problems of simple lexicon-based approaches
  • Lexicon is incomplete
  • Difficult to select correct translations
  • Our improved lexicon-based approach
  • Term disambiguation using co-occurrence
  • Phrase detecting and translation using LM
  • Translation coverage enhancement using TM

21
Term disambiguation
  • Assumption correct translation words tend to
    co-occur in Chinese language
  • A greedy algorithm
  • for English terms Te (e1en),
  • find their Chinese translations Tc (c1cn),
    such that Tc argmax
    SIM(c1, , cn)
  • Term-similarity matrix trained on Chinese corpus

22
Phrase detection and translation
  • Multi-word phrase is detected by BaseNP
    identification Xun, 2000
  • Translation pattern (PATTe), e.g.
  • ??
  • ??
  • Phrase translation
  • Tc argmax P(OTcPATTe)P(Tc)
  • P(OTcPATTe) prob. of the translation pattern
  • P(Tc) prob. of the phrase in Chinese LM

23
Using translation model (TM)
  • Enhance the coverage of the lexicon
  • Using TM
  • Tc argmax P(TeTc)SIM(Tc)
  • Mining parallel texts from the Web for TM training

24
Experiments on TREC-56
  • Monolingual
  • Simple translation lexicon looking up
  • Best-sense translation 2 manually selecting
  • Improved translation (our method)
  • Machine translation using IBM MT system

25
Summary of Experiments
26
Outline
  • Introduction
  • Finding the best indexing units for Chinese IR
  • Query translation
  • Query expansion
  • Experimental results in TREC 9
  • Conclusion

27
Query Expansion (QE)
  • Pseudo-relevance feedback
  • Top-ranked documents (n)
  • Term selection (m)
  • Term weighting (w)
  • Document length normalization
  • Sub-document (500 characters)
  • Pre-translation QE and post-translation QE

28
Experiments on TREC-56 (1)
  • Post-translation QE
  • ltu n10 , m300 , w0.6/0.4
  • ltc n20 , m500 , w0.3/0.7

29
Experiments on TREC-56 (2)
  • Pre-translation QE
  • English collection FBIS
  • ltu n10 , m10 , w0.5/0.5

30
Outline
  • Introduction
  • Finding the best indexing units for Chinese IR
  • Query translation
  • Query expansion
  • Experimental results in TREC 9
  • Conclusion

31
Experiments in TREC 9
32
Outline
  • Introduction
  • Finding the best indexing units for Chinese IR
  • Query translation
  • Query expansion
  • Experimental results in TREC 9
  • Conclusion

33
Conclusion
  • Best indexing unit for Chinese IR
  • Words characters unknown words
  • Improved lexicon based query translation
  • Translation disambiguation using co-occurrence
  • Phrase detecting and translation using language
    model
  • Translation coverage enhancement using
    translation model
  • Query expansion

34
Conclusion TREC 9
  • Pre-translation QE does not help
  • Our approach leads to same effectiveness as the
    IBM MT system.
  • The best result is obtained by combining IBM MT
    system and our approach
  • OOV is still the bottleneck for improving the
    performance of CLIR

35
Thanks ! More information jfgao_at_microsoft.com
mingzhou_at_microsoft.com
Write a Comment
User Comments (0)
About PowerShow.com