Jianfeng Gao - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Jianfeng Gao

Description:

Basic unit of indexing in Chinese IR word, n-gram, or mixed ... Character unigram and bigram is widely used (average length of Chinese word is 1.6 characters) ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 36

Provided by: yan4

Category:

Tags: chinese | gao | jianfeng | translation

Transcript and Presenter's Notes

Title: Jianfeng Gao

1
TREC-9 CLIR Experiments at MSRCN

Jianfeng Gao
Microsoft Research China (MSRCN)

2
People

Jianfeng Gao, Microsoft Research China
Jian-Yun Nie, Université de Montréal
Jian Zhang, Tsinghua University, China
Endong Xun, Microsoft Research China
Yi Su, Tsinghua University, China
Ming Zhou, Microsoft Research China
Changning Huang, Microsoft Research China

3
What is TREC ?

A workshop series that provides the
infrastructure for large-scale testing of text
retrieval technology
Realistic test collection
Uniform, appropriate scoring procedures
A forum for the exchange of research ideas and
for the discussion of research methodology
Sponsored by NIST, DARPA/ITO, ARDA

4
TREC-9 Task Tracks

Cross-Language Information Retrieval (CLIR)
Filtering
Interactive
Query
Question Answering
Spoken Document Retrieval
Web Track

5
TREC-9 CLIR Task

Given a topic in English, retrieve the top 1000
documents ranked by similarity to the topic from
a collection of Chinese newspaper/wire documents.

6
TREC-9 CLIR Topics

25 English topics (CH55-79) created at NIST
Example
Number CH55
World Trade Organization membership
Description What speculations on the
effects of the entry of China or Taiwan into the
World Trade Organization (WTO) are being reported
in the Asian press?
Narrative Documents reporting support by
other nations for China's or Taiwan's entry into
the World Trade Organization (WTO) are not
relevant.

7
TREC-9 CLIR Document Collection

126,937 documents 188 MB
Traditional Chinese, BIG5 encoding
Sources
Hong Kong Commercial Data
11. Aug 98 - 31. Jul 99
Hong Kong Daily News
1. Feb 99 - 31. Jul 99
Takongpao
21. Oct 98 - 4. Mar 99

8
Participants

BBN Technologies
Fudan University
IBM T.J. Watson Research Center
Johns Hopkins University
Korea Advanced Institute of Science and
Technology
Microsoft Research, China
MNIS-TextWise Labs
National Taiwan University

9
Participants (cont.)

Queens College, CUNY
RMIT University
Telcordia Technologies, Inc.
The Chinese University of Hong Kong
Trans-EZ Inc.
University of California at Berkeley
University of Maryland
University of Massachusetts

10
Outline

Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion

11
Introduction (1)

Participate for the first time in TREC
System modified version of SMART
Pre-processing word segmentation

12
Introduction (2)

Our work involves two aspects
Chinese IR
Finding the best indexing unit
Query expansion, etc.
CLIR query translation
Translation disambiguation using co-occurrence
Phrase detecting and translation using language
model
Translation coverage enhancement using
translation model
Resources
Lexicon Chinese, bilingual (LDC, HIT, etc.)
Corpus Chinese, bilingual
Software tools NLPWin, IBM MT, etc.

13
Outline

Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion

14
Characteristics of Chinese IR

Chinese language issues
No standard definition of word and lexicon
No space between words
Word is the basic unit of indexing in traditional
IR
In this study
Basic unit of indexing in Chinese IR word,
n-gram, or mixed
Does the accuracy of word segmentation have a
significant impact on IR performance

15
Indexing Units for Chinese IR

Using n-grams
No linguistic knowledge required
Character unigram and bigram is widely used
(average length of Chinese word is 1.6
characters)
Using words
Linguistic knowledge is required for word
segmentation dictionary, heuristic rules,

16
Possible representations in Chinese IR
17
Experiments

Impact of dict. using the longest matching with
a small dict. and with a large dict.
Combining the first method with single characters
Using full segmentation
Using bi-grams and uni-grams (characters)
Combining words with bi-grams and characters
Unknown word detection using NLPWin

18
Summary of Experiments
19
Outline

Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion

20
Query Translation

Problems of simple lexicon-based approaches
Lexicon is incomplete
Difficult to select correct translations
Our improved lexicon-based approach
Term disambiguation using co-occurrence
Phrase detecting and translation using LM
Translation coverage enhancement using TM

21
Term disambiguation

Assumption correct translation words tend to
co-occur in Chinese language
A greedy algorithm
for English terms Te (e1en),
find their Chinese translations Tc (c1cn),
such that Tc argmax
SIM(c1, , cn)
Term-similarity matrix trained on Chinese corpus

22
Phrase detection and translation

Multi-word phrase is detected by BaseNP
identification Xun, 2000
Translation pattern (PATTe), e.g.
??
??
Phrase translation
Tc argmax P(OTcPATTe)P(Tc)
P(OTcPATTe) prob. of the translation pattern
P(Tc) prob. of the phrase in Chinese LM

23
Using translation model (TM)

Enhance the coverage of the lexicon
Using TM
Tc argmax P(TeTc)SIM(Tc)
Mining parallel texts from the Web for TM training

24
Experiments on TREC-56

Monolingual
Simple translation lexicon looking up
Best-sense translation 2 manually selecting
Improved translation (our method)
Machine translation using IBM MT system

25
Summary of Experiments
26
Outline

Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion

27
Query Expansion (QE)

Pseudo-relevance feedback
Top-ranked documents (n)
Term selection (m)
Term weighting (w)
Document length normalization
Sub-document (500 characters)
Pre-translation QE and post-translation QE

28
Experiments on TREC-56 (1)

Post-translation QE
ltu n10 , m300 , w0.6/0.4
ltc n20 , m500 , w0.3/0.7

29
Experiments on TREC-56 (2)

Pre-translation QE
English collection FBIS
ltu n10 , m10 , w0.5/0.5

30
Outline

Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion

31
Experiments in TREC 9
32
Outline

Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion

33
Conclusion

Best indexing unit for Chinese IR
Words characters unknown words
Improved lexicon based query translation
Translation disambiguation using co-occurrence
Phrase detecting and translation using language
model
Translation coverage enhancement using
translation model
Query expansion

34
Conclusion TREC 9

Pre-translation QE does not help
Our approach leads to same effectiveness as the
IBM MT system.
The best result is obtained by combining IBM MT
system and our approach
OOV is still the bottleneck for improving the
performance of CLIR

35
Thanks ! More information jfgao_at_microsoft.com
mingzhou_at_microsoft.com

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Vinod Namboodiri and Lixin Gao PowerPoint PPT Presentation

Vinod Namboodiri and Lixin Gao | PowerPoint PPT presentation | free to view

YanSimon Gao PowerPoint PPT Presentation

YanSimon Gao - ... American Smoke Out (American Lung Association) Nonprofit ... Tobacco control on the American Lung Association page. Multilingual-versions of the website site ... | PowerPoint PPT presentation | free to view

GAO PowerPoint PPT Presentation

GAO - Annual 2-page Assessments of MDAPs. Macro Analysis of Major Acquisition Trends ... F/A-18E/F Super Hornet fighter aircraft $70.4 billion. DDG-51 class destroyer ship ... | PowerPoint PPT presentation | free to view

Gao Junshan, UST Beijing PowerPoint PPT Presentation

Gao Junshan, UST Beijing - Chapter Guide. Process of decision making. Assumptions on rationality ... Decision-making process: Car buying example. Fundamentals of Management: 4-6 ... | PowerPoint PPT presentation | free to view

Pro f. ZHILIANG GAO PowerPoint PPT Presentation

Pro f. ZHILIANG GAO - In 14th century, Plague in European ,20 million people death; ... In 1985,First case of AIDS in china, it is american tourer to china ? ... | PowerPoint PPT presentation | free to view

GAO Standards for Internal Control PowerPoint PPT Presentation

GAO Standards for Internal Control - Public Law 97-255, Financial Managers' Financial Integrity Act of 1982 ... Integrity and ethical values. Commitment to competence. Willingness to take risk ... | PowerPoint PPT presentation | free to view

Gao Junshan, UST Beijing PowerPoint PPT Presentation

Gao Junshan, UST Beijing - Project. Omega. Project. Repeated. as above. for each. project. Purchasing. Group. Accounting ... Project. Gamma. Project. Types of design: The matrix ... | PowerPoint PPT presentation | free to view

Gao Junshan, UST Beijing PowerPoint PPT Presentation

Gao Junshan, UST Beijing - Before meeting a candidate, review his or her application form and resume ... HRM process: safety and health. Fundamentals of Management: 6-30. Gao Junshan, UST ... | PowerPoint PPT presentation | free to view

Gao Junshan, UST Beijing PowerPoint PPT Presentation

Gao Junshan, UST Beijing - What are the organization s long-term objectives? ... Maximin choice: D3. Minimax choice: D4. Quantitative Tools to Decision Analysis ... | PowerPoint PPT presentation | free to view

By Gao Changjie PowerPoint PPT Presentation

By Gao Changjie - narrow way for people to walk on. go on. in front of. short letter. long ... An Unforgettable Trip (120 words at least) time. place. people. affairs. thoughts ... | PowerPoint PPT presentation | free to view

Professor Dayong Gao PowerPoint PPT Presentation

Professor Dayong Gao - A simple system is one in which the effects of motion, viscosity, fluid shear, ... Van der Waals. Beattie-Bridgeman. Benedict-Webb-Rubin. Other Equations of States ... | PowerPoint PPT presentation | free to view

Jianfeng Gao1, Hao Yu2, PowerPoint PPT Presentation

Jianfeng Gao1, Hao Yu2, - ... will soon make the Japanese text input test data ... Task: Japanese IME ... comparably with other state-of-the-art methods on the task of text input ... | PowerPoint PPT presentation | free to view

Gao Junshan, UST Beijing PowerPoint PPT Presentation

Gao Junshan, UST Beijing - Leaders are the people who are able to influence others and who ... Behavior theories: Robert Blake. Fundamentals of Management: 11-11. Gao Junshan, UST Beijing ... | PowerPoint PPT presentation | free to view

CCGN Presents: A Day with Rev' Yeh Gao Fang PowerPoint PPT Presentation

CCGN Presents: A Day with Rev' Yeh Gao Fang - CCGN Presents: A Day with Rev. Yeh Gao Fang. Everlasting Love Keeping Romance Alive ... Date: December 3, 2005. Time: 9:30am-5:00pm. Cost: Free. Place: 2 ... | PowerPoint PPT presentation | free to view

Gao Qingping PowerPoint PPT Presentation

Gao Qingping - Anemia Gao Qingping Department of Hematology Overview 1 Definition 2 Criteria of Diagnosis 3 Factors of Affecting Anemia (1)increased plasma volume (2 ... | PowerPoint PPT presentation | free to view

Bu Bu Gao Sheng - Chineseteaart.com PowerPoint PPT Presentation

Bu Bu Gao Sheng - Chineseteaart.com - Bu Bu Gao Sheng means "Makes Steady Progress" and is represented by the steady rise of a fine string of tied Jasmine flowers during the steeping of this Gong Yi tea. At the base is a beautiful Marigold flower in the midst of the fresh jasmine-scented Yin Zhen tea leaves. | PowerPoint PPT presentation | free to view

2011 Gao Li Gong Shan Raw Mini Tuo Cha - Chineseteaart.com PowerPoint PPT Presentation

2011 Gao Li Gong Shan Raw Mini Tuo Cha - Chineseteaart.com - Gao Li Gong Shan Mini Tuo Cha was produced using tea leaves grown in Gao Li Gong Shan, located near the border of Yunnan and Vietnam. This Pu-erh tea won the Gold Medal Honor in Shanghai 2005 Tea King Competition. What made it more impressive was that Mr. Zhang Tian Fu, China's renowned tea expert, was one of the judges in it. | PowerPoint PPT presentation | free to view

GAO Guangsheng PowerPoint PPT Presentation

GAO Guangsheng - 1323.00 1.32 5.51 23.80 5514.00 23796.00 ... | PowerPoint PPT presentation | free to view

Gao Qingping PowerPoint PPT Presentation

Gao Qingping - General Hematology Gao Qingping Department of Hematology Overview ( Definition 1 Hematopoietic system (1)Blood (2)Hematopoietic organ A prior to ... | PowerPoint PPT presentation | free to view

GAO Report: Unmet Professional Standards in DCAA Audits PowerPoint PPT Presentation

GAO Report: Unmet Professional Standards in DCAA Audits - NATIONAL CONTRACT MANAGEMENT ASSOCIATION * NATIONAL CONTRACT MANAGEMENT ASSOCIATION BETHESDA NORTH MARRIOTT HOTEL ... Unmet Professional Standards in DCAA Audits GAO ... | PowerPoint PPT presentation | free to view

Viability for a Hybrid Control System Yan Gao(? ?) PowerPoint PPT Presentation

Viability for a Hybrid Control System Yan Gao(? ?) - University of Shanghai for Science and Technology Viability for a Hybrid Control System Yan Gao( ) University of Shanghai for Science and Technology | PowerPoint PPT presentation | free to view

Hongyu Gao PowerPoint PPT Presentation

Hongyu Gao - Security Issues of Online Social Networking Hongyu Gao Northwestern University EECS450 class presentation Adapted from s of Harvard Townsend and Jessica Van Hattem | PowerPoint PPT presentation | free to view

Gao Junshan, UST Beijing PowerPoint PPT Presentation

Gao Junshan, UST Beijing - Managers and Management Where We Are Chapter Guide Three starting concepts Nature of management Functional view on Management Manager s role model Universality of ... | PowerPoint PPT presentation | free to view

Planning and Programming at GAO A Case Study PowerPoint PPT Presentation

Planning and Programming at GAO A Case Study - Title: GAO High-Risk Program: Highlighting the Need for Improved Program Integrity Author: GAO Last modified by: WB91190 Created Date: 1/23/2001 12:25:26 PM | PowerPoint PPT presentation | free to view

Tianyun Gao PowerPoint PPT Presentation

Tianyun Gao - Tianyun Gao is an experienced professional Business Systems Analyst. She has worked as a business system analyst since June of 2011. Tianyun Gao began her career as a business systems analyst with the NYU Langone Medical Center in 2011, and became a member of the Canon USA, Inc. | PowerPoint PPT presentation | free to view

Gao Guangyan Desmond Chan Ng Kia Boon Daniel Yip Raffles Institution PowerPoint PPT Presentation

Gao Guangyan Desmond Chan Ng Kia Boon Daniel Yip Raffles Institution - NUS Physics Open House Projectile Competition Gao Guangyan Desmond Chan Ng Kia Boon Daniel Yip Raffles Institution Objectives To create a device propelling a ... | PowerPoint PPT presentation | free to view

GAO Update PowerPoint PPT Presentation

GAO Update - Title: Results of GAO s Mandatory Audit Firm Rotation Report Author: GAO Last modified by: GAO Created Date: 12/2/2003 9:22:13 PM Document presentation format | PowerPoint PPT presentation | free to view