Title: Statistical Machine Translation Models for Personalized Search
1Statistical Machine Translation Models for
Personalized Search
- Rohini U
- AOL India RD, Bangalore India
- Rohini.uppuluri_at_corp.aol.com
- Vamshi Ambati
- Language Technologies Institute
- Carnegie Mellon University Pittsburgh, USA
- vamshi_at_cs.cmu.edu
- Vasudeva Varma,
- SIEL, LTRC, IIIT Hyderabad, India
- vv_at_iiit.ac.in
2Agenda
- Introduction
- Related Work
- Background
- User Profile as Translation Model
- Personalized Search
- Learning User Profile
- Re-ranking
- Experiments
- Conclusions and Future Work
3Introduction
- Current Web Search engines
- Provide users with documents relevant to their
information need - Issues
- Information overload
- To cater Hundreds of millions of users
- Terabytes of data
- Poor description of Information need
- Short queries - Difficult to understand
- Word ambiguities
- Users only see top few results
- Relevance
- subjective depends on the user
- One size Fits all ???
4Continued..
- Search is not a solved problem!
- Poorly described information need
- Java (Java island / Java programming language
) - Jaguar (cat /car)
- Lemur (animal / lemur tool kit)
- SBH (State bank of Hyderabad/Syracuse
Behavioral Health care) - Given prior information
- I am into biology best guess for Jaguar?
- past queries - information retrieval, language
modeling best guess for lemur?
5Review of Personalized Search
-
- Personalized Search
- Query logs Machine learning
Language modeling Community based
Others
6Statistical Language Modeling based Approaches
Introduction
- Statistical language modeling task of
estimating probability distribution that captures
statistical regularities of natural language - Applied to a number of problems Speech, Machine
Translation, IR, Summarization
7Statistical Language Modeling based Approaches
Background
Lemur
Query Formulation Model
Query
Given a query, which is most likely to be the
Ideal Document?
P(Q/D) P(q1.qn/D) ? P(qi/D)
User Information need Ideal Document
In spite of the progress, not much work to
capture, model and integrate user context !
8Noisy Channel based approach Motivation
Query Generation Process (Noisy Channel)
Ideal Document
Retrieval
Query Generation Process (Noisy Channel)
9Similar to Statistical Machine Translation
- Given an english sentence translate into french
- Given a query, retrieve documents closer to ideal
document
Noisy channel 1
English Sentence
French Sentence
P(e/f)
Noisy Channel 2
Ideal Document
Query
P(q/w)
10Learning user profile
- User profile Translation Model
- Triples (qw,dw,p(qw/dw))
- Use Statistical Machine Translation methods
- Learning user profile training a translation
model - In SMT Training a translation model
- From Parallel texts
- Using EM algorithm
11Learning User profile
- Extracting Parallel Texts
- From Queries and corresponding snippets from
clicked documents - Training a Translation Model
- GIZA - an open source tool kit widely used for
training translation models in Statistical
Machine Translation research.
12Sample user profile
13Reranking
- Recall, in general LM for IR
- Noisy Channel based approach
lemur
P(lemur/retrieval)
Lemur encyclopedia brief
Lemur toolkit information retireval
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
D1
D4
14Experiments
- Performed evaluation on explicit feedback data
collected from 7 users - Experiments
- Comparison with Contextless Ranking
- Comparison between different training models and
contexts
15 Data and Set up
- Data
- Explicit Feedback data collected from 7 users
- For each query, each user examined top 10
documents and identified top 10 documents - Collected the top 10 results for all queries.
Total documents 3469 documents - Set up
- 3469 documents - created lucene index.
- For reranking, first retrieve the results using
lucene and then rerank them using the noisy
channel approach. - We perform 10 fold cross validation
16Data
User No. Q unique words in Q Total Rel Avg. Rel
1 37 89 236 6.378
2 50 68.42 178 3.56
3 61 82.63 298 4.885
4 26 86.95 101 3.884
5 33 80.76 134 4.06
6 29 78.08 98 3.379
7 29 88.31 115 3.965
17Metrics
- Precision_at_n
- Number of documents relevant / n
18Set up
User Profile Learner
Train Data
User Profiles
Data
Test Data
Reranker
Reranked Results
19User Contextless Proposed
1 0.1433 0.2445
2 0.1426 0.2445
3 0.1016 0.1216
4 0.0557 0.1541
5 0.1887 0.3933
6 0.1566 0.3941
7 0.1 0.1833
Avg 0.1268 0.2332
20Results
Training Model IBM Model1 IBM Model1 GIZA GIZA
Document Train Snippet Train Document Train Snippet Train
Document Test 0.2062 0.2333 0.1799 0.2075
Snippet Test 0.2028 0.2488 0.1834 0.2034
21Results
I - Document Training and Document Testing II
- Document Training and Snippet Testing III -
Snippet Training and Document Testing IV -
Snippet Training and Snippet Testing
22Conclusions and Future Work
- Proposed a stat MT based approach for modeling
user model - Captures Richer context, relations between q and
w. - In future,
- N-gram based method trigrams etc
- Noisy Channel based method bigram
23 24 25References
- Adam Berger and John D. Lafferty. 1999.
Information retrieval as statistical translation.
In Research and Development in Information
Retrieval, pages 222229. - Peter F. Brown, Vincent J. Della Pietra, Stephen
A. Della Pietra, and Robert L. Mercer. 1993. The
mathematics of statistical machine translation
parameter estimation. Comput. Linguist.,
19(2)263311. - W. Bruce Croft, Stephen Cronen-Townsend, and
Victor Larvrenko. 2001. Relevance feedback and
personalization - A language modeling perspective. In DELOS
Workshop Personalisation and Recommender Systems
in Digital Libraries. - Jamie Allan et. al. 2003. Challenges in
information retrieval language modeling. In SIGIR
Forum, volume 37 Number 1. - K. Sugiyama K. Hatano and M. Yoshikawa. 2004.
Adaptive web search based on user profile
constructed without any effort from users. In
Proceedings of WWW 2004, page 675 684. - Victor Lavrenko and W. Bruce Croft. 2001.
Relevance-based language models. In Research and
Development in Information Retrieval, pages
120127. - F. Liu, C. Yu, and W. Meng. 2002. Personalized
web search by mapping user queries to categories.
In Proceedings of the eleventh international
conference on Information and knowledge
management, ACM Press, pages 558565. - Tom Mitchell. 1997. Machine Learning. McGrawHill.
26- Franz Josef Och and Hermann Ney. 2003. A
systematic comparison of various statistical
alignment models. Computational Linguistics,
29(1)1951. - Jay M. Ponte and W. Bruce Croft. 1998. A language
modeling approach to information retrieval. In
Research and Development in Information
Retrieval, pages 275281. - A. Pretschner and S. Gauch. 1999. Ontology based
personalized search. In ICTAI., pages 391398. - J. J. Rocchio. 1971. Relevance feedback in
information retrieval, the smart retrieval
system. Experiments in Automatic Document
Processing, pages 313323. - G. Salton and C. Buckley. 1990. Improving
retrieval performance by relevance feedback.
Journal of the American Society of Information
Science, 41288297. - Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005.
Implicit user modeling for personalized search.
In Proceedings of CIKM 2005. - F. Song and W. B. Croft. 1999. A general language
model for information retrieval. In Proceedings
on the 22nd annual international ACM SIGIR
conference, page 279280. - Micro Speretta and Susan Gauch. 2004.
Personalizing search based on user search
histories. In Thirteenth International Conference
on Information and Knowledge Management (CIKM
2004). - Chengxiang Zhai and John Lafferty. 2001. A study
of smoothing methods for language models applied
to ad hoc information retrieval. In Proceedings
of ACM SIGIR01, pages 334342.