Statistical Machine Translation Models for Personalized Search - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Statistical Machine Translation Models for Personalized Search

Description:

1. Statistical Machine Translation Models for Personalized ... Noisy channel 1. French. Sentence. English. Sentence. Noisy Channel 2. Ideal. Document. Query ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 27
Provided by: researchw6
Category:

less

Transcript and Presenter's Notes

Title: Statistical Machine Translation Models for Personalized Search


1
Statistical Machine Translation Models for
Personalized Search
  • Rohini U
  • AOL India RD, Bangalore India
  • Rohini.uppuluri_at_corp.aol.com
  • Vamshi Ambati
  • Language Technologies Institute
  • Carnegie Mellon University Pittsburgh, USA
  • vamshi_at_cs.cmu.edu
  • Vasudeva Varma,
  • SIEL, LTRC, IIIT Hyderabad, India
  • vv_at_iiit.ac.in

2
Agenda
  • Introduction
  • Related Work
  • Background
  • User Profile as Translation Model
  • Personalized Search
  • Learning User Profile
  • Re-ranking
  • Experiments
  • Conclusions and Future Work

3
Introduction
  • Current Web Search engines
  • Provide users with documents relevant to their
    information need
  • Issues
  • Information overload
  • To cater Hundreds of millions of users
  • Terabytes of data
  • Poor description of Information need
  • Short queries - Difficult to understand
  • Word ambiguities
  • Users only see top few results
  • Relevance
  • subjective depends on the user
  • One size Fits all ???

4
Continued..
  • Search is not a solved problem!
  • Poorly described information need
  • Java (Java island / Java programming language
    )
  • Jaguar (cat /car)
  • Lemur (animal / lemur tool kit)
  • SBH (State bank of Hyderabad/Syracuse
    Behavioral Health care)
  • Given prior information
  • I am into biology best guess for Jaguar?
  • past queries - information retrieval, language
    modeling best guess for lemur?

5
Review of Personalized Search
  • Personalized Search
  • Query logs Machine learning
    Language modeling Community based
    Others

6
Statistical Language Modeling based Approaches
Introduction
  • Statistical language modeling task of
    estimating probability distribution that captures
    statistical regularities of natural language
  • Applied to a number of problems Speech, Machine
    Translation, IR, Summarization

7
Statistical Language Modeling based Approaches
Background
Lemur
Query Formulation Model
Query
Given a query, which is most likely to be the
Ideal Document?
P(Q/D) P(q1.qn/D) ? P(qi/D)
User Information need Ideal Document
In spite of the progress, not much work to
capture, model and integrate user context !
8
Noisy Channel based approach Motivation

Query Generation Process (Noisy Channel)
Ideal Document
Retrieval
Query Generation Process (Noisy Channel)
9
Similar to Statistical Machine Translation
  • Given an english sentence translate into french
  • Given a query, retrieve documents closer to ideal
    document

Noisy channel 1
English Sentence
French Sentence
P(e/f)
Noisy Channel 2
Ideal Document
Query
P(q/w)
10
Learning user profile
  • User profile Translation Model
  • Triples (qw,dw,p(qw/dw))
  • Use Statistical Machine Translation methods
  • Learning user profile training a translation
    model
  • In SMT Training a translation model
  • From Parallel texts
  • Using EM algorithm

11
Learning User profile
  • Extracting Parallel Texts
  • From Queries and corresponding snippets from
    clicked documents
  • Training a Translation Model
  • GIZA - an open source tool kit widely used for
    training translation models in Statistical
    Machine Translation research.

12
Sample user profile
13
Reranking
  • Recall, in general LM for IR
  • Noisy Channel based approach
  • P(Q/D) ? P(qi/D)

lemur
P(lemur/retrieval)
Lemur encyclopedia brief
Lemur toolkit information retireval
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
D1
D4
14
Experiments
  • Performed evaluation on explicit feedback data
    collected from 7 users
  • Experiments
  • Comparison with Contextless Ranking
  • Comparison between different training models and
    contexts

15

Data and Set up
  • Data
  • Explicit Feedback data collected from 7 users
  • For each query, each user examined top 10
    documents and identified top 10 documents
  • Collected the top 10 results for all queries.
    Total documents 3469 documents
  • Set up
  • 3469 documents - created lucene index.
  • For reranking, first retrieve the results using
    lucene and then rerank them using the noisy
    channel approach.
  • We perform 10 fold cross validation

16
Data
User No. Q unique words in Q Total Rel Avg. Rel
1 37 89 236 6.378
2 50 68.42 178 3.56
3 61 82.63 298 4.885
4 26 86.95 101 3.884
5 33 80.76 134 4.06
6 29 78.08 98 3.379
7 29 88.31 115 3.965
17
Metrics
  • Precision_at_n
  • Number of documents relevant / n

18
Set up
User Profile Learner
Train Data
User Profiles
Data
Test Data
Reranker
Reranked Results
19
User Contextless Proposed
1 0.1433 0.2445
2 0.1426 0.2445
3 0.1016 0.1216
4 0.0557 0.1541
5 0.1887 0.3933
6 0.1566 0.3941
7 0.1 0.1833
Avg 0.1268 0.2332
20
Results
Training Model IBM Model1 IBM Model1 GIZA GIZA
Document Train Snippet Train Document Train Snippet Train
Document Test 0.2062 0.2333 0.1799 0.2075
Snippet Test 0.2028 0.2488 0.1834 0.2034
21
Results
I - Document Training and Document Testing II
- Document Training and Snippet Testing III -
Snippet Training and Document Testing IV -
Snippet Training and Snippet Testing
22
Conclusions and Future Work
  • Proposed a stat MT based approach for modeling
    user model
  • Captures Richer context, relations between q and
    w.
  • In future,
  • N-gram based method trigrams etc
  • Noisy Channel based method bigram

23
  • Questions?

24
  • Thank you

25
References
  • Adam Berger and John D. Lafferty. 1999.
    Information retrieval as statistical translation.
    In Research and Development in Information
    Retrieval, pages 222229.
  • Peter F. Brown, Vincent J. Della Pietra, Stephen
    A. Della Pietra, and Robert L. Mercer. 1993. The
    mathematics of statistical machine translation
    parameter estimation. Comput. Linguist.,
    19(2)263311.
  • W. Bruce Croft, Stephen Cronen-Townsend, and
    Victor Larvrenko. 2001. Relevance feedback and
    personalization
  • A language modeling perspective. In DELOS
    Workshop Personalisation and Recommender Systems
    in Digital Libraries.
  • Jamie Allan et. al. 2003. Challenges in
    information retrieval language modeling. In SIGIR
    Forum, volume 37 Number 1.
  • K. Sugiyama K. Hatano and M. Yoshikawa. 2004.
    Adaptive web search based on user profile
    constructed without any effort from users. In
    Proceedings of WWW 2004, page 675 684.
  • Victor Lavrenko and W. Bruce Croft. 2001.
    Relevance-based language models. In Research and
    Development in Information Retrieval, pages
    120127.
  • F. Liu, C. Yu, and W. Meng. 2002. Personalized
    web search by mapping user queries to categories.
    In Proceedings of the eleventh international
    conference on Information and knowledge
    management, ACM Press, pages 558565.
  • Tom Mitchell. 1997. Machine Learning. McGrawHill.

26
  • Franz Josef Och and Hermann Ney. 2003. A
    systematic comparison of various statistical
    alignment models. Computational Linguistics,
    29(1)1951.
  • Jay M. Ponte and W. Bruce Croft. 1998. A language
    modeling approach to information retrieval. In
    Research and Development in Information
    Retrieval, pages 275281.
  • A. Pretschner and S. Gauch. 1999. Ontology based
    personalized search. In ICTAI., pages 391398.
  • J. J. Rocchio. 1971. Relevance feedback in
    information retrieval, the smart retrieval
    system. Experiments in Automatic Document
    Processing, pages 313323.
  • G. Salton and C. Buckley. 1990. Improving
    retrieval performance by relevance feedback.
    Journal of the American Society of Information
    Science, 41288297.
  • Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005.
    Implicit user modeling for personalized search.
    In Proceedings of CIKM 2005.
  • F. Song and W. B. Croft. 1999. A general language
    model for information retrieval. In Proceedings
    on the 22nd annual international ACM SIGIR
    conference, page 279280.
  • Micro Speretta and Susan Gauch. 2004.
    Personalizing search based on user search
    histories. In Thirteenth International Conference
    on Information and Knowledge Management (CIKM
    2004).
  • Chengxiang Zhai and John Lafferty. 2001. A study
    of smoothing methods for language models applied
    to ad hoc information retrieval. In Proceedings
    of ACM SIGIR01, pages 334342.
Write a Comment
User Comments (0)
About PowerShow.com