Title: Statistical Machine Translation Models for Personalized Search
- Rohini U
- AOL India RD, Bangalore India
- Rohini.uppuluri_at_corp.aol.com
- Vamshi Ambati
- Language Technologies Institute
- Carnegie Mellon University Pittsburgh, USA
- vamshi_at_cs.cmu.edu
- Vasudeva Varma,
- SIEL, LTRC, IIIT Hyderabad, India
- vv_at_iiit.ac.in
- Introduction
- Related Work
- Background
- User Profile as Translation Model
- Personalized Search
- Learning User Profile
- Re-ranking
- Experiments
- Conclusions and Future Work
- Current Web Search engines
- Provide users with documents relevant to their
information need - Issues
- Information overload
- To cater Hundreds of millions of users
- Terabytes of data
- Poor description of Information need
- Short queries - Difficult to understand
- Word ambiguities
- Users only see top few results
- Relevance
- subjective depends on the user
- One size Fits all ???
- Search is not a solved problem!
- Poorly described information need
- Java (Java island / Java programming language
) - Jaguar (cat /car)
- Lemur (animal / lemur tool kit)
- SBH (State bank of Hyderabad/Syracuse
Behavioral Health care) - Given prior information
- I am into biology best guess for Jaguar?
- past queries - information retrieval, language
modeling best guess for lemur?
5Review of Personalized Search
- Personalized Search
- Query logs Machine learning
Language modeling Community based
6Statistical Language Modeling based Approaches
- Statistical language modeling task of
estimating probability distribution that captures
statistical regularities of natural language - Applied to a number of problems Speech, Machine
Translation, IR, Summarization
7Statistical Language Modeling based Approaches
Query Formulation Model
Given a query, which is most likely to be the
Ideal Document?
P(Q/D) P(q1.qn/D) ? P(qi/D)
User Information need Ideal Document
In spite of the progress, not much work to
capture, model and integrate user context !
8Noisy Channel based approach Motivation
Query Generation Process (Noisy Channel)
Ideal Document
Query Generation Process (Noisy Channel)
9Similar to Statistical Machine Translation
- Given an english sentence translate into french
- Given a query, retrieve documents closer to ideal
Noisy channel 1
English Sentence
French Sentence
Noisy Channel 2
Ideal Document
10Learning user profile
- User profile Translation Model
- Triples (qw,dw,p(qw/dw))
- Use Statistical Machine Translation methods
- Learning user profile training a translation
model - In SMT Training a translation model
- From Parallel texts
- Using EM algorithm
11Learning User profile
- Extracting Parallel Texts
- From Queries and corresponding snippets from
clicked documents - Training a Translation Model
- GIZA - an open source tool kit widely used for
training translation models in Statistical
Machine Translation research.
12Sample user profile
- Recall, in general LM for IR
- Noisy Channel based approach
Lemur encyclopedia brief
Lemur toolkit information retireval
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
- Performed evaluation on explicit feedback data
collected from 7 users - Experiments
- Comparison with Contextless Ranking
- Comparison between different training models and
15 Data and Set up
- Data
- Explicit Feedback data collected from 7 users
- For each query, each user examined top 10
documents and identified top 10 documents - Collected the top 10 results for all queries.
Total documents 3469 documents - Set up
- 3469 documents - created lucene index.
- For reranking, first retrieve the results using
lucene and then rerank them using the noisy
channel approach. - We perform 10 fold cross validation
- Precision_at_n
- Number of documents relevant / n
18Set up
User Profile Learner
Train Data
User Profiles
Test Data
Reranked Results
I - Document Training and Document Testing II
- Document Training and Snippet Testing III -
Snippet Training and Document Testing IV -
Snippet Training and Snippet Testing
22Conclusions and Future Work
- Proposed a stat MT based approach for modeling
user model - Captures Richer context, relations between q and
w. - In future,
- N-gram based method trigrams etc
- Noisy Channel based method bigram
References
