Language Modeling

About This Presentation

Title:

Language Modeling

Description:

'If a reference retrieval system's response to each ... Laplace / Addictive. Mixture Models. Interpolation. Jelinek Mercer. Dirichlet. Absolute Discounting ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 29

Provided by: Office2004364

more less

Transcript and Presenter's Notes

Title: Language Modeling

1
Language Modeling

Putting a curve to the bag of words

2
What models we covered in class so far

Boolean
Extended Boolean
Vector Space
TFIDF
Probabilistic Modeling
log P(DR) / P(DN)

3
Probability Ranking Principle

If a reference retrieval system's response to
each request is a ranking of the documents in the
collection in order of decreasing probability of
relevance to the user who submitted the request,
where the probabilities are estimated as
accurately as possible on the basis of whatever
data have been made available to the system for
this purpose, the overall effectiveness of the
system to its user will be the best that is
obtainable on the basis of those data.
- Robertson

4
Bag of words? What bag?

Documents are a vector of term occurrences
Assumption of exchangeability
What is this really?
A hyperspace where each dimension is represented
by a term
Values are term occurrences

5
Can we model this bag?

Binomial Distribution
Bernoulli / Success Fail Trials
e.g. Flipping a coin chance of getting a head
Multinomial
Probability of events occurring
e.g. Flipping a coin chance of head, chance of
tail
e.g. Die Roll chance of 1, 2, , 6
e.g. Document chance of a term occurring

6
Review

What is the Probability Ranking Principle?
What is the bag of words model?
What is exchangeability?
What is a binomial?
What is a multinomial?

7
Some Terminology

Term t
Vocabulary V t1 t2 tn
Document dx tdx1 tdxm ? V
Corpus C d1 d2 dk
Query Q q1 q2 qi ? V

8
Language Modeling

A document is represented by multinomial
Unigram model
A piece of text is generated by each term
independently
p(t1 t2 tn) p(t1)p(t2)p(tn)
p(t1)p(t2)p(tn)1

9
Why Unigram

Easy to implement
Reasonable performance
Word order and structure not captured
How much benefit would they add?
Open question
More parameters to tune in complex models
Need more data to train
Need more time to compute
Need more space to store

10
Enough how do I retrieve documents?

p(Qd) p(q1d)p(q2d)p(qnd)
How do we estimate p(qd)?
Maximum Likelihood Estimate
MLE(qd) freq(qd) / ?freq(id)
Probability Ranking Principle

11
Review

What is the unigram model?
Is the language model a binomial or multinomial?
Why use the unigram model?
Given a query, how do we use a language model to
retrieve documents?

12
What is wrong with MLE

Creates 0 probabilities for terms that do not
occur
0 probabilities break similarity scoring function
Is a 0 probability sensible?
Can a word never ever occur?

13
How can we fix this?

How do we get around the zero probabilities?
New similarity function?
Remove zero probabilities?
Build a different model?

14
Smoothing Approaches

Laplace / Addictive
Mixture Models
Interpolation
Jelinek Mercer
Dirichlet
Absolute Discounting
Backoff

15
Laplace

Just up all term frequencies by 1
Where have you seen this before?
Is this a good idea?
Strengths
Weaknesses

16
Interpolation

Mixture model approach
Combine probability models
Traditionally combine document model with the
corpus model
Is this a good idea?
What else is the corpus model used for?
Strengths
Weaknesses

17
Backoff

Only add probability mass to terms that are not
seen
What does this do to the probability model?
Flatter?
Is this a good idea?

18
Are their other sources for probability mass?

Document Clusters
Document Classes
User Profiles
Topic models

19
Review

What is wrong with 0 probabilities?
How does smoothing fix it?
What is smoothing really doing?
What is Interpolation?
What is that mixture model really representing?
What can we use to mix with the document model?

20
Bored yet? Lets do something complicated

Entropy - Information Theory
H(x) -?p(x) log p(x)
Good for data compression
Relative Entropy
D(pq) ?p(x) log (p(x)/q(x))
Not a true distance measure
Used to find differences between probability
models

21
Ok thats nice

What does relative entropy give us?
Why not just subtract probabilities?
On your calculators calculate
p(x) log (p(x)/q(x)) for
p(x) .8, q(x) .6
p(x) .6, q(x) .4

22
Clarity Score

Calculate the relative entropy between the result
set and the corpus
Positive correlation between high clarity score /
relative entropy and query performance
So what is that actually saying?

23
Relative Entropy Query Expansion