EECS 395495 Lecture 6: Document Ranking and Link Analysis

About This Presentation

Title:

EECS 395495 Lecture 6: Document Ranking and Link Analysis

Description:

Web Crawlers. Ranking Algorithms. Basics of Machine Learning and Text Classification ... Length of page, length of URL. Anchor Text features. How much text in ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 26

Provided by: douglas58

Category:

more less

Transcript and Presenter's Notes

Title: EECS 395495 Lecture 6: Document Ranking and Link Analysis

1
EECS 395/495 Lecture 6Document Ranking and Link
Analysis

Doug Downey

2
How Does Web Search Work?

Inverted Indices
Web Crawlers
Ranking Algorithms
Basics of Machine Learning and Text
Classification
Document Ranking
Link Analysis

3
Document Ranking

Static Features
Which Web pages are good?
Dynamic Features
Which pages are most relevant to the query?

4
Static Ranking (1)

Known Reputable Sites
Wikipedia, .edu domains, etc.
Spam detection
Link Analysis
More later
(Real) Popularity
Page features
Length of page, length of URL
Anchor Text features
How much text in links to page, how varied,

5
Static Ranking (2)

Beyond PageRank Machine Learning for Static
RankingRichardson et al., 2005

6
Dynamic Ranking

Vector-space model
Term frequency (tf) vectors
The bag of words representation
Idea score(q, d) q . d(i.e. dot product of
frequency vectors)

7
Normalization

Problem with naïve vector space model
Unimportant terms dominate
Solution Normalize frequencies by
Overall frequency in the document collection?
Number of distinct docs mentioning term?
Christopher D. Manning, Prabhakar Raghavan and
Hinrich Schütze, Introduction to Information
Retrieval, Cambridge University Press. 2008

8
TF-IDF Weighting

tf term frequency
idf inverse document frequency
log (N / df)N total of documents
Score by tfidf (also denoted tf-idf)
Christopher D. Manning, Prabhakar Raghavan and
Hinrich Schütze, Introduction to Information
Retrieval, Cambridge University Press. 2008

9
On the Web Other Weighting Factors

Title
Anchor Text
Font/Color
Position on page
Implicit Feedback(More on this later from
Justin, Arun, and Brent)

10
Putting it all together

Googles 400 lines of code (?)
Microsoft Live Search
Static dynamic features computed for several
query/document sets
Neural network trained on human-annotated result
ratings(more on this later from Gang, Samir, and
Hongyu)

11
Problem synonyms

Query for doctors
Also want documents containing physicians
Query for user interface design
Also want documents containing human-computer
interaction

12
Latent Semantic Analysis (LSA)

Idea
Represent documents/words in terms of semantics
E.g.
If two words tend to occur in similar documents,
the words are similar
If two documents tend to include similar words,
the documents are similar

13
LSA Linear Algebra Formulation (1)

X term-frequency matrix (t x d)

14
LSA Linear Algebra Formulation (2)

Write X (t x d) as X W S PT
Where
W (t x r) and P (d x r) are orthonormal matrices
S (r x r) is a diagonal matrix, entries sorted in
decreasing order
.where r min(t, r)

15
LSA Linear Algebra Formulation (3)

X W S PT
Xk Wk Sk PkT (all but first k entries of S gt
0)
Key Xk is the best rank-k approx. to X(in mean
squared error)

X
Xk
Wk
W
Sk
S
PkT
PT
16
Rank-k approximation

Xk Wk Sk PkT
Put another way
Represent each term as k-vector of numbers
Represent each document as another k-vector
Vectors represent semantics in the sense that
Entry in Xk is dot product of term doc
vectors(with dims weighted by Sk)
No other length k vectorization approximates X
better

17
Example

From

18
(No Transcript)
19
(No Transcript)
20
Xk
21
To Review

Represent each term as k-vector of numbers
human 0.22, -0.11
Represent each document as another k-vector
d1 0.2, 0.61
Entry in Xk for human appearing in p1 is dot
product of vectors, weighted by entries of Sk
Sk (1,1)0.220.2 Sk (2,2) -0.110.61 -0.02
So using k 1, what does the word vector
signify?The document vector?

22
LSA vs. Stemming

Stemming often confabulates meanings, e.g. flower
becomes flow as compared to LSA
flower-flow have cos -.01,
dish-dishes cos .68.
More examples of cos in latent space capturing
word meaning
Flower petals .93, gymnosperms 0.47
Flow flows .84, opens 0.46
Dish sauce 0.70, bowl 0.63
Dishes kitchen 0.75, cup 0.57
Thomas K Landauer and Susan Dumais (2008),
Scholarpedia, 3(11)4356 http//www.scholarpedia.o
rg/article/Latent_semantic_analysis

23
LSA for IR

Score cos(query k-vector, document k-vector)

24
LSA Limitations

Ignores polysemy
Chicago represented by only one fixed vector
Ignores word order
Starts from term vectors
Scalability
advances have made training on far larger
corpora possible for example a 500 million word
corpus in 2007, took less than a day on a
modest-sized cluster.Landauer Dumais, 2008
Increasingly fancy probabilistic models have been
posited to solve the first two problems (and
potentially exacerbate the third)

25
How Does Web Search Work?