EECS 395495 Lecture 6: Document Ranking and Link Analysis - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

EECS 395495 Lecture 6: Document Ranking and Link Analysis

Description:

Web Crawlers. Ranking Algorithms. Basics of Machine Learning and Text Classification ... Length of page, length of URL. Anchor Text features. How much text in ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 26
Provided by: douglas58
Category:

less

Transcript and Presenter's Notes

Title: EECS 395495 Lecture 6: Document Ranking and Link Analysis


1
EECS 395/495 Lecture 6Document Ranking and Link
Analysis
  • Doug Downey

2
How Does Web Search Work?
  • Inverted Indices
  • Web Crawlers
  • Ranking Algorithms
  • Basics of Machine Learning and Text
    Classification
  • Document Ranking
  • Link Analysis

3
Document Ranking
  • Static Features
  • Which Web pages are good?
  • Dynamic Features
  • Which pages are most relevant to the query?

4
Static Ranking (1)
  • Known Reputable Sites
  • Wikipedia, .edu domains, etc.
  • Spam detection
  • Link Analysis
  • More later
  • (Real) Popularity
  • Page features
  • Length of page, length of URL
  • Anchor Text features
  • How much text in links to page, how varied,

5
Static Ranking (2)
  • Beyond PageRank Machine Learning for Static
    RankingRichardson et al., 2005

6
Dynamic Ranking
  • Vector-space model
  • Term frequency (tf) vectors
  • The bag of words representation
  • Idea score(q, d) q . d(i.e. dot product of
    frequency vectors)

7
Normalization
  • Problem with naïve vector space model
  • Unimportant terms dominate
  • Solution Normalize frequencies by
  • Overall frequency in the document collection?
  • Number of distinct docs mentioning term?
  • Christopher D. Manning, Prabhakar Raghavan and
    Hinrich Schütze, Introduction to Information
    Retrieval, Cambridge University Press. 2008

8
TF-IDF Weighting
  • tf term frequency
  • idf inverse document frequency
  • log (N / df)N total of documents
  • Score by tfidf (also denoted tf-idf)
  • Christopher D. Manning, Prabhakar Raghavan and
    Hinrich Schütze, Introduction to Information
    Retrieval, Cambridge University Press. 2008

9
On the Web Other Weighting Factors
  • Title
  • Anchor Text
  • Font/Color
  • Position on page
  • Implicit Feedback(More on this later from
    Justin, Arun, and Brent)

10
Putting it all together
  • Googles 400 lines of code (?)
  • Microsoft Live Search
  • Static dynamic features computed for several
    query/document sets
  • Neural network trained on human-annotated result
    ratings(more on this later from Gang, Samir, and
    Hongyu)

11
Problem synonyms
  • Query for doctors
  • Also want documents containing physicians
  • Query for user interface design
  • Also want documents containing human-computer
    interaction

12
Latent Semantic Analysis (LSA)
  • Idea
  • Represent documents/words in terms of semantics
  • E.g.
  • If two words tend to occur in similar documents,
    the words are similar
  • If two documents tend to include similar words,
    the documents are similar

13
LSA Linear Algebra Formulation (1)
  • X term-frequency matrix (t x d)

14
LSA Linear Algebra Formulation (2)
  • Write X (t x d) as X W S PT
  • Where
  • W (t x r) and P (d x r) are orthonormal matrices
  • S (r x r) is a diagonal matrix, entries sorted in
    decreasing order
  • .where r min(t, r)

15
LSA Linear Algebra Formulation (3)
  • X W S PT
  • Xk Wk Sk PkT (all but first k entries of S gt
    0)
  • Key Xk is the best rank-k approx. to X(in mean
    squared error)

X
Xk
Wk
W
Sk
S
PkT
PT
16
Rank-k approximation
  • Xk Wk Sk PkT
  • Put another way
  • Represent each term as k-vector of numbers
  • Represent each document as another k-vector
  • Vectors represent semantics in the sense that
  • Entry in Xk is dot product of term doc
    vectors(with dims weighted by Sk)
  • No other length k vectorization approximates X
    better

17
Example
  • From

18
(No Transcript)
19
(No Transcript)
20
Xk
21
To Review
  • Represent each term as k-vector of numbers
  • human 0.22, -0.11
  • Represent each document as another k-vector
  • d1 0.2, 0.61
  • Entry in Xk for human appearing in p1 is dot
    product of vectors, weighted by entries of Sk
    Sk (1,1)0.220.2 Sk (2,2) -0.110.61 -0.02
  • So using k 1, what does the word vector
    signify?The document vector?

22
LSA vs. Stemming
  • Stemming often confabulates meanings, e.g. flower
    becomes flow as compared to LSA
  • flower-flow have cos -.01,
  • dish-dishes cos .68.
  • More examples of cos in latent space capturing
    word meaning
  • Flower petals .93, gymnosperms 0.47
  • Flow flows .84, opens 0.46
  • Dish sauce 0.70, bowl 0.63
  • Dishes kitchen 0.75, cup 0.57
  • Thomas K Landauer and Susan Dumais (2008),
    Scholarpedia, 3(11)4356 http//www.scholarpedia.o
    rg/article/Latent_semantic_analysis

23
LSA for IR
  • Score cos(query k-vector, document k-vector)

24
LSA Limitations
  • Ignores polysemy
  • Chicago represented by only one fixed vector
  • Ignores word order
  • Starts from term vectors
  • Scalability
  • advances have made training on far larger
    corpora possible for example a 500 million word
    corpus in 2007, took less than a day on a
    modest-sized cluster.Landauer Dumais, 2008
  • Increasingly fancy probabilistic models have been
    posited to solve the first two problems (and
    potentially exacerbate the third)

25
How Does Web Search Work?
  • Inverted Indices
  • Web Crawlers
  • Ranking Algorithms
  • Basics of Machine Learning and Text
    Classification
  • Document Ranking
  • Link Analysis
Write a Comment
User Comments (0)
About PowerShow.com