CS511 Design of Database Management Systems - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

CS511 Design of Database Management Systems

Description:

IR Search: More of an Art than Science. Most web queries are single keyword ... Hubbiness: a vector h. hi is a value representing the 'hubbiness' of page i ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 27
Provided by: kevinc65
Category:

less

Transcript and Presenter's Notes

Title: CS511 Design of Database Management Systems


1
CS511Design of Database Management Systems
  • Lecture 18
  • New Frontier Web Information Access
  • Kevin C. Chang

2
?? How the Web Different from Text DB?
3
How the Web Different from Text DB?
  • Hypertext vs. text a lot of additional clues
  • graph vs. set
  • anchor text vs. text how others say about you?
  • famous story search for more evil than Satan
    himself"
  • Web scale is super-huge
  • scalability is the key
  • Precision more valued than recall
  • quality is important than quantity, especially
    broad queries
  • Geographically distributed vs. centralized
  • so you need to build a crawler
  • Spamming
  • Hoaxes and more (search Google with ranking)

4
IR Search More of an Art than Science
  • Most web queries are single keyword
  • What do you want by
  • Java PL? Coffee? Island?
  • weather?
  • Perhaps you do not even know what you want!
  • What does this paper try to solve?
  • Are the authority pages always the wanted?

5
Link Analysis Beyond Content-based IR
  • Traditional IR content-based retrieval
  • documents match queries by contents
  • contents what documents say about themselves
  • Web IR
  • documents are interconnected in hyper-graph
  • contents are just one thing to distinguish pages
  • also, what others say about a page (spam-proof?)
  • link analysis this paper
  • click analysis?

6
Hubs and Authorities
  • An intuitive/informal definition
  • authorities highly-regarded, authoritative pages
  • hubs pages that refer you to authorities
  • A recursive/formal definition
  • mutually reinforcing relationships
  • hub
  • a page that links to many authorities
  • authority
  • a page that is linked by many hubs

7
Web Adjacent Matrix
  • Web G V, E
  • V x, y, z, V n
  • E (x, x), (x, y), (x, z),
  • (y, z),
  • (z, x), (z, y)
  • A n x n matrix Aij 1 if page i links to page
    j, 0 if not

target node
x
y
1 1 1 A 0 0 1 1 1 0
source node
z
8
Transposed Adjacent Matrix
  • Adjacent matrix A
  • what does row j represent?
  • Transpose At
  • what does row j represent?

1 1 1 A 0 0 1 1 1 0
x
y
1 0 1 At 1 0 1 1 1 0
z
9
Hubbiness and Authority
  • Hubbiness a vector h
  • hi is a value representing the hubbiness of
    page i
  • Authority a vector a
  • ai is a value representing the authority of
    page i
  • Mutual recursive definition in terms of h and a
  • ?? hx ?
  • ?? ax ?

x
y
z
10
Hubbiness
  • Hubbiness
  • hx ax ay az
  • hy az
  • hz ax ay
  • h aAa
  • A links-to nodes
  • a their authority weights
  • a scaling factor (why?)

1 1 1 A 0 0 1 1 1 0
x
y
z
11
Authority
  • Authority
  • ax hx hz
  • ay hx hz
  • az hx hy
  • a bAth
  • At linked-from nodes
  • h their hub weights
  • b scaling factor

1 0 1 At 1 0 1 1 1 0
x
y
z
12
Finding Hubbiness and Authority
  • Recursive definition
  • a bAth, h aAa
  • Authority a ab(AtA)a
  • a is an eigenvector of AtA
  • Hubbiness h ab(AAt)h
  • h is an eigenvector of AAt

13
Computing Hubbiness and Authority
  • Computation by relaxation
  • start from some initial values of a and h
  • z (1, 1, , 1)
  • a0 z h0 z
  • repeat until fixpoint apply the equations
  • ai ab(AtA)ai-1
  • hi ab(AAt)hi-1
  • fixpoint ai ai-1, hi hi-1
  • Convergence
  • for a AtA is symmetric (and z is right) ?
  • relaxation will converge to the principle
    eigenvector of AtA
  • for h similarly the principle eigenvector of AAt

14
Computing Hubbiness and Authority
  • Assume a 1, b 1, intitial h a (1, 1, 1)
  • note AtA and AAt are both symmetric matrices
  • Will converge e.g. with some scaling
  • a --gt 1.36, 1.36, 1 (or 0.63, 0.63, 0.46 as unit
    vector)

AtA
AAt
3 1 2 h 1 1 0 h 2 0 2
2 2 1 a 2 2 1 a 1 1 2
a 1 2 3 4 1 5 24 114 1 5 24 114 1 4 18 84
h 1 2 3 4 1 6 28 132 1 2 8 36 1 4 20 96
15
Why this Work Failed?
  • Theory declared a success (?), but system a
    failure
  • As refining service for extra time to process
  • As an add-on to existing search engines
  • While they were still arguing who invented link
    analysis (perhaps not a CS credit), Google
    surely has built a system!

16
Google PageRank
  • Reference http//www7.scu.edu.au/
  • S. Chakrabarti, B. Dom, P. Raghavan, S.
    Rajagopalan, D. Gibson, J. M. Kleinberg
    Automatic Resource Compilation by Analyzing
    Hyperlink Structure and Associated Text. WWW7 /
    Computer Networks 30(1-7) 65-74 (1998)
  • S. Brin, L. Page The Anatomy of a Large-Scale
    Hypertextual Web Search Engine. WWW7 / Computer
    Networks 30(1-7) 107-117 (1998)
  • Google.com
  • google googol 10100
  • in the Stanford Digital Libraries project 1996-98
  • around the same time as Kleinbergs paper
  • first demo reorder weather from AltaVista.com
  • tried to sell to Infoseek in 1997
  • founded in 1998 by Brin and Page

17
PageRank Importance of Pages
  • PageRank (or importance) again, recursively
  • a page P is important if important pages link to
    it
  • importance of P
  • proportionally contributed by the back-linked
    pages
  • Example
  • rx 1/2 rx 1/2 rz
  • ry 1/2 rz
  • rz 1/2 rx 1 ry
  • Random-surfer interpretation
  • surfer randomly follows links to navigate
  • PageRank the prob. that surfer will visit the
    page

x
y
z
18
Computing PageRank
  • Importance-propagation equation
  • Computation by relaxation
  • linked-from (At) or links-to matrix (A)?
  • column-normalized
  • column x is all that x points to
  • sum of column 1, why?

1/2 0 1/2 r 0 0 1/2 r 1/2 1 0
r 1 2 3 fixpoint 1 1 5/4 6/5 1 1/2 3/4
3/5 1 3/2 1 6/5
x
y
z
19
Problems Dead Ends
  • Dead ends
  • page without successors has nowhere to send its
    importance
  • eventually, what would happen to r?
  • Example
  • ra 0 ra 0 rb
  • rb 1 ra 0 rb

x
y
a
b
z
20
Problems Spider Trap
  • Spider traps
  • group of pages without out-of-group links will
    trap a spider inside
  • what would happen to r?
  • Example
  • ra 1/2 ra 0 rb
  • rb 1/2 ra 1 rb
  • eventually ra ?, rb ?
  • Solutions??

x
y
a
b
z
21
Solutions Surfers Random Jump
  • Surfer can randomly jump to a new page
  • without following links
  • d damping factor (set to .85 in paper)
  • model the probability of randomly jumping to this
    page
  • another interpretation
  • tax importance of each page and distribute to
    all pages

PR(A) (1-d) d (PR(T1)/C(T1) ...
PR(Tn)/C(Tn))
22
Anti-Spamming
  • Spamming
  • attempt to create artifacts to please search
    engines
  • so that ranking will be high
  • e.g., commercial search engine optimization
    service
  • Google anti-spam device
  • unlike other search engines, tends to believe
    what others say about you
  • by links and anchor texts
  • recursive importance also works
  • importance (not just links) propagate
  • Still, not perfect solution

23
How about Hub and Authority for Spamming?
  • Is it easier to spam a hub or an authority?
  • So what?

24
Search Engine Architecture
art balancing contents and popularity
inverted index word ? doc
see ref. for details
25
Hot Research Web as the Ultimate Info System
  • The web search problem (or potential) is far from
    solved!
  • Web modeling e.g., for algorithm analysis
  • Focused crawling
  • Semantics extraction for data-intensive pages
  • page type classification and data extraction
  • Natural-language queries/ question answering
  • AskJeeves.com
  • The deep Web data hidden in Web databases
  • More on this? e.g., WWW conference, SIGMOD, VLDB!

26
End Of Talk
Write a Comment
User Comments (0)
About PowerShow.com