Title: CS511 Design of Database Management Systems
1CS511Design of Database Management Systems
- Lecture 18
- New Frontier Web Information Access
- Kevin C. Chang
2?? How the Web Different from Text DB?
3How the Web Different from Text DB?
- Hypertext vs. text a lot of additional clues
- graph vs. set
- anchor text vs. text how others say about you?
- famous story search for more evil than Satan
himself" - Web scale is super-huge
- scalability is the key
- Precision more valued than recall
- quality is important than quantity, especially
broad queries - Geographically distributed vs. centralized
- so you need to build a crawler
- Spamming
- Hoaxes and more (search Google with ranking)
4IR Search More of an Art than Science
- Most web queries are single keyword
- What do you want by
- Java PL? Coffee? Island?
- weather?
- Perhaps you do not even know what you want!
- What does this paper try to solve?
- Are the authority pages always the wanted?
5Link Analysis Beyond Content-based IR
- Traditional IR content-based retrieval
- documents match queries by contents
- contents what documents say about themselves
- Web IR
- documents are interconnected in hyper-graph
- contents are just one thing to distinguish pages
- also, what others say about a page (spam-proof?)
- link analysis this paper
- click analysis?
6Hubs and Authorities
- An intuitive/informal definition
- authorities highly-regarded, authoritative pages
- hubs pages that refer you to authorities
- A recursive/formal definition
- mutually reinforcing relationships
- hub
- a page that links to many authorities
- authority
- a page that is linked by many hubs
7Web Adjacent Matrix
- Web G V, E
- V x, y, z, V n
- E (x, x), (x, y), (x, z),
- (y, z),
- (z, x), (z, y)
- A n x n matrix Aij 1 if page i links to page
j, 0 if not -
-
target node
x
y
1 1 1 A 0 0 1 1 1 0
source node
z
8Transposed Adjacent Matrix
- Adjacent matrix A
- what does row j represent?
- Transpose At
- what does row j represent?
-
-
1 1 1 A 0 0 1 1 1 0
x
y
1 0 1 At 1 0 1 1 1 0
z
9Hubbiness and Authority
- Hubbiness a vector h
- hi is a value representing the hubbiness of
page i - Authority a vector a
- ai is a value representing the authority of
page i - Mutual recursive definition in terms of h and a
- ?? hx ?
- ?? ax ?
x
y
z
10Hubbiness
- Hubbiness
- hx ax ay az
- hy az
- hz ax ay
- h aAa
- A links-to nodes
- a their authority weights
- a scaling factor (why?)
1 1 1 A 0 0 1 1 1 0
x
y
z
11Authority
- Authority
- ax hx hz
- ay hx hz
- az hx hy
- a bAth
- At linked-from nodes
- h their hub weights
- b scaling factor
1 0 1 At 1 0 1 1 1 0
x
y
z
12Finding Hubbiness and Authority
- Recursive definition
- a bAth, h aAa
- Authority a ab(AtA)a
- a is an eigenvector of AtA
- Hubbiness h ab(AAt)h
- h is an eigenvector of AAt
13Computing Hubbiness and Authority
- Computation by relaxation
- start from some initial values of a and h
- z (1, 1, , 1)
- a0 z h0 z
- repeat until fixpoint apply the equations
- ai ab(AtA)ai-1
- hi ab(AAt)hi-1
- fixpoint ai ai-1, hi hi-1
- Convergence
- for a AtA is symmetric (and z is right) ?
- relaxation will converge to the principle
eigenvector of AtA - for h similarly the principle eigenvector of AAt
14Computing Hubbiness and Authority
- Assume a 1, b 1, intitial h a (1, 1, 1)
- note AtA and AAt are both symmetric matrices
- Will converge e.g. with some scaling
- a --gt 1.36, 1.36, 1 (or 0.63, 0.63, 0.46 as unit
vector)
AtA
AAt
3 1 2 h 1 1 0 h 2 0 2
2 2 1 a 2 2 1 a 1 1 2
a 1 2 3 4 1 5 24 114 1 5 24 114 1 4 18 84
h 1 2 3 4 1 6 28 132 1 2 8 36 1 4 20 96
15Why this Work Failed?
- Theory declared a success (?), but system a
failure - As refining service for extra time to process
- As an add-on to existing search engines
- While they were still arguing who invented link
analysis (perhaps not a CS credit), Google
surely has built a system!
16Google PageRank
- Reference http//www7.scu.edu.au/
- S. Chakrabarti, B. Dom, P. Raghavan, S.
Rajagopalan, D. Gibson, J. M. Kleinberg
Automatic Resource Compilation by Analyzing
Hyperlink Structure and Associated Text. WWW7 /
Computer Networks 30(1-7) 65-74 (1998) - S. Brin, L. Page The Anatomy of a Large-Scale
Hypertextual Web Search Engine. WWW7 / Computer
Networks 30(1-7) 107-117 (1998) - Google.com
- google googol 10100
- in the Stanford Digital Libraries project 1996-98
- around the same time as Kleinbergs paper
- first demo reorder weather from AltaVista.com
- tried to sell to Infoseek in 1997
- founded in 1998 by Brin and Page
17PageRank Importance of Pages
- PageRank (or importance) again, recursively
- a page P is important if important pages link to
it - importance of P
- proportionally contributed by the back-linked
pages - Example
- rx 1/2 rx 1/2 rz
- ry 1/2 rz
- rz 1/2 rx 1 ry
- Random-surfer interpretation
- surfer randomly follows links to navigate
- PageRank the prob. that surfer will visit the
page
x
y
z
18Computing PageRank
- Importance-propagation equation
- Computation by relaxation
- linked-from (At) or links-to matrix (A)?
- column-normalized
- column x is all that x points to
- sum of column 1, why?
1/2 0 1/2 r 0 0 1/2 r 1/2 1 0
r 1 2 3 fixpoint 1 1 5/4 6/5 1 1/2 3/4
3/5 1 3/2 1 6/5
x
y
z
19Problems Dead Ends
- Dead ends
- page without successors has nowhere to send its
importance - eventually, what would happen to r?
- Example
- ra 0 ra 0 rb
- rb 1 ra 0 rb
x
y
a
b
z
20Problems Spider Trap
- Spider traps
- group of pages without out-of-group links will
trap a spider inside - what would happen to r?
- Example
- ra 1/2 ra 0 rb
- rb 1/2 ra 1 rb
- eventually ra ?, rb ?
- Solutions??
x
y
a
b
z
21Solutions Surfers Random Jump
- Surfer can randomly jump to a new page
- without following links
- d damping factor (set to .85 in paper)
- model the probability of randomly jumping to this
page - another interpretation
- tax importance of each page and distribute to
all pages
PR(A) (1-d) d (PR(T1)/C(T1) ...
PR(Tn)/C(Tn))
22Anti-Spamming
- Spamming
- attempt to create artifacts to please search
engines - so that ranking will be high
- e.g., commercial search engine optimization
service - Google anti-spam device
- unlike other search engines, tends to believe
what others say about you - by links and anchor texts
- recursive importance also works
- importance (not just links) propagate
- Still, not perfect solution
23How about Hub and Authority for Spamming?
- Is it easier to spam a hub or an authority?
- So what?
24Search Engine Architecture
art balancing contents and popularity
inverted index word ? doc
see ref. for details
25Hot Research Web as the Ultimate Info System
- The web search problem (or potential) is far from
solved! - Web modeling e.g., for algorithm analysis
- Focused crawling
- Semantics extraction for data-intensive pages
- page type classification and data extraction
- Natural-language queries/ question answering
- AskJeeves.com
- The deep Web data hidden in Web databases
- More on this? e.g., WWW conference, SIGMOD, VLDB!
26End Of Talk