Title: T?
1T?µata s?et??? µe ??????? ap? t?? ?a???sµ?? ?st?
2????t?s? ?e?µ???? (e?sa?????? ??µata)
3????t?s? ?????f???a?
??se?? ?e?µ???? (document databases) ?e????
s?????? ap? ?e?µe?a ap? d??f??e? p???? news
articles, research papers, books, digital
libraries, e-mail messages, Web pages, blogs,
library database, etc. ?a ded?µ??a de?
a????????? ??p??? a?st??? µ??t??? ?µ?-d?µ?µ??a
semi-structured Information retrieval
????t?s? ?????f???a? ? p????f???a ???a???eta? se
(??a µe???? a???µ?) ap? ?e?µe?a
documents Information retrieval problem
e?t?p?sµ?? t?? s?et???? ?e?µ???? (documents) µe
ß?s? t?? e?s?d? t?? ???st? ?p?? ???e?? ??e?d?? ?
pa?ade??µata ?e?µ????
4????t?s? ?????f???a?
- ?as???? ?????e?
- ??a ????af? (document) a??e?? ?e?µ???? µp??e? ?a
pe?????fe? ap? ??a s????? a?t?p??s?pe?t????
???e??-??e?d?? (keywords) p?? ???µ????ta? ????
de??t?d?t?s?? - index terms. - ??af??et???? ???? µe d?af??et??? ßa?µ?
s?et???t?ta? µp????? ?a ???s?µ?p??????? ??a t??
pe????af? ?e?µ???? µe d?af??et??? pe??e??µe?? - ??t? ep?t?????eta? µe t?? a???es? a???µ?t????
ßa??? (numerical weights) se ???e ????
de??t?d?t?s?? t?? ?e?µ???? (p.?. s????t?ta,
tf-idf) - ??a????a µe S???
- ???? ?e??t?d?t?s?? ? G????sµata
- ???? ? ??µ?? ?????sµ?t??
5????t?s? ?????f???a?
?? µ??t??? e??? a??e??? ?p??????µe ??a s?????
ap? ????? ??ad??? (Boolean) µ??t??? 1 a?
? ???? ?p???e? 0 a? ? ???? de? ?p???e?
???t?s? (t11 ? t12 ? ? ti11) ? (t21 ? t22 ?
? t2i2) ? . (tj1 ? tj2 ? ? tjij) ?p?? ta
tij e??a? ???? ??a ta ????afa p?? ????? t???
?????
6???et????p???s? ??a t?? ????t?s? ?e?µ????
?a??de??µa Did ???e??-??e?d?? 1 agent James
Bond 2 agent mobile computer 3 James Madison
movie 4 James Bond movie
?a?ade??µata e??t?se?? Agent, James and agent,
Agent or James
7???et????p???s? ??a t?? ????t?s? ?e?µ????
S??????, ?atas?e?????ta? e??et???a p?? pe???????
?e??? lt????, id-a??e???gt µe p??a??? ep?p????
ped?a ?p?? ? s????t?ta eµf???s?? t?? ???? st?
a??e?? ?a??µ??a, e??et???a ???s?µ?p????? ?a? ??
µ??a??? a?a??t?s??
8???et????p???s? ??a t?? ????t?s? ?e?µ????
??a ta????µ?µ??? ??sta (a?est?aµµ??? ??sta)
(inverted file, inverted list, inverted index)
??a ???e ???
Agent lt1,2gt Bond lt1,4gt Computer lt2gt James
lt1,3,4gt Madison lt3gt Mobile lt2gt Movie lt3,4gt
?a??de??µa Did ???e??-??e?d?? 1 agent James
Bond 2 agent mobile computer 3 James Madison
movie 4 James Bond movie
Postings (keyword, DocID) ?a????µ?s? ???e ??sta?
µe ß?s? t? DocID
?a??de??µa e??t?se??
9???et????p???s? ??a t?? ????t?s? ?e?µ????
???et???? ?e???????? G?a t?? ta??te?? e?t?p?sµ?
t?? ??sta? ??a ???e ??? ?? s????? t?? ????
µp??e? ?a ???a???e? µe t? ???s? µ?a? d?µ??
e??et????? (p.?. ?-d??t??) Sta f???a, de??te?
p??? t?? a?t?st???? a?est?aµµ??? ??sta
?a??de??µa ??a? ????, s??e???, d???e???
10????t?s? ?????f???a?
?as???? ?et?????
Precision ????ße?a t? p?s?st? t?? a?a?t?µ????
e????f?? p?? e??a? s?et??? µe t?? e??t?s?
(d??ad?, t? p?s?st? t?? s?st?? apa?t?se??)
Recall ??????s? t? p?s?st? t?? s?et????
e????f?? p?? a?a?t???ta?
11???t??a ????t?s? ?e?µ????
?? d?ad??? µ??t??? p?? e?daµe µ???? st??µ??
?e??e? ?t? ?? ???? de??t?d?t?s?? e?te ?p??????
e?te de? ?p?????? st? a??e?? (?e?µe??) ??
e??t?se?? e??a? ???? s??dedeµ???? µe not, and,
?a? or p?. car and repair, plane or
airplane ?? d?ad??? µ??t??? p??ß??pe? ?t? ??a
a??e?? e??a? e?te s?et??? e?te µ? s?et??? ?e?
?p???e? d?aß??µ?s? (Ranking) p?s?
s?et??? ??a??sµat??? µ??t??? -gt p??? ??a
????af? pe?????feta? ap? t??? ????? a??? ???e
???? µe ??a ß???? (p?? s?et??eta? µe t? s????t?ta
eµf???s?? t?? ???? st? ????af?) d??ad?, st?
d?ad??? µ??t???, ta ß??? e??a? ??a d?ad??? (0 ?
1)
12???t??a ????t?s? ?e?µ????
??a??sµat??? ???t???
S????t?ta ???? term frequency p?se? f????
eµfa???eta? ??a? ???? se ??a ????af? ?a??????p???
µ??? ?ste ?a ap?f????µe ?a d?s??µe µe?a??te??
ß???? se µe???a ????afa S?µas?a t?? ???? ti se
??a ????af?
??p??? µ?t??s?
13???t??a ????t?s? ?e?µ????
??est?aµµ??? s????t?ta e????f?? (inverse document
frequency) µet?? p?s? ?e???? s?µa?t???? e??a?
??a? ????
????? µ?t??s?
D a???µ?? e????f?? ????afa sta ?p??a a???e? ?
???? ti
??sa ????afa t?? pe???????
14???t??a ????t?s? ?e?µ????
- ?e???? t?µ? ?ta?
- µe???? s????t?ta eµf???s?? (se ??a s???e???µ???
????af?) ?a? - µ???? s????t?ta eµf???s?? t?? ???? se ??? t?
s?????? - ????? ???s?µ? ??a ?a ap?f????µe ??????? ?????
15???t??a ????t?s? ?e?µ????
?µ???t?ta µe t?? e??t?s?
?a a??e?a ?a? ?? e??t?se?? a?apa??sta?ta? ??
m-d??stata d?a??sµata, ?p?? m e??a? ? s????????
a???µ?? ???? st? s?????? ? ßa?µ?? ?µ???t?ta?
e??? a??e??? d ?a? µ?a? e??t?s?? q ?p??????eta?
?? ? s??????? t???, ???s?µ?p????ta? µet????? ?p??
? ????e?de?a ap?stas? ? t? s???µ?t??? t?? ????a?
t?? d?? d?a??sµ?t??
16???t??a ????t?s? ?e?µ????
???a T?µata
- ???a ???e?? - Word stem
- ?????? ???e?? e??a? µ????? pa?a??a??? af?? ?????
µ?a ????? ???a E.g., drug, drugs, drugged
- S?????µa - Synonymy ??? ? ????-??e?d? T de?
eµfa???ete st? ?e?µe?? a? ?a? t? ?e?µe?? e??a?
s?et??? - ????s?µ?a - Polysemy ? ?d?a ???? µp??e? ?a
s?µa??e? d?af??et??? p???µata µe ß?s? ta
s?µf?a??µe?a
- Stop list
- S????? ???e?? p?? de? e??a? s?et??? a? ?a?
eµfa?????ta? s????, p?, a, the, of, for, to,
with, etc.
17???a??? ??a??t?s??
18???a??? ??a??t?s??
- ?as?sµ??e? se e??et???a ??a??t??? se??de?, t??
de??t?d?t??? ?a? ?atas?e?????? te??st?a e??et???a
ßas?sµ??a se ???e?? ??e?d?? - ???s?µe? ??a t?? e?t?p?sµ? se??d?? p?? pe???????
s???e???µ??e? ???e?? ??e?d?? - ???ß??µata
- ??a ??µa µp??e? ?a pe????e? ?????de? ????afa
- ????? s?et??? µe ??p??? ??µa ????afa µp??e? ?a
µ?? pe??????? t?? ???e?? ??e?d?? p?? t?
p??sd????????
19???a??? ??a??t?s??
- Ta d??µe
- Page Rank
- HITS
- ?a? ?? d?? e?µeta??e???ta? t?? ?pa??? links
s??d?se?? a??µesa st?? se??de?
20PageRank
21PageRank e?sa????
PageRank Capturing Page Popularity (Brin
Page98)
? a?????? a??????µ?? t?? google, pa???s??st??e
st?? ??as??? e??as?a The Anatomy of a
Large-Scale Hypertextual Web Search Engine,
Sergey Brin and Lawrence Page
? e??as?a pe???aµß??e? µ?a p??? e?d?af????sa
?st?????? s?µas?a? e?sa????
We chose our system name, Google, because it is
a common spelling of googol, or 10100 and fits
well with our goal of building very large-scale
search engines.
The verb, "google", was added to the Merriam
Webster Collegiate Dictionary and the Oxford
English Dictionary in 2006, meaning, "to use the
Google search engine to obtain information on the
Internet." (source Wikipedia)
22PageRank ?as??? ?d?a
?as??? ?d?a ???µa ?a? a? ??a te??st?? e??et????
µe ??e? t?? ???e?? ?a? t? se??de? -gt a?t? p??
??e? s?µas?a e??a? ?? s?µa?t???? se??de?
(precision vs recall) ta 10 p??ta
ap?te??sµata S????S ?p?????sµ?? µ?a? t?µ?? ??a
???e se??da p?? ?a ?a?a?t????e? p?s? s?µa?t???
e??a? a?t? ? se??da, ? p?s?t?ta a?t? ???eta? page
rank
??te e??a? µ?a se??da s?µa?t???
23PageRank ?as??? ?d?a
?as??? ?d?a
- ?? Web pages de? e??a? ??e? t? ?d?? s?µa?t????
- www.joe-schmoe.com vs www.stanford.edu
- ??af???? (Inlinks) ?? ??f?? - votes
- www.stanford.edu 23,400 inlinks
- www.joe-schmoe.com 1 inlink
?? s??d?se?? µ?a se??da p?? d??eta? p?????
a?af???? pe??µ??e? ?a?e?? ?a e??a? ?e???? p??
s?µa?t???
24PageRank ?as??? ?d?a
?as??? ?d?a (s????e?a)
? PageRank ßas??eta? st?? µ?t??s? a?af???? se
µ?a se??da (citation counting), a??? µe µ?a
ße?t??s?
?e? e??a? ??e? ?? a?af???? t? ?d??
s?µa?t????! Te??e? ?µµese? a?af???? indirect
citations ??af???? ap? s?µa?t???? se??de?
(d??ad?, ap? se??de? p?? ep?s?? ????? p?????
a?af????) ?e?????ta? p?? s?µa?t???? ??ad??µ????
???sµ??!
25???sµ?? PageRank
?p?? ??ad??µ??? ??at?p?s?
???e se??da µ?a p?s?t?ta p?? ?a?a?t????e? t?
s?µa?t???t?ta t?? (a?t? ? p?s?t?ta ?a?e?ta? page
rank) ??t? ? p?s?t?ta µ?????eta? ?s?p?sa st??
e??te????? a?µ?? t?? se??da? S???e???µ??a
- ? ??f?? ???e a?µ?? (a?af????) e??a? a?????? t??
s?µa?t???t?ta? (PR) t?? se??da? ap? t?? ?p??a
p?????eta? - ?? µ?a se??da P µe s?µa?t???t?ta (PR) y ??e? n
outlinks, ???e link pa???e? y/n ??f???
26???sµ?? PageRank
?a??de??µa
?st? ?t? ?p???e? µ?a ?e???? p?s?t?ta PR p??
µ?????eta? st?? se??de? t?? s?st?µat??. ?st? 4
se??de? A, B, C ?a? D. ?????? p??se???st???
t?µ? ??a ?a?eµ?a PR 0.25
- ?st? B, C, ?a? D ????? link µ??? st? A,
- t?te ??a t? PageRank PR( ) t??? ?a µa?e??ta? st?
?
- ?st? t??a ?t? ? ? ??e? link st? C, ?a? ? D ??e?
links ?a? st? ? ?a? st? C - ? t?µ? t?? PR µ?a? se??da? µ?????eta? a??µesa
st?? e??te????? a?µ?? t?? - ??a ? ??f?? t?? B ??e? a??a ??a t?? ? 0.125 ?a?
0.125 ??a t?? C. - ??t?st???a, µ??? t? 1/3 t?? PageRank t?? D µet??
??a PageRank t?? ? (pe??p?? 0.083).
27???sµ?? PageRank
Ge????? ???sµ?? t?? PageRank ??a µ?a se??da ?
?st? ?t? ? A ??e? t?? se??de? T1, ...,Tn p??
de?????? se a?t?? (d??ad?, a?af????) ?st? C(?) ?
a???µ?? t?? e??te????? a?µ?? µ?a? se??da?
T PR(A) PR(T1)/C(T1) ... PR(Tn)/C(Tn)
28?p?????sµ?? PageRank
?p?? µ??t??? ???? -flow model
y/2
y
y y /2 a /2 a y /2 m m a /2
a/2
y/2
m
a/2
m
a
29?p?????sµ?? PageRank
??s? t?? e??s?se?? ????
- 3 e??s?se??, 3 ????st??, ??? sta?e???
- ?? µ??ad??? ??s?
- ?? ??se?? ?s?d??aµe? µe ???µ???s? (scale factor)
- ?p?p??s?et?? pe?????sµ?? ??a µ??ad???t?ta t??
??s?? - yam 1 (t? s??????? PR p?? µ?????eta? st??
se??de?) - y 2/5, a 2/5, m 1/5
30?p?????sµ?? PageRank
??at?p?s? µe t?? µ??f? p??a?a
- ? p??a?a? M ??e? µ?a ??aµµ? ?a? µ?a st??? ??a
???e web se??da (p??a?a? ?e?t??as??) - ?st? ?t? ? se??da j ??e? n outlinks
- ?? j -gt i, t?te Mij1/n
- ??????, Mij0
- M e??a? column stochastic matrix
- ?? st??e? ????? ?????sµa 1
31?p?????sµ?? PageRank
??at?p?s? µe t?? µ??f? p??a?a (pa??de??µa)
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y y /2 a /2 a y /2 m m a /2
?????sµa 1 (?? ??f?? t?? y)
32?p?????sµ?? PageRank
??at?p?s? µe t?? µ??f? p??a?a
- ?st? r ??a d????sµa µe µ?a e???af? web se??da
- ri e??a? ? s?µa?t???t?ta (PR) t?? se??da? i
- r rank vector
PR(y) PR(a) PR(m)
33?p?????sµ?? PageRank
PR ?????sµa (pa??de??µa)
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y y /2 a /2 a y /2 m m a /2
34?p?????sµ?? PageRank
?st? ?t? ? se??da j ??e? links se 3 se??de?,
s?µpe???aµßa??µ???? t?? i
r
35?p?????sµ?? PageRank
?d??d?a??sµata (eigenvectors)
- ?? e??s?se?? ???? µp????? ?a ??af???
- r M r
- ???ad?, ? rank vector e??a? ??a ?d??d????sµa
(eigenvector) t?? st??ast???? p??a?a ?e?t??as??
t?? web - S???e???µ??a e??a? t? ßas??? ?d??d????sµa (a?t?
p?? a?t?st???e? st?? ?d??t?µ? ? 1)
36?p?????sµ?? PageRank
Power Iteration method ?pa?a??pt??? ????d?
??a ap?? epa?a??pt??? s??µa (relaxation) ?st? N
web se??de?
??????p???s? r0 1/N,.,1/NT ?pa?????? rk1
Mrk ?e?µat?sµ?? ?ta? rk1 - rk1 lt ?
- x1 ?1iNxi e??a? L1 norm
- ?p??e? ?a ???s?µ?p??????? ?a? ???e? µet?????, p?
????e?de?a
37?p?????sµ?? PageRank
?a??de??µa
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y a m
1/3 1/3 1/3
1/3 1/2 1/6
5/12 1/3 1/4
3/8 11/24 1/6
2/5 2/5 1/5
. . .
S??????e? ???ad??? ??s?
38?p?????sµ?? PageRank
- ???t??? ???a??? ???t?a??? ?e?????t? Surfer -
(random walk)
- T? PageRank µ?a? se??da? µp??e? ep?s?? ?a
?e????e? ?t? e?f???e? t?? p??a??t?ta ??a? t??a???
pe?????t?? ?a ft?se? se a?t?? (d??ad?, e?f???e?
p?s? d?µ?f???? e??a?) - ??a? t??a??? pe?????t?? ?e???? ap? µ?a t??a?a
se??da ?a? s??e???e? ?a ???e? click se links,
????? ?a ep?st??fe? se p??????µe?? se??da - ?? ??????? st??µ? t, ? pe?????t?? e??a? se
??p??a se??da P - ?? ??????? st??µ? t 1, ? pe?????t?? a??????e?
??a e??te???? link - outlink t?? P t??a?a
(uniformly at random) - Ft??e? se ??p??a se??da Q t?? P
- S??e???e? t?? pa?ap??? d?ad??as?a ep ?pe????
- ?st? p(t) t? d????sµa t?? ?p???? t? i-?st?
st???e?? e??a? ? p??a??t?ta ? pe?????t?? ?a e??a?
st? se??da i t? ??????? st??µ? t - p(t) probability distribution - ?ata??µ?
p??a??t?ta? st?? se??de?
39?p?????sµ?? PageRank
The stationary distribution
- ??? e??a? ? pe?????t?? t? ??????? st??µ? t1?
- ???????e? ??a link uniformly at random
- p(t1) M p(t)
- ?st? ?t? ? t??a??? pe??pat?? ft??e? µ?a ?at?stas?
?p?? p(t1) M p(t) p(t) - ??te p(t) ???µ??eta? stationary distribution ??a
t?? t??a?? pe??pat? - ?pe?d? ? p??a?a? r ??a??p??e? t?? r Mr
- e??a? stationary distribution ??a t?? t??a??
pe?????t?
40?p?????sµ?? PageRank
- ?as??? ap?t??esµa ap? t? ?e???a t??a???
pe??p?t?? (?a? Markov processes) - G?a ???f??? p?? ??a??p????? s???e???µ??e?
s?????e?, ? stationary distribution e??a?
µ??ad??? ?a? te???? ft????µe se a?t?? a?e???t?ta
ap? t?? a????? ?ata??µ? p??a??t?ta? t? ???????
st??µ? t 0 (s?????s?).
41?pe?t?se?? (t??a?? ??µa)
Spider traps
- ??a ?µ?da se??d?? e??a? µ?a a?a???-pa??da (spider
trap) a? de? ?p?????? a?µ?? ap? t?? ?µ?da se
se??de? e?t?? t?? ?µ?da? - ? t??a??? surfer pa??de?eta?
- ?? s?????e? p?? ??e?????ta? ??a t? ?e???µa t??
t??a??? pe??p?t?? pa???? ?a ?s?????
42?pe?t?se?? (t??a?? ??µa)
Spider traps (pa??de??µa)
Yahoo
y a m
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
43?pe?t?se?? (t??a?? ??µa)
?p??tas? ???t????
- Se ???e ß?µa, ? t??a??? surfer ??e? d??
d??at?t?te? - ?e p??a??t?ta ß, a??????e? ??a t??a?? link
- ?e p??a??t?ta 1-ß pet??eta? se ??p??a ???? se??da
t??a?a - ??µ?? ??a t? ß 0.8 - 0.9
- ?ataf???e? ?a ß?e? ap? t?? pa??da µet? ap?
??p??e? ???????? st??µ??
44?pe?t?se?? (t??a?? ??µa)
?p??tas? ???t????
??????? ???sµ?? t?? PageRank ??a µ?a se??da ?
PR(A) PR(T1)/C(T1) ... PR(Tn)/C(Tn) ???sµ
?? µe t?? pa?????ta? ap?sßes?? d (damping factor)
µeta?? t?? 0 ?a? t?? 1 PR(A) (1-d)/N d
(PR(T1)/C(T1) ... PR(Tn)/C(Tn))
?ste t? ?????sµa ?a e??a? 1 gt 1-d/N ? p??t??
pa?????ta? ??e? ?t? µe t?? ?d?a p??a??t?ta
d?a???? ?p??ad?p?te se??da
45?pe?t?se?? (t??a?? ??µa)
- ?atas?e?? t?? ?x? p??a?a ?
- Aij ?Mij (1-?)/N
- ? A e??a? st??ast???? p??a?a?
- ?? page rank d????sµa r e??a? t? ßas???
?d??d????sµa a?t?? t?? p??a?a - r Ar
- ?s?d??aµa, r e??a? stationary distribution t??
t??a??? pe??p?t?? µe µetap?d?se?? (random walk
with teleports)
46?pe?t?se?? (t??a?? ??µa)
?a??de??µa (d0.8)
1/2 1/2 0 1/2 0 0 0 1/2
1
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
Yahoo
0.8
0.2
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 13/15
Msoft
Amazon
y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
47?pe?t?se?? (t??a?? ??µa)
???t??? ???a??? Surfer (f?s??? e?µ??e?a) ??a?
t??a??? surfer ?e???? ap? µ?a t??a?a se??da ?a?
s??e???e? ?a ???e? click se links, ????? ?a
ep?st??fe? se p??????µe?? se??da a??? te????
ßa???ta? ?a? ?e???? ap? ??p??a ???? t??a?a
se??da ?? d (? pa?????ta? ap?sßes??) e?f???e? t?
p??a??t?ta se ???e se??da ? t??a??? surfer ?a
ßa?e?e? ?a? ?a a???se? ap? ??p??a ???? t??a?a
se??da
48?pe?t?se?? (t??a?? ??µa)
??at?p?s? t?? ep??tas?? µe µetap?d?se?? µe t?
µ??f? p??a?a
- ?st? ? se??de?
- ?st? se??da j, µe ??a s????? outlinks O(j)
- Mij 1/O(j) a? j -gt i and Mij 0 otherwise
- ? t??a?a µetap?d?s? e??a? ?s?d??aµ? µe t?
- ?a p??s??s??µe ??a t??a?? link ap? t? j se
?p??ad?p?te ???? se??da µe (1-?)/N - ???tt?s? t?? p??a??t?ta? ?a a???????s??µe ??a
outlink ap? 1/O(j) se ?/O(j) - ? ?s?d??aµa ????se se ???e se??da ??a p?s?st?
(1-?) t?? t?µ?? t?? ?a? ???e ?ata??µ? a?t??
?µ???µ??fa
49?pe?t?se?? (ad????da)
?d????da
- ?? se??de? ????? outlinks ??a t?? t??a?? surfer
50?pe?t?se?? (ad????da)
1/2 1/2 0 1/2 0 0 0 1/2
0
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 1/15
Msoft
Amazon
y a m
1 1 1
1 0.6 0.6
0.787 0.547 0.387
0.648 0.430 0.333
0 0 0
. . .
51?pe?t?se?? (ad????da)
?e???sµ?? ad?e??d?? (dead-end)
- ?etap?d?s?
- G?a ad????da, a???????se t??a?a µetap?d?s? µe
p??a??t?ta 1 - ???p?p???se t?? p??a?a
- ?a??d?se ta ad????da ?a? a?ap??s??µ?se t? ???f?
- ???-epe?e??as?a t?? ???f?? ??a sß?s?µ? t??
ad?e??d?? - ???a??? p???ap??? epa?a???e??
- ?p?????sµ?? page rank st?? e?att?µ??? ???f?
- ?p?????sµ?? p??se???st???? t?µ?? ??a ad????da
µetaf????ta? t?? t?µ?? ap? t?? e?att?µ??? ???f?
52O ???????µ?? PageRank
- ??a se??da µp??e? ?a ??e? ????? PR a?
- ?p?????? p????? se??de? p?? de?????? se a?t??, ?
- ?ta? ??p??e? se??de? p?? de?????? se a?t?? ?????
????? PR - ?a? ?? d?? pe??pt?se?? ????? s?µas?a
- ?? st? de?te?? pe??pt?s? a? ?p???e? link ap? p?
Yahoo!
53Spamdexing
Content spam Link spam Google bombing ???s????
a?af???? p?? ep??e????? ?µesa t? PR Link
farms Se??de? p?? a?af????ta? ? µ?a st?? ????
54PageRank s????e?a
55O ???????µ?? PageRank
?? d??µe t? Web ?? ???fo, ?????µe ?a ß???µe t???
s?µa?t?????/?e?t?????? ??µß???
?e ß?s? t? PageRank ??a? ??µß?? e??a? s?µa?t????
a? s??d?eta? µe s?µa?t????? ??µß??? ??a p?s?t?ta
se ???e se??da (??µß?) H p?s?t?ta e?a?t?ta? ap?
p?se? se??de? de?????? se a?t?? ?a? µ?????eta?
st?? se??de? p?? de???e? (a?ad??µ???? ???sµ??)
56O ???????µ?? PageRank
?a??de??µa
???e ??µß?? µ?a a????? t?µ? PageRank t?? ?p??a
µ?????e? ?s?d??aµa st??? ??µß??? st??? ?p?????
de???e? ?? ???e a?µ? ??? ??µß?? 2 ??e? ½ ???
??µß?? 3 ??e? 1 ??p
?s?d??aµa, ? p??a??t?ta µet?ßas?? se ??p????
??µß? Random walks (t??a??? pe??pat??)
M o ???a?a? Ge?t??as?? (???a?a? µet?ßas?? ??a
a??s?de? Markov) r t? d????sµa PageRank r M r
57O ???????µ?? PageRank
?a??de??µa
M o ???a?a? Ge?t??as??, r t? d????sµa PageRank r
M r r e??a? t? ?d??d????sµa p?? a?t?st???e?
st?? ?d??t?µ? ? 1 (e??a? ? µe?a??te?? ?d??t?µ?,
epe?d? ? p??a?a? e??a? column-stochastic)
58O ???????µ?? PageRank
Teleport
- ???a?a? ?
- Aij ß Mij (1-ß)/N
Fly-out probability
r Ar
59O ???????µ?? PageRank
Teµat??? PageRank (Topic-Specific PageRank)
- ?p?????sµ?? d?µ?t???t?ta? (popularity) ??a ??p???
??µa - E.g., computer science, health
- Bias the random walk
- ?ta? ? t??a??? pe??pat?t?? teleports, ep????e?
µ?a se??da ap? ??a s????? S se??d?? t??
pa???sµ??? ?st?? - S pe????e? µ??? se??de? p?? e??a? s?et???? µe ??a
??µa - ?? ., Open Directory (DMOZ) se??de? ??a ??p???
??µa (www.dmoz.org) - G?a ???e s????? teleport S, d?af??et??? d????sµa
rS
60HITS
61??sa????
- ???ß??µata µe t? ???s? t?? d?µ?? t?? s??d?se??
t?? Web -
- ?e? a??e? ?a de?????? p????? s??d?se??
- ??a s??des? de s?µa??e? apa?a?t?ta ?et??? ???µ?
(a?a?????s? ??a t? se??da ) - (??p??e? s??d?se?? d?af?µ?se??, a??? navigation,
??p) - ??a a??e?t?a (authority) ??a ??p??? ??µa sp???a
?a ??e? link se a?t?pa?? a??e?t?a st?? ?d?? t?µ?a - ?? a??e?t???? se??de? sp???a e??a?
pe????af????/a?t?p??s?pe?t????
62HITS ???sµ??
? a??????µ?? HITS (Hyperlink-Induced Topic Search)
G?a ???e ??µa d?? e?d? se??d?? ???e?t???
(authority) ??a se??da p?? e??a? a??e?t?a se ??a
??µa ?a? a?a??????eta? ?? t?t??a ap? ???e?
se??de? (d??ad?, ?p?????? p????? s??desµ?? se
a?t??) ??µß???? (hubs) ??a se??da p??
a?af??eta? se µ?a a??e?t??? se??da
?as??? ?d?a ?? se??de? p?? a?af????ta? ap? ???e?
se??de? s???? p??pe? ?a e??a? a??e?t?e?
(Authorities) ?? se??de? p?? a?af????? p?????
???e? se??de? p??pe? ?a e??a? ?a?? ??µß??? s?µe?a
(hubs)
??µß???? ???e?t????
63O ???????µ?? HITS
64O ???????µ?? HITS
- ?as??? ?d?a t?? HITS
- ?a??? a??e?t?e? e??a? a?t?? st?? ?p??e?
a?af????ta? ?a?? ??µß??? s?µe?a - ?a?? ??µß??? s?µe?a e??a? a?t? ta ?p??a
a?af????ta? se ?a??? a??e?t?e? - ??ad??µ??? ??f?as?
S?µe??s? ??a??te? se ???e se??da d?? t?µ?? ??a
???e ??µa d????sµa h (hub) ?a? a (authority)
65O ???????µ?? HITS
?? web ?? ??a? ?ate?????µe??? ???f?? ??µß??
?st?se??de? ??µ? ap? ? st?? ? ? ?st?se??da ?
??e? ??a? ?pe?-s??desµ? st?? ?st?se??da ? ?
a??????µ?? ?????eta? se 2 f?se?? F?s? ?
(de??µat???pt??? st?d??) ??a s????? se??d?? p??
ap?te?e? t? ßas??? s????? ??a ??p??? ??µa F?s?
?? (epa?a??pt??? st?d??) epe?e??as?a t?? ßas????
s?????? ??a t?? e?t?p?sµ? ?a??? a??e?t???? ?a?
?a??? ??µß???? ?st?se??d??
66O ???????µ?? HITS
F?s? ? ?p?????sµ?? ßas???? s?????? 1.
?p?????sµ?? a?????? s?????? s?????-???a ??as????
µ???d?? p? a???t?s? ???? t?? se??d?? p??
pe??????? t?? ???e?? ??e?d?? (pe??µ????µe ?t? ?a
pe????e? (t??????st??) a?af???? p??? s?et????
se??de?)
67O ???????µ?? HITS
- F?s? ? ?p?????sµ?? ßas???? s??????
- (d?e????s? t?? s?????? ???a)
- 2. Se??de?-s??desµ??
- Se??da p?? e?te s?µpe???aµß??e? s??desµ? p?? ?a
a?af??eta? se ??a? ??µß? p st? s????? ???a (p
e??a? a??e?t?a) e?te - ??a? ??µß?? p st? s????? ???a (p e??a? ??µß???
s?µe??) pe????e? s??desµ? p?? a?af??eta? se a?t?? - ?as??? S????? d?e????s? t?? s??????-???a ?ste ?a
pe???aµß??e? ?a? t?? se??de? s??d?sµ??? ?as????
?st?se??de?
68O ???????µ?? HITS
F?s? ?? ???e? ßas???? ?st?se??de? e??a? ??µß??
?a? a??e?t?e? ???e ßas??? se??da p d??
t?µ?? hp - S??te?est?? ??µß???? ????? (p??????
de??te? se a??e?t????) ap - S??te?est??
???e?t???t?ta? (p????? de??te? ap? ??µß???? se
a?t??)
69O ???????µ?? HITS
- ?as??? d?af??? ap? t?? Page Rank
- ??? t?µ?? a?? se??da (a??e?t?a ??µß??? s?µe??)
- Teµat??? ?p?s????a t?? web ???f?? - ?e????µe ap?
t? ßas??? s?????
70O ???????µ?? HITS
F?s? ?? ???e? ßas???? ?st?se??de? e??a? ??µß??
?a? a??e?t?e? ??????p???s?, ? p, hp 1 ?a? ap
1 ?pa?a??pt???, a????eta? ap S hq
?as???? se??de? q p??
de?????? st?? p hp S aq ?as???? se??de? q
st?? ?p??e? de???e? ? p
71???a?a? Ge?t??as??
??apa??stas? µe p??a?e?
?st? t? ßas??? s????? se??d?? 1, 2, ...,
n ???a?a? Ge?t??as?? (adjacency matrix) B n x
n Bi, j 1 a? ? se??da i pe????e? s??desµ?
p?? de???e? st? se??da j ?st? h lth1, h2, ,
hngt t? d????sµa s??te?est?? ??µß???? ????? ?a? a
lta1, a2, ..., angt t? d????sµa s??te?est??
a??e?t???t?ta? (a?t?st???? t?? r vector)
72O ???????µ?? HITS
?? ?a???e? e??µ???s?? ?????? h B a a B?
h 1? epa?????? h B B? h (B B?) h a BT B a
(BT B) a 2? epa?????? h (B B?)2 h a (BT
B)2 a
S?????s? sta ?d??d?a??sµata t?? ??? ?a? ??? a?
?a??????p??????? a????? ?? s??te?est??
73O ???????µ?? HITS
??at?p?s? µe t?? µ??f? p??a?a (pa??de??µa)
Netscape
Msoft
Amazon
3 1 2 1 1 0 2 0 2
1 1 1
6 2 4
h BBT h
74O ???????µ?? HITS
???a?a? Ge?t??as??
d1
d3
??????? ??µ?? ah1
d2
Iterate
d4
Normalize
???? ?d??d?a??sµata
75O ???????µ?? HITS
- ???ß??µata
- Drifting ?ta? ??a ??µß??? s?µe?? pe????e? p????
??µata - Topic hijacking ?ta? p????? se??de? ap? t? ?d??
web site de?????? st? ?d?? d?µ?f???? ??µß?
76???a a??µ? ??a t?? µ??a??? a?a??t?s??
77Google ???a st???e?a
- Anchor Text
-
- ?? ?e?µe?? p?? ?p???e? sta links ??e? d?af??et???
a?t?µet?p?s? - ?? pe??ss?te?e? µ??a??? a?a??t?s?? t? s?s??t??a?
µe t? se??da st?? ?p??a eµfa???eta? - Google ?a? µe t? se??da st?? ?p??a de???e?
- ??? a???ße?? p????f???e? ??a t?? se??de? p??
de?????? pa?? ??a t?? se??de? st?? ?p??e?
eµfa?????ta? - ?p??e? ?a de?????? se se??de? p?? de? ?????
?e?µe?? a??? e????e?, p?????µµata, ??p
78Google ????te?t?????
Most of Google is implemented in C or C for
efficiency and can run in either Solaris or
Linux. The web crawling (downloading of web
pages) is done by several distributed crawlers.
There is a URLserver that sends lists of URLs to
be fetched to the crawlers. The web pages that
are fetched are then sent to the storeserver.
The storeserver then compresses and stores the
web pages into a repository.
79Google ????te?t?????
Every web page has an associated ID number called
a docID which is assigned whenever a new URL is
parsed out of a web page. The indexing function
is performed by the indexer and the sorter. The
indexer reads the repository, uncompresses the
documents, and parses them. document -gt a set
of word occurrences called hits. ?its word,
position in document, an approximation of font
size, and capitalization. The indexer
distributes these hits into a set of "barrels",
creating a partially sorted forward index.
80Google ????te?t?????
Indexer It parses out all the links in every
web page and stores important information about
them in an anchors file. This file contains
enough information to determine where each link
points from and to, and the text of the link.
81Google ????te?t?????
URLresolver relative URLs -gt absolute URLs -gt
docIDs. The sorter takes the barrels, which
are sorted by docID and resorts them by wordID
to generate the inverted index. lexicon
82Google ????te?t?????
The searcher is run by a web server uses the
lexicon built by DumpLexicon together with the
inverted index and the PageRanks to answer
queries.
83?at?????e? ???????? ap? t? Web
??????? ap? t? Web
??????? ??µ??
??????? ?e??e??µ????
??????? ???s??
?????e?s? Ge????? ???t?p?? ???sp??as??
?????e?s? p??sa?µ?sµ???? (customized) ???s??
??????? pe??e??µ???? se??d??
??????? ap?te?esµ?t?? a?a??t?s??
84?at?????e? ???????? ap? t? Web
PageRank, HITS Small-world models,
??????? ap? t? Web
??????? ??µ??
??????? ?e??e??µ????
??????? ???s??
?????e?s? Ge????? ???t?p?? ???sp??as??
?????e?s? p??sa?µ?sµ???? (customized) ???s??
??????? pe??e??µ???? se??d??
??????? ap?te?esµ?t?? a?a??t?s??
85?????