Title: T?
1T?µata s?et??? µe ??????? ap? t?? ?a???sµ?? ?st?
2????t?s? ?e?µ???? (e?sa?????? ??µata)
3????t?s? ?????f???a?
- ??se?? ?e?µ???? (document databases)
- ?e???? s?????? ap? ?e?µe?a ap? d??f??e? p????
news articles, research papers, books, digital
libraries, e-mail messages, and Web pages,
library database, etc. - ?a ded?µ??a de? a????????? ??p??? a?st??? µ??t???
?µ?-d?µ?µ??a semi-structured - Information retrieval ????t?s? ?????f???a?
- ? p????f???a ???a???eta? se (??a µe???? a???µ?)
ap? ?e?µe?a - documents - Information retrieval problem e?t?p?sµ?? t??
s?et???? ?e?µ???? (documents) µe ß?s? t?? e?s?d?
t?? ???st? ?p?? ???e?? ??e?d?? ? pa?ade??µata
?e?µ????
4????t?s? ?????f???a?
- IR s?st?µata
- ?at?????? ß?ß????????
- Online document management systems
- IR vs. DBMS
- ???µe??se??, epe?e??as?a d?s??????? ??p
- ???t?se?? µe ???e?? ??e?d??, ranking(d?aß??µ?s?)/r
elevance
5????t?s? ?????f???a?
- ?as???? ?????e?
- ??a ????af? (document) a??e?? ?e?µ???? µp??e? ?a
pe?????fe? ap? ??a s????? a?t?p??s?pe?t????
???e??-??e?d?? (keywords) p?? ???µ????ta? ????
de??t?d?t?s?? - index terms. - ??af??et???? ???? µe d?af??et??? ßa?µ?
s?et???t?ta? µp????? ?a ???s?µ?p??????? ??a t??
pe????af? ?e?µ???? µe d?af??et??? pe??e??µe?? - ??t? ep?t?????eta? µe t?? a???es? a???µ?t????
ßa??? (numerical weights) se ???e ????
de??t?d?t?s?? t?? ?e?µ???? (p.?. s????t?ta,
tf-idf) - ??a????a µe S???
- ???? ?e??t?d?t?s?? ? G????sµata
- ???? ? ??µ?? ?????sµ?t??
6????t?s? ?????f???a?
- ??a??t?s? µe µ?a ???? ??e?d? (keyword queries)
- ??t?µa Boole
- (t11 ? t12 ? ? ti11) ? (t21 ? t22 ? ? t2i2)
? . (tj1 ? tj2 ? ? tjij) - ?p?? ta tij e??a? ????
- ????? ? ??? t??? ?????
- 2. ??t?µa ??aß??µ?s?? (Ranking) ßa?µ??
s?et???t?ta?
7????t?s? ?????f???a?
?as???? ?et?????
Precision ????ße?a t? p?s?st? t?? a?a?t?µ????
e????f?? p?? e??a? s?et??? µe t?? e??t?s?
(d??ad?, t? p?s?st? t?? s?st?? apa?t?se??)
Recall ??????s? t? p?s?st? t?? s?et????
e????f?? p?? a?a?t???ta?
8???et????p???s? ??a t?? ????t?s? ?e?µ????
S??????, ?atas?e?????ta? e??et???a p?? pe???????
?e??? lt????, id-a??e???gt µe p??a??? ep?p????
ped?a ?p?? ? s????t?ta eµf???s?? t?? ???? st?
a??e?? ?a??µ??a, e??et???a ???s?µ?p????? ?a? ??
µ??a??? a?a??t?s??
9???et????p???s? ??a t?? ????t?s? ?e?µ????
??a ta????µ?µ??? ??sta (a?est?aµµ??? ??sta)
(inverted file, inverted list, inverted index)
??a ???e ???
Agent lt1,2gt Bond lt1,4gt Computer lt2gt James
lt1,3,4gt Madison lt3gt Mobile lt2gt Movie lt3,4gt
?a??de??µa Rid ???e??-??e?d?? 1 agent James
Bond 2 agent mobile computer 3 James Madison
movie 4 James Bond movie
Postings (keyword, DocID) ?a????µ?s? ???e ??sta?
µe ß?s? t? DocID
?a??de??µa e??t?se??
10???et????p???s? ??a t?? ????t?s? ?e?µ????
???et???? ?e???????? G?a t?? ta??te?? e?t?p?sµ?
t?? ??sta? ??a ???e ??? ?? s????? t?? ????
µp??e? ?a ???a???e? µe t? ???s? µ?a? d?µ??
e??et????? (p.?. ?-d??t??) Sta f???a, de??te?
p??? t?? a?t?st???? a?est?aµµ??? ??sta
?a??de??µa ??a? ????, s??e???, d???e???
11???et????p???s? ??a t?? ????t?s? ?e?µ????
?p???af? e????f?? (File Signature) ??a e???af?
e??et????? ??a ???e ????af? st? ß?s?
ded?µ???? ???e e???af? sta?e?? µ??e??? b bits,
e???? t?? ?p???af?? ?atas?e?? t?? ?p???af?? e???
a??e??? Se ???e ??? p?? ?p???e? st? a??e??,
efa?µ??eta? µ?a s????t?s? ?ata?e?µat?sµ??, p??
ep?st??fe? ??a a???µ? ap? t? 1 ?? t? b ?a? t?
a?t?st???? bit t?? ?p???af?? t?? a??e??? ???eta?
1 G?a µ?a e??t?s?, ft??????µe t?? ?p???af? t??
?a? sa?????µe t?? ?p???af?? t?? a??e??? ??a ?a
ß???µe ??p??a p?? ta?????e? False positives
12???et????p???s? ??a t?? ????t?s? ?e?µ????
?p?f??? s???s?? ???? t?? a??e???
?p???af?? ???e?? ?p???af?? µe ?ata????f?
d?aµe??sµ? se µ?????f?e? st??e? ??aµe?????µe
??a a??e?? ?p???af?? se ??a s????? ?ata????f??
d?ad???? st???? G?a ? ?ss??? a???t?s?
?-st????
13???t??a ????t?s? ?e?µ????
Boolean Model ??ad??? ???t???
- ?? µ??t??? p?? e?daµe µ???? st??µ?? ?e??e? ?t? ??
???? de??t?d?t?s?? e?te ?p?????? e?te de?
?p?????? st? a??e?? (?e?µe??) - ???ad?, ta ß??? e??a? ??a d?ad??? (0 ? 1)
- ?? e??t?se?? e??a? ???? s??dedeµ???? µe not,
and, ?a? or - p?. car and repair, plane or airplane
- ?? d?ad??? µ??t??? p??ß??pe? ?t? ??a a??e?? e??a?
e?te s?et??? e?te µ? s?et??? µe ß?s? ??a
ta???asµa t?? e??t?s?? µe t? a??e??
14???t??a ????t?s? ?e?µ????
???t??? µe ß???
S????t?ta ???? term frequency p?se? f????
eµfa???eta? ??a? ???? se ??a ????af? ?a??????p???
µ??? ?ste ?a ap?f????µe ?a d?s??µe µe?a??te??
ß???? se µe???a ????afa S?µas?a t?? ???? ti se
??a ????af?
??p??? µ?t??s?
15???t??a ????t?s? ?e?µ????
??est?aµµ??? s????t?ta e????f?? (inverse document
frequency) µet?? p?s? ?e???? s?µa?t???? e??a?
??a? ????
????? µ?t??s?
D a???µ?? e????f?? ????afa sta ?p??a a???e? ?
???? ti
??sa ????a?fa t?? pe???????
16???t??a ????t?s? ?e?µ????
- ?e???? t?µ? ?ta?
- µe???? s????t?ta eµf???s?? (se ??a s???e???µ???
????af?) ?a? - µ???? s????t?ta eµf???s?? t?? ???? se ??? t?
s?????? - ????? ???s?µ? ??a ?a ap?f????µe ??????? ?????
17???t??a ????t?s? ?e?µ????
???a ???t??a
- ??a? p??a?a? µe t? s????t?ta t?? ???? (term
frequency table) - ???e e???af? frequent_table(i, j) of
occurrences of the word ti in document di - S??????, t? p?s?st? (ratio) a?t? t?? p?a?µat????
a???µ?? eµfa??se?? - Similarity metrics µet???? ?µ???t?ta? µeta??
e??? ?e?µ???? ?a? µ?a? e??t?s?? (s?????? ap?
???e??-??e?d?? - ?????) - Relative term occurrences
- Cosine distance
18???t??a ????t?s? ?e?µ????
Vector Model ???t??? ??a??sµ?t??
- ?a a??e?a ?a? ?? e??t?se?? a?apa??sta?ta? ??
m-d??stata d?a??sµata, ?p?? m e??a? ? s????????
a???µ?? ???? st? s?????? - ? ßa?µ?? ?µ???t?ta? e??? a??e??? d ?a? µ?a?
e??t?s?? q ?p??????eta? ?? ? s??????? t???,
???s?µ?p????ta? µet????? ?p?? ? ????e?de?a
ap?stas? ? t? s???µ?t??? t?? ????a? t?? d??
d?a??sµ?t??
19???t??a ????t?s? ?e?µ????
Latent Semantic Indexing
- ?as??? ?d?a
- Similar documents have similar word frequencies
- Difficulty the size of the term frequency matrix
is very large - Use a singular value decomposition (SVD)
techniques to reduce the size of frequency table - Retain the K most significant rows of the
frequency table
20???t??a ????t?s? ?e?µ????
???a T?µata
- ???a ???e?? - Word stem
- ?????? ???e?? e??a? µ????? pa?a??a??? af?? ?????
µ?a ????? ???a E.g., drug, drugs, drugged
- S?????µa - Synonymy ??? ? ????-??e?d? T de?
eµfa???ete st? ?e?µe?? a? ?a? t? ?e?µe?? e??a?
s?et??? - ????s?µ?a - Polysemy ? ?d?a ???? µp??e? ?a
s?µa??e? d?af??et??? p???µata µe ß?s? ta
s?µf?a??µe?a
21???t??a ????t?s? ?e?µ????
???a T?µata
- Stop list
- S????? ???e?? p?? de? e??a? s?et??? a? ?a?
eµfa?????ta? s????, p?, a, the, of, for, to,
with, etc.
22???a??? ??a??t?s??
23???a??? ??a??t?s??
- ?as?sµ??e? se e??et???a ??a??t??? se??de?, t??
de??t?d?t??? ?a? ?atas?e?????? te??st?a e??et???a
ßas?sµ??a se ???e?? ??e?d?? - ???s?µe? ??a t?? e?t?p?sµ? se??d?? p?? pe???????
s???e???µ??e? ???e?? ??e?d?? - ???ß??µata
- ??a ??µa µp??e? ?a pe????e? ?????de? ????afa
- ????? s?et??? µe ??p??? ??µa ????afa µp??e? ?a
µ?? pe??????? t?? ???e?? ??e?d?? p?? t?
p??sd????????
24???a??? ??a??t?s??
- Ta d??µe
- Page Rank
- HITS
- ?a? ?? d?? e?µeta??e???ta? t?? ?pa??? links
s??d?se?? a??µesa st?? se??de?
25PageRank
26PageRank e?sa????
PageRank Capturing Page Popularity (Brin
Page98)
? a?????? a??????µ?? t?? google, pa???s??st??e
st?? ??as??? e??as?a The Anatomy of a
Large-Scale Hypertextual Web Search Engine,
Sergey Brin and Lawrence Page
? e??as?a pe???aµß??e? µ?a p??? e?d?af????sa
?st?????? s?µas?a? e?sa????
We chose our system name, Google, because it is
a common spelling of googol, or 10100 and fits
well with our goal of building very large-scale
search engines.
The verb, "google", was added to the Merriam
Webster Collegiate Dictionary and the Oxford
English Dictionary in 2006, meaning, "to use the
Google search engine to obtain information on the
Internet." (source Wikipedia)
27PageRank ?as??? ?d?a
?as??? ?d?a ???µa ?a? a? ??a te??st?? e??et????
µe ??e? t?? ???e?? ?a? t? se??de? -gt a?t? p??
??e? s?µas?a e??a? ?? s?µa?t????
se??de?(precision vs recall) S????S ?p?????sµ??
µ?a? t?µ?? ??a ???e se??da p?? ?a ?a?a?t????e?
p?s? s?µa?t??? e??a? a?t? ? se??da, ? p?s?t?ta
a?t? ???eta? page rank
- ?? Web pages de? e??a? ??e? t? ?d?? s?µa?t????
- www.joe-schmoe.com vs www.stanford.edu
- ??af???? (Inlinks) ?? ??f?? - votes
- www.stanford.edu 23,400 inlinks
- www.joe-schmoe.com 1 inlink
?? e??a? s?µa?t??? ?? s??d?se?? µ?a se??da p??
d??eta? p????? a?af???? pe??µ??e? ?a?e?? ?a e??a?
?e???? p?? s?µa?t???
28PageRank ?as??? ?d?a
?as??? ?d?a (s????e?a)
? PageRank ßas??eta? st?? µ?t??s? a?af????
citation counting, a??? µe µ?a ße?t??s?
?e? e??a? ??e? ?? a?af???? t? ?d??
s?µa?t????! Te??e? ?µµese? a?af???? indirect
citations a?af???? ap? s?µa?t???? se??de?
(d??ad?, ap? se??de? p?? ep?s?? ????? p?????
a?af????) ?e?????ta? p?? s?µa?t???? ??ad??µ????
???sµ??!
29???sµ?? PageRank
?p?? ??ad??µ??? ??at?p?s?
- ? ??f?? ???e a?µ?? (a?af????) e??a? a?????? t??
s?µa?t???t?ta? (PR) t?? se??da? ap? t?? ?p??a
p?????eta? - ?? µ?a se??da P µe s?µa?t???t?ta (PR) y ??e? n
outlinks, ???e link pa???e? y/n ??f???
30???sµ?? PageRank
?a??de??µa
?p???e? µ?a ?e???? p?s?t?ta PR p?? µ?????eta?
st?? se??de? t?? s?st?µat??. ?st? 4 se??de? A,
B, C ?a? D. ?????? p??se???st??? t?µ? ??a
?a?eµ?a PR 0.25
- ?st? B, C, ?a? D ????? link µ??? st? A,
- t?te ??a t? PageRank PR( ) t??? ?a µa?e??ta? st?
?
- ?st? t??a ?t? ? ? ??e? link st? C, ?a? ? D ??e?
links ?a? st? ? ?a? st? C - ? t?µ? t?? PR µ?a? se??da? µ?????eta? a??µesa
st?? e??te????? a?µ?? t?? - ??a ? ??f?? t?? B ??e? a??a ??a t?? ? 0.125 ?a?
0.125 ??a t?? C. - ??t?st???a, µ??? t? 1/3 t?? PageRank t?? D µet??
??a PageRank t?? ? (pe??p?? 0.083).
31???sµ?? PageRank
Ge????? ???sµ?? t?? PageRank ??a µ?a se??da ?
?st? ?t? ? A ??e? t?? se??de? T1, ...,Tn p??
de?????? se a?t?? (d??ad?, a?af????) ?st? C(?) ?
a???µ?? t?? e??te????? a?µ?? µ?a? se??da? T
PR(A) PR(T1)/C(T1) ... PR(Tn)/C(Tn)
32?p?????sµ?? PageRank
?p?? µ??t??? ???? -flow model
y/2
y
y y /2 a /2 a y /2 m m a /2
a/2
y/2
m
a/2
m
a
33?p?????sµ?? PageRank
??s? t?? e??s?se?? ????
- 3 e??s?se??, 3 ????st??, ??? sta?e???
- ?? µ??ad??? ??s?
- ?? ??se?? ?s?d??aµe? µe ???µ???s? (scale factor)
- ?p?p??s?et?? pe?????sµ?? ??a µ??ad???t?ta t??
??s?? - yam 1 (t? s??????? PR p?? µ?????eta? st??
se??de?) - y 2/5, a 2/5, m 1/5
34?p?????sµ?? PageRank
??at?p?s? µe t?? µ??f? p??a?a
- ? p??a?a? M ??e? µ?a ??aµµ? ?a? µ?a st??? ??a
???e web se??da (p??a?a? ?e?t??as??) - ?st? ?t? ? se??da j ??e? n outlinks
- ?? j -gt i, t?te Mij1/n
- ??????, Mij0
- M e??a? column stochastic matrix
- ?? st??e? ????? ?????sµa 1
35?p?????sµ?? PageRank
??at?p?s? µe t?? µ??f? p??a?a (pa??de??µa)
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y y /2 a /2 a y /2 m m a /2
?????sµa 1 (?? ??f?? t?? y)
36?p?????sµ?? PageRank
??at?p?s? µe t?? µ??f? p??a?a
- ?st? r ??a d????sµa µe µ?a e???af? web se??da
- ri e??a? ? s?µa?t???t?ta (PR) t?? se??da? i
- r rank vector
PR(y) PR(a) PR(m)
37?p?????sµ?? PageRank
PR ?????sµa (pa??de??µa)
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y y /2 a /2 a y /2 m m a /2
38?p?????sµ?? PageRank
?st? ?t? ? se??da j ??e? links se 3 se??de?,
s?µpe???aµßa??µ???? t?? i
r
39?p?????sµ?? PageRank
?d??d?a??sµata (eigenvectors)
- ?? e??s?se?? ???? µp????? ?a ??af???
- r M r
- ???ad?, ? rank vector e??a? ??a ?d??d????sµa
(eigenvector) t?? st??ast???? p??a?a ?e?t??as??
t?? web - S???e???µ??a e??a? t? ßas??? ?d??d????sµa (a?t?
p?? a?t?st???e? st?? ?d??t?µ? ? 1)
40?p?????sµ?? PageRank
Power Iteration method ?pa?a??pt??? ????d?
??a ap?? epa?a??pt??? s??µa (relaxation) ?st? N
web se??de?
??????p???s? r0 1/N,.,1/NT ?pa?????? rk1
Mrk ?e?µat?sµ?? ?ta? rk1 - rk1 lt ?
- x1 ?1iNxi e??a? L1 norm
- ?p??e? ?a ???s?µ?p??????? ?a? ???e? µet?????, p?
????e?de?a
41?p?????sµ?? PageRank
?a??de??µa
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y a m
1/3 1/3 1/3
1/3 1/2 1/6
5/12 1/3 1/4
3/8 11/24 1/6
2/5 2/5 1/5
. . .
S??????e? ???ad??? ??s?
42?p?????sµ?? PageRank
- ???t??? ???a??? ???t?a??? ?e?????t? Surfer -
(random walk)
- T? PageRank µ?a? se??da? µp??e? ep?s?? ?a
?e????e? ?t? e?f???e? t?? p??a??t?ta ??a? t??a???
pe?????t?? ?a ft?se? se a?t?? (d??ad?, e?f???e?
p?s? d?µ?f???? e??a?) - ??a? t??a??? pe?????t?? ?e???? ap? µ?a t??a?a
se??da ?a? s??e???e? ?a ???e? click se links,
????? ?a ep?st??fe? se p??????µe?? se??da - ?? ??????? st??µ? t, ? pe?????t?? e??a? se
??p??a se??da P - ?? ??????? st??µ? t1, ? pe?????t?? a??????e?
??a e??te???? link - outlink t?? P t??a?a
(uniformly at random) - Ft??e? se ??p??a se??da Q t?? P
- S??e???e? t?? pa?ap??? d?ad??as?a ep ?pe????
- ?st? p(t) t? d????sµa t?? ?p???? t? i-?st?
st???e?? e??a? ? p??a??t?ta ? pe?????t?? ?a e??a?
st? se??da i t? ??????? st??µ? t - p(t) probability distribution - ?ata??µ?
p??a??t?ta? st?? se??de?
43?p?????sµ?? PageRank
The stationary distribution
- ??? e??a? ? pe?????t?? t? ??????? st??µ? t1?
- ???????e? ??a link uniformly at random
- p(t1) M p(t)
- ?st? ?t? ? t??a??? pe??pat?? ft??e? µ?a ?at?stas?
?p?? p(t1) M p(t) p(t) - ??te p(t) ???µ??eta? stationary distribution ??a
t?? t??a?? pe??pat? - ?pe?d? ? p??a?a? r ??a??p??e? t?? r Mr
- e??a? stationary distribution ??a t?? t??a??
pe?????t?
44?p?????sµ?? PageRank
- ?as??? ap?t??esµa ap? t? ?e???a t??a???
pe??p?t?? (?a? Markov processes) - G?a ???f??? p?? ??a??p????? s???e???µ??e?
s?????e?, ? stationary distribution e??a?
µ??ad??? ?a? te???? ft????µe se a?t?? a?e???t?ta
ap? t?? a????? ?ata??µ? p??a??t?ta? t? ???????
st??µ? t 0 (s?????s?).
45?pe?t?se?? (t??a?? ??µa)
Spider traps
- ??a ?µ?da se??d?? e??a? µ?a a?a???-pa??da spider
trap a? de? ?p?????? a?µ?? ap? t?? ?µ?da se
se??de? e?t?? t?? ?µ?da? - ? t??a??? surfer pa??de?eta?
- ?? s?????e? p?? ??e?????ta? ??a t? ?e???µa t??
t??a??? pe??p?t?? pa???? ?a ?s?????
46?pe?t?se?? (t??a?? ??µa)
Spider traps (pa??de??µa)
Yahoo
y a m
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
47?pe?t?se?? (t??a?? ??µa)
?p??tas? ???t????
- Se ???e ß?µa, ? t??a??? surfer ??e? d??
d??at?t?te? - ?e p??a??t?ta ß, a??????e? ??a t??a?? link
- ?e p??a??t?ta 1-ß pet??eta? se ??p??a ???? se??da
t??a?a - ??µ?? ??a t? ß 0.8 - 0.9
- ?ataf???e? ?a ß?e? ap? t?? pa??da µet? ap?
??p??e? ???????? st??µ??
48?pe?t?se?? (t??a?? ??µa)
?p??tas? ???t????
??????? ???sµ?? t?? PageRank ??a µ?a se??da ?
PR(A) PR(T1)/C(T1) ... PR(Tn)/C(Tn) ???sµ
?? µe t?? pa?????ta? ap?sßes?? d (damping factor)
µeta?? t?? 0 ?a? t?? 1 PR(A) (1-d) d
(PR(T1)/C(T1) ... PR(Tn)/C(Tn))
?ste t? ?????sµa ?a e??a? 1 gt 1-d/N ? p??t??
pa?????ta? µe t?? ?d?a p??a??t?ta d?a????
?p??ad?p?te se??da
49?pe?t?se?? (t??a?? ??µa)
???t??? ???a??? Surfer (f?s??? e?µ??e?a) ??a?
t??a??? surfer ?e???? ap? µ?a t??a?a se??da ?a?
s??e???e? ?a ???e? click se links, ????? ?a
ep?st??fe? se p??????µe?? se??da a??? te????
ßa???ta? ?a? ?e???? ap? ??p??a ???? t??a?a
se??da ?? d (? pa?????ta? ap?sßes??) e??a? ?
p??a??t?ta se ???e se??da ? t??a??? surfer ?a
ßa?e?e? ?a? ?a a???se? ap? ??p??a ???? t??a?a
se??da
50?pe?t?se?? (t??a?? ??µa)
?a??de??µa (d0.8)
1/2 1/2 0 1/2 0 0 0 1/2
1
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
Yahoo
0.8
0.2
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 13/15
Msoft
Amazon
y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
51?pe?t?se?? (t??a?? ??µa)
??at?p?s? t?? ep??tas?? µe µetap?d?se?? µe t?
µ??f? p??a?a
- ?st? ? se??de?
- ?st? se??da j, µe ??a s????? outlinks O(j)
- Mij 1/O(j) a? j -gt i and Mij 0 otherwise
- ? t??a?a µetap?d?s? e??a? ?s?d??aµ? µe t?
- ?a p??s??s??µe ??a t??a?? link ap? t? j se
?p??ad?p?te ???? se??da µe (1-?)/N - ???tt?s? t?? p??a??t?ta? ?a a???????s??µe ??a
outlink ap? 1/O(j) se ?/O(j) - ? ?s?d??aµa ????se se ???e se??da ??a p?s?st?
(1-?) t?? t?µ?? t?? ?a? ???e ?ata??µ? a?t??
?µ???µ??fa
52?pe?t?se?? (t??a?? ??µa)
- ?atas?e?? t?? ?x? p??a?a ?
- Aij ?Mij (1-?)/N
- ? A e??a? st??ast???? p??a?a?
- ?? page rank d????sµa r e??a? t? ßas???
?d??d????sµa a?t?? t?? p??a?a - r Ar
- ?s?d??aµa, r e??a? stationary distribution t??
t??a??? pe??p?t?? µe µetap?d?se?? (random walk
with teleports)
53?pe?t?se?? (ad????da)
?d????da
- ?? se??de? ????? outlinks ??a t?? t??a?? surfer
54?pe?t?se?? (ad????da)
1/2 1/2 0 1/2 0 0 0 1/2
0
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 1/15
Msoft
Amazon
y a m
1 1 1
1 0.6 0.6
0.787 0.547 0.387
0.648 0.430 0.333
0 0 0
. . .
55?pe?t?se?? (ad????da)
?e???sµ?? ad?e??d?? (dead-end)
- ?etap?d?s?
- G?a ad????da, a???????se t??a?a µetap?d?s? µe
p??a??t?ta 1 - ???p?p???se t?? p??a?a
- ?a??d?se ta ad????da ?a? a?ap??s??µ?se t? ???f?
- ???-epe?e??as?a t??? ???f?? ??a sß?s?µ? t??
ad?e??d?? - ???a??? p???ap??? epa?a???e??
- ?p?????sµ?? page rank st?? e?att?µ??? ???f?
- ?p?????sµ?? p??se???st???? t?µ?? ??a ad????da
µetaf????ta? t?? t?µ?? ap? t?? e?att?µ??? ???f?
56O ???????µ?? PageRank
??a se??da µp??e? ?a ??e? ????? PR a? ?p??????
p????? se??de? p?? de?????? se a?t?? ? ?ta?
??p??e? se??de? p?? de?????? se a?t?? ????? ?????
PR ?a? ?? d?? pe??pt?se?? ????? s?µas?a ?? st?
de?te?? pe??pt?s? a? ?p???e? link ap? p? Yahoo!
57Spamdexing
Content spam Link spam Google bombing ???s????
a?af???? p?? ep??e????? ?µesa t? PR Link
farms Se??de? p?? a?af????ta? ? µ?a st?? ????
58PageRank s????e?a
59O ???????µ?? PageRank
?? d??µe t? Web ?? ???fo, ?????µe ?a ß???µe t???
s?µa?t?????/?e?t?????? ??µß???
?e ß?s? t? PageRank ??a? ??µß?? e??a? s?µa?t????
a? s??d?eta? µe s?µa?t????? ??µß??? ??a p?s?t?ta
se ???e se??da (??µß?) H p?s?t?ta e?a?t?ta? ap?
p?se? se??de? de?????? se a?t?? ?a? µ?????eta?
st?? se??de? p?? de???e? (a?ad??µ???? ???sµ??)
60O ???????µ?? PageRank
?a??de??µa
???e ??µß?? µ?a a????? t?µ? PageRank t?? ?p??a
µ?????e? ?s?d??aµa st??? ??µß??? st??? ?p?????
de???e? ?? ???e a?µ? ??? ??µß?? 2 ??e? ½ ???
??µß?? 5 ??e? 1 ??p
?s?d??aµa, ? p??a??t?ta µet?ßas?? se ??p????
??µß? Random walks (t??a??? pe??pat??)
M o ???a?a? Ge?t??as?? (???a?a? µet?ßas?? ??a
a??s?de? Markov) r t? d????sµa PageRank r M r
61O ???????µ?? PageRank
?a??de??µa
M o ???a?a? Ge?t??as??, r t? d????sµa PageRank r
M r r e??a? t? ?d??d????sµa p?? a?t?st???e?
st?? ?d??t?µ? ? 1 (e??a? ? µe?a??te?? ?d??t?µ?,
epe?d? ? p??a?a? e??a? column-stochastic)
62O ???????µ?? PageRank
Teleport
- ???a?a? ?
- Aij ß Mij (1-ß)/N
Fly-out probability
r Ar
63O ???????µ?? PageRank
Teµat??? PageRank (Topic-Specific PageRank)
- ?p?????sµ?? d?µ?t???t?ta? (popularity) ??a ??p???
??µa - E.g., computer science, health
- Bias the random walk
- ?ta? ? t??a??? pe??pat?t?? teleports, ep????e?
µ?a se??da ap? ??a s????? S se??d?? t??
pa???sµ??? ?st?? - S pe????e? µ??? se??de? p?? e??a? s?et???? µe ??a
??µa - ?? ., Open Directory (DMOZ) se??de? ??a ??p???
??µa (www.dmoz.org) - G?a ???e s????? teleport S, d?af??et??? d????sµa
rS
64HITS
65??sa????
???ß??µata µe t? ???s? t?? d?µ?? t?? s??d?se??
t?? Web ?e? a??e? ?a de?????? p?????
s??d?se?? ??a s??des? de s?µa??e? apa?a?t?ta
?et??? ???µ? (a?a?????s? ??a t? se??da )
(??p??e? s??d?se?? d?af?µ?se??, a???
navigation, ??p) ??a a??e?t?a (authority) ??a
??p??? ??µa sp???a ?a ??e? link se a?t?pa??
a??e?t?a st?? ?d?? t?µ?a ?? a??e?t???? se??de?
sp???a e??a? pe????af????/a?t?p??s?pe?t????
66HITS ???sµ??
? a??????µ?? HITS (Hyperlink-Induced Topic Search)
G?a ???e ??µa d?? e?d? se??d?? ???e?t???
(authority) ??a se??da p?? e??a? a??e?t?a se ??a
??µa ?a? a?a??????eta? ?? t?t??a ap? ???e?
se??de? (d??ad?, ?p?????? p????? s??desµ?? se
a?t??) ??µß???? (hubs) ??a se??da p??
a?af??eta? se µ?a a??e?t??? se??da
?as??? ?d?a ?? se??de? p?? a?af????ta? ap? ???e?
se??de? s???? p??pe? ?a e??a? a??e?t?e?
(Authorities) ?? se??de? p?? a?af????? p?????
???e? se??de? p??pe? ?a e??a? ?a?? ??µß??? s?µe?a
(hubs)
??µß???? ???e?t????
67O ???????µ?? HITS
- ?as??? ?d?a t?? HITS
- ?a??? a??e?t?e? e??a? a?t?? st?? ?p??e?
a?af????ta? ?a?? ??µß??? s?µe?a - ?a?? ??µß??? s?µe?a e??a? a?t? ta ?p??a
a?af????ta? se ?a??? a??e?t?e? - ??ad??µ??? ??f?as?
S?µe??s? ??a??te? se ???e se??da d?? t?µ?? ??a
???e ??µa d????sµa h (hub) ?a? a (authority)
68O ???????µ?? HITS
?? web ?? ??a? ?ate?????µe??? ???f?? ??µß??
?st?se??de? ??µ? ap? ? st?? ? ? ?st?se??da ?
??e? ??a? ?pe?-s??desµ? st?? ?st?se??da ? ?
a??????µ?? ?????eta? se 2 f?se?? F?s? ?
(de??µat???pt??? st?d??) ??a s????? se??d?? p??
ap?te?e? t? ßas??? s????? ??a ??p??? ??µa F?s?
?? (epa?a??pt??? st?d??) epe?e??as?a t?? ßas????
s?????? ??a t?? e?t?p?sµ? ?a??? a??e?t???? ?a?
?a??? ??µß???? ?st?se??d??
69O ???????µ?? HITS
F?s? ? ?p?????sµ?? ßas???? s?????? 1.
?p?????sµ?? a?????? s?????? s?????-???a ??as????
µ???d?? p? a???t?s? ???? t?? se??d?? p??
pe??????? t?? ???e?? ??e?d?? (pe??µ????µe ?t? ?a
pe????e? (t??????st??) a?af???? p??? s?et????
se??de?)
70O ???????µ?? HITS
- F?s? ? ?p?????sµ?? ßas???? s??????
- (d?e????s? t?? s?????? ???a)
- 2. Se??de?-s??desµ??
- Se??da p?? e?te s?µpe???aµß??e? s??desµ? p?? ?a
a?af??eta? se ??a? ??µß? p st? s????? ???a (p
e??a? a??e?t?a) e?te - ??a? ??µß?? p st? s????? ???a (p e??a? ??µß???
s?µe??) pe????e? s??desµ? p?? a?af??eta? se a?t?? - ?as??? S????? d?e????s? t?? s??????-???a ?ste ?a
pe???aµß??e? ?a? t?? se??de? s??d?sµ??? ?as????
?st?se??de?
71O ???????µ?? HITS
F?s? ?? ???e? ßas???? ?st?se??de? e??a? ??µß??
?a? a??e?t?e? ???e ßas??? se??da p d??
t?µ?? hp - S??te?est?? ??µß???? ????? (p??????
de??te? se a??e?t????) ap - S??te?est??
???e?t???t?ta? (p????? de??te? ap? ??µß???? se
a?t??)
72O ???????µ?? HITS
- ?as??? d?af??? ap? t?? Page Rank
- ??? t?µ?? a?? se??da (a??e?t?a ??µß??? s?µe??)
- Teµat??? ?p?s????a t?? web ???f?? - ?e????µe ap?
t? ßas??? s?????
73O ???????µ?? HITS
F?s? ?? ???e? ßas???? ?st?se??de? e??a? ??µß??
?a? a??e?t?e? ??????p???s?, ? p, hp 1 ?a? ap
1 ?pa?a??pt???, a????eta? ap S hq
?as???? se??de? q p??
de?????? st?? p hp S aq ?as???? se??de? q
st?? ?p??e? de???e? ? p
74???a?a? Ge?t??as??
??apa??stas? µe p??a?e?
?st? t? ßas??? s????? se??d?? 1, 2, ...,
n ???a?a? Ge?t??as?? (adjacency matrix) B n x
n Bi, j 1 a? ? se??da i pe????e? s??desµ?
p?? de???e? st? se??da j ?st? h lth1, h2, ,
hngt t? d????sµa s??te?est?? ??µß???? ????? ?a? a
lta1, a2, ..., angt t? d????sµa s??te?est??
a??e?t???t?ta? (a?t?st???? t?? r vector)
75O ???????µ?? HITS
?? ?a???e? e??µ???s?? ?????? h B a a B?
h 1? epa?????? h B B? h (B B?) h a BT B a
(BT B) a 2? epa?????? h (B B?)2 h a (BT
B)2 a
S?????s? sta ?d??d?a??sµata t?? ??? ?a? ??? a?
?a??????p??????? a????? ?? s??te?est??
76O ???????µ?? HITS
??at?p?s? µe t?? µ??f? p??a?a (pa??de??µa)
Netscape
Msoft
Amazon
3 1 2 1 1 0 2 0 2
1 1 1
6 2 4
h BBT h
77O ???????µ?? HITS
???a?a? Ge?t??as??
d1
d3
??????? ??µ?? ah1
d2
Iterate
d4
Normalize
Again eigenvector problems
78O ???????µ?? HITS
- ???ß??µata
- Drifting ?ta? ??a ??µß??? s?µe?? pe????e? p????
??µata - Topic hijacking ?ta? p????? se??de? ap? t? ?d??
web site de?????? st? ?d?? d?µ?f???? ??µß?
79???a a??µ? ??a t?? µ??a??? a?a??t?s??
80Google ???a st???e?a
- Anchor Text
-
- ?? ?e?µe?? p?? ?p???e? sta links ??e? d?af??et???
a?t?µet?p?s? - ?? pe??ss?te?e? µ??a??? a?a??t?s?? t? s?s??t??a?
µe t? se??da st?? ?p??a eµfa???eta? - Google ?a? µe t? se??da st?? ?p??a de???e?
- ??? a???ße?? p????f???e? ??a t?? se??de? p??
de?????? pa?? ??a t?? se??de? st?? ?p??e?
eµfa?????ta? - ?p??e? ?a de?????? se se??de? p?? de? ?????
?e?µe?? a??? e????e?, p?????µµata, ??p
81Google ????te?t?????
Most of Google is implemented in C or C for
efficiency and can run in either Solaris or
Linux. The web crawling (downloading of web
pages) is done by several distributed crawlers.
There is a URLserver that sends lists of URLs to
be fetched to the crawlers. The web pages that
are fetched are then sent to the storeserver.
The storeserver then compresses and stores the
web pages into a repository.
82Google ????te?t?????
Every web page has an associated ID number called
a docID which is assigned whenever a new URL is
parsed out of a web page. The indexing function
is performed by the indexer and the sorter. The
indexer reads the repository, uncompresses the
documents, and parses them. document -gt a set
of word occurrences called hits. ?its word,
position in document, an approximation of font
size, and capitalization. The indexer
distributes these hits into a set of "barrels",
creating a partially sorted forward index.
83Google ????te?t?????
Indexer It parses out all the links in every
web page and stores important information about
them in an anchors file. This file contains
enough information to determine where each link
points from and to, and the text of the link.
84Google ????te?t?????
URLresolver relative URLs -gt absolute URLs -gt
docIDs. The sorter takes the barrels, which
are sorted by docID and resorts them by wordID
to generate the inverted index. lexicon
85Google ????te?t?????
The searcher is run by a web server uses the
lexicon built by DumpLexicon together with the
inverted index and the PageRanks to answer
queries.
86??????? ?a???sµ??? ?st??
87??sa????
- The WWW is huge, widely distributed, global
information service center for - ?????S??S - Information services news,
advertisements, consumer information, financial
management, education, government, e-commerce,
etc. - S????S??? - Hyper-link information
- ?????F???? ???S?S - Access and usage information
- WWW provides rich sources for data mining
- Challenges
- Too huge for effective data warehousing and data
mining - Too complex and heterogeneous no standards and
structure
88??sa????
- Growing and changing very rapidly
-
- Broad diversity of user communities
- Only a small portion of the information on the
Web is truly relevant or useful - 99 of the Web information is useless to 99 of
Web users - How can we find high-quality Web pages on a
specified topic?
89?at?????e? ???????? ap? t? Web
??????? ap? t? Web
??????? ??µ??
??????? ?e??e??µ????
??????? ???s??
?????e?s? Ge????? ???t?p?? ???sp??as??
?????e?s? p??sa?µ?sµ???? (customized) ???s??
??????? pe??e??µ???? se??d??
??????? ap?te?esµ?t?? a?a??t?s??
90??????? ap? t? Web
- ??????µe ??a
- Web access patterns
- Web structures
- Regularity and dynamics of Web contents
- ???ß??µata
- The abundance problem ? a???µ?? t?? se??d??
p?? s?s?et????ta? µe ??a? ??? µp??e? ?a e??a?
p??? µe????? - Limited coverage of the Web hidden Web sources,
majority of data in DBMS - Limited query interface based on keyword-oriented
search - Limited customization to individual users
91?at?????e? ???????? ap? t? Web
??????? ap? t? Web
??????? ??µ??
??????? ?e??e??µ????
??????? ???s??
Web Page Content Mining Web Page Summarization
WebLog (Lakshmanan et.al. 1996),
WebOQL(Mendelzon et.al. 1998) Web Structuring
query languages Can identify information within
given web pages Ahoy! (Etzioni et.al. 1997)Uses
heuristics to distinguish personal home pages
from other web pages ShopBot (Etzioni et.al.
1997) Looks for product prices within web pages
?????e?s? Ge????? ???t?p?? ???sp??as??
?????e?s? p??sa?µ?sµ???? (customized) ???s??
??????? ap?te?esµ?t?? a?a??t?s??
92?at?????e? ???????? ap? t? Web
??????? ap? t? Web
??????? ??µ??
??????? ?e??e??µ????
??????? ???s??
?????e?s? p??sa?µ?sµ???? (customized) ???s??
?????e?s? Ge????? ???t?p?? ???sp??as??
??????? pe??e??µ???? se??d??
??????? ?p?te?esµ?t?? ??a??t?s?? Search Engine
Result Summarization Clustering Search Result
(Leouski and Croft, 1996, Zamir and Etzioni,
1997) ?µad?p???s? t?? ap?te?esµ?t??
???s?µ?p????ta? f??se?? st??? t?t???? ?a? snippets
93?at?????e? ???????? ap? t? Web
PageRank, HITS Small-world models,
??????? ap? t? Web
??????? ??µ??
??????? ?e??e??µ????
??????? ???s??
?????e?s? Ge????? ???t?p?? ???sp??as??
?????e?s? p??sa?µ?sµ???? (customized) ???s??
??????? pe??e??µ???? se??d??
??????? ap?te?esµ?t?? a?a??t?s??
94?at?????e? ???????? ap? t? Web
??????? ?ed?µ???? ???s??
- Mining Web log records to discover user access
patterns of Web pages - Applications
- Target potential customers for electronic
commerce - Enhance the quality and delivery of Internet
information services to the end user - Improve Web server system performance
- Identify potential prime advertisement locations
- Web logs provide rich information about Web
dynamics - Typical Web log entry includes the URL requested,
the IP address from which the request originated,
and a timestamp
95?at?????e? ???????? ap? t? Web
?e?????? ???????? ?ed?µ???? ???s??
- Construct multidimensional view on the Weblog
database - Perform multidimensional OLAP analysis to find
the top N users, top N accessed Web pages, most
frequently accessed time periods, etc. - Perform data mining on Weblog records
- Find association patterns, sequential patterns,
and trends of Web accessing - May need additional information,e.g., user
browsing sequences of the Web pages in the Web
server buffer - Conduct studies to
- Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web
page swapping