Web Search, Web Community and Web Mining - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Web Search, Web Community and Web Mining

Description:

The Web is 500 times larger than the segment covered by standard search engines ... The Web holds about 550 billion documents, search engines index a combined total ... – PowerPoint PPT presentation

Number of Views:187

Avg rating:3.0/5.0

Slides: 42

Provided by: JimKurosea346

Category:

more less

Transcript and Presenter's Notes

Title: Web Search, Web Community and Web Mining

1
Web Search, Web Community and Web Mining

The lecture notes are edited based on the
lecture notes of mining the web discovering
knowledge from hyper data by Soumen Chakrabarti,
a talk by Soumen Chakrabarti et al on Mining the
Link Structure of the World Wide Web, a talk by
Sanjay Kumar Madria on Web Mining ABirds Eye
View, a talk by Chen Li on Searching and
Integrating Information on the Web, a talk given
by Yanchun Zhang on Web Search Web Community,
a talk by S. D. Kamvar et al on Extrapolation
Methods for Accelarating PageRank Computations

2
Web Bigger Than We Think

Web is expanding continuously
Today's search engines only cover a fraction of
the existing pages
The Web is 500 times larger than the segment
covered by standard search engines such as Yahoo!
and AltaVista.
The Web holds about 550 billion documents, search
engines index a combined total of 1 billion
pages, .(CNet, 26 July 2000)

3
Web Search Problem

Web search tools such as Yahoo!, AltaVista,
Google, return more information than those
require
Example search Java Programming , Google
returns 1,330,000 (more than 1 million) Web
pages, AltaVista returns 16,921,862 (more than 16
million) Web pages.
Users only concern about a small and interesting
portion of the Web search results.

4
Closer Look at the Problems

Lacking the concept of importance of each page
on each topic
E.g. My homepage is not as important as
Yahoos main page.
A link from Yahoo is more important than a link
from a personal homepage
But, how to capture the importance of a page?
A guess of hits? ? where to get that info?
of inlinks to a page ? Googles main idea.

5
(Google) PageRank

Intuition
The importance of each page should be decided by
what other pages say about this page
One naïve implementation count the of pages
pointing to each page (i.e., of inlinks)
Problem
We can easily fool this technique by generating
many dummy pages that point to our class page

6
Google (PageRank)

Assumption
the importance of a page is proportional to the
sum of the prestige scores of pages linking to it
Random surfer on strongly connected web graph

As Home Page
Bs Home Page
Linked by 2 Important Pages
Linked by 1 Unimportant page
Yahoo!
DB Pub Server
CNN
7
Page Importance

The importance of a page is given by the
importance of the pages that link to it.

importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
8
Link Counts Example
A
B
Yahoo!
CNN
DB Pub Server
9
Computing PageRank

Initialize
Repeat until convergence

importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
10
PageRank Diagram
0.167
Initialize all nodes to rank
0.333
0.333
0.167
0.333
0.333
Propagate ranks across links (multiplying by link
weights)
0.5
0.333
0.333
0.167
11
PageRank Diagram
0.167
0.5
0.167
0.167
0.5
0.333
0.333
0.5
0.167
0.167
12
PageRank Diagram
The Surfing Model

Correspondence between surfer model and the
notion of importance
Page v has high prestige if the visit rate is
high
This happens if there are many neighbors u with
high visit rates leading to v

After a while
0.4
0.4
0.2
13
Example MiniWeb

Our MiniWeb has only three web sites Netscape,
Amazon, and Microsoft.
Their weights are represented as a vector

Ne
MS
For instance, in each iteration, half of the
weight of AM goes to NE, and half goes to MS.
Am
Materials by courtesy of Jeff Ullman
14
Iterative computation
Ne

Final result
Netscape and Amazon have the same importance, and
twice the importance of Microsoft.
Does it capture the intuition? Yes.

MS
Am
15
Problem 1 Dead Ends!
Ne
MS
Am

MS does not point to anybody
Result weights of the Web leak out

16
Problem 2 Spider Traps
Ne
MS
Am

MS only points to itself
Result all weights go to MS!

17
PageRank at Google

Ranking of pages more important than exact values
of pi
Convergence of page ranks in 52 iterations for a
crawl with 322 million links.
Pre-compute and store the PageRank of each page.
PageRank independent of any query or textual
content.
Ranking scheme combines PageRank with textual
match
Unpublished
Many empirical parameters, human effort and
regression testing.
Criticism Ad-hoc coupling and decoupling
between relevance and importance

18
Hubs and Authorities

Motivation find web pages to a topic
E.g. find all web sites about automobiles
Authority a page that offers info about a
topic
E.g. DBLP is a page about papers
E.g. google.com, aj.com, teoma.com, lycos.com
Hub a page that doesnt provide much info, but
tell us where to find pages about a topic
E.g. www.searchenginewatch.com is a hub of
search engines

19
Two Values of a Page

Each page has a hub value and an authority value.
In PageRank, each page has one value weight
Two vectors
H hub values
A authority values

20
HITS Find Hubs and Authorities

First step find pages related to the topic
(e.g., automobile), and construct the
corresponding focused subgraph
Find pages S containing the keyword
(automobile)
Find all pages these S pages point to, i.e.,
their forward neighbors.
Find all pages that point to S pages, i.e., their
backward neighbors
Compute the subgraph of these pages

root
Focused subgraph
21
Computing H and A

Initially set hub and authority to 1
In each iteration, the hub value of a page is the
total authority value of its forward neighbors
The authority value of each page is the total hub
value of its backward neighbors
Iterate until converge

authorities
hubs
22
HITS How to Count?

Count OUT links
Count IN links

23
HITS Calculating Values

Authority value
Hub value
Matrix notation A - adjacency matrix
A(i, j) 1 if i-th page points to j-th page

24
PageRank vs HITS

PageRank advantage over HITS
Query-time cost is low
HITS computes an eigenvector for every query
Less susceptible to localized link-spam
HITS advantage over PageRank
HITS ranking is sensitive to query
HITS has notion of hubs and authorities

25
Web Community

Suppose one is familiar with some Web pages of
specific topic, such as, sports
Find more pages about the same topic
Web community entity of related web pages (
centers )

26
What is cyber-community

A community on the web is a group of web pages
sharing a common interest
Eg. A group of web pages talking about POP Music
Eg. A group of web pages interested in
data-mining
Main properties
Pages in the same community should be similar to
each other in contents
The pages in one community should differ from the
pages in another community
Similar to cluster

27
Two different types of communities

Explicitly-defined communities
They are well known ones, such as the resource
listed by Yahoo!
Implicitly-defined communities
They are communities unexpected or invisible to
most users
How to find them!?

Arts
eg.
Music
Painting
Classic
Pop
eg. The group of web pages interested in a
particular singer
28
Similarity of Web Pages

Discovering web communities is similar to
clustering. For clustering, we must define the
similarity of two nodes
Method I
For page and page B, A is related to B if there
is a hyper-link from A to B, or from B to A
Not so good. Consider the home page of IBM and
Microsoft. They dont point to each other as
competitors.

Page A
Page B
29
Similarity of Web Pages

Method II (from Bibliometrics)
Co-citation the similarity of A and B is
measured by the number of pages cite both A and B
Bibliographic coupling the similarity of A and B
is measured by the number of pages cited by both
A and B.

Page A
Page B
Page A
Page B
30
Methods of Clustering

All of them can discover meaningful communities.
But their methods are very expensive to the whole
World Wide Web with billions of web pages.

31
An Effective Method

The method from Ravi Kumar, Prabhakar Raghavan,
Sridhar Rajagopalan, Andrew Tomkins
IBM Almaden Research Center
They call their method communities trawling (CT)
They implemented it on the graph of 200 millions
pages, it worked very well

32
Basic idea of CT

Dense directed bipartite sub graphs
Bipartite graph Nodes are partitioned into two
sets, F and C
Every directed edge in the graph is directed from
a node u in F to a node v in C
dense if many of the possible edges between F and
C are present

F
C
33
Basic idea of CT

Bipartite cores
a complete bipartite subgraph with at least i
nodes from F and at least j nodes from C
i and j are tunable parameters
A (i, j) Bipartite core
Every community have such a core with a certain i
and j.
A bipartite core is the identity of a community
To extract all the communities is to enumerate
all the bipartite cores on the web.

A (i3, j3) bipartite core
34
Experiment on CT

200 millions web pages
IBM PC with an Intel 300MHz Pentium II processor,
with 512M of memory, running Linux
i from 3 to 10 and j from 3 to 20
200k potential communities were discovered
29 of them cannot be found in Yahoo!.

35
Mining the World-Wide Web

WWW is huge, widely distributed, global
information source for
Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
Hyper-link information
Access and usage information
Web Site contents and Organization

36
Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure Mining
37
Web Content Mining

Discovery of useful information from web contents
/ data / documents
Web data contents text, image, audio, video,
metadata and hyperlinks.
Information Retrieval View ( Structured
Semi-Structured)
Assist / Improve information finding
Filtering Information to users on user profiles
Database View
Model Data on the web
Integrate them for more sophisticated queries

38
Web Structure Mining

To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page.
Direction 1 based on the hyperlinks,
categorizing the Web pages and generated
information.
Direction 2 discovering the structure of Web
document itself.
Direction 3 discovering the nature of the
hierarchy or network of hyperlinks in the Website
of a particular domain.

39
Web Structure Mining

Finding authoritative Web pages
Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic
Hyperlinks can infer the notion of authority
The Web consists not only of pages, but also of
hyperlinks pointing from one page to another
These hyperlinks contain an enormous amount of
latent human annotation
A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page

40
Web Usage Mining

Also known as Web log mining
Mining techniques to discover interesting usage
patterns from the secondary data derived from the
interactions of the users while surfing the web

41
Web Usage Mining

Applications
Target potential customers for electronic
commerce
Enhance the quality and delivery of Internet
information services to the end user
Improve Web server system performance
Identify potential prime advertisement locations
Facilitates personalization/adaptive sites
Improve site design
Fraud/intrusion detection
Predict users actions (allows prefetching)