Title: Web Search, Web Community and Web Mining
1Web Search, Web Community and Web Mining
- The lecture notes are edited based on the
lecture notes of mining the web discovering
knowledge from hyper data by Soumen Chakrabarti,
a talk by Soumen Chakrabarti et al on Mining the
Link Structure of the World Wide Web, a talk by
Sanjay Kumar Madria on Web Mining ABirds Eye
View, a talk by Chen Li on Searching and
Integrating Information on the Web, a talk given
by Yanchun Zhang on Web Search Web Community,
a talk by S. D. Kamvar et al on Extrapolation
Methods for Accelarating PageRank Computations
2Web Bigger Than We Think
- Web is expanding continuously
- Today's search engines only cover a fraction of
the existing pages - The Web is 500 times larger than the segment
covered by standard search engines such as Yahoo!
and AltaVista. - The Web holds about 550 billion documents, search
engines index a combined total of 1 billion
pages, .(CNet, 26 July 2000)
3Web Search Problem
- Web search tools such as Yahoo!, AltaVista,
Google, return more information than those
require - Example search Java Programming , Google
returns 1,330,000 (more than 1 million) Web
pages, AltaVista returns 16,921,862 (more than 16
million) Web pages. - Users only concern about a small and interesting
portion of the Web search results.
4Closer Look at the Problems
- Lacking the concept of importance of each page
on each topic - E.g. My homepage is not as important as
Yahoos main page. - A link from Yahoo is more important than a link
from a personal homepage - But, how to capture the importance of a page?
- A guess of hits? ? where to get that info?
- of inlinks to a page ? Googles main idea.
5(Google) PageRank
- Intuition
- The importance of each page should be decided by
what other pages say about this page - One naïve implementation count the of pages
pointing to each page (i.e., of inlinks) - Problem
- We can easily fool this technique by generating
many dummy pages that point to our class page
6Google (PageRank)
- Assumption
- the importance of a page is proportional to the
sum of the prestige scores of pages linking to it - Random surfer on strongly connected web graph
As Home Page
Bs Home Page
Linked by 2 Important Pages
Linked by 1 Unimportant page
Yahoo!
DB Pub Server
CNN
7Page Importance
- The importance of a page is given by the
importance of the pages that link to it.
importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
8Link Counts Example
A
B
Yahoo!
CNN
DB Pub Server
9Computing PageRank
- Initialize
- Repeat until convergence
importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
10PageRank Diagram
0.167
Initialize all nodes to rank
0.333
0.333
0.167
0.333
0.333
Propagate ranks across links (multiplying by link
weights)
0.5
0.333
0.333
0.167
11PageRank Diagram
0.167
0.5
0.167
0.167
0.5
0.333
0.333
0.5
0.167
0.167
12PageRank Diagram
The Surfing Model
- Correspondence between surfer model and the
notion of importance - Page v has high prestige if the visit rate is
high - This happens if there are many neighbors u with
high visit rates leading to v
After a while
0.4
0.4
0.2
13Example MiniWeb
- Our MiniWeb has only three web sites Netscape,
Amazon, and Microsoft. - Their weights are represented as a vector
Ne
MS
For instance, in each iteration, half of the
weight of AM goes to NE, and half goes to MS.
Am
Materials by courtesy of Jeff Ullman
14Iterative computation
Ne
- Final result
- Netscape and Amazon have the same importance, and
twice the importance of Microsoft. - Does it capture the intuition? Yes.
MS
Am
15Problem 1 Dead Ends!
Ne
MS
Am
- MS does not point to anybody
- Result weights of the Web leak out
16Problem 2 Spider Traps
Ne
MS
Am
- MS only points to itself
- Result all weights go to MS!
17PageRank at Google
- Ranking of pages more important than exact values
of pi - Convergence of page ranks in 52 iterations for a
crawl with 322 million links. - Pre-compute and store the PageRank of each page.
- PageRank independent of any query or textual
content. - Ranking scheme combines PageRank with textual
match - Unpublished
- Many empirical parameters, human effort and
regression testing. - Criticism Ad-hoc coupling and decoupling
between relevance and importance
18Hubs and Authorities
- Motivation find web pages to a topic
- E.g. find all web sites about automobiles
- Authority a page that offers info about a
topic - E.g. DBLP is a page about papers
- E.g. google.com, aj.com, teoma.com, lycos.com
- Hub a page that doesnt provide much info, but
tell us where to find pages about a topic - E.g. www.searchenginewatch.com is a hub of
search engines
19Two Values of a Page
- Each page has a hub value and an authority value.
- In PageRank, each page has one value weight
- Two vectors
- H hub values
- A authority values
20HITS Find Hubs and Authorities
- First step find pages related to the topic
(e.g., automobile), and construct the
corresponding focused subgraph - Find pages S containing the keyword
(automobile) - Find all pages these S pages point to, i.e.,
their forward neighbors. - Find all pages that point to S pages, i.e., their
backward neighbors - Compute the subgraph of these pages
root
Focused subgraph
21Computing H and A
- Initially set hub and authority to 1
- In each iteration, the hub value of a page is the
total authority value of its forward neighbors - The authority value of each page is the total hub
value of its backward neighbors - Iterate until converge
authorities
hubs
22HITS How to Count?
- Count OUT links
- Count IN links
23HITS Calculating Values
- Authority value
- Hub value
- Matrix notation A - adjacency matrix
- A(i, j) 1 if i-th page points to j-th page
24PageRank vs HITS
- PageRank advantage over HITS
- Query-time cost is low
- HITS computes an eigenvector for every query
- Less susceptible to localized link-spam
- HITS advantage over PageRank
- HITS ranking is sensitive to query
- HITS has notion of hubs and authorities
25Web Community
- Suppose one is familiar with some Web pages of
specific topic, such as, sports - Find more pages about the same topic
- Web community entity of related web pages (
centers )
26What is cyber-community
- A community on the web is a group of web pages
sharing a common interest - Eg. A group of web pages talking about POP Music
- Eg. A group of web pages interested in
data-mining - Main properties
- Pages in the same community should be similar to
each other in contents - The pages in one community should differ from the
pages in another community - Similar to cluster
27Two different types of communities
- Explicitly-defined communities
- They are well known ones, such as the resource
listed by Yahoo! - Implicitly-defined communities
- They are communities unexpected or invisible to
most users - How to find them!?
Arts
eg.
Music
Painting
Classic
Pop
eg. The group of web pages interested in a
particular singer
28Similarity of Web Pages
- Discovering web communities is similar to
clustering. For clustering, we must define the
similarity of two nodes - Method I
- For page and page B, A is related to B if there
is a hyper-link from A to B, or from B to A - Not so good. Consider the home page of IBM and
Microsoft. They dont point to each other as
competitors.
Page A
Page B
29Similarity of Web Pages
- Method II (from Bibliometrics)
- Co-citation the similarity of A and B is
measured by the number of pages cite both A and B - Bibliographic coupling the similarity of A and B
is measured by the number of pages cited by both
A and B.
Page A
Page B
Page A
Page B
30Methods of Clustering
- All of them can discover meaningful communities.
- But their methods are very expensive to the whole
World Wide Web with billions of web pages.
31 An Effective Method
- The method from Ravi Kumar, Prabhakar Raghavan,
Sridhar Rajagopalan, Andrew Tomkins - IBM Almaden Research Center
- They call their method communities trawling (CT)
- They implemented it on the graph of 200 millions
pages, it worked very well
32Basic idea of CT
- Dense directed bipartite sub graphs
- Bipartite graph Nodes are partitioned into two
sets, F and C - Every directed edge in the graph is directed from
a node u in F to a node v in C - dense if many of the possible edges between F and
C are present
F
C
33Basic idea of CT
- Bipartite cores
- a complete bipartite subgraph with at least i
nodes from F and at least j nodes from C - i and j are tunable parameters
- A (i, j) Bipartite core
- Every community have such a core with a certain i
and j. - A bipartite core is the identity of a community
- To extract all the communities is to enumerate
all the bipartite cores on the web.
A (i3, j3) bipartite core
34Experiment on CT
- 200 millions web pages
- IBM PC with an Intel 300MHz Pentium II processor,
with 512M of memory, running Linux - i from 3 to 10 and j from 3 to 20
- 200k potential communities were discovered
- 29 of them cannot be found in Yahoo!.
35Mining the World-Wide Web
- WWW is huge, widely distributed, global
information source for - Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc. - Hyper-link information
- Access and usage information
- Web Site contents and Organization
36Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure Mining
37Web Content Mining
- Discovery of useful information from web contents
/ data / documents - Web data contents text, image, audio, video,
- metadata and hyperlinks.
- Information Retrieval View ( Structured
Semi-Structured) - Assist / Improve information finding
- Filtering Information to users on user profiles
- Database View
- Model Data on the web
- Integrate them for more sophisticated queries
38Web Structure Mining
- To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page. - Direction 1 based on the hyperlinks,
categorizing the Web pages and generated
information. - Direction 2 discovering the structure of Web
document itself. - Direction 3 discovering the nature of the
hierarchy or network of hyperlinks in the Website
of a particular domain.
39Web Structure Mining
- Finding authoritative Web pages
- Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic - Hyperlinks can infer the notion of authority
- The Web consists not only of pages, but also of
hyperlinks pointing from one page to another - These hyperlinks contain an enormous amount of
latent human annotation - A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page
40Web Usage Mining
- Also known as Web log mining
- Mining techniques to discover interesting usage
patterns from the secondary data derived from the
interactions of the users while surfing the web
41Web Usage Mining
- Applications
- Target potential customers for electronic
commerce - Enhance the quality and delivery of Internet
information services to the end user - Improve Web server system performance
- Identify potential prime advertisement locations
- Facilitates personalization/adaptive sites
- Improve site design
- Fraud/intrusion detection
- Predict users actions (allows prefetching)