Title: Web Search, Web Community and Web Mining
1Web Search, Web Community and Web Mining
- The lecture notes are edited based on the
lecture notes of mining the web discovering
knowledge from hyper data by Soumen Chakrabarti,
a talk by Soumen Chakrabarti et al on Mining the
Link Structure of the World Wide Web, a talk by
Sanjay Kumar Madria on Web Mining ABirds Eye
View, a talk by Chen Li on Searching and
Integrating Information on the Web, a talk given
by Yanchun Zhang on Web Search Web Community,
a talk by S. D. Kamvar et al on Extrapolation
Methods for Accelarating PageRank Computations
2Web Bigger Than We Think
- Web is expanding continuously
- Today's search engines only cover a fraction of
the existing pages - The Web is 500 times larger than the segment
covered by standard search engines such as Yahoo!
and AltaVista. - The Web holds about 550 billion documents, search
engines index a combined total of 1 billion
pages, .(CNet, 26 July 2000)
3Web Search Problem
- Web search tools such as Yahoo!, AltaVista,
Google, return more information than those
require - Example search Java Programming , Google
returns 1,330,000 (more than 1 million) Web
pages, AltaVista returns 16,921,862 (more than 16
million) Web pages. - Users only concern about a small and interesting
portion of the Web search results.
4Closer Look at the Problems
- Lacking the concept of importance of each page
on each topic - E.g. My homepage is not as important as
Yahoos main page. - A link from Yahoo is more important than a link
from a personal homepage - But, how to capture the importance of a page?
- A guess of hits? ? where to get that info?
- of inlinks to a page ? Googles main idea.
5(Google) PageRank
- Intuition
- The importance of each page should be decided by
what other pages say about this page - One naïve implementation count the of pages
pointing to each page (i.e., of inlinks) - Problem
- We can easily fool this technique by generating
many dummy pages that point to our class page
6Google (PageRank)
- Assumption
- the importance of a page is proportional to the
sum of the prestige scores of pages linking to it - Random surfer on strongly connected web graph
As Home Page
Bs Home Page
Linked by 2 Important Pages
Linked by 1 Unimportant page
DB Pub Server
7Page Importance
- The importance of a page is given by the
importance of the pages that link to it.
importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
8Link Counts Example
DB Pub Server
9Computing PageRank
- Initialize
- Repeat until convergence
importance of page i
importance of page j
number of outlinks from page j
pages j that link to page i
10PageRank Diagram
Initialize all nodes to rank
Propagate ranks across links (multiplying by link
11PageRank Diagram
12PageRank Diagram
The Surfing Model
- Correspondence between surfer model and the
notion of importance - Page v has high prestige if the visit rate is
high - This happens if there are many neighbors u with
high visit rates leading to v
After a while
13Example MiniWeb
- Our MiniWeb has only three web sites Netscape,
Amazon, and Microsoft. - Their weights are represented as a vector
For instance, in each iteration, half of the
weight of AM goes to NE, and half goes to MS.
Materials by courtesy of Jeff Ullman
14Iterative computation
- Final result
- Netscape and Amazon have the same importance, and
twice the importance of Microsoft. - Does it capture the intuition? Yes.
15Problem 1 Dead Ends!
- MS does not point to anybody
- Result weights of the Web leak out
16Problem 2 Spider Traps
- MS only points to itself
- Result all weights go to MS!
17PageRank at Google
- Ranking of pages more important than exact values
of pi - Convergence of page ranks in 52 iterations for a
crawl with 322 million links. - Pre-compute and store the PageRank of each page.
- PageRank independent of any query or textual
content. - Ranking scheme combines PageRank with textual
match - Unpublished
- Many empirical parameters, human effort and
regression testing. - Criticism Ad-hoc coupling and decoupling
between relevance and importance
18Hubs and Authorities
- Motivation find web pages to a topic
- E.g. find all web sites about automobiles
- Authority a page that offers info about a
topic - E.g. DBLP is a page about papers
- E.g. google.com, aj.com, teoma.com, lycos.com
- Hub a page that doesnt provide much info, but
tell us where to find pages about a topic - E.g. www.searchenginewatch.com is a hub of
search engines
19Two Values of a Page
- Each page has a hub value and an authority value.
- In PageRank, each page has one value weight
- Two vectors
- H hub values
- A authority values
20HITS Find Hubs and Authorities
- First step find pages related to the topic
(e.g., automobile), and construct the
corresponding focused subgraph - Find pages S containing the keyword
(automobile) - Find all pages these S pages point to, i.e.,
their forward neighbors. - Find all pages that point to S pages, i.e., their
backward neighbors - Compute the subgraph of these pages
Focused subgraph
21Computing H and A
- Initially set hub and authority to 1
- In each iteration, the hub value of a page is the
total authority value of its forward neighbors - The authority value of each page is the total hub
value of its backward neighbors - Iterate until converge
22HITS How to Count?
- Count OUT links
- Count IN links
23HITS Calculating Values
- Authority value
- Hub value
- Matrix notation A - adjacency matrix
- A(i, j) 1 if i-th page points to j-th page
24PageRank vs HITS
- PageRank advantage over HITS
- Query-time cost is low
- HITS computes an eigenvector for every query
- Less susceptible to localized link-spam
- HITS advantage over PageRank
- HITS ranking is sensitive to query
- HITS has notion of hubs and authorities
25Web Community
- Suppose one is familiar with some Web pages of
specific topic, such as, sports - Find more pages about the same topic
- Web community entity of related web pages (
centers )
26What is cyber-community
- A community on the web is a group of web pages
sharing a common interest - Eg. A group of web pages talking about POP Music
- Eg. A group of web pages interested in
data-mining - Main properties
- Pages in the same community should be similar to
each other in contents - The pages in one community should differ from the
pages in another community - Similar to cluster
27Two different types of communities
- Explicitly-defined communities
- They are well known ones, such as the resource
listed by Yahoo! - Implicitly-defined communities
- They are communities unexpected or invisible to
most users - How to find them!?
eg. The group of web pages interested in a
particular singer
28Similarity of Web Pages
- Discovering web communities is similar to
clustering. For clustering, we must define the
similarity of two nodes - Method I
- For page and page B, A is related to B if there
is a hyper-link from A to B, or from B to A - Not so good. Consider the home page of IBM and
Microsoft. They dont point to each other as
Page A
Page B
29Similarity of Web Pages
- Method II (from Bibliometrics)
- Co-citation the similarity of A and B is
measured by the number of pages cite both A and B - Bibliographic coupling the similarity of A and B
is measured by the number of pages cited by both
A and B.
Page A
Page B
Page A
Page B
30Methods of Clustering
- All of them can discover meaningful communities.
- But their methods are very expensive to the whole
World Wide Web with billions of web pages.
31 An Effective Method
- The method from Ravi Kumar, Prabhakar Raghavan,
Sridhar Rajagopalan, Andrew Tomkins - IBM Almaden Research Center
- They call their method communities trawling (CT)
- They implemented it on the graph of 200 millions
pages, it worked very well
32Basic idea of CT
- Dense directed bipartite sub graphs
- Bipartite graph Nodes are partitioned into two
sets, F and C - Every directed edge in the graph is directed from
a node u in F to a node v in C - dense if many of the possible edges between F and
C are present
33Basic idea of CT
- Bipartite cores
- a complete bipartite subgraph with at least i
nodes from F and at least j nodes from C - i and j are tunable parameters
- A (i, j) Bipartite core
- Every community have such a core with a certain i
and j. - A bipartite core is the identity of a community
- To extract all the communities is to enumerate
all the bipartite cores on the web.
A (i3, j3) bipartite core
34Experiment on CT
- 200 millions web pages
- IBM PC with an Intel 300MHz Pentium II processor,
with 512M of memory, running Linux - i from 3 to 10 and j from 3 to 20
- 200k potential communities were discovered
- 29 of them cannot be found in Yahoo!.
35Mining the World-Wide Web
- WWW is huge, widely distributed, global
information source for - Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc. - Hyper-link information
- Access and usage information
- Web Site contents and Organization
36Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure Mining
37Web Content Mining
- Discovery of useful information from web contents
/ data / documents - Web data contents text, image, audio, video,
- metadata and hyperlinks.
- Information Retrieval View ( Structured
Semi-Structured) - Assist / Improve information finding
- Filtering Information to users on user profiles
- Database View
- Model Data on the web
- Integrate them for more sophisticated queries
38Web Structure Mining
- To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page. - Direction 1 based on the hyperlinks,
categorizing the Web pages and generated
information. - Direction 2 discovering the structure of Web
document itself. - Direction 3 discovering the nature of the
hierarchy or network of hyperlinks in the Website
of a particular domain.
39Web Structure Mining
- Finding authoritative Web pages
- Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic - Hyperlinks can infer the notion of authority
- The Web consists not only of pages, but also of
hyperlinks pointing from one page to another - These hyperlinks contain an enormous amount of
latent human annotation - A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page
40Web Usage Mining
- Also known as Web log mining
- Mining techniques to discover interesting usage
patterns from the secondary data derived from the
interactions of the users while surfing the web
41Web Usage Mining
- Applications
- Target potential customers for electronic
commerce - Enhance the quality and delivery of Internet
information services to the end user - Improve Web server system performance
- Identify potential prime advertisement locations
- Facilitates personalization/adaptive sites
- Improve site design
- Fraud/intrusion detection
- Predict users actions (allows prefetching)