Title: Web Modeling
1Web Modeling Mark Joseph SI 767 2-15-06
2Why do we want to know what the Web "looks like"?
Why do we care how it grows? Why should we care
how pages are connected? Simple reason as NLP
and IR professionals, this knowledge can help us
in designing techniques to retrieve information.
3I am going to be building off the foundation
that The web can be modeled as a mathematical
graph by considering its pages to be nodes
connected by arcs corresponding to hyperlinks
(ThelwallWilkinson well see this later)
4- The Web is the largest repository of data and it
grows exponentially. - - 320 Million Web pages Lawrence Giles 1998
- - 800 Million Web pages, 15 TB Lawrence Giles
1999 - 8 Billion Web pages indexed Google 2005
- Amount of data
- roughly 200 TB Lyman et al. 2003
- In order to search the web, and because the web
is ever changing, we need to create a model to
explain how the web grows.
(Radev lecture, 2005)
5To start Graph Theory
6There is no central authority over the web. Web
pages are created by individuals, who wish to
then connect their web pages to other web pages
(using hyperlinks). These designers hope to be
linked to themselves to create more traffic to
their sites, and thus justify, inform through, or
profit from their web page. This is all done at
random. Hence, the Random Graph.
7- The Random Graph was first defined by Paul Erdos
and Alfred Renyi in their 1959 paper On Random
Graphs I. It works something like this - Start with n vertices
- 2. After each time unit, one new edge is created
at random, with equal probability to any other
edge
(Wikipedia, http//en.wikipedia.org/wiki/Random_gr
aph)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Random Graph Measurements
24Diameter The longest shortest path of a network
(excluding backtracks, detours, and loops)
Assuming Im not losing it, the diameter here is
3.
25Clustering Coefficient A test for a small world
network, introduced by Duncan J. Watts and Steven
Strogatz in 1998. Ci number of connections
between is neighbors/maximum number of possible
connections between those same neighbors
26Because the web is directed (hyperlinks only go
one way), the calculation is as follows
ejk is number of directed connections between is
neighbors ki(ki -1) is number of potential
connections
27Clustering Coefficient
28Clustering Coefficient
Ci ejk / ki(ki -1) ejk 3 ki 6 Ci 3/6(6
-1) 3/30 1/10
i
29The clustering coefficient for an entire network
is
the average of all clustering coefficients in the
network
30Degree of a node/vertex number of arcs or edges
connected to that node/vertex In-degree
Out-degree measure of directed arcs going in
and out respectively in a directed graph
31Degree Distributions Mark Newman in The
Structure and Function of Complex Networks pk
? the fraction of vertices in the network that
have degree k equivalently pk ? the
probability that a vertex chosen uniformly at
random has degree k
32A plot of pk for any given network can be formed
by making a histogram of the degrees of vertices.
This histogram is known as the degree
distribution. NOTE These are often plotted on a
log-log scale graph to cut down on the noise
33Another important note when graphing a Web
network Because the representation is a
directed graph, each vertex has both an in-degree
and out-degree, the degree distribution becomes a
function of pjk, representing the fraction of
vertices that simultaneously have in-degree j and
out-degree k.
Mark Newman in The Structure and Function of
Complex Networks
34- Two Important Degree Distribution Models
- Assuming the random graph from above, where
- Start with n vertices
- 2. After each time unit, one new edge is created
at random, with equal probability to any other
edge
35Poisson distribution If the adding of new edges
between the vertices is independent of the
presence or absence of any other edge, so that
each edge is present with independent probability
p, you have a Poisson distribution.
Newmanal Random Graphs with Arbitrary Degree
Distribution and Their Applications
36Problem A number of studies of different
networks have found measurably different degree
distributions from a Poisson distribution one
of which is the Web. In these networks, there
is an exponential fall-off of probability as you
approach larger in- or out-degree. Solution
Power Law Distribution, or Scale Free Graph
37A Quick Comparison of Poisson vs. Power Law
38Poisson graph
number of nodes found
93
Thanks Lada!!! Adamic lecture 2006
39power-law graph
number of nodes found
94
6
2
Thanks Lada!!! Adamic lecture 2006
40Power law networks
- Many real world networks contain hubs highly
connected nodes - Usually the distribution of edges is extremely
skewed
many nodes with few edges
number of nodes with so many edges
fat tail a few nodes with a very large numberof
edges
number of edges
no typical number of edges
Thanks Lada!!! Adamic lecture 2006
41But is it really a power-law?
- A power-law will appear as a straight line on a
log-log plot - A deviation from a straight line could indicate a
different distribution - exponential
- lognormal
log( nodes)
log( edges)
Thanks Lada!!! Adamic lecture 2006
42Barabasi/Albert Model for Preferential Attachment
43Easiest way to describe a Power Law
Distribution The Rich Get Richer Applied to
networks, the argument goes that vertices with a
higher degree have a higher probability of being
linked to by newly added vertices.
44This theory applied to networks first gained
wider acceptance after the publication in 1999 of
a paper by Barabasi and Albert entitled The
Emergence of Scaling in Random Networks. In
this paper they coined the term preferential
attachment to define this weighted probability
of new vertex attachment.
Mark Newman in The Structure and Function of
Complex Networks
45Barabasi and Alberts model is undirected, which
can be construed as problematic when looking at
the web, but what it sacrifices in realism, it
makes up for in simplicity.
Mark Newman in The Structure and Function of
Complex Networks
46Broder et al. Bow Tie Structure of the Web
47Numerous large scale studies have developed the
Bow Tie Structure of the Web a model
initially coined by Andrei Broder, Ravi Kumar,
Farzin Maghoul1, Prabhakar Raghavan, Sridhar
Rajagopalan, Raymie Stata, Andrew Tomkins, and
Janet Wiener in their paper Graph Structure in
the Web. These studies have been duplicated a
number of times.
48Bow-tie model of the Web
TEND44M
SCC56 M
OUT44 M
IN44 M
DISC17 M
Bröder al. WWW 2000, Dill al. VLDB 2001
Thanks Drago
49Statistics from Bow Tie Study of Broder et. al.
SCC 27.5 IN and OUT 21.5 Tendrils and
tubes 21.5 Disconnected 8 24 of pages
reachable from a given page
Bröder al. WWW 2000, Dill al. VLDB 2001
Thanks Drago and Lada
50Now to Get Into the Mathematics of the Power Law
in the Internet
Heuristically Optimized Trade-offs A New
Paradigm for Power Laws in the Internet, by Alex
Fabrikant, Elias Koutsoupias, and Christis H.
Papadimitriou
Fabrikantal
51First and foremost The Web is scale-free The
model of the web A tree is built as nodes
arrive uniformly at random. When the i-th node
arrives, it attaches itself on one of the
previous nodes.
Fabrikantal
52We assume this node would like to connect to a
centrally located node a node whose distances to
other nodes is minimized.
dij is the Euclidean distance hj is some measure
of the centrality of node j a is a parameter
a function of the final number n of points,
gauging the relative importance of the two
objectives
Fabrikantal
53Fabrikant et al. define 3 possible measures of
centrality 1. The average number of hops from
other nodes 2. The maximum number of hops from
another node 3. The number of hops from a fixed
center of the tree
Fabrikantal
54a is the crux of the theorem! Why? Here are
some examples
Fabrikantal
55If a is too low, then the Euclidian distances
become unimportant, and the network resembles a
star
Fabrikantal
56But if a grows at least as fast as vn, where n is
the final number of points, then distance becomes
too important, and minimum spanning trees with
high degree occur, but with exponentially
vanishing probability thus not a power law. if
a is anywhere in between, we have a power law
Through a rather complex and elaborate proof,
Fabrikantal prove this initial assumption will
produce a power law distribution Ill save you
the math!
Fabrikantal
57Information Retrieval Applications
Growing and navigating the small world Web
by local content Filippo Menczer
58The degree sequence of Web pages has a power-law
distribution, Pr(k) k-? where k is the degree
of a page (number of in-links or out-links) and ?
is a constant exponent
Menczer
59The goal of Menczers study
to propose a Web growth model that is shown to
accurately predict the distribution of Web page
degree, based on textual content and assuming
only local knowledge of degree for existing
pages. efficient paths can be discovered by
decentralized Web navigation algorithms based on
textual and/or categorical cues."
So lets step through the argument
Menczer
60Menczers Model to Explain How Web Pages are
Generated and Why the Popular are Popular.
Menczer
61To gain insight into the Webs scale-free growth
and mechanisms for efficient navigation, I want
to study the connection between the two
topologies induced over the Web by links and
textual content. Menczer
Menczer
62start by introducing a distance measure based on
lexical similarity
where (p1,p2) is a pair of web pages and
is the cosine similarity function traditionally
used in information retrieval (wkp is some weight
function for term k in page p, e.g., term
frequency).
Menczer
63Finding the relationship between lexical topology
(r from above) and the link topology requires
measuring the probability that two pages at a
certain lexical distance have a direct link
between them. But, this measure is extremely
hard to get because the size of the web makes
this probability negligibly small. Instead,
focus on a neighborhood link relation in link
space, which approximates link probability but is
easier to measure and is used to identify Web
communities.
Menczer
64NOTE A neighborhood is the set of URLs
representing a web page, all of its in-links,
and all of its out-links.
Menczer
65Measure of frequency of neighbor pairs of pages
as a function of the lexical distance
Up is the URL set representing ps neighborhood
? is the neighborhood threshold it models the
ratio of local versus long-range links
Menczer
66Here is the plot of Pr(??) against ? over
various ?
Menczer
67As you can see from the graph, up to about ? lt
1, there is no correlation, but after that point,
the probability that two pages are neighbors
across lexical distance decreases (like a power
law) at
Menczer
68The conclusion Aside from immediate neighbors
(one step away from the original link), for whom
the relation shows no distinct features, pages
more similar in content have a higher likelihood
to be neighbors.
Menczer
69next we get Menczers Web Growth Model based on
Generative Models
but first, a short history of proposed Web Growth
Models
70Barabasi-Albert (BA) Model A Preferential
attachment model node i receives a new edge with
probability proportional to its current degree,
Pr(i) k(i) Pros produces power-law degree
distributions Cons based on unrealistic
assumption that Web authors have complete global
knowledge of the Web degree
Menczer
71Extension of BA Model Pr(i) ?(i)k(i), where
?(i) is the fitness of page i Pros Still yields
power law degree distributions Cons Over time,
pages with high fitness win out
Menczer
72Another Extension of BA Model linking to a node
is based on its degree with probability f or to a
uniformly chosen node with probability 1-
f Pros Fits not only power law degree
distributions of the entire web, but also the
unimodal degree distribution of subsets of Web
pages (like universities, companies, or newspaper
homepages) Cons Still relies on global
knowledge of degree
Menczer
73Menczers Proposal Content-Based Generative
Model Attempts to model the urge of page authors
to link to similar (hence probably related) and
popular (hence probably important) sites Also
makes the assumption that page popularity is
correlated with degree, but that a page author
only has local knowledge of degree.
Menczer
74At each step t one new page pt is added, and m
new links are created from pt to m existing pages
pi, iltt, each selected with probability
k(i) is the in-degree of pi ? is a lexical
distance threshold c1 and a are constants
Menczer
75now we have a growth process driven by local link
decisions based on content and that mirrors this
phase transition lexical independence for close
pages and an inverse power-law dependence for
distant pages
Menczer
76Next Step Define an Optimal Navigation Algorithm
for Small World Networks for Efficient Web
Crawling
77Given the Webs small world and power-law
topology, its diameter scales as T(logN/loglogN)
therefore, if two pages belong to a connected
component of the Web, some short path exists
between them. Given the need to find unknown
target pages, we are interested only in
decentralized crawling algorithms, which can use
only information available locally about a page
and its neighborhood.
Menczer
78Starting from some source Web page we aim to
visit a target page by visiting l ltlt N pages,
where N is the size of the Web, several billion
pages. The Web is a small world network, so we
know its diameter, or the diameter of the largest
connected component, scales logarithmically with N
Menczer
79therefore a short path of length
is likely to exist between some source (a
bookmarked page or a search engine result) and
some unknown relevant target page.
Menczer
80Simple greedy algorithms that always pick the
neighbor with the highest degree end up being too
costly, so what is the alternative? Use
Kleinbergs Hierarchical Model and knowledge of
Semantic Distance
Menczer
81First define a semantic distance between topics
p0 is the lowest common ancestor of p1 and
p2 Pr(p) represents the fraction of pages
classified at node p
Menczer
82The relationship between this measure of semantic
distance and link topology can be analyzed as was
done for lexical distance earlier by measuring
the frequency of neighbor pairs of pages as a
function of semantic distance
Menczer
83Which yields this plot of Pr(?µ) versus µ for
various values of µ
Menczer
84Menczer observed a good fit between the data and
the exponential model, and using Kleinbergs
greedy algorithm, the majority of relevant pages
are located based on local content before 104
pages have been crawled.
Menczer
85Graph Structure in Three National Academic
Webs Power Laws with Anomalies Mike Thelwall
and David Wilkinson
86What Anomalies can be Found in the Network of the
Web? To start, a summation of the Broderal
2000 study of two complete crawls from AltaVista
and the connectivity of the Web, on which
Thelwall and Wilkinsons work is based.
ThelwallWilkinson
87ThelwallWilkinson
88Five parts IN OUT STRONGLY CONNECTED COMPONENT
(SCC) TENDRILS DISCONNECTED The first four had
roughly equal sizes
ThelwallWilkinson
89An SCC is a collection of pages from which a
crawl following only links in pages could start
anywhere in the set and reach every other page in
the set OUT is set of pages outside SCC that can
be reach from the SCC but are not connected back
to the SCC IN is the set of pages that connect
to the SCC but are not connected to it so a crawl
starting in IN would contain all of the SCC and
OUT TENDRILS is a separate set linked to by a
page in IN or OUT but are not in IN, OUT, or
SCC DISCONNECTED are just that disconnected
from the other four
ThelwallWilkinson
90- Methodological Issues that Arise from this Study
- Without access to a major search engine database,
researchers may only be able to study SCC and OUT
systematically - A crawler may not be able to retrieve all the
out-links, especially those created by
JavaScript, server side image maps, and embedded
applications. So any of these components could
be larger - Some links may be ignored due to a policy
decision, such as database queries (including a
? in the URL) and frameset pages. Search
engines may also ban spam sites thus losing out
on more potential links. - The AltaVista data set included duplicates, which
if eliminated, could have caused a shrinking in
size for some of the components. - 5. The AltaVista data set only included HTML
pages, thus missing out on other potential
resources
ThelwallWilkinson
91Thelwall and Wilkinson then did a study over
three universities publicly indexable Web sites,
with some updates to fix the methodological
issues just mentioned in Broderals 2000 study.
ThelwallWilkinson
92Their Results
ThelwallWilkinson
93And here are the logarithmic graphs of in-link
and out-link counts for the schools
ThelwallWilkinson
94Australia
ThelwallWilkinson
95New Zealand
ThelwallWilkinson
96the United Kingdom
ThelwallWilkinson
97Explanations for the Anomalies In New Zealand,
it was because of set of highly interlinking
software documentation. In Australia, the
biggest came from an online course handbook with
a standard navigation bar. In the UK, the huge
in-link counts also came from a standard
navigation bar. The biggest anomalies are
produced by internal links within data-driven
sites.
ThelwallWilkinson
98Some other interesting observations 1. all SCC
pages have out-degree of at least 1, by
definition, whereas the median out-degree of OUT
pages is 0. 2. the median for SCC of between 6
and 8 shows that there is significantly more
out-linking from within SCC pages. The same is
true for in-linking, but to a lesser degree. In
fact, in-linking and out-linking from both SCC
and OUT display power laws, so although there are
generally more links within SCC, OUT also has a
spectrum of the more highly connected pages. 3.
as a final point, a median of two or three
in-links for SCC shows that this area is actually
very sparsely interconnected. The average SCC
page can only be reached from two (Australia) or
three (UK, New Zealand) other SCC pages.
ThelwallWilkinson
99Another Strange Anomaly They also found a huge
number of components of size 1 (2,220,070,
containing 49 of pages for the UK), all of which
must be linked to by pages from the SCC,
indicating that the SCC must be surrounded by a
fuzz of individual pages that do not link to any
other national university pages. Many of these
were non-HTML resources that cannot or do not
contain crawled links (PDF, PPT, Image File).
ThelwallWilkinson
100The Longest Shortest Paths A shortest path is
the least number of links that need to be
traversed to get from the first to the
second. Australia 362 links New Zealand
1445 links UK 1022 links From these results
it can be seen that very long paths do exist in
the data set, with the end pages being buried
deeply in obscure places.
ThelwallWilkinson
101Summation of Thelwall and Wilkinson Power laws
are clearly evident in many aspects of the
topology of national university Webs There is
evidence for a rich get richer model of new
links However, there is evidence for a small
degree of linking at random
ThelwallWilkinson
102Also, anomalies were present, caused by 1.
Automatically generated pages served to the
crawler, and those produced by automatically
fixed link errors 2. The inclusion of non-HTML
Web pages, in particular because these cannot
host links, or the crawler did not extract links
from them 3. Large resource-driven Web
sites These anomalies need to accounted for or
segregated to get the most meaningful result
about the Web.
ThelwallWilkinson