World Wide Web - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

World Wide Web

Description:

http://ncis.pku.edu.cn. Intersection of two search engines ... http://ncis.pku.edu.cn. Let ci denote the number of web pages at the indicated time. ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 49

Provided by: lxm

Category:

more less

Transcript and Presenter's Notes

Title: World Wide Web

1
World Wide Web
Size Growth Shape Structure Search Mining

Li Xiaoming
Network Computing and Information Systems
Peking University
http//ncis.pku.edu.cn
June 2, 2008

2
Size
3
As of 2007

Surface web, indexable web ? 15, 30 billion
pages
http//www.worldwidewebsize.com, 15.46 billion
http//www.boutell.com, 29.7 billion
A more authoritative number 11.5 billion in
2005(see Gulli, WWW 2005)
Deep web 544 times more
Each page is of average data size 15KB (in China,
the number is 20KB)

4
Estimation capture-recapture model

Notation and assumption
w size of the web
n, m two independent and random subsets of the
web, also use n and m representing their sizes
x the intersection of n and m
Reasoning
Due to the independent assumption for n and m,
n/wx/m, thus, w mn/x
It is tempting to think of n and m being two
search engines collections, assuming the owners
will tell their sizes. But how to figure out x ?

5
Intersection of two search engines

It would be easy to count it out, if we could get
the two collections. But we couldnt.
It is also easy to figure out, if we can take a
random subset n1 of n, and ask the owner of m to
tell us how many (m1) of n1 intersects with m.
But we couldnt and but we have got the hint !
Due to the randomness n, m, and n1, well be able
to assume m1/n1x/n, thus xnm1/n1.
In practice, researchers explore search engines
(via queries) to estimate those numbers.

6
Exploring search engines (E1,E2)

Sampling at search engine E1
queries ? urls (pages), hopefully random
random queries from a large corpus
random selection of urls from returned list
Checking against search engine E2 to see if the
urls found in E1 are also found in E2
pages ? strong queries ? urls ?
strong query constructed from a particular
page, trying to pin point the url, with respect
of possible alias, replicas, etc.

7
, you get the idea

Krishna Bharat and Andrei Broder, A technique
for measuring the relative size and overlap of
public Web search engines, WWW 1998.
A Gulli and A. Signorini, The indexable Web is
more than 11.5 billion pages, WWW 2005.
There are more attempts, but the basic approaches
are similar
Theoretically, some variation of
capture-recapture model
Practically, exploring search engines, two or more

8
Growth

For the world or a country, how many web pages
were there since 1995, year by year ?

9
We start with some assumptions

Practical assumption
We know the current number (size)
We are able to get a sufficiently large and
random subset of the total (for whatever scope of
interest)
The last-modify-time attribute is valid for each
member of the subset
Scientific assumption
Page changes (modification and deletion) follow
Poisson process
independence and exponential

Let ci denote the number of web pages at the
indicated time. Assume
c70 and c0 is known
Let mi denote the number of web pages with the
last-modified-time being 2001-i, learnt at the
time of c0
Let ai denote the number of pages with LMT being
2001-i, observed at the time of ci.
The exponential assumption 1-e -µt is the
probability a page gets deleted within time t
after last modification, denote it by p(µ,t)

11
Growth (continued)

With ai as a bridge, we were able to establish ci
as a function of ci1 and mi, with µ as a tbd
parameter.
Set the proper boundary condition of ci, we were
able to solve the unknown cis and µ0.7.

this growth curve has been nicely extended
further by 5 years actual data since 2002
12
Shape

How does the web (of the world or a country) look
like in terms of shape ?
Bow-tie ! how is it determined ?

13
A practical approach

The model directed graph
Nodes pages edges hyperlinks
The problem given the set of all web pages of
interest, determine the shape of the web graph
that best represents its overall structure
The challenges
shape ? best represent ? What do you mean ?
in search of a path while destination is unknown.
Sheer amount of data overloading any graph
algorithm we need to re-engineer the algorithms

14
For the challenges

Bharat and Broders work (WWW 1998) provides an
insight on the features of the shape, i.e., look
for
the strong connected component
ins and outs
tendrils
tubes
disconnected parts
(which is useful, but may be refined, e.g.
daisy, or redefined)
But how to figure them out effectively ?

15
An example

From Jan-Feb, 2006, we conducted a complete
crawl of Chinese web, 830 million pages were
collected
As a result, we constructed a huge directed graph
of 830 million nodes, summing to 400GB data
A program ran one week on 16 computers and
generated the shape parameters

16
An algorithm of engineering flavor

Represent the 830m node graph as adjacency list
(sequential files in computer)
Pick some seeds that are sure in SCC
BFS forward as it converged, obtaining a set FS
BFS backward, obtaining a set BS
The intersection of FS and BS is SCC
FS SCC is OUT
BS SCC is IN
BFS start from union of FS and BS without
direction, obtain the WCC
Total WCC is the DISKs
WCC SCC is the TENDRILs

17
Structure

Discover meaningful local structures, such as web
community, etc.
Hierarchy of the web
page, host, domain (organization)
page, city, province, country

18
Page, host, and domain

Page http//.../....html, (a complete url)
Host http//.../, such as a departmental website
of a university
domain http//..../, such as the collection of
all departmental websites of a university
Weighted digraphs can be constructed from the
hosts and domains, respectively
Then shapes can be figured out for them, too.

19
The result (our WWW 2008 paper)
Donato, D. Leonardi, S., Millozzi, S.,
Tsaparas, P. Mining the inner structure of the
Web graph. Eighth International Workshop on the
Web and Databases (WebDB 2005), June 16-17, 2005,
Baltimore, Maryland.
20
Search

A dream rooted from the elapsing nature of web
content web archive
Search from the archive temporal, an additional
dimension

21
More on the dynamics of the web size

Recall the previous exponential model

Since we have determined µ0.7, we obtain t0.99
as we set the probability0.5. This says half of
the current pages will disappear in a year.
Also recall the growth formula

We arrive at a simple conclusion
22
Webinfomall the Chinese Web Archive since 2001

The crawl started in 2001 and the first batch of
data was put on line Jan 18, 2002.
As of today, there is a total repository over 3
billion Chinese web pages, more precisely, pages
crawled from the web servers hosted in mainland
China
About 12 million pages increment every day.
Total online data (compressed) volume 30TB.

23
??InfoMall??
24
????www.sina.com.cn
25
??2002.1.18??
Headquarter of Bin Ladin was bombed.
26
????
The first air strike in new year, American AF
bombed the headquarter of Bin Ladin.
27
??????
28
2002.10.8
29
2003.9.2
30
Search from the archive an example

A comparison with ordinary SE

http//hist.infomall.cn
31
HisTrace
32
What it takes to build such a SE ?

Step 1 take 2.5 billion pages in Webinfomall
Step 2 pick out all the article-pages, results
in 430 million in total
Step 3 partition the article page set into
replica-subsets, results in 70 million in total
Figure out the earliest date of publication for
each replica-subset
Create index out of the articles

33
Mining

Difference between search and mining
Search answer is in some page
Mining answer is in a set of pages

34
WebDigest looking for the Ws

When
Time of an event occurred, publication time of a
report about an event
Where
Venue of an event, location of the publisher
Who
Not only a name, but also attributes
What
Planed event, breaking news

35
Example who -- about persons

Problem 1 given a set of person names, find
all web pages about each of them
Easy search engine will do
Not easy what about two persons of the same name
?
Problem 2 given a snapshot (say 1B pages) of the
web, find the top N celebrities
Not easy we dont even know who should be
compared with !
Problem 3 given a snapshot of the web, find all
the people whom were mentioned
Not easy where to start ?

These should be done efficiently
36
Determine top N

Brute force approach
Analyze each page of the snapshot, extract every
person names.
Compare the occurrences of each person and
declare success !
It is not going to work !
Typically, analyzing a webpage to extract names
needs 1 second, for 1 billion pages, 10000 days
are needed !

37
Assumptions and observations

top N must be famous people (celebrities), if N
is not too big
For a celebrity, there are many web pages
describing him/her, in terms of not only name,
but also many other attributes
e.g., age, job, representative work, height,
weight, birth place,
Those information occurs often with certain
common patterns
e.g., ??,?????,the pattern isname
,???place
Another example, ??,???,the pattern isname
,place?
Of course, we dont have complete knowledge of
the patterns and relations in advance.

38
DIPRE (Sergey Brin, 1998)

Dual Iterative Pattern Relation Expansion
Use these two kinds of incomplete information,
iteratively enrich each other,to discover more
and more celebrities
Start from some seed persons and their known
relations, search to get related pages and
discover the patterns from those pages
e.g.,??? ????????????????,?????????????????
????,????????????????????

39
DIPRE

With these patterns, search again to find pages
containing other attributes
e.g.,??,?????,??????,???????,?????????????????????
Next round, the new relation ??? ???? is used
to get some new pages, and probably discover a
new pattern, such as, ??,????,the new
pattern then helps us to find new relation, and
so on

40
2006.7 ????top 100
41
Why can you claim they are really top 100?

Prove it suffices to show if somebody belongs
to top 100 then he will be caught by the above
process
If he belongs to top 100, then he must have a
lot of occurrences
Some of the occurrences must be in some common
pattern
The common pattern will be discovered sooner or
later in the iteration
Then he will be discovered when the pattern is
issued for search
Once discovered, the number of occurrences can be
obtained and then can be compared with others.

42
Who were mentioned on the Web ?

Not necessarily famous people, so we can not have
the assumption of many occurrences and common
patterns. As such, DIPRE is not applicable in
this case
In stead, we make use of small world idea as an
assumptionif some one occurs in one webpage, the
probability he co-occurs with other person in a
page is very high, the graph of co-occurrence has
small diameter.
Thus, we start from some people (seeds), get
related pages, extract new names and use them to
get new pages,
(this way, only pages containing names are
processed)

43
A program ran 7 days

It got 2.1 million names when seeds reach 1500
Among pages containing names, there are average
32 names in a page.
Discovered a page containing 11480 names !

44
2006?,???????????
The page contains the most number of person
names11480
45
Going beyond entity extractions relations

Relations of entities of the same type social
network
Co-occurrence (in a page)
Linked (between two pages)
Relations of entities of different types a much
more broader space !
Who appeared at where on when
Who for what on when1, when2,

46
Our demo at WWW 2008
47
A scaring example
Return persons names for a given event
48
Summary and conclusions

World Wide Web size and growth, shape and
structure, search and mining
The topics are not new but recurrent, and the
space is still largely unexplored
For any method or algorithm, if it is supposed to
deal with web scale data, efficiency should be of
the priority, while large amount of redundancy
helps to improve accuracy

Write a Comment

User Comments (0)