Title: World Wide Web
1World Wide Web
Size Growth Shape Structure Search Mining
- Li Xiaoming
- Network Computing and Information Systems
- Peking University
- http//ncis.pku.edu.cn
- June 2, 2008
2Size
3As of 2007
- Surface web, indexable web ? 15, 30 billion
pages - http//www.worldwidewebsize.com, 15.46 billion
- http//www.boutell.com, 29.7 billion
- A more authoritative number 11.5 billion in
2005(see Gulli, WWW 2005) - Deep web 544 times more
- Each page is of average data size 15KB (in China,
the number is 20KB)
4Estimation capture-recapture model
- Notation and assumption
- w size of the web
- n, m two independent and random subsets of the
web, also use n and m representing their sizes - x the intersection of n and m
- Reasoning
- Due to the independent assumption for n and m,
n/wx/m, thus, w mn/x - It is tempting to think of n and m being two
search engines collections, assuming the owners
will tell their sizes. But how to figure out x ?
5Intersection of two search engines
- It would be easy to count it out, if we could get
the two collections. But we couldnt. - It is also easy to figure out, if we can take a
random subset n1 of n, and ask the owner of m to
tell us how many (m1) of n1 intersects with m.
But we couldnt and but we have got the hint ! - Due to the randomness n, m, and n1, well be able
to assume m1/n1x/n, thus xnm1/n1. - In practice, researchers explore search engines
(via queries) to estimate those numbers.
6Exploring search engines (E1,E2)
- Sampling at search engine E1
- queries ? urls (pages), hopefully random
- random queries from a large corpus
- random selection of urls from returned list
- Checking against search engine E2 to see if the
urls found in E1 are also found in E2 - pages ? strong queries ? urls ?
- strong query constructed from a particular
page, trying to pin point the url, with respect
of possible alias, replicas, etc.
7, you get the idea
- Krishna Bharat and Andrei Broder, A technique
for measuring the relative size and overlap of
public Web search engines, WWW 1998. - A Gulli and A. Signorini, The indexable Web is
more than 11.5 billion pages, WWW 2005. - There are more attempts, but the basic approaches
are similar - Theoretically, some variation of
capture-recapture model - Practically, exploring search engines, two or more
8Growth
- For the world or a country, how many web pages
were there since 1995, year by year ?
9We start with some assumptions
- Practical assumption
- We know the current number (size)
- We are able to get a sufficiently large and
random subset of the total (for whatever scope of
interest) - The last-modify-time attribute is valid for each
member of the subset - Scientific assumption
- Page changes (modification and deletion) follow
Poisson process - independence and exponential
10 - Let ci denote the number of web pages at the
indicated time. Assume - c70 and c0 is known
- Let mi denote the number of web pages with the
last-modified-time being 2001-i, learnt at the
time of c0 - Let ai denote the number of pages with LMT being
2001-i, observed at the time of ci. - The exponential assumption 1-e -µt is the
probability a page gets deleted within time t
after last modification, denote it by p(µ,t)
11Growth (continued)
- With ai as a bridge, we were able to establish ci
as a function of ci1 and mi, with µ as a tbd
parameter. - Set the proper boundary condition of ci, we were
able to solve the unknown cis and µ0.7.
this growth curve has been nicely extended
further by 5 years actual data since 2002
12Shape
- How does the web (of the world or a country) look
like in terms of shape ? - Bow-tie ! how is it determined ?
13A practical approach
- The model directed graph
- Nodes pages edges hyperlinks
- The problem given the set of all web pages of
interest, determine the shape of the web graph
that best represents its overall structure - The challenges
- shape ? best represent ? What do you mean ?
in search of a path while destination is unknown.
- Sheer amount of data overloading any graph
algorithm we need to re-engineer the algorithms
14For the challenges
- Bharat and Broders work (WWW 1998) provides an
insight on the features of the shape, i.e., look
for - the strong connected component
- ins and outs
- tendrils
- tubes
- disconnected parts
- (which is useful, but may be refined, e.g.
daisy, or redefined) - But how to figure them out effectively ?
15An example
- From Jan-Feb, 2006, we conducted a complete
crawl of Chinese web, 830 million pages were
collected - As a result, we constructed a huge directed graph
of 830 million nodes, summing to 400GB data - A program ran one week on 16 computers and
generated the shape parameters
16An algorithm of engineering flavor
- Represent the 830m node graph as adjacency list
(sequential files in computer) - Pick some seeds that are sure in SCC
- BFS forward as it converged, obtaining a set FS
- BFS backward, obtaining a set BS
- The intersection of FS and BS is SCC
- FS SCC is OUT
- BS SCC is IN
- BFS start from union of FS and BS without
direction, obtain the WCC - Total WCC is the DISKs
- WCC SCC is the TENDRILs
17Structure
- Discover meaningful local structures, such as web
community, etc. - Hierarchy of the web
- page, host, domain (organization)
- page, city, province, country
18Page, host, and domain
- Page http//.../....html, (a complete url)
- Host http//.../, such as a departmental website
of a university - domain http//..../, such as the collection of
all departmental websites of a university - Weighted digraphs can be constructed from the
hosts and domains, respectively - Then shapes can be figured out for them, too.
19The result (our WWW 2008 paper)
Donato, D. Leonardi, S., Millozzi, S.,
Tsaparas, P. Mining the inner structure of the
Web graph. Eighth International Workshop on the
Web and Databases (WebDB 2005), June 16-17, 2005,
Baltimore, Maryland.
20Search
- A dream rooted from the elapsing nature of web
content web archive - Search from the archive temporal, an additional
dimension
21More on the dynamics of the web size
- Recall the previous exponential model
- Since we have determined µ0.7, we obtain t0.99
as we set the probability0.5. This says half of
the current pages will disappear in a year. - Also recall the growth formula
We arrive at a simple conclusion
22Webinfomall the Chinese Web Archive since 2001
- The crawl started in 2001 and the first batch of
data was put on line Jan 18, 2002. - As of today, there is a total repository over 3
billion Chinese web pages, more precisely, pages
crawled from the web servers hosted in mainland
China - About 12 million pages increment every day.
- Total online data (compressed) volume 30TB.
23??InfoMall??
24????www.sina.com.cn
25??2002.1.18??
Headquarter of Bin Ladin was bombed.
26????
The first air strike in new year, American AF
bombed the headquarter of Bin Ladin.
27??????
282002.10.8
292003.9.2
30Search from the archive an example
- A comparison with ordinary SE
http//hist.infomall.cn
31HisTrace
32What it takes to build such a SE ?
- Step 1 take 2.5 billion pages in Webinfomall
- Step 2 pick out all the article-pages, results
in 430 million in total - Step 3 partition the article page set into
replica-subsets, results in 70 million in total - Figure out the earliest date of publication for
each replica-subset - Create index out of the articles
33Mining
- Difference between search and mining
- Search answer is in some page
- Mining answer is in a set of pages
34WebDigest looking for the Ws
- When
- Time of an event occurred, publication time of a
report about an event - Where
- Venue of an event, location of the publisher
- Who
- Not only a name, but also attributes
- What
- Planed event, breaking news
35Example who -- about persons
- Problem 1 given a set of person names, find
all web pages about each of them - Easy search engine will do
- Not easy what about two persons of the same name
? - Problem 2 given a snapshot (say 1B pages) of the
web, find the top N celebrities - Not easy we dont even know who should be
compared with ! - Problem 3 given a snapshot of the web, find all
the people whom were mentioned - Not easy where to start ?
These should be done efficiently
36Determine top N
- Brute force approach
- Analyze each page of the snapshot, extract every
person names. - Compare the occurrences of each person and
declare success ! - It is not going to work !
- Typically, analyzing a webpage to extract names
needs 1 second, for 1 billion pages, 10000 days
are needed !
37Assumptions and observations
- top N must be famous people (celebrities), if N
is not too big - For a celebrity, there are many web pages
describing him/her, in terms of not only name,
but also many other attributes - e.g., age, job, representative work, height,
weight, birth place, - Those information occurs often with certain
common patterns - e.g., ??,?????,the pattern isname
,???place - Another example, ??,???,the pattern isname
,place? - Of course, we dont have complete knowledge of
the patterns and relations in advance.
38DIPRE (Sergey Brin, 1998)
- Dual Iterative Pattern Relation Expansion
- Use these two kinds of incomplete information,
iteratively enrich each other,to discover more
and more celebrities - Start from some seed persons and their known
relations, search to get related pages and
discover the patterns from those pages - e.g.,??? ????????????????,?????????????????
????,????????????????????
39DIPRE
- With these patterns, search again to find pages
containing other attributes - e.g.,??,?????,??????,???????,?????????????????????
- Next round, the new relation ??? ???? is used
to get some new pages, and probably discover a
new pattern, such as, ??,????,the new
pattern then helps us to find new relation, and
so on
402006.7 ????top 100
41Why can you claim they are really top 100?
- Prove it suffices to show if somebody belongs
to top 100 then he will be caught by the above
process - If he belongs to top 100, then he must have a
lot of occurrences - Some of the occurrences must be in some common
pattern - The common pattern will be discovered sooner or
later in the iteration - Then he will be discovered when the pattern is
issued for search - Once discovered, the number of occurrences can be
obtained and then can be compared with others.
42Who were mentioned on the Web ?
- Not necessarily famous people, so we can not have
the assumption of many occurrences and common
patterns. As such, DIPRE is not applicable in
this case - In stead, we make use of small world idea as an
assumptionif some one occurs in one webpage, the
probability he co-occurs with other person in a
page is very high, the graph of co-occurrence has
small diameter. - Thus, we start from some people (seeds), get
related pages, extract new names and use them to
get new pages, - (this way, only pages containing names are
processed)
43A program ran 7 days
- It got 2.1 million names when seeds reach 1500
- Among pages containing names, there are average
32 names in a page. - Discovered a page containing 11480 names !
442006?,???????????
The page contains the most number of person
names11480
45Going beyond entity extractions relations
- Relations of entities of the same type social
network - Co-occurrence (in a page)
- Linked (between two pages)
- Relations of entities of different types a much
more broader space ! - Who appeared at where on when
- Who for what on when1, when2,
46Our demo at WWW 2008
47A scaring example
Return persons names for a given event
48Summary and conclusions
- World Wide Web size and growth, shape and
structure, search and mining - The topics are not new but recurrent, and the
space is still largely unexplored - For any method or algorithm, if it is supposed to
deal with web scale data, efficiency should be of
the priority, while large amount of redundancy
helps to improve accuracy