World Wide Web - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

World Wide Web

Description:

http://ncis.pku.edu.cn. Intersection of two search engines ... http://ncis.pku.edu.cn. Let ci denote the number of web pages at the indicated time. ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 49
Provided by: lxm
Category:
Tags: ncis | web | wide | world

less

Transcript and Presenter's Notes

Title: World Wide Web


1
World Wide Web
Size Growth Shape Structure Search Mining
  • Li Xiaoming
  • Network Computing and Information Systems
  • Peking University
  • http//ncis.pku.edu.cn
  • June 2, 2008

2
Size
3
As of 2007
  • Surface web, indexable web ? 15, 30 billion
    pages
  • http//www.worldwidewebsize.com, 15.46 billion
  • http//www.boutell.com, 29.7 billion
  • A more authoritative number 11.5 billion in
    2005(see Gulli, WWW 2005)
  • Deep web 544 times more
  • Each page is of average data size 15KB (in China,
    the number is 20KB)

4
Estimation capture-recapture model
  • Notation and assumption
  • w size of the web
  • n, m two independent and random subsets of the
    web, also use n and m representing their sizes
  • x the intersection of n and m
  • Reasoning
  • Due to the independent assumption for n and m,
    n/wx/m, thus, w mn/x
  • It is tempting to think of n and m being two
    search engines collections, assuming the owners
    will tell their sizes. But how to figure out x ?

5
Intersection of two search engines
  • It would be easy to count it out, if we could get
    the two collections. But we couldnt.
  • It is also easy to figure out, if we can take a
    random subset n1 of n, and ask the owner of m to
    tell us how many (m1) of n1 intersects with m.
    But we couldnt and but we have got the hint !
  • Due to the randomness n, m, and n1, well be able
    to assume m1/n1x/n, thus xnm1/n1.
  • In practice, researchers explore search engines
    (via queries) to estimate those numbers.

6
Exploring search engines (E1,E2)
  • Sampling at search engine E1
  • queries ? urls (pages), hopefully random
  • random queries from a large corpus
  • random selection of urls from returned list
  • Checking against search engine E2 to see if the
    urls found in E1 are also found in E2
  • pages ? strong queries ? urls ?
  • strong query constructed from a particular
    page, trying to pin point the url, with respect
    of possible alias, replicas, etc.

7
, you get the idea
  • Krishna Bharat and Andrei Broder, A technique
    for measuring the relative size and overlap of
    public Web search engines, WWW 1998.
  • A Gulli and A. Signorini, The indexable Web is
    more than 11.5 billion pages, WWW 2005.
  • There are more attempts, but the basic approaches
    are similar
  • Theoretically, some variation of
    capture-recapture model
  • Practically, exploring search engines, two or more

8
Growth
  • For the world or a country, how many web pages
    were there since 1995, year by year ?

9
We start with some assumptions
  • Practical assumption
  • We know the current number (size)
  • We are able to get a sufficiently large and
    random subset of the total (for whatever scope of
    interest)
  • The last-modify-time attribute is valid for each
    member of the subset
  • Scientific assumption
  • Page changes (modification and deletion) follow
    Poisson process
  • independence and exponential

10
  • Let ci denote the number of web pages at the
    indicated time. Assume
  • c70 and c0 is known
  • Let mi denote the number of web pages with the
    last-modified-time being 2001-i, learnt at the
    time of c0
  • Let ai denote the number of pages with LMT being
    2001-i, observed at the time of ci.
  • The exponential assumption 1-e -µt is the
    probability a page gets deleted within time t
    after last modification, denote it by p(µ,t)

11
Growth (continued)
  • With ai as a bridge, we were able to establish ci
    as a function of ci1 and mi, with µ as a tbd
    parameter.
  • Set the proper boundary condition of ci, we were
    able to solve the unknown cis and µ0.7.

this growth curve has been nicely extended
further by 5 years actual data since 2002
12
Shape
  • How does the web (of the world or a country) look
    like in terms of shape ?
  • Bow-tie ! how is it determined ?

13
A practical approach
  • The model directed graph
  • Nodes pages edges hyperlinks
  • The problem given the set of all web pages of
    interest, determine the shape of the web graph
    that best represents its overall structure
  • The challenges
  • shape ? best represent ? What do you mean ?
    in search of a path while destination is unknown.
  • Sheer amount of data overloading any graph
    algorithm we need to re-engineer the algorithms

14
For the challenges
  • Bharat and Broders work (WWW 1998) provides an
    insight on the features of the shape, i.e., look
    for
  • the strong connected component
  • ins and outs
  • tendrils
  • tubes
  • disconnected parts
  • (which is useful, but may be refined, e.g.
    daisy, or redefined)
  • But how to figure them out effectively ?

15
An example
  • From Jan-Feb, 2006, we conducted a complete
    crawl of Chinese web, 830 million pages were
    collected
  • As a result, we constructed a huge directed graph
    of 830 million nodes, summing to 400GB data
  • A program ran one week on 16 computers and
    generated the shape parameters

16
An algorithm of engineering flavor
  • Represent the 830m node graph as adjacency list
    (sequential files in computer)
  • Pick some seeds that are sure in SCC
  • BFS forward as it converged, obtaining a set FS
  • BFS backward, obtaining a set BS
  • The intersection of FS and BS is SCC
  • FS SCC is OUT
  • BS SCC is IN
  • BFS start from union of FS and BS without
    direction, obtain the WCC
  • Total WCC is the DISKs
  • WCC SCC is the TENDRILs

17
Structure
  • Discover meaningful local structures, such as web
    community, etc.
  • Hierarchy of the web
  • page, host, domain (organization)
  • page, city, province, country

18
Page, host, and domain
  • Page http//.../....html, (a complete url)
  • Host http//.../, such as a departmental website
    of a university
  • domain http//..../, such as the collection of
    all departmental websites of a university
  • Weighted digraphs can be constructed from the
    hosts and domains, respectively
  • Then shapes can be figured out for them, too.

19
The result (our WWW 2008 paper)
Donato, D. Leonardi, S., Millozzi, S.,
Tsaparas, P. Mining the inner structure of the
Web graph. Eighth International Workshop on the
Web and Databases (WebDB 2005), June 16-17, 2005,
Baltimore, Maryland.
20
Search
  • A dream rooted from the elapsing nature of web
    content web archive
  • Search from the archive temporal, an additional
    dimension

21
More on the dynamics of the web size
  • Recall the previous exponential model
  • Since we have determined µ0.7, we obtain t0.99
    as we set the probability0.5. This says half of
    the current pages will disappear in a year.
  • Also recall the growth formula

We arrive at a simple conclusion
22
Webinfomall the Chinese Web Archive since 2001
  • The crawl started in 2001 and the first batch of
    data was put on line Jan 18, 2002.
  • As of today, there is a total repository over 3
    billion Chinese web pages, more precisely, pages
    crawled from the web servers hosted in mainland
    China
  • About 12 million pages increment every day.
  • Total online data (compressed) volume 30TB.

23
??InfoMall??
24
????www.sina.com.cn
25
??2002.1.18??
Headquarter of Bin Ladin was bombed.
26
????
The first air strike in new year, American AF
bombed the headquarter of Bin Ladin.
27
??????
28
2002.10.8
29
2003.9.2
30
Search from the archive an example
  • A comparison with ordinary SE

http//hist.infomall.cn
31
HisTrace
32
What it takes to build such a SE ?
  • Step 1 take 2.5 billion pages in Webinfomall
  • Step 2 pick out all the article-pages, results
    in 430 million in total
  • Step 3 partition the article page set into
    replica-subsets, results in 70 million in total
  • Figure out the earliest date of publication for
    each replica-subset
  • Create index out of the articles

33
Mining
  • Difference between search and mining
  • Search answer is in some page
  • Mining answer is in a set of pages

34
WebDigest looking for the Ws
  • When
  • Time of an event occurred, publication time of a
    report about an event
  • Where
  • Venue of an event, location of the publisher
  • Who
  • Not only a name, but also attributes
  • What
  • Planed event, breaking news

35
Example who -- about persons
  • Problem 1 given a set of person names, find
    all web pages about each of them
  • Easy search engine will do
  • Not easy what about two persons of the same name
    ?
  • Problem 2 given a snapshot (say 1B pages) of the
    web, find the top N celebrities
  • Not easy we dont even know who should be
    compared with !
  • Problem 3 given a snapshot of the web, find all
    the people whom were mentioned
  • Not easy where to start ?

These should be done efficiently
36
Determine top N
  • Brute force approach
  • Analyze each page of the snapshot, extract every
    person names.
  • Compare the occurrences of each person and
    declare success !
  • It is not going to work !
  • Typically, analyzing a webpage to extract names
    needs 1 second, for 1 billion pages, 10000 days
    are needed !

37
Assumptions and observations
  • top N must be famous people (celebrities), if N
    is not too big
  • For a celebrity, there are many web pages
    describing him/her, in terms of not only name,
    but also many other attributes
  • e.g., age, job, representative work, height,
    weight, birth place,
  • Those information occurs often with certain
    common patterns
  • e.g., ??,?????,the pattern isname
    ,???place
  • Another example, ??,???,the pattern isname
    ,place?
  • Of course, we dont have complete knowledge of
    the patterns and relations in advance.

38
DIPRE (Sergey Brin, 1998)
  • Dual Iterative Pattern Relation Expansion
  • Use these two kinds of incomplete information,
    iteratively enrich each other,to discover more
    and more celebrities
  • Start from some seed persons and their known
    relations, search to get related pages and
    discover the patterns from those pages
  • e.g.,??? ????????????????,?????????????????
    ????,????????????????????

39
DIPRE
  • With these patterns, search again to find pages
    containing other attributes
  • e.g.,??,?????,??????,???????,?????????????????????
  • Next round, the new relation ??? ???? is used
    to get some new pages, and probably discover a
    new pattern, such as, ??,????,the new
    pattern then helps us to find new relation, and
    so on

40
2006.7 ????top 100
41
Why can you claim they are really top 100?
  • Prove it suffices to show if somebody belongs
    to top 100 then he will be caught by the above
    process
  • If he belongs to top 100, then he must have a
    lot of occurrences
  • Some of the occurrences must be in some common
    pattern
  • The common pattern will be discovered sooner or
    later in the iteration
  • Then he will be discovered when the pattern is
    issued for search
  • Once discovered, the number of occurrences can be
    obtained and then can be compared with others.

42
Who were mentioned on the Web ?
  • Not necessarily famous people, so we can not have
    the assumption of many occurrences and common
    patterns. As such, DIPRE is not applicable in
    this case
  • In stead, we make use of small world idea as an
    assumptionif some one occurs in one webpage, the
    probability he co-occurs with other person in a
    page is very high, the graph of co-occurrence has
    small diameter.
  • Thus, we start from some people (seeds), get
    related pages, extract new names and use them to
    get new pages,
  • (this way, only pages containing names are
    processed)

43
A program ran 7 days
  • It got 2.1 million names when seeds reach 1500
  • Among pages containing names, there are average
    32 names in a page.
  • Discovered a page containing 11480 names !

44
2006?,???????????
The page contains the most number of person
names11480
45
Going beyond entity extractions relations
  • Relations of entities of the same type social
    network
  • Co-occurrence (in a page)
  • Linked (between two pages)
  • Relations of entities of different types a much
    more broader space !
  • Who appeared at where on when
  • Who for what on when1, when2,

46
Our demo at WWW 2008
47
A scaring example
Return persons names for a given event
48
Summary and conclusions
  • World Wide Web size and growth, shape and
    structure, search and mining
  • The topics are not new but recurrent, and the
    space is still largely unexplored
  • For any method or algorithm, if it is supposed to
    deal with web scale data, efficiency should be of
    the priority, while large amount of redundancy
    helps to improve accuracy
Write a Comment
User Comments (0)
About PowerShow.com