WebBase: Building a Web Warehouse

About This Presentation

Title:

WebBase: Building a Web Warehouse

Description:

crawling. archive distribution. index construction. storage ... Crawling Deep Web. 43. Final Conclusion. Many challenges ahead... Additional information: ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 44

Provided by: Hec17

Learn more at: http://dbpubs.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: WebBase: Building a Web Warehouse

1
WebBaseBuilding a Web Warehouse

Hector Garcia-Molina
Stanford University

Work with Sergey Brin, Junghoo Cho, Taher
Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar,
Sep Kamvar, Wang Lam, Larry Page, Andreas
Paepcke, Sriram Raghavan, Gary Wesley
2
The Web

A universal information resource
Model weak, strong agreement
How to exploit it?

3
WebBase
WEB PAGE
4
WebBase Goals

Manage very large collections of Web pages
Today 1500GB HTML, 200 M pages
Enable large-scale Web-related research
Locally provide a significant portion of the Web
Efficient wide-area Web data distribution

5
WebBase Architecture
6
WebBase Remote Users

Berkeley
Columbia
U. Washington
Harvey Mudd
Università degliStudi di Milano
U. of Arizona

California Digital Library
Cornell
U. of Houston
Learning LabLower Saxony (L3S)
France Telecom
U. Texas

7
Outline

Technical Challenges
WebBase Use
The Future

8
Challenges

Archiving
units
coordination
IP Management
copy access
link access
access control
Hidden Web
Topic-Specific Collection Building

Scalability
crawling
archive distribution
index construction
storage
Consistency
freshness
versions
Dissemination

9
What is a Crawler?
initial urls
init
to visit urls
get next url
web
get page
visited urls
extract urls
web pages
10
Parallel Crawling
web
...
11
Independent Crawlers
12
Partition Firewall
partition

URL hash
Site hash
Hierarchical

13
Partition Cross-Over
partition
14
Partition Cross-Over
partition
15
Partition Exchange
partition
16
Partition Exchange
partition
17
Coverage vs Overlap
cross-over crawler 5 random seeds per C-proc
18
WebBase Parallel Crawling
computer
coordinator
web
...
other computers
19
WebBase Parallel Crawling
2 cpu utilzation
200
100
0
number of processes
20
Challenges

Archiving
units
coordination
IP Management
copy access
link access
access control
Hidden Web
Topic-Specific Collection Building

Scalability
crawling
archive distribution
index construction
storage
Consistency
freshness
versions
Dissemination

done
next
21
How to Refresh?
a
a
a changes daily
can visit one page per week
b
b
b changes once a week
web
repository

How should we visit pages?
a a a a a a a ...
b b b b b b b ...
a b a b a b a b ... uniform
a a a a a a b a a a ... proportional
?

22
Using WebBase

Fast Page Rank
Complex Queries

23
Structure of the Web
Color the nodes by their domain red
stanford.edu green berkeley.edu blue mit.edu
24
Structure of the Web
berkeley.edu
stanford.edu
mit.edu
25
Nested Block Structure of the Web
to
Berkeley
Stanford
from
26
Personalized Page Rank
a
b
27
Complex Queries
Text search E.g., Search for SARS Symptoms
Stanford WebBase Repository
Complex queries Declarative analysis interface
28
Example of a Complex Query
Web
Entire Web
Compute S stanford.edu pages containing phrase
Mobile networking
stanford.edu
Mobile networking pages (S)
find universities collaborating with Stanford on
mobile networking
Compute R set of all .edu domains pointed to
by pages in S
S
R
29
Supernodes
N2
N1
P3
N3
P1
P4
P5
P2
Web graph
? N1, N2, N3
30
Growth of Supernode Graph
100
90
80
70
82MB, 115M pages
(830 GB ofraw HTML)
60
Size of supernode graph (MB)
50
40
30
20
120
0
20
40
60
80
100
Number of pages (Millions)
31
Query Execution Times
600
S-Node representation
Relational DB
500
Connectivity Server
Files of adjacency lists
400
300
Time for navigation operation (secs)
200
100
0
Query 1
Query 2
Query 3
Query 4
Query 5
Query 6
Query
32
Query Optimization
33
Impact of cluster-based optimization
35-million page dataset 600 million links 300GB
of HTML
40-45 reduction in query execution times
34
Conclusion (So Far)

Web is universal information resource
WebBase exploits this resource
WebBase Challenges
scalability, consitency, complex queries...

35
Will WebBase Scale?
webBase capacity (optimistic)
web content (indexable)
webBase capacity (pesimistic)
today
time
36
Pessimistic Scenario
web content (indexable)