Title: WebBase: Building a Web Warehouse
1WebBaseBuilding a Web Warehouse
- Hector Garcia-Molina
- Stanford University
Work with Sergey Brin, Junghoo Cho, Taher
Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar,
Sep Kamvar, Wang Lam, Larry Page, Andreas
Paepcke, Sriram Raghavan, Gary Wesley
2The Web
- A universal information resource
- Model weak, strong agreement
- How to exploit it?
3WebBase
WEB PAGE
4WebBase Goals
- Manage very large collections of Web pages
- Today 1500GB HTML, 200 M pages
- Enable large-scale Web-related research
- Locally provide a significant portion of the Web
- Efficient wide-area Web data distribution
5WebBase Architecture
6WebBase Remote Users
- Berkeley
- Columbia
- U. Washington
- Harvey Mudd
- Università degliStudi di Milano
- U. of Arizona
- California Digital Library
- Cornell
- U. of Houston
- Learning LabLower Saxony (L3S)
- France Telecom
- U. Texas
7Outline
- Technical Challenges
- WebBase Use
- The Future
8Challenges
- Archiving
- units
- coordination
- IP Management
- copy access
- link access
- access control
- Hidden Web
- Topic-Specific Collection Building
- Scalability
- crawling
- archive distribution
- index construction
- storage
- Consistency
- freshness
- versions
- Dissemination
9What is a Crawler?
initial urls
init
to visit urls
get next url
web
get page
visited urls
extract urls
web pages
10Parallel Crawling
web
...
11Independent Crawlers
12Partition Firewall
partition
- URL hash
- Site hash
- Hierarchical
13Partition Cross-Over
partition
14Partition Cross-Over
partition
15Partition Exchange
partition
16Partition Exchange
partition
17Coverage vs Overlap
cross-over crawler 5 random seeds per C-proc
18WebBase Parallel Crawling
computer
coordinator
web
...
other computers
19WebBase Parallel Crawling
2 cpu utilzation
200
100
0
number of processes
20Challenges
- Archiving
- units
- coordination
- IP Management
- copy access
- link access
- access control
- Hidden Web
- Topic-Specific Collection Building
- Scalability
- crawling
- archive distribution
- index construction
- storage
- Consistency
- freshness
- versions
- Dissemination
done
next
21How to Refresh?
a
a
a changes daily
can visit one page per week
b
b
b changes once a week
web
repository
- How should we visit pages?
- a a a a a a a ...
- b b b b b b b ...
- a b a b a b a b ... uniform
- a a a a a a b a a a ... proportional
- ?
22Using WebBase
- Fast Page Rank
- Complex Queries
23Structure of the Web
Color the nodes by their domain red
stanford.edu green berkeley.edu blue mit.edu
24Structure of the Web
berkeley.edu
stanford.edu
mit.edu
25Nested Block Structure of the Web
to
Berkeley
Stanford
from
26Personalized Page Rank
a
b
27Complex Queries
Text search E.g., Search for SARS Symptoms
Stanford WebBase Repository
Complex queries Declarative analysis interface
28Example of a Complex Query
Web
Entire Web
Compute S stanford.edu pages containing phrase
Mobile networking
stanford.edu
Mobile networking pages (S)
find universities collaborating with Stanford on
mobile networking
Compute R set of all .edu domains pointed to
by pages in S
S
R
29Supernodes
N2
N1
P3
N3
P1
P4
P5
P2
Web graph
? N1, N2, N3
30Growth of Supernode Graph
100
90
80
70
82MB, 115M pages
(830 GB ofraw HTML)
60
Size of supernode graph (MB)
50
40
30
20
120
0
20
40
60
80
100
Number of pages (Millions)
31Query Execution Times
600
S-Node representation
Relational DB
500
Connectivity Server
Files of adjacency lists
400
300
Time for navigation operation (secs)
200
100
0
Query 1
Query 2
Query 3
Query 4
Query 5
Query 6
Query
32Query Optimization
33Impact of cluster-based optimization
35-million page dataset 600 million links 300GB
of HTML
40-45 reduction in query execution times
34Conclusion (So Far)
- Web is universal information resource
- WebBase exploits this resource
- WebBase Challenges
- scalability, consitency, complex queries...
35Will WebBase Scale?
webBase capacity (optimistic)
web content (indexable)
webBase capacity (pesimistic)
today
time
36Pessimistic Scenario
web content (indexable)
- Specialized WebBases
- sports
- shopping
- ...
webBase capacity (pesimistic)
today
time
37Optimistic Scenario
webBase capacity (optimistic)
- Web in a Box
- web delivered in CD monthy
- search engine handles updates
web content (indexable)
today
time
38Legal Issues?
- Is WebBase legal?
- copies
- links, deep linking
- International regulations
39Biasing Results
- How long will Google, Altavista, etc.resist
temptations? - Biasing Crawler
- Link and Content Spam
40Access Data
- WebBase does not capture access patterns
WebBase
?
41Semantic Web?
semantic tags
WebBase
?
- Will tags be generated?
- By whom?
- Agreement?
42Future Technical Challenges
- Incremental Updates
- Query Optimization
- Crawling Deep Web
43Final Conclusion
- Many challenges ahead...
- Additional informationGoogle Stanford WebBase