Title: The Web Laboratory
1The Web Laboratory
Goals, progress report, and research challenges
http//www.cs.cornell.edu/wya/weblab/
A project of Cornell University and the Internet
Archive
2The Internet Archive
3The Internet Archive Web Collection
The Data Complete crawls of the Web, every two
months since 1996, with some gaps Range of
formats and depth of crawl have increased with
time. No sites that are protected by robots.txt
or where owners requested not to be
archived Some missing or lost data
4The Internet Archive Web Collection
Sizes Current crawls are about 40-60 TByte
(compressed) Total archive is about 600 TByte
(compressed) Compression ratio up to
251 best guess of overall average is
101 Rate of increase is about 1 TByte/day
(compressed) Note that total storage requirement
is reduced because much data does not change
between crawls
5MotivationSocial Science Research
The Web as a social phenomenon Political
campaigns Online retailing The Web as
evidence The spread of urban legends ("Einstein
failed mathematics") Development of legal
concepts across time
6The Petabyte Data Store
- A project of the Cornell CS database group and
the Theory Center - to support research projects that manage large
data sets - Physical Gantry
- Measure light-scattering properties of objects
- Create accurate physical models for graphical
rendering - Each dataset is 14TB
- Arecibo Telescope
- Perform surveys of parts of the sky
- Analyze the data to find high red-shift pulsars
- 1TB/day
- The Web Laboratory
7Year One System
- 2 16-Processor Unisys ES7000 Servers
- 64 GByte RAM
- 8 GByte/sec aggregate I/O bandwidth
- 2 50 TByte RAID Online Storage
- ADIC Scalar 10K robotic tape library for archive
8Unisys Server ES7000/430
9RAID Storage System
10Web Laboratory
The petabyte data store will allow us to mount
several very large portions of the Web online for
all types of web research. Copy snapshots of
the Web from the Internet Archive Transport
the data to Cornell on a regular basis Store
it at Cornell and load parts on demand Extract
features sets Present APIs to researchers
(program API, Web Services API)
11Research Using Web Data
The Web Graph Structure and evolution of the Web
graph Hubs authorities, PageRank, etc. Social
networks Many of the basic studies have only
been done once Few if any large-scale studies
across time Typical research needs graphs of 1
billion pages should be possible in memory (64
Gbytes) algorithms are needed for processing
larger graphs
12In Memory Web Graph
Representation of graph by its adjacency matrix
using a compressed sparse row representation The
Cuthill-McKee algorithm is used to reorder the
nodes to create dense blocks within the
matrix Work by Karthik Jeyabalan and
Jerrin Kallukalam
13Research Using Web Data
Pseudo-crawling experiments on crawled
data Focused or selective Web crawling Burst
analysis Digital library selection Crawling the
Web is complex and unsatisfactory Time
consuming and unreliable Experiments cannot be
repeated because data changes Cannot study
changes across time A pseudo-crawl applies the
same algorithms but retrieves the pages from the
Web Laboratory Filters allow experiments on
subsets of the Web
14Storing the Web Data
SQL Server database of structural metadata and
links
Content Store
Content Content hash(MD5) (zip)
Web graphs
Work by Pavel Dmitriev and Richard Wang
15(No Transcript)
16Benchmarking the Synthetic Web
A Synthetic Web is a generated graph with
graphical properties and distributions of domain
names similar to a Web crawl R-MAT algorithm is
used to generate URL1 URL 2 Satisfies Web
graph power laws Used for benchmarking and
experimentation
Work by Pavel Dmitriev and Shantanu Shah
17Social Science Research
The Web as a social phenomenon Political
campaigns Online retailing The Web as
evidence Urban legends Development of legal
concepts across time Requires access to Web
data by content (Quark?) automated tools to
replace hand-coding (NLP, Machine
Learning) straightforward interfaces for
non-computing specialists (HCI)
18Work Flow System
Transfer 1 Tbyte per day Internet 2 -- 1
gigabit, off peak Store 10 Gbyte batches of
compressed files Process raw data Uncompress and
unpack ARC and DAT files Create IDs for pages
and content hashes Extract links from html
pages Database load batches of pages, metadata,
and links Compress and store content files Work
by Mayank Gandhi and Jimmy Sun
19Current Status
Data Capture Delays in connecting Internet
Archive to Internet 2 Testing using 250 Gbyte
test data set Ingest and Workflow Under test --
performance challenges Database and Content
Store Under test for scalability Synthetic
Web 500 million links generated -- data
structure under revision Web Graph Completion
scheduled for end of semester
20The Cornell Team
Researchers William Arms, Dan Huttenlocher, Jon
Kleinberg, Carl Lagoze Ph.D. Students Pavel
Dmitriev, Selcuk Aya M.Eng. Students Mayank
Gandhi, Karthik Jeyabalan, Jerrin Kallukalam,
Shantanu Shah, Jimmy Yanbo Sun, Richard
Wang Petabyte Data Store Al Demers, Johannes
Gehrke, Dave Lifka, Jai Shanmugasundaram, John
Zollweg
21Thanks
This work would not be possible without the
forethought and longstanding commitment of the
Internet Archive to capture and preserve the
content of the Web for future generations. The
petabyte data store is funded in part by National
Science Foundation grant 0403340, with equipment
support from Unisys. The Cornell Theory Center's
support for this project is funded in part by
Microsoft, Dell and Intel.