Title: Web Characterization
1Web Characterization
- Week 9
- LBSC 690
- Information Technology
2Agenda
- Midterm results, questions, answers
- Project proposal
- Web characteristics
3The Why of the Web (in 1995)
- Affordable storage
- 300,000 words/
- Adequate backbone capacity
- 25,000 simultaneous transfers
- Adequate last mile bandwidth
- 1 second/screen
- Display capability
- 10 of US population
- Effective search capabilities
- Lycos, Yahoo
4Defining the Web
- HTTP, HTML, or URL?
- Static, dynamic or streaming?
- Public, protected, or internal?
5Total Sites Across All Domains August 1995 -
December 2005
6Discussion TopicWhats a Web Site?
- OCLC counted any server at port 80
- Misses many servers at other ports
- Some servers host unrelated content
- Geocities
- Some content requires specialized servers
- rtsp
7Crawling the Web
8Web Crawl Challenges
- Discovering islands and peninsulas
- Duplicate and near-duplicate content
- 30-40 of total content
- Server and network loads
- Dynamic content generation
- Link rot
- Changes at 1 per week
- Temporary server interruptions
9Link Structure of the Web
10Duplicate Detection
- Structural
- Identical directory structure (e.g., mirrors,
aliases) - Syntactic
- Identical bytes
- Identical markup (HTML, XML, )
- Semantic
- Identical content
- Similar content (e.g., with a different banner
ad) - Related content (e.g., translated)
11Robots Exclusion Protocol
- Requires voluntary compliance by crawlers
- Exclusion by site
- Create a robots.txt file at the servers top
level - Indicate which directories not to crawl
- Exclusion by document (in HTML head)
- Not implemented by all crawlers
-
12Hands onThe Internet Archive
- alexa.com Web crawls since 1997
- http//archive.org
- Check out Marylands Web site in 1997
- Check out the history of your favorite site
13Discussion Point
- Can we save everything?
- Should we?
- Do people have a right to remove things?
14The Deep Web
- Dynamic pages, generated from databases
- Not easily discovered using crawling
- Perhaps 400-500 times larger than surface Web
- Fastest growing source of new information
15(No Transcript)
16Content of the Deep Web
17Deep Web
- 60 Deep Sites Exceed Surface Web by 40 Times
18Source James Crawford, http//ourworld.compuserve
.com/homepages/JWCRAWFORD/can-pop.htm
19Global Internet Users
Native speakers, Global Reach projection for 2004
(as of Sept, 2003)
20Global Internet Users
Web Pages
Native speakers, Global Reach projection for 2004
(as of Sept, 2003)
21Leading exporters and importersin world
merchandise trade, 2004(Billion dollars and
percentage)
Source World Trade Organization
22Blogs
Doubling
18.9 Million Weblogs Tracked Doubling in size
approx. every 5 months Consistent doubling over
the last 36 months
Doubling
Doubling
Doubling
23Blue Mainstream Media
Red Blog
Challenge Fight, or Embrace?
24Daily Posting Volume
Katrina
1.2 Million legitimate Posts/Day Spam posts
marked in red On average, additional 5.8 are
spam posts Some spam spikes as high as 18
London Bombings
Justice OConnor Live 8 Concerts
Deepthroat Revealed
Kryptonite Lock Controversy
Newsweek Koran
Schiavo Dies
US Election Day
Superbowl
Indian Ocean Tsunami
25(No Transcript)
26A Web of Speech?
27Rethinking the Spoken Word
- Speech is better for some things than writing
- Spoken bits are as persistent as written bits
- Storage costs is 80 times more than text
- Disk cost falls by a factor of 80 in 16 years
- If speech is searchable, we will keep lots of it
28A Little Math
- Collectable spoken words 10 Tw/day
- 1 billion users 100 words/min 200 min/day / 2
- Compressed speech 2 words/kiloByte
- (100/60 w/sec) (6.5 kb/sec / 8 b/B)
- Required storage 5 PetaBytes/day
29A Little Math
- Collectable spoken words 10 Tw/day
- 1 billion users 100 words/min 200 min/day / 2
- Compressed speech 2 words/kiloByte
- (100/60 w/sec) (6.5 kb/sec / 8 b/B)
- Required storage 5 PetaBytes/day
- Storage array sales 5 PB/day
- 457 PB in 2Q 2005 (increasing 59 per year)
- 22/person/year (decreasing at 31/year)
Source IDC Worldwide Disk Storage Systems
Tracker, 2Q 2005
30Human History
Oral Tradition
Writing
31Hands On Speech on the Web
- singingfish.com
- blinkx.com
- ocw.mit.edu
- podcasts.yahoo.com