Title: Web Characterization
1Web Characterization
- Week 9
- LBSC 690
- Information Technology
2Outline
- What is the Web?
- Whats on the Web?
- What is the nature of the Web?
- Preserving the Web
3Defining the Web
- HTTP, HTML, or URL?
- Static, dynamic or streaming?
- Public, protected, or internal?
4Economics of the Web in 1995
- Affordable storage
- 300,000 words/
- Adequate backbone capacity
- 25,000 simultaneous transfers
- Adequate last mile bandwidth
- 1 second/screen
- Display capability
- 10 of US population
- Effective search capabilities
- Lycos (now google), Yahoo
5Nature of the Web
- Over one billion pages by 1999
- Growing at 25 per month!
- Google indexed about 3 billion pages in 2003
- Unstable
- Changing at 1 per week
- Redundant
- 30-40 (near) duplicates
- e.g., unix man page tree
6Source Michael Lesk, How Much Information is
there in the World?
7Number of Web Sites
8Web Sites by Country, 2002
9Whats a Web Site?
- OCLC counts any server at port 80
- Misses many servers at other ports
- Some servers host unrelated content
- Geocities
- Some content requires specialized servers
- rtsp
10World Trade in 2001
Source World Trade Organization
11World Trade
12Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
13Widely Spoken Languages
Source http//www.g11n.com/faq.html
14Source James Crawford, http//ourworld.compuserve
.com/homepages/JWCRAWFORD/can-pop.htm
15 Web Page Languages
Source Jack Xu, Excite_at_Home, 1999
16European Web Size Exponential Growth
Source Extrapolated from Grefenstette and
Nioche, RIAO 2000
17European Web Content
Source European Commission, Evolution of the
Internet and the World Wide Web in Europe, 1997
18Live Streams
Almost 2000 Internet-accessible Radio and
Television Stations
source www.real.com, Feb 2000
19Streaming Media
- SingingFish indexes 35 million streams
- 60 of queries are for music
- Then movies
- Then sports
- Then news
20The Deep Web
- Dynamic generated Web pages from databases
- Traditional search engines cannot retrieve
- 400 to 500 bigger than surface web
- Largest growing new information
21Information in Deep Web
- Related to most information need
22Deep Web
- 60 Deep Sites Exceed Surface Web by 40 Times
23Link Structure of the Web
24Crawling the Web
25Web Crawl Challenges
- Temporary server interruptions
- Discovering islands and peninsulas
- Duplicate and near-duplicate content
- Dynamic content
- Link rot
- Server and network loads
- Have I seen this page before?
26Duplicate Detection
- Structural
- Identical directory structure (e.g., mirrors,
aliases) - Syntactic
- Identical bytes
- Identical markup (HTML, XML, )
- Semantic
- Identical content
- Similar content (e.g., with a different banner
ad) - Related content (e.g., translated)
27Robots Exclusion Protocol
- Based on voluntary compliance by crawlers
- Exclusion by site
- Create a robots.txt file at the servers top
level - Indicate which directories not to crawl
- Exclusion by document (in HTML head)
- Not implemented by all crawlers
- ltmeta name"robots content"noindex,nofollow"gt
28Hands on The Wayback Machine
- Internet Archive
- Stored Alexa.com Web crawls since 1997
- http//archive.org
- Check out Marylands Web site in 1997
- Check out your college web site back time
29Discussion Point
- Can we save everything?
- Should we?
- Do people have a right to remove things?