Web Characterization - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Web Characterization

Description:

60% of queries are for music. Then movies. Then sports. Then news. The Deep Web ... http://www.mp3.com/ Link Structure of the Web. Crawling the Web. Web Crawl ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 30
Provided by: Doug9
Category:

less

Transcript and Presenter's Notes

Title: Web Characterization


1
Web Characterization
  • Week 9
  • LBSC 690
  • Information Technology

2
Outline
  • What is the Web?
  • Whats on the Web?
  • What is the nature of the Web?
  • Preserving the Web

3
Defining the Web
  • HTTP, HTML, or URL?
  • Static, dynamic or streaming?
  • Public, protected, or internal?

4
Economics of the Web in 1995
  • Affordable storage
  • 300,000 words/
  • Adequate backbone capacity
  • 25,000 simultaneous transfers
  • Adequate last mile bandwidth
  • 1 second/screen
  • Display capability
  • 10 of US population
  • Effective search capabilities
  • Lycos (now google), Yahoo

5
Nature of the Web
  • Over one billion pages by 1999
  • Growing at 25 per month!
  • Google indexed about 3 billion pages in 2003
  • Unstable
  • Changing at 1 per week
  • Redundant
  • 30-40 (near) duplicates
  • e.g., unix man page tree

6
Source Michael Lesk, How Much Information is
there in the World?
7
Number of Web Sites
8
Web Sites by Country, 2002
9
Whats a Web Site?
  • OCLC counts any server at port 80
  • Misses many servers at other ports
  • Some servers host unrelated content
  • Geocities
  • Some content requires specialized servers
  • rtsp

10
World Trade in 2001
Source World Trade Organization
11
World Trade
12
Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
13
Widely Spoken Languages
Source http//www.g11n.com/faq.html
14
Source James Crawford, http//ourworld.compuserve
.com/homepages/JWCRAWFORD/can-pop.htm
15

Web Page Languages
Source Jack Xu, Excite_at_Home, 1999
16
European Web Size Exponential Growth
Source Extrapolated from Grefenstette and
Nioche, RIAO 2000
17
European Web Content
Source European Commission, Evolution of the
Internet and the World Wide Web in Europe, 1997
18
Live Streams
Almost 2000 Internet-accessible Radio and
Television Stations
source www.real.com, Feb 2000
19
Streaming Media
  • SingingFish indexes 35 million streams
  • 60 of queries are for music
  • Then movies
  • Then sports
  • Then news

20
The Deep Web
  • Dynamic generated Web pages from databases
  • Traditional search engines cannot retrieve
  • 400 to 500 bigger than surface web
  • Largest growing new information

21
Information in Deep Web
  • Related to most information need

22
Deep Web
  • 60 Deep Sites Exceed Surface Web by 40 Times

23
Link Structure of the Web
24
Crawling the Web
25
Web Crawl Challenges
  • Temporary server interruptions
  • Discovering islands and peninsulas
  • Duplicate and near-duplicate content
  • Dynamic content
  • Link rot
  • Server and network loads
  • Have I seen this page before?

26
Duplicate Detection
  • Structural
  • Identical directory structure (e.g., mirrors,
    aliases)
  • Syntactic
  • Identical bytes
  • Identical markup (HTML, XML, )
  • Semantic
  • Identical content
  • Similar content (e.g., with a different banner
    ad)
  • Related content (e.g., translated)

27
Robots Exclusion Protocol
  • Based on voluntary compliance by crawlers
  • Exclusion by site
  • Create a robots.txt file at the servers top
    level
  • Indicate which directories not to crawl
  • Exclusion by document (in HTML head)
  • Not implemented by all crawlers
  • ltmeta name"robots content"noindex,nofollow"gt

28
Hands on The Wayback Machine
  • Internet Archive
  • Stored Alexa.com Web crawls since 1997
  • http//archive.org
  • Check out Marylands Web site in 1997
  • Check out your college web site back time

29
Discussion Point
  • Can we save everything?
  • Should we?
  • Do people have a right to remove things?
Write a Comment
User Comments (0)
About PowerShow.com