Surfing the Invisible Web - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Surfing the Invisible Web

Description:

www.yahoo.ca or www.tc.gc.ca/en/modes/htm. Crawlers can ... 3. Customized maps & driving directions. 4. Clinical trials. 5. Patents. 6. Out of Print Books ... – PowerPoint PPT presentation

Number of Views:786
Avg rating:3.0/5.0
Slides: 26
Provided by: hob9
Category:

less

Transcript and Presenter's Notes

Title: Surfing the Invisible Web


1
Surfing the Invisible Web
  • by Katie Hobbins
  • 20 November 2001

2
Overview
  • Introduction
  • The Invisible Web defined
  • Why is it invisible to search engines?
  • Finding and indentifying Invisible Web content
  • Why use Invisible Web resources?
  • When to use Invisible Web resources
  • The future of the Invisible Web

3
Introduction
  • General-purpose search engines do not have
    comprehensive coverage of the Web
  • Often we find just what we need, but equally
    often, we do not (love-hate relationship)
  • One reason for this a great deal of Web content
    is invisible to search enginesThe Invisible Web

4
The Invisible Web Defined
  • Very difficult to define
  • Also referred to as the deep Web or the hidden
    Web
  • Term Invisible Web first coined in 1994 (Dr.
    Jill Ellsworth)
  • Consists largely of content-rich databases from
    universities, libraries, associations, businesses
    and government
  • Ex. MapQuest maps, New York Times stories

5
BrightPlanet Study (2000)
  • White paper The Deep Web Surfacing Hidden Value
    - studied the size, quality and traffic of the
    Deep Web
  • Concluded that the Deep/Invisible Web is approx.
    500 times larger than the surface Web and growing
    faster
  • 7500 terabytes of info on Deep Web vs. 19
    terabytes on surface Web

6
BrightPlanet Study (2000) contd
  • BrightPlanet breakdown of Deep Web content
  • Topical databases 54
  • Internal Sites 13
  • Publications 11
  • Shopping / Auctions 5
  • Classifieds 5
  • Portals 3
  • Libraries 2
  • Yellow White Pages 2
  • Calculators 2
  • Jobs 1
  • Message or Chat 1
  • General Search 1

7
Invisible Web Size (Sherman and Price)
  • Chris Sherman and Gary Price, Internet search
    gurus and authors of The Invisible Web (2001)
    disagree with BrightPlanets sizing of the
    Invisible Web
  • They estimate it to be 2-50 times larger than the
    visible Web

8
Why Search Engines Cant Find It
  • Technical and non-technical issues prevent search
    engines from indexing the Invisible Web
  • Spiders/crawlers dont index information stored
    in databases
  • Costs prohibit search engines from searching more
    often or more deeply
  • Some content is non-textual a problem for
    search engines
  • Spider traps

9
Why Search Engines Cant Find Itcontd
  • Typical Invisible Web content that is resistant
    to search engines
  • Dynamically generated based on users queries
  • Requires registration or login (NYTimes)
  • Fee-based or licensed (Electric Library)
  • Resides on an Intranet
  • Archives (newspapers)
  • Newly added pages
  • Noindex meta tags

10
Four Types of Invisibility
  • The Opaque Web
  • Files that can be, but are not included in search
    engine indices because of issues such as
  • Depth of crawl
  • Frequency of crawl
  • Maximum number of viewable results
  • Disconnected URLs
  • Sherman and Price (2001)

11
Four Types of Invisibility
  • The Private Web
  • Sites that are technically indexable, but have
    been excluded by the Webmaster
  • Password protection, Robots.txt, noindex meta
    tag
  • Sherman and Price (2001)

12
Four Types of Invisibility
  • The Proprietary Web
  • Sites only accessible to those who register
    (NYTimes)
  • Fee-based sites (Electric Library, Northern Light
    Special Collections)
  • Does not include Lexis-Nexis, Dialog, etc.
  • Sherman and Price (2001)

13
Four Types of Invisibility
  • The Truly Invisible Web
  • Cannot be indexed for truly technical reasons
  • Crawlers cant handle the file formats
  • Dynamically generated information
  • Stored in relational databases
  • Sherman and Price (2001)

14
Finding Invisible Web Content
  • Knowing the Invisible Web exists is first step
  • Seven strategies from Gary Price (2001)
  • Adopt the mindset of a hunter
  • Use search engines
  • Examine your Bookmarks / Favorites
  • Monitor discussion lists in your subject area
  • Use Invisible Web pathfinders
  • Try offline finding aids
  • Create your own monitoring service

15
Invisible Web Collection Development
  • Continuous education and internet collection
    development are key to managing the Invisible Web
  • Familiarize yourself with your collection
  • Get used to knowing when to use it vs. when to go
    to a search engine

16
Identifying Invisible Web Sites
  • First, understand the difference between
    Navigation and Content sites
  • Navigation Sites that facilitate navigation and
    resource discovery
  • Content Sites that provide content
  • All truly Invisible Web sites are fundamentally
    providers of content

17
Direct vs. Indirect URLs
  • Examining URLs easiest way to determine if a
    Web page is invisible
  • Direct URLs
  • point to a specific Web page
  • Ex. www.yahoo.ca or www.tc.gc.ca/en/modes/htm
  • Crawlers can follow these URLs
  • Indirect URLs
  • Dont point to a specific page.
  • Contain information to be executed by a script on
    server
  • Contain symbols (?) or words (cgi-bin or
    javascript)
  • Ex. www.elections.ca/scripts/info/edMap_e.asp?edID
    35059showLinkno

18
Specialized vs. Invisible
  • Specialized search directories share some
    characteristics with Invisible Web sites, but are
    visible to search engines
  • Structured hierarchically as navigational hubs
    consist of 100s or 1000s of HTML pages (ex.
    www.lawcrawler.com)
  • To test Start browsing the directory, drilling
    down are the URLs direct or indirect?

19
Comparison Specialized vs. Invisible
20
Why Use the Invisible Web?
  • General-purpose search engines mass audience
    resources
  • Invisible Web sites more focused
  • We may be doing our clients a disservice if we
    stop searching at Google
  • Consider the point of view of the provider of the
    resource (search engines are trying to please
    everyone, motivated by profit)

21
Why Use the Invisible Web?
  • Specialized content focus more comprehensive
    results
  • Specialized search interface more control over
    search input and output
  • Increased precision and recall
  • Invisible Web resources higher level of
    authority
  • The answer may not be available elsewhere.

22
When to Use the Invisible Web
Rules of Thumb
  • When you are familiar with a subject
  • When you are familiar with specific search tools
    and techniques
  • When you are looking for a precise answer
  • When you want authoritative, exhaustive results
  • When the timeliness of content is an issue

23
Top 25 Invisible Web Categories
  • 1. Public company filings
  • 2. Telephone numbers
  • 3. Customized maps driving directions
  • 4. Clinical trials
  • 5. Patents
  • 6. Out of Print Books
  • 7. Library catalogues
  • 8. Authoritative dictionaries
  • 9. Environmental information
  • 10. Historical stock quotes
  • 11. Historical documents and images
  • 12. Company directories
  • 13. Searchable subject bibliographies
  • 14. Economic information
  • 15. Award winners
  • 16. Job postings
  • 17. Philanthropy grant information
  • 18. Translation tools
  • 19. Postal codes
  • 20. Basic demographic information
  • 21. Interactive school finders
  • 22. Campaign financing information
  • 23. Weather data
  • 24. Product catalogues
  • 25. Art gallery holdings

Sherman and Price (2001)
24
The Future of the Invisible Web
  • Question Will traditional search engines ever be
    able to index the Invisible Web? Answer Yes and
    No
  • Yes, we will see advances in search engine
    technology and approaches to search, such as
  • Indexing of new file formats (PDF, Word, Excel
    and non-textual multi-media)
  • Smarter Crawlers
  • Metadata
  • Ability to search databases and interact with
    query forms
  • Real-time crawling
  • No, the Invisible Web will probably always exist
    because information growth is just too great for
    search engines to keep up with.

25
Conclusion What is a Web Searcher to do?
  • Gain an understanding of the Invisible Web
  • Develop your own Invisible Web collection
  • You will be expanding the number of tools
    available to you, thus making you a more
    efficient searcher
  • Keep current with new developments and new
    Invisible Web resources
Write a Comment
User Comments (0)
About PowerShow.com