Title: Surfing the Invisible Web
1Surfing the Invisible Web
- by Katie Hobbins
- 20 November 2001
2Overview
- Introduction
- The Invisible Web defined
- Why is it invisible to search engines?
- Finding and indentifying Invisible Web content
- Why use Invisible Web resources?
- When to use Invisible Web resources
- The future of the Invisible Web
3Introduction
- General-purpose search engines do not have
comprehensive coverage of the Web - Often we find just what we need, but equally
often, we do not (love-hate relationship) - One reason for this a great deal of Web content
is invisible to search enginesThe Invisible Web
4The Invisible Web Defined
- Very difficult to define
- Also referred to as the deep Web or the hidden
Web - Term Invisible Web first coined in 1994 (Dr.
Jill Ellsworth) - Consists largely of content-rich databases from
universities, libraries, associations, businesses
and government - Ex. MapQuest maps, New York Times stories
5BrightPlanet Study (2000)
- White paper The Deep Web Surfacing Hidden Value
- studied the size, quality and traffic of the
Deep Web - Concluded that the Deep/Invisible Web is approx.
500 times larger than the surface Web and growing
faster - 7500 terabytes of info on Deep Web vs. 19
terabytes on surface Web
6BrightPlanet Study (2000) contd
- BrightPlanet breakdown of Deep Web content
- Topical databases 54
- Internal Sites 13
- Publications 11
- Shopping / Auctions 5
- Classifieds 5
- Portals 3
- Libraries 2
- Yellow White Pages 2
- Calculators 2
- Jobs 1
- Message or Chat 1
- General Search 1
7Invisible Web Size (Sherman and Price)
- Chris Sherman and Gary Price, Internet search
gurus and authors of The Invisible Web (2001)
disagree with BrightPlanets sizing of the
Invisible Web - They estimate it to be 2-50 times larger than the
visible Web
8Why Search Engines Cant Find It
- Technical and non-technical issues prevent search
engines from indexing the Invisible Web - Spiders/crawlers dont index information stored
in databases - Costs prohibit search engines from searching more
often or more deeply - Some content is non-textual a problem for
search engines - Spider traps
9Why Search Engines Cant Find Itcontd
- Typical Invisible Web content that is resistant
to search engines
- Dynamically generated based on users queries
- Requires registration or login (NYTimes)
- Fee-based or licensed (Electric Library)
- Resides on an Intranet
- Archives (newspapers)
- Newly added pages
- Noindex meta tags
10Four Types of Invisibility
- The Opaque Web
- Files that can be, but are not included in search
engine indices because of issues such as - Depth of crawl
- Frequency of crawl
- Maximum number of viewable results
- Disconnected URLs
- Sherman and Price (2001)
11Four Types of Invisibility
- The Private Web
- Sites that are technically indexable, but have
been excluded by the Webmaster - Password protection, Robots.txt, noindex meta
tag - Sherman and Price (2001)
12Four Types of Invisibility
- The Proprietary Web
- Sites only accessible to those who register
(NYTimes) - Fee-based sites (Electric Library, Northern Light
Special Collections) - Does not include Lexis-Nexis, Dialog, etc.
- Sherman and Price (2001)
13Four Types of Invisibility
- The Truly Invisible Web
- Cannot be indexed for truly technical reasons
- Crawlers cant handle the file formats
- Dynamically generated information
- Stored in relational databases
- Sherman and Price (2001)
14Finding Invisible Web Content
- Knowing the Invisible Web exists is first step
- Seven strategies from Gary Price (2001)
- Adopt the mindset of a hunter
- Use search engines
- Examine your Bookmarks / Favorites
- Monitor discussion lists in your subject area
- Use Invisible Web pathfinders
- Try offline finding aids
- Create your own monitoring service
15Invisible Web Collection Development
- Continuous education and internet collection
development are key to managing the Invisible Web - Familiarize yourself with your collection
- Get used to knowing when to use it vs. when to go
to a search engine
16Identifying Invisible Web Sites
- First, understand the difference between
Navigation and Content sites - Navigation Sites that facilitate navigation and
resource discovery - Content Sites that provide content
- All truly Invisible Web sites are fundamentally
providers of content
17Direct vs. Indirect URLs
- Examining URLs easiest way to determine if a
Web page is invisible - Direct URLs
- point to a specific Web page
- Ex. www.yahoo.ca or www.tc.gc.ca/en/modes/htm
- Crawlers can follow these URLs
- Indirect URLs
- Dont point to a specific page.
- Contain information to be executed by a script on
server - Contain symbols (?) or words (cgi-bin or
javascript) - Ex. www.elections.ca/scripts/info/edMap_e.asp?edID
35059showLinkno
18Specialized vs. Invisible
- Specialized search directories share some
characteristics with Invisible Web sites, but are
visible to search engines - Structured hierarchically as navigational hubs
consist of 100s or 1000s of HTML pages (ex.
www.lawcrawler.com) - To test Start browsing the directory, drilling
down are the URLs direct or indirect?
19Comparison Specialized vs. Invisible
20Why Use the Invisible Web?
- General-purpose search engines mass audience
resources - Invisible Web sites more focused
- We may be doing our clients a disservice if we
stop searching at Google - Consider the point of view of the provider of the
resource (search engines are trying to please
everyone, motivated by profit)
21Why Use the Invisible Web?
- Specialized content focus more comprehensive
results - Specialized search interface more control over
search input and output - Increased precision and recall
- Invisible Web resources higher level of
authority - The answer may not be available elsewhere.
22When to Use the Invisible Web
Rules of Thumb
- When you are familiar with a subject
- When you are familiar with specific search tools
and techniques - When you are looking for a precise answer
- When you want authoritative, exhaustive results
- When the timeliness of content is an issue
23Top 25 Invisible Web Categories
- 1. Public company filings
- 2. Telephone numbers
- 3. Customized maps driving directions
- 4. Clinical trials
- 5. Patents
- 6. Out of Print Books
- 7. Library catalogues
- 8. Authoritative dictionaries
- 9. Environmental information
- 10. Historical stock quotes
- 11. Historical documents and images
- 12. Company directories
- 13. Searchable subject bibliographies
- 14. Economic information
- 15. Award winners
- 16. Job postings
- 17. Philanthropy grant information
- 18. Translation tools
- 19. Postal codes
- 20. Basic demographic information
- 21. Interactive school finders
- 22. Campaign financing information
- 23. Weather data
- 24. Product catalogues
- 25. Art gallery holdings
Sherman and Price (2001)
24The Future of the Invisible Web
- Question Will traditional search engines ever be
able to index the Invisible Web? Answer Yes and
No - Yes, we will see advances in search engine
technology and approaches to search, such as - Indexing of new file formats (PDF, Word, Excel
and non-textual multi-media) - Smarter Crawlers
- Metadata
- Ability to search databases and interact with
query forms - Real-time crawling
- No, the Invisible Web will probably always exist
because information growth is just too great for
search engines to keep up with.
25Conclusion What is a Web Searcher to do?
- Gain an understanding of the Invisible Web
- Develop your own Invisible Web collection
- You will be expanding the number of tools
available to you, thus making you a more
efficient searcher - Keep current with new developments and new
Invisible Web resources