Title: Searching the Deep Web
1Searching the Deep Web
- Abe LedermanDeep Web Technologies, LLCLos
Alamos, NM - Arlington, Virginia
- October 25, 2002
2What is the Deep Web? (1 of 2)
- Web content crawlers can not get to because it is
- Publicly available information is 400-550 times
larger than the known web (surface web) - Consists of 7500 terabytes of information
- inside databases
- behind firewalls
- available only for a fee
3What is the Deep Web? (2 of 2)
- 200,000 deep web websites have been identified
- Content is of higher quality than the surface web
- Growing faster than the surface web
- (according to BrightPlanet March 2000 study)
4Background
- Cofounder of Verity in 1988
- Started consulting to Los Alamos National
Laboratory in 1994 - Demonstrated web search retrieval application
at a Verity Users Conference in April 1994 - Developed SciSearch_at_LANL application in 1995
- Founded Innovative Web Applications in 1996
- Deployed the first "deep web" application in the
Federal government, the Environmental Science
Network in February 1999.
5Distributed Explorit Features (1 of
2)
- Search engine independent
- Configurable user interface
- Common, consistent user interface across
heterogeneous document collections - Built-in navigation capabilities
- Built-in mark download capability
6Distributed Explorit Features (2 of
2)
- Field including date-range searching
- Supports access to login-restricted sites
- Supports access to sites that use cookies other
session ids - Distributed Alerts for notification of new
relevant content - Personal Library to organize search results
7Distributed Explorit How it Works (1 of 2)
- Translates user search requests into the syntax
that is understood by the web database being
searched
The search deep web may be translated
into deep web deep web deep AND web deep
ADJ web
8Distributed Explorit How it Works (2 of 2)
- Maps fields that user is searching to fields that
are available in database - Submits searches in parallel, in real time, to
the various databases selected by the user - Sends requests to web server in same format as a
web browser would - Result lists are parsed, and field values
(author, title, etc) are extracted.
9Deep Web Search Application Advantages
- No crawling or indexing of data required
- Centralized access to multiple databases from one
search form - User needs to learn only one search language
- Common user interface across all sources accessed
- Improved functionality, in many cases, as
compared to what is provided by the source itself - Configurable for a wide variety of web databases
and information sources
10Deep Web Search Application Disadvantages
- Changes in database engine used, result list
format, etc will temporarily disable access to
source - Not all search functionality of source search may
be exposed - Some of the information that a source returns may
be lost - Increase network/bandwidth requirements of server
11Performance Characteristics of aDeep Web Search
Application
- Requires no disk space for indices
- Requires minimal CPU resources
- Requires more network resources than a comparable
application that only searches local content
12Distributed Explorit is a Stepping Stone
- From legacy content in government databases
- To content that is re-architected, re-purposed to
be accessed via emerging technologies (e.g. XML,
Web Services) - Enables cheap and quick development of
applications that showcase the value of universal
access to content - Gateway between legacy world and emerging
technologies
13Demo
- US. Government Science Portal
- -Collaborative effort of 10 Federal Government
Agencies - http//www.science.gov
- We searched three collections for cyber
security.
14(No Transcript)
15 16(No Transcript)
17(No Transcript)
18(No Transcript)
19The Problem
- The haystack is getting bigger
- How do you find that needle?
20The Future (1 of 2)
- Clustering of results
- In-depth searching
- Indexing of results (uniform relevance ranking)
- Sophisticated result analysis
- Assist user in identifying best sources for a
given search - Collaborative discovery
21The Future (2 of 2)
- Web Service compliant interface
- Output of search results in both HTML and XML
- Support for querying with XQuery when this W3C
standard is finalized - Dynamic cross-content hyperlinking
- Leads to UberPortal, portal of portals
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29For additional information
- Searching the Deep Web
- Directed Query Engine Applications
- At the Department of Energy
- www.dlib.org/dlib/january01/warnick/01warnick.html
- The Deep Web
- Surfacing Hidden Value
- www.brightplanet.com/deepcontent/tutorials/DeepWeb
/index.asp
30Contact Information
- Abe Lederman Deep Web Technologies154 Piedra
Loop Los Alamos, NM 87544(505)672-0007 - abe_at_deepwebtech.com
31(No Transcript)