Searching the Deep Web

About This Presentation

Title:

Searching the Deep Web

Description:

What is the 'Deep Web'? (1 of 2) Web content crawlers can not get ... Deployed the first 'deep web' application in the Federal ... Deep Web Search Application: ... – PowerPoint PPT presentation

Number of Views:282

Avg rating:3.0/5.0

Slides: 32

Provided by: webser

Category:

more less

Transcript and Presenter's Notes

Title: Searching the Deep Web

1
Searching the Deep Web

Abe LedermanDeep Web Technologies, LLCLos
Alamos, NM
Arlington, Virginia
October 25, 2002

2
What is the Deep Web? (1 of 2)

Web content crawlers can not get to because it is
Publicly available information is 400-550 times
larger than the known web (surface web)
Consists of 7500 terabytes of information

inside databases
behind firewalls
available only for a fee

3
What is the Deep Web? (2 of 2)

200,000 deep web websites have been identified
Content is of higher quality than the surface web
Growing faster than the surface web
(according to BrightPlanet March 2000 study)

4
Background

Cofounder of Verity in 1988
Started consulting to Los Alamos National
Laboratory in 1994
Demonstrated web search retrieval application
at a Verity Users Conference in April 1994
Developed SciSearch_at_LANL application in 1995
Founded Innovative Web Applications in 1996
Deployed the first "deep web" application in the
Federal government, the Environmental Science
Network in February 1999.

5
Distributed Explorit Features (1 of
2)

Search engine independent
Configurable user interface
Common, consistent user interface across
heterogeneous document collections
Built-in navigation capabilities
Built-in mark download capability

6
Distributed Explorit Features (2 of
2)

Field including date-range searching
Supports access to login-restricted sites
Supports access to sites that use cookies other
session ids
Distributed Alerts for notification of new
relevant content
Personal Library to organize search results

7
Distributed Explorit How it Works (1 of 2)

Translates user search requests into the syntax
that is understood by the web database being
searched

The search deep web may be translated
into deep web deep web deep AND web deep
ADJ web
8
Distributed Explorit How it Works (2 of 2)

Maps fields that user is searching to fields that
are available in database
Submits searches in parallel, in real time, to
the various databases selected by the user
Sends requests to web server in same format as a
web browser would
Result lists are parsed, and field values
(author, title, etc) are extracted.

9
Deep Web Search Application Advantages

No crawling or indexing of data required
Centralized access to multiple databases from one
search form
User needs to learn only one search language
Common user interface across all sources accessed
Improved functionality, in many cases, as
compared to what is provided by the source itself
Configurable for a wide variety of web databases
and information sources

10
Deep Web Search Application Disadvantages

Changes in database engine used, result list
format, etc will temporarily disable access to
source
Not all search functionality of source search may
be exposed
Some of the information that a source returns may
be lost
Increase network/bandwidth requirements of server

11
Performance Characteristics of aDeep Web Search
Application

Requires no disk space for indices
Requires minimal CPU resources
Requires more network resources than a comparable
application that only searches local content

12
Distributed Explorit is a Stepping Stone

From legacy content in government databases
To content that is re-architected, re-purposed to
be accessed via emerging technologies (e.g. XML,
Web Services)
Enables cheap and quick development of
applications that showcase the value of universal
access to content
Gateway between legacy world and emerging
technologies

13
Demo

US. Government Science Portal
-Collaborative effort of 10 Federal Government
Agencies
http//www.science.gov
We searched three collections for cyber
security.

14
(No Transcript)
15

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
The Problem

The haystack is getting bigger
How do you find that needle?

20
The Future (1 of 2)

Clustering of results
In-depth searching
Indexing of results (uniform relevance ranking)
Sophisticated result analysis
Assist user in identifying best sources for a
given search
Collaborative discovery

21
The Future (2 of 2)

Web Service compliant interface
Output of search results in both HTML and XML
Support for querying with XQuery when this W3C
standard is finalized
Dynamic cross-content hyperlinking
Leads to UberPortal, portal of portals

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
For additional information