Identifying Redundant Search Engines in Metasearch Context - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Identifying Redundant Search Engines in Metasearch Context

Description:

Metasearch is a way of searching any number of other search engines to create ... remove as many unnecessary search engines from the meta set as possible while ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 26

Provided by: jamier2

Category:

more less

Transcript and Presenter's Notes

Title: Identifying Redundant Search Engines in Metasearch Context

1
Identifying Redundant Search Engines in
Metasearch Context

By Jamie McPeek

2
Overview

Background Information
Metasearch
Sets
Surface Web/Deep Web
The Problem
Application Goals

3
Metasearch

Metasearch is a way of searching any number of
other search engines to create and generate more
precise data.
Metasearch engines began appearing on the web in
the mid 90s.
Many metasearch engines exist today but are
generally overlooked due to Googles grasp on the
searching industry.
An example www.info.com

4
Sets

Set is a collection of similar objects with no
repeated values.
For our purposes, a set consists of any number of
web pages the name of the set being based on the
search engine that returns the values.
When viewing all sets returned from all search
engines, there is no longer a set as there may be
many repeated values.
Removing search engines to lower/remove
redundancy is an NP-Complete problem.

5
Sets (Continued)

A cover is a collection of sets that contain at
the very least one of each element contained in
the universe in our case, at least one of each
web page from the total amount of web pages
returned.
Keeping the cover in mind, we have a goal
Remove as many search engines as possible while
maintaining a cover.
If possible, remove all redundant search engines.
This creates a minimal cover.

6
Surface Web/Deep Web

The surface web is any content that is publicly
available or accessible on the internet.
The deep web is all other content.
Data accessible only through an on-site query
system.
Generated on-the-fly data.
Frequently updated/changed content.
Most normal search engines are incapable of
retrieving data from the deep web or catch it in
only one state.

7
Suface Web/Deep Web (Cont.)

Various search engines catching different states
would allow a compilation from them to provide a
clearly picture of the actual content.
Specialized and site-only search systems can
almost always be generalized to allow remote
searching.
With the above in mind, metasearch becomes an
intriguing idea as a way to view not only the
surface web, but the deep web as well.

8
The Problem

A finite and known number of search engines each
return a set of a finite and known number of web
pages.
Across all search engines, there may be
redundancy. The idea is to remove as many
unnecessary search engines from the meta set as
possible while leaving a complete cover of the
web pages.
Accuracy (relative to the true minimal cover) and
speed are the most important aspects.

9
Application Goals

Using two different languages, we want to
Compare the accuracy and speed of two different
algorithms.
Compare different structures for the data based
on the same algorithm.
Assess the impact of regions on overall time.
Regions are a way of grouping elements based on
which search engines they are in. At most one
element in a region is necessary, all others are
fully redundant.

10
Overview

Source Code
Original Setup
Key Structures C
Key Structures C
Procedure
Reasons For Changing

11
Original Setup

System UW-Plattevilles IO System
Language C
Minor work on this code as it was already
written.
Used as a baseline for improvements.
Managed using subversion to allow rollbacks and
checkins.

12
Key Structures - C

Structure for storing the sets.
Each web page is mapped to a specific bit.
Bitwise operators are used for accessing/editing
a specific bit.
listindex (1 (sizeof(unsigned int)

13
Key Structures C (Cont.)

Structure for storing web pages.
Stored using a tree for faster insertion.
BITMAP in this instance stores the specific
search engine that the document exists in.
nID allows reference back to the specific web
page instead of dragging the string around the
entire time.

14
Key Structures C

The bitmap structure as changed for C.
Added some variables to reduce fixed
calculations.
Added variables to hold new data available when
reading in the web pages.
Converted to a Class OOP.

15
Key Structures C (Cont.)

A new structure implemented for C.
A two-dimensional grid of these nodes implemented
as a linked list.
Eliminates empty bits.
Structure is self-destructive in use.
Access coordinate based on matrix (i, j)

16
Procedure

Read web pages in from file.
Number of search engines.
Number of documents per search engine.
Store each incoming web page as a node in a
balanced tree.
Total number of web pages.
Total number of unique web pages.
Setup whichever structure is to be used based on
the numbers learned from reading in and storing
the web pages.

17
Procedure (Continued)

Populate the structures based on the data
available in the tree.
This can be the original tree or the region tree.
The bitmap structure is stored in two ways
Search engine major used in original C code.
Document major used in new C code.
Run one of the two algorithms over the structure
and document the results.
Cover size and amount of time taken.

18
Reasons For Changing

Personal preference. Im significantly more
familiar with C than I am with C.
Additional compiler options for the language.
OOP.
Additional language features.
Inline functions.
Operator overloading.
More readable code.

19
Overview

Algorithms
Greedy Algorithm
Check and Remove (CAR) Algorithm
Results
Data Sets
Baseline Results
Updated (C) Results
Regions
Impact (Pending)

20
Greedy Algorithm

Straight-forward, brute force.
Add the largest set, then the next largest, and
so on.
Easily translated to code.
Makes no provision for removing redundant sets
after reaching a cover set.

21
Check and Remove Algorithm

Less direct approach adds based on uncovered
elements.
Remove phase makes a single pass at removing any
redundant sets.

22
Data Sets

The structures and algorithms were tested on
moderately large to very large data sets.
Number of documents range from 100,000 to
1,000,000.
The number of search engines was constant at
1,000.
Distribution was uniform (all search engines
contained the same number of documents).
Non-uniform sets were tested by Dr. Qi. It
apparently worked or he would have let me know.

23
Baseline Results

Greedy
Min 1,500 Seconds
Max 29,350 Seconds
8h 9m 10s

CAR
Min 16.5 Seconds
Max 138.5 Seconds

24
Updated (C) Results

Greedy
Min 4.5 Sec.
Max 19.25 Sec.

CAR
Min 1.0 Sec.
Max 7.75 Sec.

Matrix (Both)
Min 0.20 Sec.
Max 0.40 Sec.

25
Regions Impact (Pending)

The idea is to find and remove redundant web
pages in an intermediary step between reading
data and performing the algorithm.
Redundant web pages are determined based on the
search engines that contain them.
Currently the process of removing these web pages
takes more time than it saves.
This is not true for the baseline code as the
run-time of the algorithms is significantly
longer.
Not determined whether this can be improved.