Identifying Redundant Search Engines in Metasearch Context - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Identifying Redundant Search Engines in Metasearch Context

Description:

Metasearch is a way of searching any number of other search engines to create ... remove as many unnecessary search engines from the meta set as possible while ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 26
Provided by: jamier2
Category:

less

Transcript and Presenter's Notes

Title: Identifying Redundant Search Engines in Metasearch Context


1
Identifying Redundant Search Engines in
Metasearch Context
  • By Jamie McPeek

2
Overview
  • Background Information
  • Metasearch
  • Sets
  • Surface Web/Deep Web
  • The Problem
  • Application Goals

3
Metasearch
  • Metasearch is a way of searching any number of
    other search engines to create and generate more
    precise data.
  • Metasearch engines began appearing on the web in
    the mid 90s.
  • Many metasearch engines exist today but are
    generally overlooked due to Googles grasp on the
    searching industry.
  • An example www.info.com

4
Sets
  • Set is a collection of similar objects with no
    repeated values.
  • For our purposes, a set consists of any number of
    web pages the name of the set being based on the
    search engine that returns the values.
  • When viewing all sets returned from all search
    engines, there is no longer a set as there may be
    many repeated values.
  • Removing search engines to lower/remove
    redundancy is an NP-Complete problem.

5
Sets (Continued)
  • A cover is a collection of sets that contain at
    the very least one of each element contained in
    the universe in our case, at least one of each
    web page from the total amount of web pages
    returned.
  • Keeping the cover in mind, we have a goal
  • Remove as many search engines as possible while
    maintaining a cover.
  • If possible, remove all redundant search engines.
    This creates a minimal cover.

6
Surface Web/Deep Web
  • The surface web is any content that is publicly
    available or accessible on the internet.
  • The deep web is all other content.
  • Data accessible only through an on-site query
    system.
  • Generated on-the-fly data.
  • Frequently updated/changed content.
  • Most normal search engines are incapable of
    retrieving data from the deep web or catch it in
    only one state.

7
Suface Web/Deep Web (Cont.)
  • Various search engines catching different states
    would allow a compilation from them to provide a
    clearly picture of the actual content.
  • Specialized and site-only search systems can
    almost always be generalized to allow remote
    searching.
  • With the above in mind, metasearch becomes an
    intriguing idea as a way to view not only the
    surface web, but the deep web as well.

8
The Problem
  • A finite and known number of search engines each
    return a set of a finite and known number of web
    pages.
  • Across all search engines, there may be
    redundancy. The idea is to remove as many
    unnecessary search engines from the meta set as
    possible while leaving a complete cover of the
    web pages.
  • Accuracy (relative to the true minimal cover) and
    speed are the most important aspects.

9
Application Goals
  • Using two different languages, we want to
  • Compare the accuracy and speed of two different
    algorithms.
  • Compare different structures for the data based
    on the same algorithm.
  • Assess the impact of regions on overall time.
  • Regions are a way of grouping elements based on
    which search engines they are in. At most one
    element in a region is necessary, all others are
    fully redundant.

10
Overview
  • Source Code
  • Original Setup
  • Key Structures C
  • Key Structures C
  • Procedure
  • Reasons For Changing

11
Original Setup
  • System UW-Plattevilles IO System
  • Language C
  • Minor work on this code as it was already
    written.
  • Used as a baseline for improvements.
  • Managed using subversion to allow rollbacks and
    checkins.

12
Key Structures - C
  • Structure for storing the sets.
  • Each web page is mapped to a specific bit.
  • Bitwise operators are used for accessing/editing
    a specific bit.
  • listindex (1 (sizeof(unsigned int)

13
Key Structures C (Cont.)
  • Structure for storing web pages.
  • Stored using a tree for faster insertion.
  • BITMAP in this instance stores the specific
    search engine that the document exists in.
  • nID allows reference back to the specific web
    page instead of dragging the string around the
    entire time.

14
Key Structures C
  • The bitmap structure as changed for C.
  • Added some variables to reduce fixed
    calculations.
  • Added variables to hold new data available when
    reading in the web pages.
  • Converted to a Class OOP.

15
Key Structures C (Cont.)
  • A new structure implemented for C.
  • A two-dimensional grid of these nodes implemented
    as a linked list.
  • Eliminates empty bits.
  • Structure is self-destructive in use.
  • Access coordinate based on matrix (i, j)

16
Procedure
  • Read web pages in from file.
  • Number of search engines.
  • Number of documents per search engine.
  • Store each incoming web page as a node in a
    balanced tree.
  • Total number of web pages.
  • Total number of unique web pages.
  • Setup whichever structure is to be used based on
    the numbers learned from reading in and storing
    the web pages.

17
Procedure (Continued)
  • Populate the structures based on the data
    available in the tree.
  • This can be the original tree or the region tree.
  • The bitmap structure is stored in two ways
  • Search engine major used in original C code.
  • Document major used in new C code.
  • Run one of the two algorithms over the structure
    and document the results.
  • Cover size and amount of time taken.

18
Reasons For Changing
  • Personal preference. Im significantly more
    familiar with C than I am with C.
  • Additional compiler options for the language.
  • OOP.
  • Additional language features.
  • Inline functions.
  • Operator overloading.
  • More readable code.

19
Overview
  • Algorithms
  • Greedy Algorithm
  • Check and Remove (CAR) Algorithm
  • Results
  • Data Sets
  • Baseline Results
  • Updated (C) Results
  • Regions
  • Impact (Pending)

20
Greedy Algorithm
  • Straight-forward, brute force.
  • Add the largest set, then the next largest, and
    so on.
  • Easily translated to code.
  • Makes no provision for removing redundant sets
    after reaching a cover set.

21
Check and Remove Algorithm
  • Less direct approach adds based on uncovered
    elements.
  • Remove phase makes a single pass at removing any
    redundant sets.

22
Data Sets
  • The structures and algorithms were tested on
    moderately large to very large data sets.
  • Number of documents range from 100,000 to
    1,000,000.
  • The number of search engines was constant at
    1,000.
  • Distribution was uniform (all search engines
    contained the same number of documents).
  • Non-uniform sets were tested by Dr. Qi. It
    apparently worked or he would have let me know.

23
Baseline Results
  • Greedy
  • Min 1,500 Seconds
  • Max 29,350 Seconds
  • 8h 9m 10s
  • CAR
  • Min 16.5 Seconds
  • Max 138.5 Seconds

24
Updated (C) Results
  • Greedy
  • Min 4.5 Sec.
  • Max 19.25 Sec.
  • CAR
  • Min 1.0 Sec.
  • Max 7.75 Sec.
  • Matrix (Both)
  • Min 0.20 Sec.
  • Max 0.40 Sec.

25
Regions Impact (Pending)
  • The idea is to find and remove redundant web
    pages in an intermediary step between reading
    data and performing the algorithm.
  • Redundant web pages are determined based on the
    search engines that contain them.
  • Currently the process of removing these web pages
    takes more time than it saves.
  • This is not true for the baseline code as the
    run-time of the algorithms is significantly
    longer.
  • Not determined whether this can be improved.
Write a Comment
User Comments (0)
About PowerShow.com