Web Crawlers and Web Indexes

About This Presentation

Title:

Description:

Number of Views:77

Avg rating:3.0/5.0

Slides: 14

Provided by: kevinrsanj

Category:

Tags: crawlers | indexes | web

Transcript and Presenter's Notes

Title: Web Crawlers and Web Indexes

1
Web Crawlers and Web Indexes

2
Contents

3
The Main Objective

Producing a sufficient web crawler that can
explore a website, finding documents and their
contents
Producing a efficient web indexer that can
process all retrieved documents, ordering them in
alphabetical order

4
Description of the Program

Consists of 7 java files that coordinate with one
another that search a website and return sorted
information from that site in text files.
Once Start.java is initiated, the entire program
commences.
The WordMiner.java file produces a data.txt file
which contains unsorted data
The WebCrawler.java produces a URLInfo.txt
The WebIndexer.java produces a searchFile.txt

5
Description of Program Diagram
6
Design of the Program

The Start java file initiates the entire process.
The web crawler process executes first, which is
split into 2 java files WebCrawler.java and
WordMiner.java
The web crawler java file searches a particular
website specified by the user for links and
stores it into a visited URL vector.
Once completed, the WordMiner.java file scans the
page searching for key words using a string
tokenizer class and eliminates all html and
javascript code.

7
Design of the Program Continued

All of this data is stored in a text file called
data.txt
The web indexer java file then indexes the data
from the data.txt file by using the string
tokenizer class. This assigns a position ID and a
URL ID to each piece of data.
Once each piece of data is assigned an ID, all
data is stored in searchFile.txt, where it is
also alphabetized.

8
Example of searchFile.txt

9
Problems and Solutions

Solution
the conclusion to do an alphabetical sort would
be the most efficient way to sort the file

10
Problems and Solutions Cont

Solution
A boolean statement was added, which checks for
another token and goes ahead only if another
token exists.

11
Problems and Solutions Cont

Problem 3
When the search was initiated, the program
originally searched for all the available links
that were directly connected to the initial
search page

Solution
This problem was solved by modifying the code so
that the program would only search for those
links which are within

12
Program Improvements

Added to the search file text file was search
time and index time.
Now it is possible to see how long the web
crawler and web indexer takes to generate the
appropriate text files.

Web Crawlers and Web Indexes - PowerPoint PPT Presentation