Web Crawlers and Web Indexes - PowerPoint PPT Presentation

About This Presentation
Title:

Web Crawlers and Web Indexes

Description:

Producing a sufficient web crawler that can explore a website, finding documents ... The web crawler process executes first, which is split into 2 java files: ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 14
Provided by: kevinrsanj
Category:
Tags: crawlers | indexes | web

less

Transcript and Presenter's Notes

Title: Web Crawlers and Web Indexes


1
Web Crawlers and Web Indexes
  • ITEC 4020 3.0M Assignment 2
  • Implementing a Web Crawler and Building a Web
    Index
  • Group 11
  • Irina, Punit, Kevin, Mehul, Megha

2
Contents
  • Objective
  • Decription of Web Crawler and Web Index
  • Design of Project
  • Analysis Problems and Solutions
  • General Scenario

3
The Main Objective
  • Producing a sufficient web crawler that can
    explore a website, finding documents and their
    contents
  • Producing a efficient web indexer that can
    process all retrieved documents, ordering them in
    alphabetical order

4
Description of the Program
  • Consists of 7 java files that coordinate with one
    another that search a website and return sorted
    information from that site in text files.
  • Once Start.java is initiated, the entire program
    commences.
  • The WordMiner.java file produces a data.txt file
    which contains unsorted data
  • The WebCrawler.java produces a URLInfo.txt
  • The WebIndexer.java produces a searchFile.txt

5
Description of Program Diagram
6
Design of the Program
  • The Start java file initiates the entire process.
  • The web crawler process executes first, which is
    split into 2 java files WebCrawler.java and
    WordMiner.java
  • The web crawler java file searches a particular
    website specified by the user for links and
    stores it into a visited URL vector.
  • Once completed, the WordMiner.java file scans the
    page searching for key words using a string
    tokenizer class and eliminates all html and
    javascript code.

7
Design of the Program Continued
  • All of this data is stored in a text file called
    data.txt
  • The web indexer java file then indexes the data
    from the data.txt file by using the string
    tokenizer class. This assigns a position ID and a
    URL ID to each piece of data.
  • Once each piece of data is assigned an ID, all
    data is stored in searchFile.txt, where it is
    also alphabetized.

8
Example of searchFile.txt
  • answers 26 411 96 18 134 308
  • antarctica 140 123
  • anthropology 147 160
  • anti-barns 197 184
  • anti-spam 166 70
  • antigua 140 124

9
Problems and Solutions
  • Problem 1
  • Sorting the searchfile.txt in an organized
    fashion was a concern the team
  • Solution
  • the conclusion to do an alphabetical sort would
    be the most efficient way to sort the file

10
Problems and Solutions Cont
  • Problem 2
  • When the data.txt file only had one line, the
    program would crash
  • Solution
  • A boolean statement was added, which checks for
    another token and goes ahead only if another
    token exists.

11
Problems and Solutions Cont
  • Problem 3
  • When the search was initiated, the program
    originally searched for all the available links
    that were directly connected to the initial
    search page
  • Solution
  • This problem was solved by modifying the code so
    that the program would only search for those
    links which are within

12
Program Improvements
  • Added to the search file text file was search
    time and index time.
  • Now it is possible to see how long the web
    crawler and web indexer takes to generate the
    appropriate text files.

13
Web Crawlers and Web Indexes
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com