OUR GROUP MEMBERS:

About This Presentation

Title:

OUR GROUP MEMBERS:

Description:

Implementing a Prototype of XML Software Tool for Displaying and Searching Genomics Documents ... 2) All files are parsed using DOM architecture. ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 17

Provided by: YOR46

Category:

more less

Transcript and Presenter's Notes

Title: OUR GROUP MEMBERS:

1
Implementing a Prototype of XML Software Tool for
Displaying and Searching Genomics Documents

OUR GROUP MEMBERS
Farhan Khalid
Vladimir Kossoi
Mani Yasrebi
David Feng
Mauricio Franco
Surjeet Roopani

2
INTRODUCTION

The objective of this assignment is to build a
web
information system for displaying and searching
XML Documents.
The implementation of this project is divided
into three parts
Creation of a two-level indexer
Searching
Creation of an XSL stylesheet for purposes of
presenting XML documents

3
OUR SEARCHING PAGE

The main page is located at http//unix.aml.yorku.
ca8080/w04_g4/Search.html

4
DISPLAYING GENOMICS DOCUMENTS(This page is the
java class SearchSerlvet)
5
DISPLAYING GENOMICS DOCUMENTS(This page is the
java class SearchSerlvet)

This page displays a number of important things
to
the user
Number of documents found that contain
the specified term
Number of times the term was found in all
the documents
Time (seconds) it took to search for the
term
Article titles

6
Formatted XMLby using XSL stylesheet
7
INDEXER ARCHITECTURE
MainApp
Sorted collection
Create index
Positions.txt
Parse files
Dictionary.txt
ArticleTitles.txt
8
General Design

Two parts to the creation of indexer
Parsing all documents.
Write to file and create a lexicon
Parsing involves
1) 1139 files are created
2) All files are parsed using DOM architecture.
Article titles are extracted and saved in a file
using object serialization.
ArrayList contains a collection of all terms,
documents, and positions.
Using Collections.sort(List l) arraylist is
sorted.

9
General Design Second part

Write to file and create a lexicon
Merge all duplicate terms
Write every term record to file in binary format
Store every term in a TreeMap as a Lexicon object
The connection (pointer) between the lexicon and
Postings.txt is the start and end byte of record
in Postings.txt.

10
General Design Second part

LEXICON

POSTINGS.TXT
lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
197)437.xml(16173,8,12,13,3,22,9,3,21,19,10,9,13,
3,20,10)446.xml(7202,19,2,3,14,14,15)705.xml(114
5)gtlt534,1.xml(3218,12,19)1001.xml(1316)
11
Lexicon

Our lexicon consists of 359,304
terms. A term was considered to be
sequence of any character except for the
following
" \t\n\r\f,'() and a .
These characters were used as delimiters.

12
INDEXER

First level
In memory, a collection of all terms stored in a
TreeMap object.
Key is the term
Values are the start and end byte in
Positions.txt
Second level
Postings.txt
lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
197)437.xml(16173,8,12,13,3,22,9,3,21,19,10,9,13,
3,20,10)gt
ltTotalFreqTerm, DocName(TermFreqpos1,posN)gt

13
Compression

Store differences of positions.
For each term, in each document positions are
being compressed.
i.e. 100,102,105,110.
After compression 100,2,3,5
In Postings.txt
lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
197)437.xml(16
173,8,12,13,3,22,9,3,21,19,10,9,13,3,20,10)
For efficiency reasons positions are compressed
only if there are gt 2 positions.
Without compression our Postings.txt file is
3.38 MB (3,546,244 bytes)
With compression our Postings.txt file is 3.25
MB (3,408,825 bytes)
We have saved 137,419 bytes.

14
SEARCHING

For searching we have created one servlet that
does all the processing SearchServlet.java.
Servlet receives the term, looks in TreeMap
object.
IF found, read Postings.txt from START
until END byte. All the bytes read until START
are discarded, thereby saving memory.
If this is the first time servlet is called, then
the init method is executed.
Read two files Dictionary.txt and
ArticleTitles.txt.
Store in two TreeMap objects

15
SEARCHING/DECOMPRESSION

In SearchSevlet the positions are NOT being
decompressed.
We have created another file called
SearchServletCompression, where positions are
being Decompressed. The reason behind this is
SPEED. With compression our search is much
slower.
Below is a comparison

16
ANY

Write a Comment

User Comments (0)

About PowerShow.com

OUR GROUP MEMBERS: - PowerPoint PPT Presentation

OUR GROUP MEMBERS:

Implementing a Prototype of XML Software Tool for Displaying and Searching Genomics Documents ... 2) All files are parsed using DOM architecture. ... – PowerPoint PPT presentation