Title: OUR GROUP MEMBERS:
1Implementing a Prototype of XML Software Tool for
Displaying and Searching Genomics Documents
- OUR GROUP MEMBERS
- Farhan Khalid
- Vladimir Kossoi
- Mani Yasrebi
- David Feng
- Mauricio Franco
- Surjeet Roopani
2INTRODUCTION
- The objective of this assignment is to build a
web - information system for displaying and searching
- XML Documents.
- The implementation of this project is divided
into three parts - Creation of a two-level indexer
- Searching
- Creation of an XSL stylesheet for purposes of
presenting XML documents
3OUR SEARCHING PAGE
- The main page is located at http//unix.aml.yorku.
ca8080/w04_g4/Search.html
4DISPLAYING GENOMICS DOCUMENTS(This page is the
java class SearchSerlvet)
5DISPLAYING GENOMICS DOCUMENTS(This page is the
java class SearchSerlvet)
- This page displays a number of important things
to - the user
- Number of documents found that contain
- the specified term
- Number of times the term was found in all
- the documents
- Time (seconds) it took to search for the
- term
- Article titles
6Formatted XMLby using XSL stylesheet
7INDEXER ARCHITECTURE
MainApp
Sorted collection
Create index
Positions.txt
Parse files
Dictionary.txt
ArticleTitles.txt
8General Design
- Two parts to the creation of indexer
- Parsing all documents.
- Write to file and create a lexicon
- Parsing involves
- 1) 1139 files are created
- 2) All files are parsed using DOM architecture.
- Article titles are extracted and saved in a file
using object serialization. - ArrayList contains a collection of all terms,
documents, and positions. - Using Collections.sort(List l) arraylist is
sorted.
9General Design Second part
- Write to file and create a lexicon
- Merge all duplicate terms
- Write every term record to file in binary format
- Store every term in a TreeMap as a Lexicon object
- The connection (pointer) between the lexicon and
Postings.txt is the start and end byte of record
in Postings.txt.
10General Design Second part
POSTINGS.TXT
lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
197)437.xml(16173,8,12,13,3,22,9,3,21,19,10,9,13,
3,20,10)446.xml(7202,19,2,3,14,14,15)705.xml(114
5)gtlt534,1.xml(3218,12,19)1001.xml(1316)
11Lexicon
- Our lexicon consists of 359,304
- terms. A term was considered to be
- sequence of any character except for the
- following
- " \t\n\r\f,'() and a .
- These characters were used as delimiters.
12INDEXER
- First level
- In memory, a collection of all terms stored in a
TreeMap object. - Key is the term
- Values are the start and end byte in
Positions.txt - Second level
- Postings.txt
- lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
197)437.xml(16173,8,12,13,3,22,9,3,21,19,10,9,13,
3,20,10)gt - ltTotalFreqTerm, DocName(TermFreqpos1,posN)gt
13Compression
- Store differences of positions.
- For each term, in each document positions are
being compressed. - i.e. 100,102,105,110.
- After compression 100,2,3,5
-
- In Postings.txt
- lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
197)437.xml(16 - 173,8,12,13,3,22,9,3,21,19,10,9,13,3,20,10)
- For efficiency reasons positions are compressed
only if there are gt 2 positions. - Without compression our Postings.txt file is
3.38 MB (3,546,244 bytes) - With compression our Postings.txt file is 3.25
MB (3,408,825 bytes) - We have saved 137,419 bytes.
14SEARCHING
- For searching we have created one servlet that
does all the processing SearchServlet.java. - Servlet receives the term, looks in TreeMap
object. - IF found, read Postings.txt from START
until END byte. All the bytes read until START
are discarded, thereby saving memory. - If this is the first time servlet is called, then
the init method is executed. - Read two files Dictionary.txt and
ArticleTitles.txt. - Store in two TreeMap objects
15SEARCHING/DECOMPRESSION
- In SearchSevlet the positions are NOT being
decompressed. - We have created another file called
SearchServletCompression, where positions are
being Decompressed. The reason behind this is
SPEED. With compression our search is much
slower. - Below is a comparison
16ANY