Title: Online Search Engine MED SEARCH
1Online Search Engine MED SEARCH
2Introduction
- The goal of this project is to build a search
engine that provides data XML documents that
contains data for Medicine. - developed an efficient online search engine for
MED - need a valid PMID (unique identification number)
for each provided XML document or keyword from
the document. - Once a PMID or any keyword is provided the user
gets relevant data
3Class Diagram
4How does the Indexeractually work?
- This document was parsed in order for its
contents to be used. - The program that does this is called
Indexer.java.
if(totalWord.charAt(i) 'lt') test
false if(totalWord.charAt(i)
'gt') test true //builds words based
character at a time if (test
totalWord.charAt(i) ! 'gt') word word
totalWord.charAt(i)
5- The indexer also parses the document to ignore
characters such as the following ones.
StringTokenizer st new StringTokenizer(word,
"," "_at_" "-" "." "\"" "(" ")" "?" " "
"" "\n" "\t" "\r" "" "" "/" ""
"" "'" "", false )
6- indexer runs through all parsed words and writes
them to another file, preferable .txt or .doc
file, with the page/document numbers in which
each of the parsed term appeared. - For example if the word dna appears on
page/document 67 then on the .txt/.doc file it
would appear as dna 67.
element st.nextToken().toLowerCase() v
ector.addElement("\n" element " "
count)
7Storing the Word in a Index
element st.nextToken().toLowerCase() v
ector.addElement("\n" element " "
count)
- The above code stores the word with the page
number attached. Eg. Dna 206
8Writing the XML pages
if(readTemp.indexOf("ltPubmedArticlegt")!-1) /
/test true count FileWriter
file new FileWriter(count ".xml", true)
- The above code reads through the entire XML
dataset and every time it reads the tag
ltPubmedArticlegt it creates a new XML document.
9Doing the Search at Runtime
- First the search is done in middle of file, if
keyword not found, then looks on the top of
vector, if keyword still does not match, then
looks in the bottom section of vectored file.
//perform divide and search sort
technique while(lower1!upper) middle
(lowerupper)/2 String tmpElement
(String)vector.elementAt(middle) StringTokeni
zer tmpToken new StringTokenizer(tmpElement)
String tmpString tmpToken.nextToken()
10 if(wordIn.compareTo(tmpString)lt0) upper
middle else if(wordIn.compareTo(tmpStr
ing)gt0) lowermiddle else
if(wordIn.equals(tmpString)) found
true upper lower 1 int
mid middle1
11- As soon as match is found for keyword the
SearchPageServlet will look for the word with in
the top, middle or bottom vector, depending on
vector the word was found.
if(found true) String tmpElement
(String)vector.elementAt(middle) Stri
ngTokenizer tmpToken new StringTokenizer(tmpElem
ent,"," "", false) String tmpWord
tmpToken.nextToken() String tmpPage
tmpToken.nextToken() vectorPage.addEleme
nt(tmpPage) .
12 String tmpElement2 (String)vector.elementAt
(mid) StringTokenizer tmpToken2 new
StringTokenizer(tmpElement2,",""",
false) String tmpWord2 tmpToken2.nextToken(
) String tmpPage2 tmpToken2.nextToken()
13- Reading numbers attached to all matched words.
//adds the page numbers of the words that match
the user input while(tmpWord.equals(tmpWord2))
middle String tmpE
(String)vector.elementAt(middle) Str
ingTokenizer tmp new StringTokenizer(tmpE,",""
", false) tmpWord tmp.nextToken() tm
pPage tmp.nextToken() mid
String tmpE2 (String)vector.elementAt(mid)
StringTokenizer tmpT2 new StringTokenizer(tmp
E2, ",""", false) tmpWord2
tmpT2.nextToken()
14- Every time the program finds matching code it
adds it to the vector and formats them to be
displayed in a list on the website. - If there are no matching keywords found in
indexed file then program will display on website
no matching results are found. - This happens when the vector is empty, as each
time a matching word is found it is stored in a
vector.
15- Thus from the search results listed the user can
select the link that most suits their search. - As soon as they click on the link on the website
the XSL style XML document is displayed on the
web.
16lt?xml version'1.0'?gt ltxslstylesheet
xmlnsxsl"http//www.w3.org/TR/WD-xsl"
xmlns"http//www.w3.org/TR/REC-html40"
result-ns""gt ltxsltemplate match"/"gt ltHTMLgt
ltHEADgt ltTITLEgtPubmed Articleslt/TITLEgt
lt/HEADgt ltBODY BGCOLOR"000000"
TEXT"FFFFFF"gt ltTABLE BORDER"1"gt ltTR
ALIGN"LEFT"gt ltTHgtPMIDlt/THgtltTHgtYearlt/THgtltT
HgtTitlelt/THgtltTHgtAffiliationlt/THgtltTHgtAbstractlt/THgt
lt/TRgt ltxslfor-each
select"PubmedArticleSet/PubmedArticle"gt
ltTR ALIGN"LEFT" VALIGN"TOP"gt
17- This XSL style sheet isolates each XML document
by identifying the tag ltPubMedArticlegt
lt/PubmedArticlegt and displays each set of
information for every document, independently as
a separate page.
ltTDgtltxslvalue-of select"MedlineCitation/PMID"/gtlt
/TDgt ltTDgtltxslvalue-of
select"MedlineCitation/Article/Journal/JournalIss
ue/PubDate/Year"/gt ltxslvalue-of
select"MedlineCitation/Article/Journal/JournalIss
ue/PubDate/Month"/gtlt/TDgt ltTDgtltFONT
SIZE"2"gtltxslvalue-of select"MedlineCitation/Art
icle/ArticleTitle"/gtlt/FONTgtlt/TDgt
ltTDgtltFONT SIZE"2"gtltxslvalue-of
select"MedlineCitation/Article/Affiliation"/gtlt/FO
NTgtlt/TDgt ltTDgtltFONT SIZE"2"gtltxslvalue-of
select"MedlineCitation/Article/Abstract/Abstract
Text"/gtlt/FONTgtlt/TDgt lt/TRgt
lt/xslfor-eachgt lt/TABLEgt
lt/BODYgt lt/HTMLgt lt/xsltemplategt lt/xslstylesheet
gt
18Website
- http//unix.aml.yorku.ca8080/w04_g20/searchPage.h
tml