Title: Masters Project Overview
1Masters Project Overview
- A Search Engine Implementation based on Content
Indexing and - Web Service Architecture
- Advisor Dr. Shen
- Department of Computer Science
- Presenter Vijay Gomatam
2Project Goals
- Implement a travel-centric search engine based on
- Content Indexing Search Mechanism
- Web Service Technology
- Flexible and Scalable User Interface
3What is a Search Engine?
Its a robot that searches the data across the
databank based on the user query.
Challenge
How to make it efficient?
4Important factor for an efficient Search Engine is
5TRAVEL SEARCH ENGINE
Users could access the search engine like any
other web application through HTTP URL request
and response.
6Why Travel Search Engine?
- To retrieve destination(s) and activity(s)
specific details - --- Learn and implement LUCENE.
- To retrieve generic Web / Journal Review Results
- --- Learn and implement Data Binding Technology
7BACKBONE
- The backbone of the search engine is based on
- local data query using Content Indexing and
- generic web query using Web Service.
8GUI Presentation Layer
HTTP REQUEST
TOMCAT
STRUTS
HTTP RESPONSE
HTML Text UI input field accepts the user search
criteria and sends the same as a query to the
backend process. The backend business process
parses the query to determine for the
following 1) Destination 2)
Activity 3) Member ID tied to
destination 4) Weather tied to
destination 5) Combination of destination
and/or activity results a. Can be
retrieved as web results b. Can be
retrieved as journal-review results
9STRUTS BUSINESS LAYER
- Struts is based on the time-proven MVC design
pattern, where the processing is broken into
three distinct sections viz., the Model, the View
and the Controller. - Business layer is implemented using Struts,
which is a MVC based architecture to accept all
the users requests through controller and
dispatches the request to the corresponding
Search Action implementation object. The mapping
of each business implementation object to every
user request is provided in the configuration
file, which will be processed and implemented by
the controller i.e., the Action-Servlet class.
10Model-View-Control
- Model provides the model of the application
business logic, represented in Java Bean. - View components are those pieces of application
that present the information to the users and
accept input from the web pages, built using JSP,
XML, HTML. - Controller coordinates activities in the
application. The Action Servlet, which functions
as a controller, centralizes the logic for
dispatching the requests to the implementation
Action object based on the request URL, input
parameters, and application state.
11Data Flow in Business Layer
CONTROLLER SERVLET
BUSINESS LOGIC
STRUTS-CONFIG.XML
CLIENT BROWSER
VIEW JSP
MODEL APPLICATION STATE
- Business implementation object, Search Action
class, will have the logic to process the query
and gets the results from back end process. The
model data is then passed on to JSP pages to
render as a result back to the user.
12Locating and Accessing Search Data
- The fastest way of retrieving the search data is
directly proportional to how much organized is
the data. - Indexing logic is used to index the data files,
which makes the information to be stored in an
organized structure and making the search
mechanism faster. - Example LUCENE
- Many big players in the search technology like
Google follow the Lucene search principles.
13LUCENE
- Lucene is a Java-based open-source framework for
text indexing and searching. It is a flexible,
fully customizable and amazingly fast search
engine. - It has its own content index algorithms and
performing the search upon parsing the query
string. - It provides building blocks to build a search
engine based on the requirements. - It integrates directly with the Web application.
- Any Java application can use Lucene as the core
of any search functionality. - So, I have used Lucene to build the search
solution for the Travel Search Engine
application, by adding Servlets and JSP pages to
process the input query and display the results.
14LUCENE FUNDAMENTALS
- Lucene internally creates its own file-based
indexes and organizes the text data accordingly. - Lucene works on two fundamental steps viz.,
- Indexing Documents based on an index structure
and - Searching over the query in the documents.
15INDEXING
- Indexing Documents is the first step in creating
Lucene Index in a directory, using different
analyzer algorithms. Lucene index is a collection
of documents organized in a way that allows quick
retrieval of information when arbitrarily queried
upon. Each document in a lucene index is made up
of one or more fields that are name-value pairs,
much like entries in a HashMap. - The fundamental concepts in Lucene are index,
document, field and term. - Index An index contains a sequence of
documents. - Document A document is a sequence of fields.
- Field A field is a named sequence of terms.
- Term A term is a string.
- Depending on the size of the file, many groups of
files with same name and different extensions are
created. Each of these groups is known as
segment. Lucene keeps track of each segment
using a file called segment.
16INDEXING FILES
- Lucene is used to convert the huge amount of
text data, containing destination, activity and
member ID details, into well-organized documents.
It indexes the documents into segments, which
makes the search faster. Initially, I have
retrieved the content index file and used Lucene
to index it into 2 folders - Destination, that contains documents related to
destination and activity index. - /destination/_ckl.cfs compound file format for
storing destination and activity. - /destination/segment to keep track of
destination and activity indexes - Members, that contains documents related to
member ID. - /members/_asyz.cfs compound file format for
storing the members. - /members/segment to keep track of member
indexes.
17SEARCHING
- Now that the indexes are built, search mechanism
accesses the indexes and queries the contents.
The search can be performed on single index or
multiple indexes. Ultimately the result is
collected in a single result set. The search
method returns Hits, ordered collection of
documents matching the query. Query Parser, is
used to parse the query string and builds an
appropriate Query object. For faster and
consistent results, its recommended that the same
Analyzer to be used for parsing queries that was
used when indexing the documents.
18Content Index Searching Work Flow
HTTP REQUEST
QUERY PARSING USING LUCENE API
LUCENE FILE SYSTEM
TOMCAT
STRUTS
HTTP RESPONSE
HTTP REQUEST
WEB SEARCH
XML RESPONSE
When the user enters the search query, the lucene
parses the query as a preliminary step for
searching and accordingly searches the query in
its self-organized file system.
19JiBX INTEGRATION TIER
Integration tier uses the binding principles to
process the retrieved xml responses from the
third party service based on Web Service
Architecture and ultimately delivers Java objects
to the business tier. JiBX is an open source
framework for binding XML response to Java
objects. JiBX framework handles all the details
of converting data to and from XML based on its
own class structures and proprietary
instructions. It works with existing classes,
using a flexible mapping definition file to
determine how data objects are translated to and
from XML. JiBX is designed to perform the
translation between internal data structures and
XML with very high efficiency, but still allows
you a high degree of control over the translation
process.
20JiBX Performance Bind Mark
Reference https//bindmark.dev.java.net/
21JiBX Startup Time
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
22JiBX Small Documents Memory Usage
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
23JiBX Large Documents Memory Usage
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
24JiBX BINDING
- JiBX makes the binding process easier, faster and
efficient. It contains - the binding definition xml file i.e.,
SearchRes-jibx.xml - the xml response from the third party Nutch API
and - the converted java objects
- In JiBX, the binding process is handled in two
fundamental steps - BINDING COMPILER
- JiBX uses binding definition documents to define
the rules for how the Java objects are converted
to or from XML. - BINDING RUNTIME (Marshalling / Unmarshalling)
- The enhanced class files generated by the binding
compiler use this runtime component both for
generating an XML representation for an object in
memory, called marshalling and for building an
object in memory from an XML representation,
called unmarshalling. - Travel Search Engine performs unmarshalling to
build objects from the OTA Response.
25Unmarshalling Large Documents
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
26Unmarshalling Small Documents
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
27Integration Tier Work Flow
SERVICES THE HTTP REQUEST AND RETURNS AN OPEN
TRAVEL ALLIANCE OTA STANDARD XML RESPONSE BASED
ON WEB SEARCH OR SITE SEARCH ALSO BASED ON
CONTENT INDEXING
HTTP REQUEST
HTTP REQUEST
TOMCAT
STRUTS
HTTP RESPONSE
INTEGRATION TIER - BINDING XML RESPONSE TO JAVA
OBJECTS USING JIBX
JAVA OBJECT
BUSINESS LOGIC TIER
XML RESPONSE
28PROJECT ENVIRONMENT
INSTALLATION REQUIREMENTS JVM Environment As I
have used new features of JDK1.5, JVM version of
1.5 is required for the development. Application
Server Tomcat 5.0.28 Browser Internet Explorer
6 or later Project Deployment The ant build
utility creates a war file, which can be deployed
onto the container. DEVELOPMENT
ENVIRONMENT Languages Java 1.5, JSP, Servlets,
XML, XSLT, HTML and JavaScript Framework Apache
Jakarta Struts 1.2.7 Content Indexing Algorithm
Apache Lucene External Search API Nutch, built
based on Lucene Binding JIBX Build Apache
Jakarta Ant 1.6.5 Web Server Apache Tomcat 5.0.28
29FUTURE WORK
- Include more destination and activity specific
details into the content index file - Include zip code based destination-details search
- For example Restaurant details based on zip code
- Implement AJAX technology Suggesting the user
about the destinations - Explore alternative content indexing and data
binding technologies
30REFERENCES
1) LUCENE a. Apache Lucene
http//lucene.apache.org/ b. search enable your
application with LUCENE, Java Developers
Journal, Dec 2002 2) STRUTS a.
Apache Strut http//struts.apache.org/ b.
Struts Kick Start, James Turner and Kevin
Bedell 3) JIBX Apache Software
Foundation http//jibx.sourceforge.net/ 4)
NUTCH Apache Nutch Website
http//lucene.apache.org/nutch/ 5) WEB
SERVICES Beginning Java Web Services, H.
Bequet, M.M. Kunnumpurath, S. Rhody, A. Tost ,
Wrox Press, Mar 2003.