Masters Project Overview - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Masters Project Overview

Description:

Why Travel Search Engine? To retrieve destination(s) and activity(s) ... Travel Search Engine performs unmarshalling to build objects from the OTA Response. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 31
Provided by: vijayg8
Category:

less

Transcript and Presenter's Notes

Title: Masters Project Overview


1
Masters Project Overview
  • A Search Engine Implementation based on Content
    Indexing and
  • Web Service Architecture
  • Advisor Dr. Shen
  • Department of Computer Science
  • Presenter Vijay Gomatam

2
Project Goals
  • Implement a travel-centric search engine based on
  • Content Indexing Search Mechanism
  • Web Service Technology
  • Flexible and Scalable User Interface

3
What is a Search Engine?
Its a robot that searches the data across the
databank based on the user query.
Challenge
How to make it efficient?
4
Important factor for an efficient Search Engine is
  • DATA ORGANIZATION

5
TRAVEL SEARCH ENGINE
Users could access the search engine like any
other web application through HTTP URL request
and response.
6
Why Travel Search Engine?
  • To retrieve destination(s) and activity(s)
    specific details
  • --- Learn and implement LUCENE.
  • To retrieve generic Web / Journal Review Results
  • --- Learn and implement Data Binding Technology

7
BACKBONE
  • The backbone of the search engine is based on
  • local data query using Content Indexing and
  • generic web query using Web Service.

8
GUI Presentation Layer
HTTP REQUEST
TOMCAT
STRUTS
HTTP RESPONSE
HTML Text UI input field accepts the user search
criteria and sends the same as a query to the
backend process. The backend business process
parses the query to determine for the
following 1)      Destination 2)     
Activity 3)      Member ID tied to
destination 4)      Weather tied to
destination 5)      Combination of destination
and/or activity results a.       Can be
retrieved as web results b.      Can be
retrieved as journal-review results
9
STRUTS BUSINESS LAYER
  • Struts is based on the time-proven MVC design
    pattern, where the processing is broken into
    three distinct sections viz., the Model, the View
    and the Controller.
  • Business layer is implemented using Struts,
    which is a MVC based architecture to accept all
    the users requests through controller and
    dispatches the request to the corresponding
    Search Action implementation object. The mapping
    of each business implementation object to every
    user request is provided in the configuration
    file, which will be processed and implemented by
    the controller i.e., the Action-Servlet class.

10
Model-View-Control
  • Model provides the model of the application
    business logic, represented in Java Bean.
  • View components are those pieces of application
    that present the information to the users and
    accept input from the web pages, built using JSP,
    XML, HTML.
  • Controller coordinates activities in the
    application. The Action Servlet, which functions
    as a controller, centralizes the logic for
    dispatching the requests to the implementation
    Action object based on the request URL, input
    parameters, and application state.

11
Data Flow in Business Layer
CONTROLLER SERVLET
BUSINESS LOGIC
STRUTS-CONFIG.XML
CLIENT BROWSER
VIEW JSP
MODEL APPLICATION STATE
  • Business implementation object, Search Action
    class, will have the logic to process the query
    and gets the results from back end process. The
    model data is then passed on to JSP pages to
    render as a result back to the user.

12
Locating and Accessing Search Data
  • The fastest way of retrieving the search data is
    directly proportional to how much organized is
    the data.
  • Indexing logic is used to index the data files,
    which makes the information to be stored in an
    organized structure and making the search
    mechanism faster.
  • Example LUCENE
  • Many big players in the search technology like
    Google follow the Lucene search principles.

13
LUCENE
  • Lucene is a Java-based open-source framework for
    text indexing and searching. It is a flexible,
    fully customizable and amazingly fast search
    engine.
  • It has its own content index algorithms and
    performing the search upon parsing the query
    string.
  • It provides building blocks to build a search
    engine based on the requirements.
  • It integrates directly with the Web application.
  • Any Java application can use Lucene as the core
    of any search functionality.
  • So, I have used Lucene to build the search
    solution for the Travel Search Engine
    application, by adding Servlets and JSP pages to
    process the input query and display the results.

14
LUCENE FUNDAMENTALS
  • Lucene internally creates its own file-based
    indexes and organizes the text data accordingly.
  • Lucene works on two fundamental steps viz.,
  • Indexing Documents based on an index structure
    and
  • Searching over the query in the documents.

15
INDEXING
  • Indexing Documents is the first step in creating
    Lucene Index in a directory, using different
    analyzer algorithms. Lucene index is a collection
    of documents organized in a way that allows quick
    retrieval of information when arbitrarily queried
    upon. Each document in a lucene index is made up
    of one or more fields that are name-value pairs,
    much like entries in a HashMap.
  • The fundamental concepts in Lucene are index,
    document, field and term.
  • Index An index contains a sequence of
    documents.
  • Document A document is a sequence of fields.
  • Field A field is a named sequence of terms.
  • Term A term is a string.
  • Depending on the size of the file, many groups of
    files with same name and different extensions are
    created. Each of these groups is known as
    segment. Lucene keeps track of each segment
    using a file called segment.


16
INDEXING FILES
  • Lucene is used to convert the huge amount of
    text data, containing destination, activity and
    member ID details, into well-organized documents.
    It indexes the documents into segments, which
    makes the search faster. Initially, I have
    retrieved the content index file and used Lucene
    to index it into 2 folders
  • Destination, that contains documents related to
    destination and activity index.
  • /destination/_ckl.cfs compound file format for
    storing destination and activity.
  • /destination/segment to keep track of
    destination and activity indexes
  • Members, that contains documents related to
    member ID.
  • /members/_asyz.cfs compound file format for
    storing the members.
  • /members/segment to keep track of member
    indexes.

17
SEARCHING
  • Now that the indexes are built, search mechanism
    accesses the indexes and queries the contents.
    The search can be performed on single index or
    multiple indexes. Ultimately the result is
    collected in a single result set. The search
    method returns Hits, ordered collection of
    documents matching the query. Query Parser, is
    used to parse the query string and builds an
    appropriate Query object. For faster and
    consistent results, its recommended that the same
    Analyzer to be used for parsing queries that was
    used when indexing the documents.

18
Content Index Searching Work Flow
HTTP REQUEST
QUERY PARSING USING LUCENE API
LUCENE FILE SYSTEM
TOMCAT
STRUTS
HTTP RESPONSE
HTTP REQUEST
WEB SEARCH
XML RESPONSE
When the user enters the search query, the lucene
parses the query as a preliminary step for
searching and accordingly searches the query in
its self-organized file system.
19
JiBX INTEGRATION TIER
Integration tier uses the binding principles to
process the retrieved xml responses from the
third party service based on Web Service
Architecture and ultimately delivers Java objects
to the business tier. JiBX is an open source
framework for binding XML response to Java
objects. JiBX framework handles all the details
of converting data to and from XML based on its
own class structures and proprietary
instructions. It works with existing classes,
using a flexible mapping definition file to
determine how data objects are translated to and
from XML. JiBX is designed to perform the
translation between internal data structures and
XML with very high efficiency, but still allows
you a high degree of control over the translation
process.
20
JiBX Performance Bind Mark
Reference https//bindmark.dev.java.net/
21
JiBX Startup Time
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
22
JiBX Small Documents Memory Usage
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
23
JiBX Large Documents Memory Usage
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
24
JiBX BINDING
  • JiBX makes the binding process easier, faster and
    efficient. It contains
  • the binding definition xml file i.e.,
    SearchRes-jibx.xml
  • the xml response from the third party Nutch API
    and
  • the converted java objects
  • In JiBX, the binding process is handled in two
    fundamental steps
  • BINDING COMPILER
  • JiBX uses binding definition documents to define
    the rules for how the Java objects are converted
    to or from XML.
  • BINDING RUNTIME (Marshalling / Unmarshalling)
  • The enhanced class files generated by the binding
    compiler use this runtime component both for
    generating an XML representation for an object in
    memory, called marshalling and for building an
    object in memory from an XML representation,
    called unmarshalling.
  • Travel Search Engine performs unmarshalling to
    build objects from the OTA Response.

25
Unmarshalling Large Documents
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
26
Unmarshalling Small Documents
Reference http//www-128.ibm.com/developerworks/
library/x-databdopt2
27
Integration Tier Work Flow
SERVICES THE HTTP REQUEST AND RETURNS AN OPEN
TRAVEL ALLIANCE OTA STANDARD XML RESPONSE BASED
ON WEB SEARCH OR SITE SEARCH ALSO BASED ON
CONTENT INDEXING
HTTP REQUEST
HTTP REQUEST
TOMCAT
STRUTS
HTTP RESPONSE
INTEGRATION TIER - BINDING XML RESPONSE TO JAVA
OBJECTS USING JIBX
JAVA OBJECT
BUSINESS LOGIC TIER
XML RESPONSE
28
PROJECT ENVIRONMENT
INSTALLATION REQUIREMENTS  JVM Environment As I
have used new features of JDK1.5, JVM version of
1.5 is required for the development. Application
Server Tomcat 5.0.28 Browser Internet Explorer
6 or later Project Deployment The ant build
utility creates a war file, which can be deployed
onto the container.    DEVELOPMENT
ENVIRONMENT Languages Java 1.5, JSP, Servlets,
XML, XSLT, HTML and JavaScript Framework Apache
Jakarta Struts 1.2.7 Content Indexing Algorithm
Apache Lucene External Search API Nutch, built
based on Lucene Binding JIBX Build Apache
Jakarta Ant 1.6.5 Web Server Apache Tomcat 5.0.28
29
FUTURE WORK
  • Include more destination and activity specific
    details into the content index file
  • Include zip code based destination-details search
  • For example Restaurant details based on zip code
  • Implement AJAX technology Suggesting the user
    about the destinations
  • Explore alternative content indexing and data
    binding technologies

30
REFERENCES
1)  LUCENE a.  Apache Lucene
http//lucene.apache.org/ b.  search enable your
application with LUCENE, Java Developers
Journal, Dec 2002   2)      STRUTS a.     
Apache Strut http//struts.apache.org/ b.     
Struts Kick Start, James Turner and Kevin
Bedell   3)      JIBX Apache Software
Foundation http//jibx.sourceforge.net/   4)     
NUTCH Apache Nutch Website
http//lucene.apache.org/nutch/   5)      WEB
SERVICES Beginning Java Web Services, H.
Bequet, M.M. Kunnumpurath, S. Rhody, A. Tost ,
Wrox Press, Mar 2003.
Write a Comment
User Comments (0)
About PowerShow.com