Title: OneSAF: Indexing and Searching XML Documents 03FSIW006
1OneSAF Indexing and Searching XML Documents
03F-SIW-006
- Robin Outar
- Boaventura (Ben) DaCosta
2Agenda
- Overview
- OneSAF
- XML
- OneSAF Repository Framework and Data Repositories
- OneSAF Indexing Mechanism
- Why the Need to Index XML?
- XML to Index XML
- The Lucene Indexing Engine
- Integrating Lucene
- Meta-data Stored in the Index
- Searching the Index
- Conclusion and Lessons Learned
3OneSAF Overview
A composable, next generation Computer-Generated
Forces (CGF) that can represent a full range of
operations, systems, and control process (TTP)
from entity up to brigade level, with variable
level of fidelity that supports multiple Army
Modeling and Simulation (MS) domains (ACR, RDA,
TEMO) applications
Software only
Automated Composable Extensible Interoperable
Platform Independent
Fielded to National Guard Armories RDECs /
Battle Labs Reserve Training Centers All Active
Duty Brigades and Battalions
Designed to eventually replace legacy entity
based Simulations BBS - ModSAF - JANUS - CCTT
SAF AVCATT SAF
4XML Overview
XML is a meta-language, not a programming language
XML is a family of technologies which includes
XML, XML Schema, XSL/T, XPATH
XML helps define rules (grammar) for designing
structured data
XML is a language for describing languages
XML was created by the World Wide Web Consortium
(W3C) and released in 1998
Think of structured data as spreadsheets, address
books, configuration files, etc.
5XML Overview (2)
lttablegt lttrgtlttdgtEMPNOlt/tdgtlttdgtEMPNAMElt/tdgtlt/tr
gt lttrgtlttdgt123456lt/tdgtlttdgtJohn Adamslt/tdgtlt/trgt
lt/tablegt
XML looks like HTML, but it isnt
HTML specifies what each tag means and how the
text will be formated
HTML
HTML tags elements for presentation
XML uses tags to delimited data and what the tags
mean is left to the client
lt?xml version1.0?gt ltEMPLOYEEgt ltEMPNOgt123456
lt/EMPNOgt ltEMPNAMEgtJohn Adamslt/EMPNAMEgt lt/EMPLOY
EEgt
XML tags elements as data
XML is a vehicle for sharing and interchanging
structured data
XML
6OneSAF Repositories Framework Overview
- The Repositories Framework is an intricate part
of the OneSAF Data Architecture - The framework provides a mechanism in which to
manage a number of different file formats,
including XML, gifs, pngs, text files, property
files, and proprietary binary files in a
distributed manner across multiple hosts that
support simulations - XML is used predominately as a data interchange
format (DIF) so that data and meta-data can be
defined, stored, exchanged, and shared both
within OneSAF and between other simulations
7OneSAF Repositories Framework Overview (2)
The data services provided by the SRS act as an
Application Programming interface (API) to the
OneSAF Repositories
The OneSAF Repositories Framework is composed of
System Repository Services (SRS), which provide a
uniform mechanism to create and manage both local
and remote repository data and meta-data across
all OneSAF Products and Components
These SRS are accessible by OneSAF Products and
Components in order to work with data stored in
the OneSAF Repositories
8OneSAF Data Repositories
OneSAF data and meta-data are stored in
repositories
- There are 7 OneSAF Repositories
- Software Repository (CVS)
- Knowledge Acquisition/Knowledge Engineering
(KA/KE) Repository - System Composition Repository (SCR)
- Military Scenario Repository (MSR)
- Environment Repository (ER)
- Parametric and Initialization Repository (PAIR)
- Simulation Output Repository (SOR)
These repositories are electronic storage
mechanisms that keep all the information, data,
and meta-data for one logical area pertaining to
OneSAF
9Why the Need to Index XML?
- Considering the magnitude of information that can
potentially be housed in the data repositories, a
vehicle is needed to catalog the data so that it
can be managed and searched - OneSAF decided on an indexing solution simply
because it would speed up data retrieval based on
certain search conditions - Unfortunately, at the time OneSAF began examining
XML indexing solutions, very little was available
in the open-source community - What was available was either in its beta stage
or required a monetary cost
10Using XML to Index XML
- Since little was available, an in-house approach
was developed until such time that a more robust
solution became available - This entailed storing meta-data information about
all XML documents in each directory inside an
Index XML document - For Example
- lt?xml version"1.0" encoding"UTF-8"?gt
- ltINDEX xmlnsxsi"http//www.w3.org/2001/XMLSchema
-instance" xsinoNamespaceSchemaLocation"index.xs
d"gt ltFILEENTRY filename"M1A1.xml"
classification"U" releasability"US Gov only"
Last_Modified"2000-01-15T120000" Format"xml"
lock"false" status"complete" test"false"/gt - ltKEYWORDS key"vehicletype"
value"tank"/gt - ltKEYWORDS key"tank" value"M1A1"/gt
- ltKEYWORDS key"chassis"
value"tracked"/gt - lt/INDEXgt
11Using XML to Index XML (2)
- The first concern with this solution was the
index files were XML - Each time the document was accessed for reading
and writing, it needed to be parsed - Reading not a problem, Simple API for XML (SAX)
was used - Writing was difficult though, since Document
Object Model (DOM) was used - A one-megabyte XML document could easily require
up to ten megabytes to be represented in memory - The second concern was keeping the index files
up-to-date. - SRS allowed developers to update the index files
- During development, index files where updated
manually While most developers where diligent in
updating the index, some where not
12The Lucene Indexing Engine
- After examining other XML indexing solutions, the
Jakarta Lucene Indexing Engine from the Apache
Jakarta Group was selected - Lucene creates an index, which is represented as
a series of binary files that are optimized for
fast lookup of data - This index is housed in the local file system
- Attributes about the document can be stored in
the index - Queries can be performed using the attributes as
keys - The index can be re-indexed as needed
- The OneSAF Parametric and Initialization
Repository (PAIR) (275 MB) was indexed in under
10 seconds
13The Lucene Indexing Engine (2)
- Lucene also showed favorable search performance.
The following shows the average search times
using the OneSAF in-house indexing approach and
the Lucene Indexing Engine -
14Integrating Lucene
- Integration of Lucene with OneSAF was a challenge
- Lucene was not designed for a distributed
environment. Since OneSAF is being designed to
operate on both a single node and multiple nodes,
this posed a problem - Lucene does not support multiple indexes on
multiple nodes. There is no way to synchronize
indexes using Lucene - Lucene also does not support networked queries
15Integrating Lucene (2)
- In order to allow all OneSAF composers, editors,
and tools running on multiple nodes to have a
consistent view of the index, the decision was
made to have one copy of the index set running on
a server - This meant that the indexes (one for each
repository) would reside on the server - A basic client-server approach was adopted
16Integrating Lucene (3)
- Since Lucene does not have network support built
in, an Indexer SRS was developed to enable Lucene
to operate in a distributed environment - This Indexer SRS is simply a daemon that listens
for requests from clients, processes the
requests, and sends back a response - This Indexer SRS is also capable of arbitration
to ensure it is the only Indexer SRS running in
the distributed system
17Meta-data Stored in the Index
- The index houses two categories of meta-data,
required attributes and client-defined name-value
pairs - Required meta-data are automatically supplied for
all files added to the index - Includes information as filename, location,
locking state, and last modified date - Default information is used if no other
information is supplied
18Meta-data Stored in the Index (2)
- Name-value pairs are simply client-defined
attributes assigned to desired file references
within the index - These values are not automatically added for each
file in the index - Values must be added using the SRS
- Or making the use of XML Processing Instructions
(PIs). These PIs are placed in the XML document
after the XML declaration - lt?xml version"1.0" encoding"UTF-8"?gt
- lt?INDEX keyword "vehicletype" value "tank"?gt
- lt?INDEX keyword "tank" value "M1A1"?gt
- lt?INDEX keyword "chassis" value "tracked"?gt
- ltELEMENT...
19Searching the Index
- Once a file reference has been added to the
index, a Find SRS provides the capability to
search the index and retrieve specific data form
an indexed file - SQL-like queries can be performed
- The Lucene query language allows nested queries
using the operators AND, OR, NOT or phrase
relations. This allows the client to pose complex
queries against the files that have been indexed - For example, (filename\"M1A1.xml\" AND
classification\"U\"") -
- Returns all XML documents that have a filename of
M1A1.xml and a classification equal to U
(i.e. Unclassified)
20Conclusion
- Implementing a robust XML Indexing solution has
been a challenge - As it stands, using binary to index XML seems to
be the best solution right now
21Contact Information
- Robin Outar
- Science Applications International Corporation
- 321-235-7660
- outarr_at_saic.com
- Boaventura (Ben) DaCosta
- Dynamics Research Corporation
- 407-380-1200
- bdacosta_at_drc.com