OneSAF: Indexing and Searching XML Documents 03FSIW006 - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

OneSAF: Indexing and Searching XML Documents 03FSIW006

Description:

A composable, next generation Computer-Generated Forces (CGF) that can represent ... In order to allow all OneSAF composers, editors, and tools running on multiple ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 22
Provided by: Caro423
Category:

less

Transcript and Presenter's Notes

Title: OneSAF: Indexing and Searching XML Documents 03FSIW006


1
OneSAF Indexing and Searching XML Documents
03F-SIW-006
  • Robin Outar
  • Boaventura (Ben) DaCosta

2
Agenda
  • Overview
  • OneSAF
  • XML
  • OneSAF Repository Framework and Data Repositories
  • OneSAF Indexing Mechanism
  • Why the Need to Index XML?
  • XML to Index XML
  • The Lucene Indexing Engine
  • Integrating Lucene
  • Meta-data Stored in the Index
  • Searching the Index
  • Conclusion and Lessons Learned

3
OneSAF Overview
A composable, next generation Computer-Generated
Forces (CGF) that can represent a full range of
operations, systems, and control process (TTP)
from entity up to brigade level, with variable
level of fidelity that supports multiple Army
Modeling and Simulation (MS) domains (ACR, RDA,
TEMO) applications
Software only
Automated Composable Extensible Interoperable
Platform Independent
Fielded to National Guard Armories RDECs /
Battle Labs Reserve Training Centers All Active
Duty Brigades and Battalions
Designed to eventually replace legacy entity
based Simulations BBS - ModSAF - JANUS - CCTT
SAF AVCATT SAF
4
XML Overview
XML is a meta-language, not a programming language
XML is a family of technologies which includes
XML, XML Schema, XSL/T, XPATH
XML helps define rules (grammar) for designing
structured data
XML is a language for describing languages
XML was created by the World Wide Web Consortium
(W3C) and released in 1998
Think of structured data as spreadsheets, address
books, configuration files, etc.
5
XML Overview (2)
lttablegt lttrgtlttdgtEMPNOlt/tdgtlttdgtEMPNAMElt/tdgtlt/tr
gt lttrgtlttdgt123456lt/tdgtlttdgtJohn Adamslt/tdgtlt/trgt
lt/tablegt
XML looks like HTML, but it isnt
HTML specifies what each tag means and how the
text will be formated
HTML
HTML tags elements for presentation
XML uses tags to delimited data and what the tags
mean is left to the client
lt?xml version1.0?gt ltEMPLOYEEgt ltEMPNOgt123456
lt/EMPNOgt ltEMPNAMEgtJohn Adamslt/EMPNAMEgt lt/EMPLOY
EEgt
XML tags elements as data
XML is a vehicle for sharing and interchanging
structured data
XML
6
OneSAF Repositories Framework Overview
  • The Repositories Framework is an intricate part
    of the OneSAF Data Architecture
  • The framework provides a mechanism in which to
    manage a number of different file formats,
    including XML, gifs, pngs, text files, property
    files, and proprietary binary files in a
    distributed manner across multiple hosts that
    support simulations
  • XML is used predominately as a data interchange
    format (DIF) so that data and meta-data can be
    defined, stored, exchanged, and shared both
    within OneSAF and between other simulations

7
OneSAF Repositories Framework Overview (2)
The data services provided by the SRS act as an
Application Programming interface (API) to the
OneSAF Repositories
The OneSAF Repositories Framework is composed of
System Repository Services (SRS), which provide a
uniform mechanism to create and manage both local
and remote repository data and meta-data across
all OneSAF Products and Components
These SRS are accessible by OneSAF Products and
Components in order to work with data stored in
the OneSAF Repositories
8
OneSAF Data Repositories
OneSAF data and meta-data are stored in
repositories
  • There are 7 OneSAF Repositories
  • Software Repository (CVS)
  • Knowledge Acquisition/Knowledge Engineering
    (KA/KE) Repository
  • System Composition Repository (SCR)
  • Military Scenario Repository (MSR)
  • Environment Repository (ER)
  • Parametric and Initialization Repository (PAIR)
  • Simulation Output Repository (SOR)

These repositories are electronic storage
mechanisms that keep all the information, data,
and meta-data for one logical area pertaining to
OneSAF
9
Why the Need to Index XML?
  • Considering the magnitude of information that can
    potentially be housed in the data repositories, a
    vehicle is needed to catalog the data so that it
    can be managed and searched
  • OneSAF decided on an indexing solution simply
    because it would speed up data retrieval based on
    certain search conditions
  • Unfortunately, at the time OneSAF began examining
    XML indexing solutions, very little was available
    in the open-source community
  • What was available was either in its beta stage
    or required a monetary cost

10
Using XML to Index XML
  • Since little was available, an in-house approach
    was developed until such time that a more robust
    solution became available
  • This entailed storing meta-data information about
    all XML documents in each directory inside an
    Index XML document
  • For Example
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • ltINDEX xmlnsxsi"http//www.w3.org/2001/XMLSchema
    -instance" xsinoNamespaceSchemaLocation"index.xs
    d"gt ltFILEENTRY filename"M1A1.xml"
    classification"U" releasability"US Gov only"
    Last_Modified"2000-01-15T120000" Format"xml"
    lock"false" status"complete" test"false"/gt
  • ltKEYWORDS key"vehicletype"
    value"tank"/gt
  • ltKEYWORDS key"tank" value"M1A1"/gt
  • ltKEYWORDS key"chassis"
    value"tracked"/gt
  • lt/INDEXgt

11
Using XML to Index XML (2)
  • The first concern with this solution was the
    index files were XML
  • Each time the document was accessed for reading
    and writing, it needed to be parsed
  • Reading not a problem, Simple API for XML (SAX)
    was used
  • Writing was difficult though, since Document
    Object Model (DOM) was used
  • A one-megabyte XML document could easily require
    up to ten megabytes to be represented in memory
  • The second concern was keeping the index files
    up-to-date.
  • SRS allowed developers to update the index files
  • During development, index files where updated
    manually While most developers where diligent in
    updating the index, some where not

12
The Lucene Indexing Engine
  • After examining other XML indexing solutions, the
    Jakarta Lucene Indexing Engine from the Apache
    Jakarta Group was selected
  • Lucene creates an index, which is represented as
    a series of binary files that are optimized for
    fast lookup of data
  • This index is housed in the local file system
  • Attributes about the document can be stored in
    the index
  • Queries can be performed using the attributes as
    keys
  • The index can be re-indexed as needed
  •  The OneSAF Parametric and Initialization
    Repository (PAIR) (275 MB) was indexed in under
    10 seconds

13
The Lucene Indexing Engine (2)
  • Lucene also showed favorable search performance.
    The following shows the average search times
    using the OneSAF in-house indexing approach and
    the Lucene Indexing Engine

14
Integrating Lucene
  • Integration of Lucene with OneSAF was a challenge
  • Lucene was not designed for a distributed
    environment. Since OneSAF is being designed to
    operate on both a single node and multiple nodes,
    this posed a problem
  • Lucene does not support multiple indexes on
    multiple nodes. There is no way to synchronize
    indexes using Lucene
  • Lucene also does not support networked queries

15
Integrating Lucene (2)
  • In order to allow all OneSAF composers, editors,
    and tools running on multiple nodes to have a
    consistent view of the index, the decision was
    made to have one copy of the index set running on
    a server
  • This meant that the indexes (one for each
    repository) would reside on the server
  • A basic client-server approach was adopted

16
Integrating Lucene (3)
  • Since Lucene does not have network support built
    in, an Indexer SRS was developed to enable Lucene
    to operate in a distributed environment
  • This Indexer SRS is simply a daemon that listens
    for requests from clients, processes the
    requests, and sends back a response
  • This Indexer SRS is also capable of arbitration
    to ensure it is the only Indexer SRS running in
    the distributed system

17
Meta-data Stored in the Index
  • The index houses two categories of meta-data,
    required attributes and client-defined name-value
    pairs
  • Required meta-data are automatically supplied for
    all files added to the index
  • Includes information as filename, location,
    locking state, and last modified date
  • Default information is used if no other
    information is supplied

18
Meta-data Stored in the Index (2)
  • Name-value pairs are simply client-defined
    attributes assigned to desired file references
    within the index
  • These values are not automatically added for each
    file in the index
  • Values must be added using the SRS
  • Or making the use of XML Processing Instructions
    (PIs). These PIs are placed in the XML document
    after the XML declaration
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • lt?INDEX keyword "vehicletype" value "tank"?gt
  • lt?INDEX keyword "tank" value "M1A1"?gt
  • lt?INDEX keyword "chassis" value "tracked"?gt
  • ltELEMENT...

19
Searching the Index
  • Once a file reference has been added to the
    index, a Find SRS provides the capability to
    search the index and retrieve specific data form
    an indexed file
  • SQL-like queries can be performed
  • The Lucene query language allows nested queries
    using the operators AND, OR, NOT or phrase
    relations. This allows the client to pose complex
    queries against the files that have been indexed
  • For example, (filename\"M1A1.xml\" AND
    classification\"U\"")
  •  
  • Returns all XML documents that have a filename of
    M1A1.xml and a classification equal to U
    (i.e. Unclassified)

20
Conclusion
  • Implementing a robust XML Indexing solution has
    been a challenge
  • As it stands, using binary to index XML seems to
    be the best solution right now

21
Contact Information
  • Robin Outar
  • Science Applications International Corporation
  • 321-235-7660
  • outarr_at_saic.com
  • Boaventura (Ben) DaCosta
  • Dynamics Research Corporation
  • 407-380-1200
  • bdacosta_at_drc.com
Write a Comment
User Comments (0)
About PowerShow.com