Title: EP118 What are you Searching For
1EP118What are youSearching For?
Dmitry Chernizer Enterprise Systems
Architect Dcherniz_at_sybase.com
2Agenda
- In My generation
- Concept Search Module
- Content Search Module
- Sample Configurations
- Summary
3The e-Volution of Unstructured Data
- Back in my day, people used to walk to their
information. - Up Hill Both Ways!
- Now everything is at you fingertips.
- www.Research.com
- www.StockAdvice.com
- www.NewJob.com
- Billions of pages of text, HTML, XML
- Most of them useless and out-dated information
- Its not about the 20 docs you have
- Its about the 5 pages you need!
4The e-Volution of Unstructured Data How did we
get here?
Word
Crash!
Text
Relational
PDF
HTML
5The e-Volution of Unstructured Data What is
knowledge management?
- A way to store non-relational (maybe
hierarchical) data - A standard way to express complex relationships
- Gather data
- Process store it
- Query display it
- Ability to assign a life-cycle to a piece of
information - Why you ask?
- Because your brain works that way
6What you need to know
- Less than Einstein
-
- More then this guy..
- Two kinds of Search Engines
- Concept Based Search
- Content Based Search
7What you need to know
Concept Based Search Deals with processing
unstructured requests Content Based Search Deal
with processing structured requests
8Concept Based Search Why and Where does it fit
in?
Continuous Availability
Personalization
Content Management
Integration
Security
9Concept Search Engine Basics
Note The Sybase Concept Search Engine uses
embedded technology from Autonomy
- The Purpose
- The Process
- Bayesian Inference
- Shannons Information Theory
- Adaptive Probabilistic Concept Modeling
- Dynamic Reasoning Engine
- Examples
10Concept Search Engine BasicsThe Purpose
- Automate process of getting the right information
to the right person - Improve the efficiency of information retrieval
- Enable the dynamic personalization of digital
content. - Natural language content search and retrieval
- Automatic categorization by an agent
- Automatic Content Personalization
11Concept Search Engine BasicsThe process
- Advanced Concept matching techniques
- High-performance pattern matching algorithms
- Can analyze a text and identify the key concepts
within the document - Based on frequency and relationships of terms
correlated with meaning - Language Independent
12Concept Search Engine BasicsLimitation of
other approaches
- Keyword, Boolean and Proximity Searches
- Exacerbate / increase information overload
- Cant tell how relevant a document is to subject
being researched - Only track simple occurrence of keywords
- ( e.g., "CD AND (NOT (financial OR money OR
invest)) AND music. - May track proximity of content but not relevant
content - Lack of localization (English Wizard hey Im
NOT!)
13Bayesian Inference
- Developed by Thomas Bayes, 18th century cleric
and mathematician - Central tenet of modern statistical probability
modeling - Calculates probabilistic relationship between
multiple variables and the extent to which each
impacts the other - Used in pattern and fingerprint recognition
Okay maybe not this guy
14Shannons Information Theory
- Developed by Claude Shannon in 1949
- Words which are less frequent across all
documents, but appear in a cluster of documents
are more distinguishing and tend to convey more
information - Ideas can be inferred from related content
- An inference engine may be used to parse and
build content
15Adaptive, Probabilistic Concept Modeling
Bayesian Inference Shannons Information
Theory Dynamic Reasoning Engine (DRE) generates
networks of concepts Terms are weighted
relationships are established The unstructured
content portal metaphor
16Concept Search Inference EngineAlso known as
Dynamic Reasoning Engine
- Core Engine of the Concept Search Logic
- Uses the APCM algorithms to extract content
- Generates relative weight of document relevance,
base summary and/or result set (non-tabular) - Generates query plans for unstructured data
- May be stored as Templates for reusable queries
- May be used by agent processes for aggregation
- Accessed thru Enterprise Portal Search API
17Auto Indexer
- Automatically gathers text content from local
file systems and imports external files into an
index - Can gather document sets in a local file system
- Can spider mapped drives
- Can load a single document as discrete sets
- Uses Verity, Keyview Adobe filters,
- To work on ASCII text
- Will continually check for new content
18Agent Process
- Automated Content Categorizer
- Stores categories or reusable queries known as
Agents - Agents can be shared or used to find people with
similar interests
19Knowledge Fetch Process
- Allows auto- spidering of web sites to gather
data - Converts web content to index able format
- May be used to Fetch content from many sites
simultaneously - Can return meta-data and conventional text
content - Obtains Web Pages behind Firewalls and through
Proxy Servers - Obtains Web Pages protected by a login
- Obtains Web Pages using Cookies
20The Knowledge Management ServerA portal Service
Sybase Enterprise Portal
Open Client IBM DRDA SQLNet ODBC/JDBC
File I/O POP3 Exchange Lotus Notes
Application Service Engine
HTTP HTTPS
HTTP HTTPS
Word
Back
21Enterprise Portal Search Services
- Encapsulate Search API into a set of EP
components - Components can be accessed by other EP services,
- such as security servlets, messaging or other
EJBs - Allows load balancing across server clusters
- Secure Search and Profile Locking
- Allows extending of the Dynamic Reasoning Engine
via ANY component model (Java, C, ActiveX, Server
Side Java Script, etc.)
22Sample Architectures
Load Balancing Hardware
Firewall
Client
Web Server Presentation Layer
External Spider Agent
Concept Search Inference Engine
Application Engine
Knowledge Server Agents
Knowledge Server
Internal Spider Agents
Fetch Agent
Fetch Agent
Unstructured Data Repositories
Data repository
Intranet
DMZ Ring
Back
23Storage Overhead
No content stored, just terms wts 30 - 50 of
original document size Content stored, plus terms
wts 150 of original document size Content,
proximity phrase matching, and terms
wts 250 of original document size
24Content Based Search Why and Where does it fit
in?
Continuous Availability
Personalization
Content Management
Integration
Security
25Content Full Text Search Engine Basics
Note The Sybase Content Full Text Search Engine
uses embedded technology from Verity
- The Purpose
- The Process
- Content Search Basics
- Full Text Search Specialty Data Store
- Sample Architectures
26Content Search Engine BasicsThe Purpose
- Structured (SQL) Access to Unstructured Data
- Adaptive Server (or EP) indexes documents stored
in external data stores - Indexes are maintained within a collection
- It understands words and language constructs
- It understands many document types e.g. MS Word,
html, sgml, pdf, etc
27Content Search Engine BasicsThe Process
Sybase Enterprise Portal
Application Service Engine
SQL Query
Specialty Data Store
Text
Word
28Content Search Engine BasicsThe Process
- Queries are issued against a collection
- Results include a document identifier and a score
- Score indicates how well a document matched the
query - Can understand and index many foreign languages
- Include rules for understanding words and
- constructs of the specified language
29Content Search Engine BasicsThe Process
- Queries are issued against a collection
- Results include a document identifier and a score
- Score indicates how well a document matched the
query
Collection - A
Find documents where blue is near red
ID 68, score98 ID17, score 71
30Content Search Engine BasicsThe Process
- Can understand and index many foreign languages
- Include rules for understanding words and
constructs of the specified language
Hola! Bon Jiorno! Mahalo! Kem-Cho!
31Content Search Engine BasicsThe Process
Indexed data and index in two separate data stores
Indexed Data
Indexed Data
Updates, synchronization, backup, recovery?
32Full Text Search is a Specialty Data StoreYes
but..
- Data Store propagates source changes to the
collection - An events table (text_events) is used to log
changes to the source tables - Data Store must be notified that changes exist
- Backups of both data stores must be synchronized
33Full Text Search is a Specialty Data
StoreSybase Provides
Integrated backup and restore facility Backup /
Recover database and text indexes Online
configuration Configure Full Text Search at
runtime
dump database...
34Enhanced Full Text Search Features
Clustering a feature for grouping similar
documents Clusters are inherently fuzzy - the
algorithm merely attempts to group similar
documents Query By Example provides ability to
search for documents that are similar to one or
more segments of text select summary, score,
copy from t1 t, vt1 v where t.id v.id
and index_any ltlikegt (Space the final
frontier)
35Enhanced Full Text Search Features
- Custom Thesaurus allows users to build a
thesaurus specific to their application. - Synonym Maps for proximity search
- control 1synonyms(list red, ruby, scarlet,
fuchsia, magenta - list blue ltorgt azure )
- A Text Index is used by joining the source table
and the index table - select score, copy from story_index i, stories
s where i.id s.id and i.score gt 70
and i.index_any Digital ltneargt Compaq
36Summary
- Sybase provides 2 types of Knowledge Management
- Concept Search
- Content Search
- Technology Futures include an unstructured data
server, XML search and indexing, XSL translation
and other ways of managing hybrid data.
37Summary
- Yes it can be done
- Content, Concept We have it all